etcd Performance Degradation and High Latency

criticalIncident Response

Diagnose etcd performance issues causing API server slowdowns and cluster instability.

Prompt: My kubectl commands are timing out and the API server is super slow — I think etcd might be the bottleneck. How do I check if etcd latency is too high and what's causing it?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When etcd performance degrades, start by confirming the bottleneck is actually etcd-related by checking API server latency patterns, then immediately check disk I/O latency (the most common culprit). From there, investigate database size, leader election frequency, and network latency between peers before looking at resource pressure.

1Confirm etcd is causing API server slowness
First, verify that API server latency is actually caused by etcd, not client-side issues or scheduler bottlenecks. Check `apiserver_request_duration_seconds` for elevated p95/p99 latency and correlate timing with etcd metrics. If API server latency spikes correlate with etcd performance degradation (you'll see this in etcd_disk_* metrics), you've confirmed the bottleneck. The insight on API server request latency impacting operations explains how etcd issues propagate upstream to kubectl timeouts.
2Check disk I/O latency on etcd nodes
Disk latency is the #1 cause of etcd performance problems. Check `etcd_disk_wal_fsync_duration_seconds` — if p99 is above 10ms (or worse, above 100ms), your disk is too slow. etcd writes every transaction to WAL (write-ahead log) synchronously, so any disk stall directly blocks write operations. Look at `kubernetes_diskio_io_service_size_stats` for the etcd container to confirm disk I/O bottlenecks. The insight on disk I/O bottlenecks masquerading as application slowness applies here — slow storage manifests as API timeouts, not obvious disk errors.
3Check etcd database size and compaction status
Large etcd databases slow down all operations, especially range queries and watch operations. Check `etcd_mvcc_db_total_size_in_bytes` — if it's over 2GB, you need compaction. Also check `etcd_mvcc_db_total_size_in_use_in_bytes` vs total size to see fragmentation. Enable automatic compaction if not already configured, and consider setting shorter TTLs on events and other ephemeral objects. A bloated database amplifies disk I/O issues.
4Look for frequent leader elections
Check `etcd_server_leader_changes_seen_total` — if you're seeing more than 3 leader changes per hour, that's a sign of cluster instability. Each election disrupts write operations for 1-2 seconds. Watch for `etcd_server_has_leader` dropping to 0. As the insight on frequent leader elections explains, this is usually a symptom of underlying disk latency, network issues, or resource pressure, not a root cause itself. If you're seeing elections, the root cause is likely already identified in steps 2-3.
5Check network latency between etcd peers
etcd uses Raft consensus, so network latency between peers directly impacts write latency. Check `etcd_network_peer_round_trip_time_seconds` — if p99 is above 50ms, network delays are contributing to performance issues. High network latency can also trigger leader elections. Look at `kubernetes_network_errors` for the etcd pods to identify packet loss or network instability. In cloud environments, ensure etcd nodes are in the same availability zone or have low-latency networking between zones.
6Review CPU and memory pressure on etcd nodes
Finally, check `kubernetes_cpu_usage` and `kubernetes_memory_usage` for etcd pods. etcd is CPU-intensive during compaction and memory-intensive with many watchers. If CPU is consistently above 70% or memory is near limits, resource contention could be causing performance degradation. However, in my experience, this is rarely the root cause unless you have thousands of nodes — disk I/O is almost always the culprit. If resources look fine but you're still seeing issues, circle back to disk latency with more detailed storage backend metrics.

Technologies

Related Insights

Frequent Leader Elections Indicate Cluster Instability
warning
Multiple leader changes within a short timeframe signal network issues, resource contention, or disk latency problems. Each election disrupts write operations and degrades cluster performance.
API Server Request Latency Impacting Kubernetes Operations
warning
When etcd experiences performance issues, the Kubernetes API server latency increases, causing kubectl timeouts, scheduler delays, and controller lag that impacts cluster operations.
Disk I/O Bottleneck Masquerading as Application Slowness
critical
Application latency increases and container restarts occur due to disk stalls or slow persistent volume performance, but manifest as generic timeouts or OOM kills. The underlying storage bottleneck is hidden by higher-level symptoms.
High Latency from Slow API Server or Scheduler
warning
Elevated apiserver_request_duration_seconds and apiserver_request_total errors indicate API server overload or scheduler bottlenecks, causing slow pod scheduling, kubectl timeouts, and degraded cluster responsiveness.

Relevant Metrics

Monitoring Interfaces

Kubernetes Datadog