etcd Performance Degradation and High Latency

criticalIncident Response

Diagnose etcd performance issues causing API server slowdowns and cluster instability.

Prompt: “My kubectl commands are timing out and the API server is super slow — I think etcd might be the bottleneck. How do I check if etcd latency is too high and what's causing it?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When etcd performance degrades, start by confirming the bottleneck is actually etcd-related by checking API server latency patterns, then immediately check disk I/O latency (the most common culprit). From there, investigate database size, leader election frequency, and network latency between peers before looking at resource pressure.

1Confirm etcd is causing API server slowness

First, verify that API server latency is actually caused by etcd, not client-side issues or scheduler bottlenecks. Check `apiserver_request_duration_seconds` for elevated p95/p99 latency and correlate timing with etcd metrics. If API server latency spikes correlate with etcd performance degradation (you'll see this in etcd_disk_* metrics), you've confirmed the bottleneck. The insight on API server request latency impacting operations explains how etcd issues propagate upstream to kubectl timeouts.

API Server Request Latency Impacting Kubernetes Operations High Latency from Slow API Server or Scheduler

2Check disk I/O latency on etcd nodes

Disk latency is the #1 cause of etcd performance problems. Check `etcd_disk_wal_fsync_duration_seconds` — if p99 is above 10ms (or worse, above 100ms), your disk is too slow. etcd writes every transaction to WAL (write-ahead log) synchronously, so any disk stall directly blocks write operations. Look at `kubernetes_diskio_io_service_size_stats` for the etcd container to confirm disk I/O bottlenecks. The insight on disk I/O bottlenecks masquerading as application slowness applies here — slow storage manifests as API timeouts, not obvious disk errors.

Disk I/O Bottleneck Masquerading as Application Slowness kubernetes_diskio_io_service_size_stats

3Check etcd database size and compaction status

Large etcd databases slow down all operations, especially range queries and watch operations. Check `etcd_mvcc_db_total_size_in_bytes` — if it's over 2GB, you need compaction. Also check `etcd_mvcc_db_total_size_in_use_in_bytes` vs total size to see fragmentation. Enable automatic compaction if not already configured, and consider setting shorter TTLs on events and other ephemeral objects. A bloated database amplifies disk I/O issues.

4Look for frequent leader elections

Check `etcd_server_leader_changes_seen_total` — if you're seeing more than 3 leader changes per hour, that's a sign of cluster instability. Each election disrupts write operations for 1-2 seconds. Watch for `etcd_server_has_leader` dropping to 0. As the insight on frequent leader elections explains, this is usually a symptom of underlying disk latency, network issues, or resource pressure, not a root cause itself. If you're seeing elections, the root cause is likely already identified in steps 2-3.

Frequent Leader Elections Indicate Cluster Instability

5Check network latency between etcd peers

etcd uses Raft consensus, so network latency between peers directly impacts write latency. Check `etcd_network_peer_round_trip_time_seconds` — if p99 is above 50ms, network delays are contributing to performance issues. High network latency can also trigger leader elections. Look at `kubernetes_network_errors` for the etcd pods to identify packet loss or network instability. In cloud environments, ensure etcd nodes are in the same availability zone or have low-latency networking between zones.

Frequent Leader Elections Indicate Cluster Instability kubernetes_network_errors

6Review CPU and memory pressure on etcd nodes

Finally, check `kubernetes_cpu_usage` and `kubernetes_memory_usage` for etcd pods. etcd is CPU-intensive during compaction and memory-intensive with many watchers. If CPU is consistently above 70% or memory is near limits, resource contention could be causing performance degradation. However, in my experience, this is rarely the root cause unless you have thousands of nodes — disk I/O is almost always the culprit. If resources look fine but you're still seeing issues, circle back to disk latency with more detailed storage backend metrics.

Frequent Leader Elections Indicate Cluster Instability kubernetes_cpu_usagekubernetes_memory_usage