High disk fsync latency (>10ms) causes etcd performance degradation, leading to slow API responses, leader elections, and cluster instability. This is the most critical performance bottleneck for etcd.
When etcd database reaches its space quota (default 2GB), it stops accepting writes and sets a NOSPACE alarm, making the cluster read-only and blocking all write operations.
etcd maintains a full history of all changes through MVCC revisions. Without regular compaction, the database grows indefinitely, consuming disk space and degrading performance.
Multiple leader changes within a short timeframe signal network issues, resource contention, or disk latency problems. Each election disrupts write operations and degrades cluster performance.
High network round-trip time between etcd cluster members (>50ms) delays Raft consensus operations, causing proposal timeouts, failed heartbeats, and potential leader elections.
Growing etcd_server_proposals_pending count indicates that write proposals are queuing up faster than they can be committed, suggesting performance bottlenecks or consensus issues.
High etcd_server_slow_apply_total count indicates that applying committed entries to the state machine is taking too long (>100ms), typically due to disk latency or large database size.
When etcd experiences performance issues, the Kubernetes API server latency increases, causing kubectl timeouts, scheduler delays, and controller lag that impacts cluster operations.
When cilium_kvstore_quorum_errors_datadog increments, the cluster has lost consensus with the backing KVStore (etcd/consul). This prevents policy propagation, service discovery updates, and can cause cluster-wide connectivity failures as agents cannot sync state.
Periodic etcd snapshots can cause temporary latency spikes, especially on HDDs, as the snapshot process competes for disk I/O with normal operations.
When etcd pods crash or enter crash-loop states due to data corruption, PVC issues, or member ID problems, Milvus loses its metadata store, causing all coordinator components to fail and bringing down the entire cluster.
High latency in pulsar_metadata_store_ops_latency_ms_bucket indicates ZooKeeper is struggling with metadata operations (topic creation, subscription management, ownership changes). This blocks broker operations and causes cascading failures across the cluster.
ETCD database size (etcd_mvcc_db_total_size_in_use_in_bytes) exceeding 2GB indicates control plane stress, which can cause API server slowness and cluster instability.
High etcd_server_read_index_slow count or failures indicate that read index operations (required for linearizable reads) are timing out, degrading read performance and consistency guarantees.