Ciliumetcd

KVStore Quorum Loss Cascade

critical
reliabilityUpdated Sep 2, 2025

When cilium_kvstore_quorum_errors_datadog increments, the cluster has lost consensus with the backing KVStore (etcd/consul). This prevents policy propagation, service discovery updates, and can cause cluster-wide connectivity failures as agents cannot sync state.

How to detect:

Monitor cilium_kvstore_quorum_errors_datadog for non-zero values. Check cilium_kvstore_initial_sync_completed to verify agents have completed initial sync. High cilium_kvstore_events_queue_seconds_datadog indicates events are backing up due to KVStore unavailability.

Recommended action:

Verify etcd cluster health and network connectivity to etcd endpoints. Check 'cilium status' for KVStore connection status. Review etcd logs for consensus failures or leadership elections. Ensure etcd has sufficient resources and is not experiencing split-brain. Consider implementing etcd monitoring with proper alerting on leader elections and member health.