etcd Cluster Instability Cascading Failures
criticalWhen etcd pods crash or enter crash-loop states due to data corruption, PVC issues, or member ID problems, Milvus loses its metadata store, causing all coordinator components to fail and bringing down the entire cluster.
Monitor etcd pod health, restart counts, and PVC status in Kubernetes. Track etcd client connection errors in Milvus logs. Alert immediately on etcd pod crashes, pending PVC states, or repeated restarts. Watch for rootcoord, querycoord, and datacoord connection failures following etcd issues.
Pre-configure StorageClass for etcd PVCs before deployment. Implement automated backups of etcd data using /bitnami/etcd/data snapshots. For member_id errors, delete the corrupted /bitnami/etcd/data/member_id file. For multi-pod crashes, scale StatefulSet to 1, delete PVCs, restore from backup, then scale back. Monitor etcd disk space and set alerts for >70% utilization.