etcd Cluster Instability Cascading Failures

critical

reliabilityUpdated Jan 1, 2025

When etcd pods crash or enter crash-loop states due to data corruption, PVC issues, or member ID problems, Milvus loses its metadata store, causing all coordinator components to fail and bringing down the entire cluster.

Sources

Troubleshooting | Milvus Documentationmilvus.io

Monitor Overview | Milvus Documentationmilvus.io

Milvus Metrics Dashboard | Milvus Documentationmilvus.io

Technologies:

MilvusSymptoms of this issue are visible in Milvus metrics and logs

etcdThe root cause of this issue originates in etcd

How to detect:

Monitor etcd pod health, restart counts, and PVC status in Kubernetes. Track etcd client connection errors in Milvus logs. Alert immediately on etcd pod crashes, pending PVC states, or repeated restarts. Watch for rootcoord, querycoord, and datacoord connection failures following etcd issues.

Recommended action:

Pre-configure StorageClass for etcd PVCs before deployment. Implement automated backups of etcd data using /bitnami/etcd/data snapshots. For member_id errors, delete the corrupted /bitnami/etcd/data/member_id file. For multi-pod crashes, scale StatefulSet to 1, delete PVCs, restore from backup, then scale back. Monitor etcd disk space and set alerts for >70% utilization.