Milvusetcd

etcd Cluster Instability Cascading Failures

critical
reliabilityUpdated Jan 1, 2025

When etcd pods crash or enter crash-loop states due to data corruption, PVC issues, or member ID problems, Milvus loses its metadata store, causing all coordinator components to fail and bringing down the entire cluster.

How to detect:

Monitor etcd pod health, restart counts, and PVC status in Kubernetes. Track etcd client connection errors in Milvus logs. Alert immediately on etcd pod crashes, pending PVC states, or repeated restarts. Watch for rootcoord, querycoord, and datacoord connection failures following etcd issues.

Recommended action:

Pre-configure StorageClass for etcd PVCs before deployment. Implement automated backups of etcd data using /bitnami/etcd/data snapshots. For member_id errors, delete the corrupted /bitnami/etcd/data/member_id file. For multi-pod crashes, scale StatefulSet to 1, delete PVCs, restore from backup, then scale back. Monitor etcd disk space and set alerts for >70% utilization.