Snapshot Operations Causing Latency Spikes

info

latencyUpdated Jun 4, 2025

Periodic etcd snapshots can cause temporary latency spikes, especially on HDDs, as the snapshot process competes for disk I/O with normal operations.

Sources

Performanceetcd.io

Technologies:

etcdSymptoms of this issue are visible in etcd metrics and logs

How to detect:

Monitor etcd_snap_fsync_time_seconds and etcd_snap_database_save_count_time_seconds_datadog histograms for latency spikes. Correlate snapshot timing with observed API server latency increases.

Recommended action:

Use SSDs instead of HDDs to minimize snapshot impact. Adjust snapshot count threshold (--snapshot-count) to balance snapshot frequency with database recovery time. Schedule high-priority workloads to avoid snapshot windows if predictable.