Flapping OSDs Indicate Network or Hardware Instability
warningreliabilityUpdated Jan 7, 2026
OSDs repeatedly marking themselves up/down (flapping) due to failed heartbeats indicate underlying network problems, overloaded OSDs, or failing hardware. This causes continuous recovery operations and severely impacts performance.
Sources
How to detect:
Monitor cluster logs for 'heartbeat_check: no reply from osd.X' and 'wrongly marked me down' messages. Watch for OSDs transitioning between up/down states. Check ceph_count_up_osds and ceph_count_in_osds for frequent changes.
Recommended action:
Test network latency between OSD nodes (should be <1ms for cluster network). Check for network errors using netstat or ip commands. Verify OSD node isn't overloaded (high CPU/memory). Increase osd_heartbeat_grace or mon_osd_down_out_interval temporarily while investigating root cause.