Ceph

Flapping OSDs Indicate Network or Hardware Instability

warning
reliabilityUpdated Jan 7, 2026

OSDs repeatedly marking themselves up/down (flapping) due to failed heartbeats indicate underlying network problems, overloaded OSDs, or failing hardware. This causes continuous recovery operations and severely impacts performance.

How to detect:

Monitor cluster logs for 'heartbeat_check: no reply from osd.X' and 'wrongly marked me down' messages. Watch for OSDs transitioning between up/down states. Check ceph_count_up_osds and ceph_count_in_osds for frequent changes.

Recommended action:

Test network latency between OSD nodes (should be <1ms for cluster network). Check for network errors using netstat or ip commands. Verify OSD node isn't overloaded (high CPU/memory). Increase osd_heartbeat_grace or mon_osd_down_out_interval temporarily while investigating root cause.