Flapping OSDs Indicate Network or Hardware Instability

warning

reliabilityUpdated Jan 7, 2026

OSDs repeatedly marking themselves up/down (flapping) due to failed heartbeats indicate underlying network problems, overloaded OSDs, or failing hardware. This causes continuous recovery operations and severely impacts performance.

Sources

How to Troubleshoot Ceph Performance Issues - OneUptimeoneuptime.com

Troubleshooting Guide Red Hat Ceph Storage 3 | Red Hat Customer Portalaccess.redhat.com

Technologies:

CephSymptoms of this issue are visible in Ceph metrics and logs

How to detect:

Monitor cluster logs for 'heartbeat_check: no reply from osd.X' and 'wrongly marked me down' messages. Watch for OSDs transitioning between up/down states. Check ceph_count_up_osds and ceph_count_in_osds for frequent changes.

Recommended action:

Test network latency between OSD nodes (should be <1ms for cluster network). Check for network errors using netstat or ip commands. Verify OSD node isn't overloaded (high CPU/memory). Increase osd_heartbeat_grace or mon_osd_down_out_interval temporarily while investigating root cause.