Clock Skew Breaks Monitor Quorum and Cluster Operations

critical

reliabilityUpdated Jun 4, 2024

Monitor clock drift exceeding 50ms (default mon_clock_drift_allowed) causes monitors to mark each other as down, breaking quorum and preventing cluster operations. This manifests as monitors being out of quorum despite being healthy.

Sources

Troubleshooting Guide Red Hat Ceph Storage 3 | Red Hat Customer Portalaccess.redhat.com

Chapter 4. Troubleshooting Ceph Monitors - Red Hat Documentationdocs.redhat.com

Chapter 4. Troubleshooting Ceph Monitorsdocs.redhat.com

Technologies:

CephSymptoms of this issue are visible in Ceph metrics and logs

How to detect:

Check for 'clock skew' messages in `ceph health detail` output or cluster logs. Monitor shows 'mon.X is down (out of quorum)' with 'clock skew X > max 0.05s' error. Verify NTP/chrony synchronization across monitor nodes.

Recommended action:

Immediately synchronize clocks using NTP/chrony. Deploy local NTP server if relying on remote servers with network issues. Verify network connectivity between monitors. Do not change mon_clock_drift_allowed default without testing, as it affects cluster stability.