Ceph

Clock Skew Breaks Monitor Quorum and Cluster Operations

critical
reliabilityUpdated Jun 4, 2024

Monitor clock drift exceeding 50ms (default mon_clock_drift_allowed) causes monitors to mark each other as down, breaking quorum and preventing cluster operations. This manifests as monitors being out of quorum despite being healthy.

How to detect:

Check for 'clock skew' messages in `ceph health detail` output or cluster logs. Monitor shows 'mon.X is down (out of quorum)' with 'clock skew X > max 0.05s' error. Verify NTP/chrony synchronization across monitor nodes.

Recommended action:

Immediately synchronize clocks using NTP/chrony. Deploy local NTP server if relying on remote servers with network issues. Verify network connectivity between monitors. Do not change mon_clock_drift_allowed default without testing, as it affects cluster stability.