GrafanaCockroachDB

Physical Cluster Replication Lag Silently Growing Before Failover

critical
ReplicationUpdated Sep 28, 2023

In CockroachDB physical cluster replication (PCR) setups monitored via Grafana, replication lag (replicated_time vs. actual time) can grow unnoticed if only DB Console is used. During failover, this lag translates to data loss or extended RTO, as the standby cluster is further behind than expected.

How to detect:

Monitor physical_replication.replicated_time_seconds (timestamp of standby's consistent data) and calculate replication lag (current time - replicated_time). Alert when lag exceeds SLO threshold (e.g., >5 minutes for critical systems). Cross-reference with physical_replication.logical_bytes (ingestion rate) and standby cluster's Prometheus metrics. If lag climbs while primary cluster load is stable, investigate network latency, standby disk I/O, or insufficient standby resources.

Recommended action:

Instrument both primary and standby CockroachDB clusters with separate Prometheus exporters and Grafana dashboards. Configure alerts on replication lag exceeding acceptable threshold and on physical_replication.logical_bytes stalling (indicating replication stream failure). Before planned failover, verify replicated_time is within acceptable lag window. During incidents, use Prometheus-backed metrics (not DB Console alone) to assess standby readiness.