CockroachDBGrafanaPrometheus

Physical cluster replication lag escalation

warning
ReplicationUpdated Sep 28, 2023

Growing replication lag in CockroachDB PCR indicates the standby cluster cannot keep pace with primary writes. Unchecked lag degrades failover readiness and increases potential data loss window.

How to detect:

SHOW VIRTUAL CLUSTER WITH REPLICATION STATUS shows replication_lag exceeding SLO threshold (e.g., 60s) or showing sustained upward trend. physical_replication.replicated_time_seconds metric (Prometheus) shows lag growth.

Recommended action:

When replication lag grows: (1) Check standby cluster resource utilization (CPU, disk I/O) for bottlenecks, (2) Verify network bandwidth between clusters is sufficient, (3) Investigate physical_replication.logical_bytes for sudden data volume spikes, (4) Consider scaling standby cluster if lag persists during normal load. Review replication job status.