Physical cluster replication lag escalation

warning

ReplicationUpdated Sep 28, 2023

Growing replication lag in CockroachDB PCR indicates the standby cluster cannot keep pace with primary writes. Unchecked lag degrades failover readiness and increases potential data loss window.

Sources

Physical Cluster Replicationwww.cockroachlabs.com

Physical Cluster Replication Monitoringcockroachlabs.com

Technologies:

CockroachDBThe root cause of this issue originates in CockroachDB

GrafanaSymptoms of this issue are visible in Grafana metrics and logs

PrometheusSymptoms of this issue are visible in Prometheus metrics and logs

How to detect:

SHOW VIRTUAL CLUSTER WITH REPLICATION STATUS shows replication_lag exceeding SLO threshold (e.g., 60s) or showing sustained upward trend. physical_replication.replicated_time_seconds metric (Prometheus) shows lag growth.

Recommended action:

When replication lag grows: (1) Check standby cluster resource utilization (CPU, disk I/O) for bottlenecks, (2) Verify network bandwidth between clusters is sufficient, (3) Investigate physical_replication.logical_bytes for sudden data volume spikes, (4) Consider scaling standby cluster if lag persists during normal load. Review replication job status.