Replication Lag Accumulation

warning

ReplicationUpdated Jul 24, 2024

Bucket or site replication can fall behind due to network issues, target unavailability, or insufficient replication workers, causing growing backlogs that impact RPO objectives.

Sources

Debugging MinIO Installsblog.min.io

A Closer Look: MinIO Observabilityblog.min.io

Powerful Perspective: Introducing MinIO Observabilityblog.min.io

Technologies:

MinIOSymptoms of this issue are visible in MinIO metrics and logs

minio.replication.objects_pending.total

minio.replication.active_workers

minio.replication.throughput_bytes

minio.replication.queue_depth

How to detect:

Track replication worker queue depth, objects_pending_replication growing over time, replication_active_workers at maximum while backlog increases, or replication_throughput_bytes dropping below baseline during normal load.

Recommended action:

Use 'mc admin replicate resync status' to inspect replication backlog details. Verify target cluster health and network connectivity with hperf. Check if replication workers are saturated (CPU/network bound). Use 'mc admin replicate resync start' to manually trigger replication recovery. Alert if backlog exceeds RPO tolerance (e.g., >1 hour of writes).