Redis Replication Lag Spike Causing Stale Reads

warningIncident Response

Redis replicas are falling behind the master, causing stale data to be served to read queries and risking data inconsistency.

Prompt: Our Redis replicas are showing 30+ seconds of replication lag and we're getting complaints about stale data - help me diagnose if this is a network issue, resource constraint, or if the master is writing too fast for replicas to keep up.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing Redis replication lag, start by confirming the lag severity and checking for network connectivity issues between master and replicas. Then investigate whether the replication backlog is sized appropriately for your write rate, and finally assess if the master's write throughput or network bandwidth is overwhelming the replicas.

1Confirm replication lag severity across all replicas
Check `redis.replication.master_last_io_seconds_ago` on each replica node to confirm the reported 30+ second lag. This metric tells you how long it's been since the replica last communicated with the master. If only one replica shows lag while others are healthy, you've isolated the problem to that specific node. If all replicas show similar lag patterns, the issue is likely at the master or network level.
2Check if the replication link is down or unstable
Look at `redis.replication.master_link_down_since_seconds` on replicas and `redis.clients.connected_slaves` on the master. If master_link_down_since_seconds is non-zero, you have a network connectivity issue preventing replication entirely. If connected_slaves is lower than expected, some replicas have disconnected. Network instability between master and replicas is the most common cause of sudden replication lag spikes.
3Measure replication offset divergence to quantify the backlog
Compare `redis.replication.offset` on the master against the same metric on each replica. The difference tells you how many bytes of replication data the replica is behind. If this gap is growing over time rather than staying stable or shrinking, replicas are falling further behind and can't keep up with the master's write rate. A stable gap suggests temporary congestion that's resolved.
4Check if replication backlog is too small for your write pattern
Verify that `redis.replication.repl_backlog_size` is large enough to handle temporary disconnections without triggering full resyncs. Calculate the required size: (bytes written per second from redis.net.output) × (maximum expected disconnection time in seconds) × 2. If replicas are frequently doing full resyncs instead of partial resyncs after brief network hiccups, this backlog is too small and you'll see replication storms that crush the master.
5Assess master write throughput and replica capacity
Check `redis.net.output` on the master to see the bytes per second being written. A sudden spike in write activity can overwhelm replicas, especially if they're resource-constrained (CPU, memory, or disk I/O on systems with persistence enabled). Compare this against historical baselines — if the master is suddenly writing 10x more data than usual, replicas configured for normal load won't keep up without scaling.
6Investigate network throughput bottlenecks between master and replicas
Compare `redis.net.output` on the master with `redis.net.input` on replicas to identify network bandwidth constraints. If the master is pushing 500 MB/s but replicas are only receiving 50 MB/s, you have a network bottleneck — could be insufficient bandwidth, packet loss, or congestion on the replication network. This is especially common in cross-AZ or cross-region replication setups where network capacity is limited.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Redis Prometheus
Redis Datadog
Redis Native Metrics