Redis Replication Lag Spike Causing Stale Reads
warningIncident Response
Redis replicas are falling behind the master, causing stale data to be served to read queries and risking data inconsistency.
Prompt: “Our Redis replicas are showing 30+ seconds of replication lag and we're getting complaints about stale data - help me diagnose if this is a network issue, resource constraint, or if the master is writing too fast for replicas to keep up.”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When diagnosing Redis replication lag, start by confirming the lag severity and checking for network connectivity issues between master and replicas. Then investigate whether the replication backlog is sized appropriately for your write rate, and finally assess if the master's write throughput or network bandwidth is overwhelming the replicas.
1Confirm replication lag severity across all replicas
Check `redis.replication.master_last_io_seconds_ago` on each replica node to confirm the reported 30+ second lag. This metric tells you how long it's been since the replica last communicated with the master. If only one replica shows lag while others are healthy, you've isolated the problem to that specific node. If all replicas show similar lag patterns, the issue is likely at the master or network level.
2Check if the replication link is down or unstable
Look at `redis.replication.master_link_down_since_seconds` on replicas and `redis.clients.connected_slaves` on the master. If master_link_down_since_seconds is non-zero, you have a network connectivity issue preventing replication entirely. If connected_slaves is lower than expected, some replicas have disconnected. Network instability between master and replicas is the most common cause of sudden replication lag spikes.
3Measure replication offset divergence to quantify the backlog
Compare `redis.replication.offset` on the master against the same metric on each replica. The difference tells you how many bytes of replication data the replica is behind. If this gap is growing over time rather than staying stable or shrinking, replicas are falling further behind and can't keep up with the master's write rate. A stable gap suggests temporary congestion that's resolved.
4Check if replication backlog is too small for your write pattern
Verify that `redis.replication.repl_backlog_size` is large enough to handle temporary disconnections without triggering full resyncs. Calculate the required size: (bytes written per second from redis.net.output) × (maximum expected disconnection time in seconds) × 2. If replicas are frequently doing full resyncs instead of partial resyncs after brief network hiccups, this backlog is too small and you'll see replication storms that crush the master.
5Assess master write throughput and replica capacity
Check `redis.net.output` on the master to see the bytes per second being written. A sudden spike in write activity can overwhelm replicas, especially if they're resource-constrained (CPU, memory, or disk I/O on systems with persistence enabled). Compare this against historical baselines — if the master is suddenly writing 10x more data than usual, replicas configured for normal load won't keep up without scaling.
6Investigate network throughput bottlenecks between master and replicas
Compare `redis.net.output` on the master with `redis.net.input` on replicas to identify network bandwidth constraints. If the master is pushing 500 MB/s but replicas are only receiving 50 MB/s, you have a network bottleneck — could be insufficient bandwidth, packet loss, or congestion on the replication network. This is especially common in cross-AZ or cross-region replication setups where network capacity is limited.
Technologies
Related Insights
Replica Lag Hides Behind Stale Cache Hits
critical
Redis replica falling behind master (redis.replication.master_last_io_seconds_ago increasing) continues serving stale cached data with normal hit rates, masking synchronization issues until applications encounter data inconsistencies or master failover fails.
Replication Backlog Too Small for Network Instability
warning
When redis.replication.repl_backlog_size is too small relative to write rate and network latency, replicas require full resync after brief disconnections, causing replication storms and increased master load.
Relevant Metrics
Monitoring Interfaces
Redis Native Metrics