Replication Lag Causing Stale Reads and Failover Risk
criticalIncident Response
Redis replicas falling behind master due to high write load, network issues, or resource constraints, causing stale reads and increasing data loss risk during failover.
Prompt: “Our Redis replicas are showing replication lag of 30+ seconds and we're getting complaints about stale data being served to users. I'm worried about data loss if we have a failover right now. What metrics should I check to understand why replication is lagging and how can I determine if we need to scale or if this is a temporary spike?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When investigating Redis replication lag, start by quantifying the actual lag and offset divergence between master and replicas, then verify replica connectivity before diving into root causes like inadequate backlog size, network instability, or master overload. The key is distinguishing between temporary spikes and systemic issues that require architectural changes.
1Quantify the replication lag and offset divergence
Check `redis.replication.master_last_io_seconds_ago` on each replica — values above 30 seconds indicate problematic lag. Compare `redis.replication.offset` between the master and each replica to see the actual byte divergence. This baseline measurement tells you if you're dealing with a temporary spike or sustained lag, and which replicas are most affected.
2Verify replica connectivity and link status
On the master, check `redis.clients.connected_slaves` to confirm all expected replicas are connected. On each replica, look for `master_link_down_since_seconds` being non-zero, which indicates connection drops. Frequent disconnections combined with lag point to network instability rather than just throughput issues.
3Assess replication backlog adequacy
Check if `redis.replication.repl_backlog_size` is large enough for your write rate and network conditions. Calculate required backlog as (write_bytes_per_sec × max_expected_disconnection_seconds × 2). If replicas are frequently triggering full resyncs instead of partial resyncs after brief network hiccups, your backlog is too small and you're creating replication storms that increase master load.
4Check master resource saturation
Look at master CPU utilization, network bandwidth saturation, and memory usage. If the master is overloaded (CPU >80%, network bandwidth maxed out), it can't serve replication traffic fast enough regardless of replica or network health. High write throughput combined with complex data structures (large lists, sorted sets) amplifies this issue.
5Investigate network latency and bandwidth
Measure network latency and packet loss between master and replicas, especially if you see `master_link_down_since_seconds` events. Network issues combined with an undersized backlog cause the full-resync pattern described in the replication backlog insight. If replicas are in different availability zones or regions, sustained high write rates may exceed available bandwidth.
6Determine failover risk and data loss exposure
Calculate potential data loss by multiplying current write rate by `redis.replication.master_last_io_seconds_ago` — this is roughly how much data would be lost in a failover right now. Check if applications are reading from lagging replicas and serving stale data. High `keyspace.hits` on replicas can mask the lag issue because caching appears to be working, but users are seeing outdated values.
Technologies
Related Insights
Replica Lag Hides Behind Stale Cache Hits
critical
Redis replica falling behind master (redis.replication.master_last_io_seconds_ago increasing) continues serving stale cached data with normal hit rates, masking synchronization issues until applications encounter data inconsistencies or master failover fails.
Replication Backlog Too Small for Network Instability
warning
When redis.replication.repl_backlog_size is too small relative to write rate and network latency, replicas require full resync after brief disconnections, causing replication storms and increased master load.
Relevant Metrics
Monitoring Interfaces
Redis Native Metrics