Replication Lag Causing Stale Reads and Failover Risk

criticalIncident Response

Redis replicas falling behind master due to high write load, network issues, or resource constraints, causing stale reads and increasing data loss risk during failover.

Prompt: “Our Redis replicas are showing replication lag of 30+ seconds and we're getting complaints about stale data being served to users. I'm worried about data loss if we have a failover right now. What metrics should I check to understand why replication is lagging and how can I determine if we need to scale or if this is a temporary spike?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating Redis replication lag, start by quantifying the actual lag and offset divergence between master and replicas, then verify replica connectivity before diving into root causes like inadequate backlog size, network instability, or master overload. The key is distinguishing between temporary spikes and systemic issues that require architectural changes.

1Quantify the replication lag and offset divergence

Check `redis.replication.master_last_io_seconds_ago` on each replica — values above 30 seconds indicate problematic lag. Compare `redis.replication.offset` between the master and each replica to see the actual byte divergence. This baseline measurement tells you if you're dealing with a temporary spike or sustained lag, and which replicas are most affected.

Replica Lag Hides Behind Stale Cache Hits redis.replication.master_last_io_seconds_agoredis.replication.offset

2Verify replica connectivity and link status

On the master, check `redis.clients.connected_slaves` to confirm all expected replicas are connected. On each replica, look for `master_link_down_since_seconds` being non-zero, which indicates connection drops. Frequent disconnections combined with lag point to network instability rather than just throughput issues.

Replica Lag Hides Behind Stale Cache Hits Replication Backlog Too Small for Network Instability redis.clients.connected_slaves

3Assess replication backlog adequacy

Check if `redis.replication.repl_backlog_size` is large enough for your write rate and network conditions. Calculate required backlog as (write_bytes_per_sec × max_expected_disconnection_seconds × 2). If replicas are frequently triggering full resyncs instead of partial resyncs after brief network hiccups, your backlog is too small and you're creating replication storms that increase master load.

Replication Backlog Too Small for Network Instability

4Check master resource saturation

Look at master CPU utilization, network bandwidth saturation, and memory usage. If the master is overloaded (CPU >80%, network bandwidth maxed out), it can't serve replication traffic fast enough regardless of replica or network health. High write throughput combined with complex data structures (large lists, sorted sets) amplifies this issue.

Replica Lag Hides Behind Stale Cache Hits redis.replication.offset

5Investigate network latency and bandwidth

Measure network latency and packet loss between master and replicas, especially if you see `master_link_down_since_seconds` events. Network issues combined with an undersized backlog cause the full-resync pattern described in the replication backlog insight. If replicas are in different availability zones or regions, sustained high write rates may exceed available bandwidth.

Replication Backlog Too Small for Network Instability redis.replication.master_last_io_seconds_ago

6Determine failover risk and data loss exposure

Calculate potential data loss by multiplying current write rate by `redis.replication.master_last_io_seconds_ago` — this is roughly how much data would be lost in a failover right now. Check if applications are reading from lagging replicas and serving stale data. High `keyspace.hits` on replicas can mask the lag issue because caching appears to be working, but users are seeing outdated values.

Replica Lag Hides Behind Stale Cache Hits redis.replication.master_last_io_seconds_agoredis.replication.offset