Redis Failover Decision - Manual vs Automatic

criticalIncident Response

Primary Redis node is unhealthy but Sentinel hasn't triggered automatic failover - deciding whether to wait or manually promote a replica.

Prompt: Our Redis master is degraded with high latency but Sentinel hasn't triggered failover yet - help me understand if I should manually promote a replica with FAILOVER command or wait for automatic failover, and what's the data loss risk either way?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When facing a Redis failover decision, first verify that replicas are healthy and synchronized enough to be viable failover targets. Then quantify the data loss risk by checking replication offsets and understand why Sentinel hasn't triggered automatically—this helps you decide whether manual intervention is warranted or if you should address the underlying issue first.

1Check replica synchronization status and lag
Start by checking `redis.replication.master_last_io_seconds_ago` on each replica—if this exceeds 30 seconds, replicas are significantly lagging and may not be safe failover targets. Compare `redis.replication.offset` between master and each replica to see how far behind they are; even a small offset difference represents data you'll lose on failover. The `replica-lag-hides-behind-stale-cache-hits` insight warns that replicas can appear healthy while serving stale data, so don't rely on cache hit rates to judge replica health.
2Assess replication link health and connectivity
Check `redis.replication.master_link_down_since_seconds` on each replica—if non-zero, the replication link is broken and that replica isn't receiving updates. Verify `redis.clients.connected_slaves` on the master matches your expected replica count; missing replicas indicate connectivity or configuration issues. If replicas are disconnected, manual failover is risky because they're not getting recent data, but automatic failover also won't work until Sentinel detects enough healthy replicas.
3Quantify potential data loss from failover
Calculate the `redis.replication.offset` difference between the master and your best-positioned replica—this is the exact number of bytes you'll lose if you failover now. Also check `redis.rdb.changes_since_last_save` to understand how much data hasn't been persisted to disk yet; if both master and replicas fail before the next save, this data is gone. If the offset difference is growing rapidly, waiting longer increases data loss risk, which may tip the decision toward manual failover.
4Verify replication backlog is adequate for recovery
Check if `redis.replication.repl_backlog_size` is large enough for your write rate and network conditions—calculate (write_bytes_per_sec × max_expected_disconnection_seconds × 2). The `replication-backlog-too-small-for-network-instability` insight shows that undersized backlogs cause replicas to require full resync after brief disconnections, which would make them unavailable during recovery. If your backlog is too small and replicas keep disconnecting, you're in a replication storm that manual failover won't fix—you need to increase the backlog size first.
5Understand why Sentinel hasn't triggered failover
Review Sentinel configuration (down-after-milliseconds, quorum settings) and check if the master is responding to Sentinel pings but just degraded with high latency. Sentinel only triggers automatic failover when the master is truly unreachable for down-after-milliseconds; if the master is slow but responsive, Sentinel considers it healthy even if application performance is terrible. Check `redis.uptime` on the master—if it's very low, the master may have recently restarted and Sentinel's timers haven't expired yet. This tells you whether waiting for automatic failover is realistic or if manual intervention is your only option.
6Check for application-level issues masking as Redis problems
The `unacked-mutex-expiration-cascading-failures` insight shows a pattern where application-level mutex expirations cause cascading worker failures every 5 minutes, often misdiagnosed as Redis issues. Look for zrevrangebyscore commands executing every 5 minutes and high hget usage—if present, your 'Redis degradation' might actually be worker processes getting stuck on unacked tasks. Failover won't fix this; you need to increase mutex TTLs or fix the underlying worker issue first.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Redis Native Metrics
Redis Prometheus
Redis Datadog