Redis Cluster Failover and Split-Brain Diagnosis

criticalIncident Response

Redis cluster experiencing failover issues or split-brain scenario where multiple nodes believe they are master, risking data inconsistency and loss.

Prompt: “Our Redis Sentinel is showing conflicting master nodes after a network partition. Some sentinels think node A is master while others think node B is master. We're getting write conflicts and I'm worried about data loss. How do I diagnose which node should actually be master and safely resolve this split-brain situation?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing Redis Sentinel split-brain scenarios, the first priority is identifying which node has the most authoritative data by comparing replication offsets across all candidate masters. Then verify Sentinel consensus and replication topology to understand the scope of the split. Finally, assess data divergence and network partition timing to determine safe resolution steps without data loss.

1Compare replication offsets across conflicting master candidates

Check `redis.replication.offset` on both nodes that sentinels believe are master. The node with the higher offset has processed more write operations and contains the most recent data state. This is your strongest signal for which node should be the authoritative master. If offsets differ by more than a few hundred commands, you have significant data divergence and need to assess which writes can be safely discarded.

redis.replication.offset

2Verify current role and replication topology on each node

Run `INFO replication` on both candidate masters to see what each node believes its role is and check `redis.clients.connected_slaves` to see how many replicas are connected to each. A true master should have replicas connected. If both nodes show role:master with connected replicas, you have a confirmed split-brain. If one shows zero connected replicas, it's likely the isolated node that incorrectly promoted itself.

redis.clients.connected_slaves

3Check when the network partition occurred

Look at `redis.replication.master_link_down_since_seconds` on nodes that were replicas before the split. This tells you exactly when the partition happened and how long nodes have been isolated. Cross-reference with `redis.replication.master_last_io_seconds_ago` to understand replication lag before the split. If master_link was down for longer than your Sentinel down-after-milliseconds setting (typically 30s), that explains why failover was triggered.

redis.replication.master_link_down_since_secondsredis.replication.master_last_io_seconds_ago

4Assess data loss exposure from unsaved changes

Check `redis.rdb.changes_since_last_save` on both candidate masters to understand how many write operations would be lost if you need to demote one node. If a node has accumulated thousands of changes since last save and you demote it, those writes are gone unless you have AOF enabled. This metric helps you quantify the data loss risk of each resolution path and whether you need to attempt manual data reconciliation.

redis.rdb.changes_since_last_save

5Query all Sentinel instances for their view of the master

Run `SENTINEL get-master-addr-by-name <master-name>` on every Sentinel instance to see which nodes have consensus and which are outliers. Majority wins in Sentinel — if 2 out of 3 Sentinels agree on node A, that's your canonical master. If Sentinels are evenly split, you likely have a network partition that's still active, and resolving the split-brain requires first fixing the network connectivity between Sentinel instances.

6Review Sentinel logs for failover triggers and quorum voting

Check Sentinel logs on all instances for +sdown, +odown, and +failover messages around the partition time. Look for which Sentinels voted for failover and whether quorum was legitimately achieved. If you see failover triggered without proper quorum (quorum configured too low, or network partition prevented majority vote), this points to a configuration issue that needs fixing after resolving the immediate split-brain to prevent recurrence.

7Check for application-layer cascading failures from Redis unavailability

If your application uses Redis for distributed locks or queues, be aware that split-brain scenarios can trigger cascading worker failures. Watch for patterns where workers fail every few minutes as mutex locks expire and new workers become stuck trying to restore state. If you see this pattern correlating with zrevrangebyscore commands every 5 minutes, you may need to increase mutex TTLs to prevent worker cascade while resolving the Redis split-brain.

Unacked mutex expiration triggers cascading worker failures every 5 minutes

Technologies

Redis

Related Insights

Unacked mutex expiration triggers cascading worker failures every 5 minutes

critical

On-demand ServerGroup cache refresh causes temporary instance data loss

critical

Relevant Metrics

redis.replication.offsetredis.replication.master_link_down_since_secondsredis.clients.connected_slavesredis.replication.master_last_io_seconds_agoredis.rdb.changes_since_last_save

Monitoring Interfaces

Redis Datadog

Redis Prometheus

Redis Native Metrics