Replica Lag Hides Behind Stale Cache Hits

critical

ReplicationUpdated Oct 20, 2015

Redis replica falling behind master (redis.replication.master_last_io_seconds_ago increasing) continues serving stale cached data with normal hit rates, masking synchronization issues until applications encounter data inconsistencies or master failover fails.

Sources

Monitoring RDS MySQL performance metrics | Datadogwww.datadoghq.com

Technologies:

RedisThe root cause of this issue originates in Redis

redis.replication.master_last_io_seconds_ago

redis.replication.master_link_down_since_seconds

redis.replication.offset

redis.clients.connected_slaves

redis.keyspace.hits

How to detect:

Alert when redis.replication.master_last_io_seconds_ago exceeds expected replication delay threshold (e.g., >30 seconds) on replica nodes, especially if redis.replication.master_link_down_since_seconds is non-zero. Monitor redis.replication.offset divergence between master and replicas. High redis.keyspace.hits on replicas may mask this issue.

Recommended action:

Investigate network connectivity between master and replica. Check redis.replication.repl_backlog_size to ensure it's large enough to handle temporary disconnections. Verify master isn't overloaded (check redis.stats.instantaneous_ops_per_sec). Consider increasing replication timeout values. Implement application-level read-after-write consistency checks when serving from replicas.