Redis Backup and Disaster Recovery Verification

warningProactive Health

Ensuring Redis backup strategy (RDB snapshots, AOF logs) is working correctly and data can be recovered in disaster scenarios.

Prompt: “Our Redis backups show rdb_last_save_time from 6 hours ago but we configured hourly snapshots - help me verify if backups are actually running correctly, whether AOF is healthy, and how much data we'd lose if Redis crashed right now.”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating Redis backup failures, start by checking if RDB snapshots are actually failing versus misconfigured, then assess your current data loss exposure by examining uncommitted changes and AOF status. Finally, investigate performance bottlenecks that might prevent successful persistence operations.

1Check if RDB snapshots are silently failing

First, examine `redis-persistence-rdb-last-bgsave-status` — if it's 0 (error) instead of 1 (ok), your RDB saves are failing, not just delayed. A failing status combined with a stale rdb_last_save_time means Redis tried to save but encountered errors, likely disk I/O issues or permission problems. This is the most common cause when backups appear to stop working.

redis.persistence.rdb_last_bgsave_status

2Verify RDB save triggers and configuration alignment

Check your Redis config for save directives (e.g., 'save 3600 1' for hourly snapshots) and compare against `redis-rdb-changes-since-last-save`. If this metric is low (say, under 100 changes) and your save threshold requires more changes, Redis won't trigger a snapshot even after an hour. Also verify `redis-uptime` — if Redis recently restarted, the 6-hour-old RDB file might be from before the restart, not a sign of current backup failure.

redis.rdb.changes_since_last_saveredis.uptime

3Assess immediate data loss exposure

Look at `redis-rdb-changes-since-last-save` to quantify how many write operations would be lost if Redis crashed right now. If this number is high (thousands or more) and you need to know the exact data at risk, check `redis-persistence-aof-enabled`. If AOF is enabled (value 1), you have much better durability — you'd only lose 1 second of writes with appendfsync everysec, versus potentially 6 hours with RDB-only. This tells you whether you have a backup problem or a backup + durability problem.

redis.rdb.changes_since_last_saveredis.persistence.aof_enabled

4Verify AOF health and rewrite patterns

If AOF is enabled, check `redis-persistence-aof-current-size` and `redis-persistence-aof-last-rewrite-time-sec`. A rapidly growing AOF file (multiple GB) without recent rewrites indicates the `aof-persistence-latency-from-synchronous-disk-writes` problem — slow disk I/O can cascade to affect RDB saves too. If AOF rewrites are taking many seconds, your disk subsystem may be too slow to reliably handle both AOF and RDB persistence simultaneously.

AOF Persistence Latency from Synchronous Disk Writes redis.persistence.aof_current_sizeredis.persistence.aof_last_rewrite_time_sec

5Check for persistence performance bottlenecks

Examine `redis-persistence-rdb-last-bgsave-time-sec` to see how long RDB saves are taking. If this is increasing over time or exceeds 60 seconds on a moderately-sized dataset, you have disk I/O contention. The `redis-appendfsync-blocking-gunicorn-timeout` insight shows how persistence issues can cascade — if appendfsync is set to 'always' or 'everysec' and disk writes take over 1 second, this can block application threads and prevent timely RDB snapshots from completing.

Redis appendfsync blocking causes Gunicorn worker timeout redis.persistence.rdb_last_bgsave_time_sec

6Review appendfsync configuration for durability tradeoffs

Check your current appendfsync setting in Redis config. If it's set to 'always', every write blocks on disk fsync, which can prevent RDB background saves from completing under write load. The `redis-appendfsync-blocking-gunicorn-timeout` insight shows this can cause worker timeouts. Consider 'everysec' for a balance between durability (1 second max loss) and allowing RDB to complete, or 'no' if RDB snapshots are your primary backup strategy and you can tolerate losing data since the last snapshot.

Redis appendfsync blocking causes Gunicorn worker timeout