High Latency Spikes from Slow Commands

warningIncident Response

Redis experiencing intermittent latency spikes due to slow commands, blocking operations, or persistence overhead, degrading application performance.

Prompt: “We're seeing Redis latency spike to 500ms+ randomly throughout the day, causing timeouts in our application. Our P99 latency is normally under 10ms. I need to figure out what's causing these spikes - is it slow commands, memory issues, or something with persistence? What should I look at in SLOWLOG and what other metrics correlate with these latency events?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating Redis latency spikes jumping from <10ms to 500ms+, start by examining the SLOWLOG to identify which commands are blocking the event loop. Then determine if the culprits are expensive O(N) operations on large datasets, persistence overhead from AOF/RDB, or a combination creating cascading connection pool exhaustion. Correlate command frequency patterns with spike timing to understand whether it's occasional heavy operations or sustained high-frequency calls.

1Check the Redis SLOWLOG for command patterns

Start by examining `redis-slowlog-length` and `redis-slowlog-micros-95percentile` to confirm that slowlog entries correlate with your 500ms+ latency spikes. If the 95th percentile is approaching or exceeding 500,000 microseconds during spike windows, you've confirmed slow commands are blocking Redis's single-threaded event loop. Query the actual slowlog entries (SLOWLOG GET 100) to see which specific commands, key patterns, and argument sizes are triggering the delays—this gives you the direct evidence of what's blocking.

redis.slowlog.lengthredis.slowlog.micros.95percentileredis.slowlog.micros.count

2Identify expensive O(N) operations on large datasets

Examine `redis-commands-usec` and `redis-commands-usec-per-call` to find which commands are consuming the most CPU time. The `command-latency-spikes-from-expensive-operations` insight shows that KEYS, SMEMBERS, HGETALL, and unbounded LRANGE/ZRANGE are common culprits—these have O(N) complexity and block Redis when operating on large collections. If you see KEYS with thousands of microseconds per call or SMEMBERS on sets with 10K+ members, you've found your latency source. Check `redis-command-calls` to see if these expensive operations are being called frequently enough to explain your intermittent spikes.

Command Latency Spikes from Expensive O(N) Operations redis.commands.usecredis.commands.usec_per_callredis.command.calls

3Examine AOF and RDB persistence blocking

Check if `redis-persistence-aof-last-rewrite-time-sec` or `redis-persistence-rdb-last-bgsave-time-sec` show multi-second durations that align with your latency spike timing. The `aof-persistence-latency-from-synchronous-disk-writes` insight explains that appendfsync=everysec can cause write latency when disk I/O is slow, especially as the AOF file grows between rewrites. The `redis-appendfsync-blocking-gunicorn-timeout` pattern shows this can block writes for over 1 second on caches of several hundred MB. If persistence durations exceed 1-2 seconds and correlate with your spikes, consider switching from appendfsync=everysec to appendfsync=no or tuning auto-aof-rewrite-percentage to trigger more frequent AOF rewrites.

AOF Persistence Latency from Synchronous Disk Writes Redis appendfsync blocking causes Gunicorn worker timeout redis.persistence.aof_last_rewrite_time_secredis.persistence.rdb_last_bgsave_time_sec

4Monitor for connection pool exhaustion from blocked operations

When `redis-slowlog-length` trends upward during latency events, slow operations can hold client connections and exhaust your application's connection pool—even when Redis CPU appears healthy. The `slow-query-backlog-masks-redis-connection-pool-exhaustion` insight describes this cascade failure pattern. If you're seeing application-side connection timeout errors ("unable to acquire Redis connection") concurrent with Redis latency spikes, your slow operations are blocking application threads and starving the connection pool. This is a secondary effect that amplifies the impact of the slow commands you identified in steps 1-2.

Slow Query Backlog Masks Redis Connection Pool Exhaustion redis.slowlog.length

5Correlate command frequency patterns with spike timing

Use `redis-command-calls` alongside `redis-commands-usec` to distinguish between occasionally-slow commands and high-frequency moderate-latency commands that saturate Redis. A command averaging 50ms but called 200 times per second contributes 10 seconds of blocking per second (impossible on single-threaded Redis, so it queues), versus a 500ms command called once per minute. Look for spikes in call frequency during your latency windows—if HGETALL calls jump 10x during certain application workflows, that explains the intermittent nature of your latency spikes.

redis.command.callsredis.commands.usec