Redis CPU Saturation and Slowlog Spike

criticalIncident Response

Redis EngineCPU hitting 100% with slow queries accumulating in the slowlog, causing request latency spikes and timeouts.

Prompt: “Redis EngineCPU is pegged at 100% and the slowlog is filling up with queries taking over 500ms - help me identify which commands are blocking the single-threaded engine and whether I need to optimize queries or scale vertically.”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When Redis CPU hits 100% with slowlog accumulation, start by examining the slowlog itself to identify which specific commands are blocking the single-threaded engine. Then trace those commands back to their CPU consumption patterns to distinguish between expensive O(N) operations, cache deletion table scans, and high-frequency moderate-cost commands. Finally, determine whether you need query optimization or vertical scaling by comparing per-command costs against overall throughput.

1Examine slowlog entries for command patterns

Start with `redis.slowlog.length` and `redis.slowlog.micros.95percentile` to understand the severity of the backlog. If slowlog length is growing and p95 latency is above 500ms, query the slowlog directly to see which specific commands (KEYS, SMEMBERS, HGETALL, etc.) are appearing most frequently. This is your most direct view into what's blocking the single-threaded Redis engine right now.

Command Latency Spikes from Expensive O(N) Operations Slow Query Backlog Masks Redis Connection Pool Exhaustion redis.slowlog.lengthredis.slowlog.micros.95percentileredis.slowlog.micros.count

2Identify O(N) operations consuming CPU time

Check `redis.commands.usec` per command to find total CPU time consumed, and compare it with `redis.commands.usec_per_call` to understand per-call cost. Commands like KEYS, SMEMBERS, HGETALL, and unbounded LRANGE operations have O(N) complexity and will show both high total microseconds and high per-call averages. These are your primary optimization targets—replace KEYS with SCAN, use SSCAN/HSCAN instead of full-set operations, and limit LRANGE with reasonable start/stop values.

Command Latency Spikes from Expensive O(N) Operations redis.commands.usecredis.commands.usec_per_callredis.commands.calls

3Check for cache deletion operations causing table scans

Look for DEL or pattern-based deletion commands in `redis.commands.calls` that correlate with CPU spikes. Cache key deletions force Redis to scan the entire key table to find matching keys, which can push CPU utilization above 83% during normal traffic. If deletion operations are frequent, this is often the primary culprit and requires moving to expiration-based invalidation strategies instead.

Redis CPU utilization spikes during cache key deletions redis.commands.callsredis.commands.usec

4Compare command frequency against per-call cost

Cross-reference `redis.commands.calls` with `redis.commands.usec` and `redis.commands.usec_per_call` to find high-frequency commands. A command taking 10ms per call but executed 1000 times per second (10 seconds of CPU time) is far more damaging than a 100ms command executed once per minute. Focus optimization efforts on commands with high total CPU time, not just high per-call latency.

Command Latency Spikes from Expensive O(N) Operations redis.commands.callsredis.commands.usecredis.commands.usec_per_callredis_command_call_time_seconds

5Assess whether connection pool exhaustion is amplifying the issue

If `redis.slowlog.length` is climbing while slow operations accumulate, check if these long-running commands are exhausting connection pools and blocking application threads. This creates a cascading failure where even healthy Redis operations appear slow because clients are queued waiting for connections. If this pattern exists alongside persistence operations (RDB snapshots), the issue may be I/O-bound rather than CPU-bound, requiring investigation into disk performance.

Slow Query Backlog Masks Redis Connection Pool Exhaustion redis.slowlog.lengthredis.slowlog.micros.95percentile

6Evaluate overall throughput to determine scaling needs

Finally, check `redis.stats.instantaneous_ops_per_sec` to understand if the workload itself has simply outgrown your Redis instance. If you've optimized away expensive O(N) operations and cache deletions but CPU is still saturated at high but reasonable ops/sec, you likely need vertical scaling (larger instance) or horizontal scaling (sharding/clustering). Compare your current ops/sec against your instance's documented capacity to make this determination.

redis.stats.instantaneous_ops_per_secredis.commands.calls