Redis Memory Eviction Crisis During Peak Traffic

criticalIncident Response

Production Redis cache is evicting keys aggressively and cache hit rate is dropping, risking database overload and degraded performance.

Prompt: My Redis cache is evicting keys like crazy and the cache hit rate just dropped from 95% to 60% - help me figure out if I need to scale up, tune eviction policies, or if there's a memory leak in the application.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When Redis evictions spike and cache hit rates drop, start by confirming memory pressure against configured limits, then check fragmentation before looking at eviction policy fit and TTL patterns. This systematic approach distinguishes between needing more capacity versus configuration tuning versus application-side memory leaks.

1Confirm memory pressure and quantify eviction severity
First, check if `redis.memory.used` is consistently above 90% of `redis.memory.maxmemory` — this is the primary trigger for evictions. Compare current `redis.keys.evicted` rate to your baseline; a 3-5x spike indicates acute pressure. Calculate your cache hit rate using `redis.keyspace.hits / (redis.keyspace.hits + redis.keyspace.misses)` to confirm it's actually degraded. This establishes whether you're truly memory-constrained or if something else is causing cache misses.
2Check memory fragmentation ratio
Calculate `redis.memory.fragmentation_ratio` (RSS / used memory). If it's consistently above 1.5, you're experiencing fragmentation that's amplifying memory pressure — your actual physical memory consumption is 50%+ higher than logical usage. This means Redis might be hitting limits even when `redis.memory.used` looks healthy. High fragmentation combined with evictions suggests you need defragmentation (activedefrag on Redis 4.0+) or a controlled restart during low traffic, not necessarily more capacity.
3Verify eviction policy matches your workload
Check your configured `maxmemory-policy` (via Redis INFO or config file). If you're storing session data or critical user state, `allkeys-lru` will indiscriminately evict sessions causing logouts — you need `volatile-lru` or `noeviction` instead. For pure cache workloads, `allkeys-lru` is appropriate. A policy mismatch causes evictions of hot data that should be retained, tanking your hit rate even with adequate memory. The insight on session loss is a common production incident pattern here.
4Analyze TTL patterns and key accumulation
Compare `redis.db.keys` to `redis.db.expires` — if a large percentage of keys have no TTL, you're accumulating data indefinitely and will eventually exhaust memory. Check if `redis.db.keys` is growing unbounded over time. Keys without appropriate expiration will crowd out hot data even with LRU policies. This often indicates application bugs where developers forget to set TTLs on cached objects, causing gradual memory exhaustion that looks like you need more capacity but is really a configuration/code issue.
5Look for application memory leaks or unbounded growth
Track `redis.memory.peak` and `redis.memory.used` over several days. If memory usage keeps climbing without plateau, even after evictions, you likely have an application-side leak — code writing increasingly large values, duplicate keys with different timestamps, or missing cleanup logic. Correlate memory growth with application deployments or traffic patterns. If usage grows linearly with time rather than stabilizing, scaling won't solve it — you need to find and fix the leak in your application code.
6Determine if you need to scale up or scale out
If you've ruled out fragmentation, policy mismatches, and leaks, and `redis.memory.used` consistently exceeds 80% of `redis.memory.maxmemory` even during normal traffic, you genuinely need more capacity. For read-heavy workloads, add read replicas to distribute load. For memory constraints, either increase `maxmemory` if physical RAM allows, or add shards/nodes to distribute data. The key is that scaling should be the last resort after confirming it's a capacity problem, not a configuration or application bug.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Redis Datadog
Redis Native Metrics
Redis Prometheus