SRE Scenarios - Schema

Goal-oriented SRE workflows for infrastructure technologies. Archetypal questions, diagnostic paths, and decision logic grounded in real-world evidence.

Technology ↑	Scenario ⇅	Category ⇅	Severity ⇅
Apache Kafka	ZooKeeper to KRaft Migration PlanningWe're still running Kafka with ZooKeeper and need to migrate to KRaft before upgrading to Kafka 4.0. What's involved in this migration and can we do it without downtime?	Migration	warning
Apache Kafka	Under-Replicated Partitions AppearingI'm seeing under-replicated partitions in my Kafka cluster monitoring. ISR count is dropping on several partitions. What's causing this and how urgent is it to fix?	Incident Response	critical
Apache Kafka	Consumer Lag Spike During Peak TrafficMy Kafka consumer lag just spiked from 100 messages to 50,000 messages in the last 10 minutes. Help me figure out what's wrong and whether I need to scale up my consumer group.	Capacity Planning	critical
Kubernetes	Pod OOMKilled and Eviction Under Memory PressureMy pods keep getting OOMKilled or evicted with 'The node was low on resource: memory' — help me figure out if this is a resource request/limit problem, node pressure issue, or QoS class misconfiguration.	Incident Response	critical
Kubernetes	etcd Performance Degradation and High LatencyMy kubectl commands are timing out and the API server is super slow — I think etcd might be the bottleneck. How do I check if etcd latency is too high and what's causing it?	Incident Response	critical
NGINX	Worker Connections Exhausted During Traffic SpikeMy NGINX server is showing 'worker_connections are not enough' errors and dropping connections. How do I determine if I need to increase worker_connections or add more worker processes based on my current traffic patterns?	Capacity Planning	critical
NGINX	Upstream Backend Timeout Causing 502 ErrorsI'm getting 502 Bad Gateway errors from NGINX and my logs show 'upstream timed out'. How do I tell if this is a backend performance issue versus an NGINX timeout configuration problem?	Incident Response	critical
NGINX	Rate Limiting Configuration for DDoS ProtectionWe're experiencing what looks like a DDoS attack with massive request spikes. How should I configure NGINX rate limiting (limit_req) with the right burst and nodelay settings to protect my backends without blocking legitimate users?	Incident Response	critical
PostgreSQL	Transaction ID Wraparound RiskI'm getting warnings about transaction ID wraparound in my PostgreSQL logs. The age of the oldest transaction is over 1.5 billion. How urgent is this and what do I need to do?	Incident Response	critical
PostgreSQL	Transaction ID Wraparound Emergency ResponseI'm getting FATAL errors about transaction ID wraparound in PostgreSQL and the database is warning it will shut down in 1 million transactions. The age of the oldest transaction is over 2 billion. What do I need to do right now to prevent an outage?	Incident Response	critical
PostgreSQL	Replication lag threatening data consistencyMy PostgreSQL read replicas are falling behind the primary by 30 seconds and climbing. Help me diagnose if this is a resource bottleneck, network issue, or replication slot problem before it causes an outage.	Incident Response	critical
Redis	Redis CPU Saturation and Slowlog SpikeRedis EngineCPU is pegged at 100% and the slowlog is filling up with queries taking over 500ms - help me identify which commands are blocking the single-threaded engine and whether I need to optimize queries or scale vertically.	Incident Response	critical
Redis	Replication Lag Causing Stale Reads and Failover RiskOur Redis replicas are showing replication lag of 30+ seconds and we're getting complaints about stale data being served to users. I'm worried about data loss if we have a failover right now. What metrics should I check to understand why replication is lagging and how can I determine if we need to scale or if this is a temporary spike?	Incident Response	critical
Redis	Redis Memory Eviction Crisis During Peak TrafficMy Redis cache is evicting keys like crazy and the cache hit rate just dropped from 95% to 60% - help me figure out if I need to scale up, tune eviction policies, or if there's a memory leak in the application.	Incident Response	critical