Goal-oriented SRE workflows for infrastructure technologies. Archetypal questions, diagnostic paths, and decision logic grounded in real-world evidence.

Technology Scenario Category Severity
Apache Kafka
ZooKeeper to KRaft Migration PlanningWe're still running Kafka with ZooKeeper and need to migrate to KRaft before upgrading to Kafka 4.0. What's involved in this migration and can we do it without downtime?
Migrationwarning
Apache Kafka
Under-Replicated Partitions AppearingI'm seeing under-replicated partitions in my Kafka cluster monitoring. ISR count is dropping on several partitions. What's causing this and how urgent is it to fix?
Incident Responsecritical
Apache Kafka
Consumer Lag Spike During Peak TrafficMy Kafka consumer lag just spiked from 100 messages to 50,000 messages in the last 10 minutes. Help me figure out what's wrong and whether I need to scale up my consumer group.
Capacity Planningcritical
Kubernetes
Pod OOMKilled and Eviction Under Memory PressureMy pods keep getting OOMKilled or evicted with 'The node was low on resource: memory' — help me figure out if this is a resource request/limit problem, node pressure issue, or QoS class misconfiguration.
Incident Responsecritical
Kubernetes
etcd Performance Degradation and High LatencyMy kubectl commands are timing out and the API server is super slow — I think etcd might be the bottleneck. How do I check if etcd latency is too high and what's causing it?
Incident Responsecritical
Kubernetes
Node NotReady Status and Kubelet FailuresOne of my Kubernetes nodes just went NotReady and pods are being rescheduled — how do I figure out if it's a kubelet issue, resource exhaustion, or network problem?
Incident Responsecritical
NGINX
Worker Connections Exhausted During Traffic SpikeMy NGINX server is showing 'worker_connections are not enough' errors and dropping connections. How do I determine if I need to increase worker_connections or add more worker processes based on my current traffic patterns?
Capacity Planningcritical
NGINX
Upstream Backend Timeout Causing 502 ErrorsI'm getting 502 Bad Gateway errors from NGINX and my logs show 'upstream timed out'. How do I tell if this is a backend performance issue versus an NGINX timeout configuration problem?
Incident Responsecritical
NGINX
Rate Limiting Configuration for DDoS ProtectionWe're experiencing what looks like a DDoS attack with massive request spikes. How should I configure NGINX rate limiting (limit_req) with the right burst and nodelay settings to protect my backends without blocking legitimate users?
Incident Responsecritical
PostgreSQL
Replication lag threatening data consistencyMy PostgreSQL read replicas are falling behind the primary by 30 seconds and climbing. Help me diagnose if this is a resource bottleneck, network issue, or replication slot problem before it causes an outage.
Incident Responsecritical
PostgreSQL
Transaction ID wraparound emergency approachingMy PostgreSQL database is showing warnings about transaction ID wraparound with age at 1.8 billion. How urgent is this, what happens if I hit the limit, and what's the safest way to prevent emergency autovacuum or shutdown?
Incident Responsecritical
Redis
Redis CPU Saturation and Slowlog SpikeRedis EngineCPU is pegged at 100% and the slowlog is filling up with queries taking over 500ms - help me identify which commands are blocking the single-threaded engine and whether I need to optimize queries or scale vertically.
Incident Responsecritical
Redis
Replication Lag Causing Stale Reads and Failover RiskOur Redis replicas are showing replication lag of 30+ seconds and we're getting complaints about stale data being served to users. I'm worried about data loss if we have a failover right now. What metrics should I check to understand why replication is lagging and how can I determine if we need to scale or if this is a temporary spike?
Incident Responsecritical
Redis
Redis Memory Eviction Crisis During Peak TrafficMy Redis cache is evicting keys like crazy and the cache hit rate just dropped from 95% to 60% - help me figure out if I need to scale up, tune eviction policies, or if there's a memory leak in the application.
Incident Responsecritical