Pod OOMKilled and Eviction Under Memory Pressure

criticalIncident Response

Diagnose why pods are being killed or evicted due to memory constraints and determine if resource requests/limits need adjustment.

Prompt: “My pods keep getting OOMKilled or evicted with 'The node was low on resource: memory' — help me figure out if this is a resource request/limit problem, node pressure issue, or QoS class misconfiguration.”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing OOMKilled or evicted pods, start by confirming the termination reason and checking whether resource requests/limits are even configured — missing or misconfigured limits are the most common culprit. Then compare actual memory usage to limits to determine if this is legitimate usage exceeding capacity or undersized limits. Finally, investigate node-level memory pressure and QoS class issues that can cause evictions even when pod-level limits aren't exceeded.

1Confirm the actual termination reason and OOM history

First thing: verify the pod is actually being OOMKilled versus other termination reasons. Check `kubectl describe pod` for events showing 'OOMKilled' or exit code 137, and inspect container status for `lastState.terminated.reason` showing OOMKilled. This confirms you're dealing with a memory issue rather than crashes, liveness probe failures, or other termination causes. The `oom-kill-history-detected` insight will show if this is a recurring pattern.

Previous OOM kill detected in container history Pod CrashLoop from OOMKill

2Check if memory requests and limits are configured

Before diving deeper, verify the basics: does the pod spec actually define memory requests and limits? Missing limits let containers consume unbounded memory until the kernel OOM killer intervenes; missing requests cause the scheduler to misplace pods and create resource contention. Run `kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'` — if you see empty results, that's your smoking gun. Set explicit values like requests: 256Mi, limits: 512Mi as a starting point.

Missing Kubernetes resource limits cause OOM kills Missing Resource Requests Cause Unpredictable Pod Placement kubernetes_memory_limitskubernetes_memory_requested

3Compare actual memory usage to configured limits

Check whether the app is legitimately exceeding its limits or if limits are just set too low. Compare `kubernetes_memory_usage` to `kubernetes_memory_limits` — if usage consistently approaches or exceeds limits (say, >90%), you need to either increase limits or investigate why the app is consuming so much memory. Use `kubectl top pod` for real-time usage. If the working set is steadily climbing toward limits, the `memory-pressure-from-working-set-growth` insight indicates you need bigger limits, memory optimization, or a memory leak investigation.

Memory Pressure from Working Set Growth kubernetes_memory_usagekubernetes_memory_limits

4Check for kernel OOM kills before Kubernetes limits are hit

Here's the tricky one: pods can be OOMKilled by the Linux kernel *before* hitting their Kubernetes memory limits, especially when requests are much lower than limits (like requests: 512Mi, limits: 2Gi). Monitor if `kubernetes_memory_usage` is approaching `kubernetes_memory_requested` while still below `kubernetes_memory_limits` when OOM occurs. This happens when the node runs out of allocatable memory because requests were too conservative. Fix this by increasing requests to 70-80% of limits, or setting requests equal to limits for critical workloads.

Kubernetes Pod Memory OOM Before Limits Reached kubernetes_memory_usagekubernetes_memory_requestedkubernetes_memory_limits

5Investigate node-level memory pressure and capacity

Check if the node itself is under memory pressure by comparing total `kubernetes_memory_capacity` to the sum of all pod requests on that node. Run `kubectl describe node` and look for 'MemoryPressure' condition or eviction events mentioning 'The node was low on resource: memory'. Even with autoscaling enabled, memory exhaustion can trigger faster than new nodes can spin up — memory isn't compressible like CPU, so exhaustion is catastrophic. The `aks-memory-pressure-pod-eviction` insight shows this often results in random low-priority pod kills.

Memory pressure triggers random pod eviction in AKS clusters kubernetes_memory_capacitykubernetes_memory_requested

6Analyze memory growth trends for leaks or spikes

Distinguish between steady memory growth (leak) and transient spikes (traffic patterns). Graph `kubernetes_memory_usage` over hours or days — if you see a linear upward trend, you likely have a memory leak that will eventually OOM regardless of limits. The `predicted-oom-within-hours` insight can forecast OOM based on growth rate, giving you advance warning. For spikes, correlate with traffic patterns, batch jobs, or cache warming. Leaks require code fixes; spikes require right-sizing limits for peak usage.

Predicted OOM kill within 2 hours based on memory growth trend Memory Pressure from Working Set Growth kubernetes_memory_usage

7Review QoS class and pod priority for eviction behavior

Kubernetes evicts pods differently based on QoS class: BestEffort (no requests/limits) gets killed first, then Burstable (requests < limits), then Guaranteed (requests = limits). Check your pod's QoS with `kubectl get pod <pod> -o jsonpath='{.status.qosClass}'`. If critical pods are BestEffort or Burstable, they're vulnerable to eviction when nodes face pressure. Set requests equal to limits for Guaranteed QoS on critical workloads, and use pod disruption budgets to prevent mass evictions.

Pod CrashLoop from OOMKill Missing Resource Requests Cause Unpredictable Pod Placement kubernetes_memory_requestedkubernetes_memory_limits