Pod OOMKilled and Eviction Under Memory Pressure

criticalIncident Response

Diagnose why pods are being killed or evicted due to memory constraints and determine if resource requests/limits need adjustment.

Prompt: My pods keep getting OOMKilled or evicted with 'The node was low on resource: memory' — help me figure out if this is a resource request/limit problem, node pressure issue, or QoS class misconfiguration.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing OOMKilled or evicted pods, start by confirming the termination reason and checking whether resource requests/limits are even configured — missing or misconfigured limits are the most common culprit. Then compare actual memory usage to limits to determine if this is legitimate usage exceeding capacity or undersized limits. Finally, investigate node-level memory pressure and QoS class issues that can cause evictions even when pod-level limits aren't exceeded.

1Confirm the actual termination reason and OOM history
First thing: verify the pod is actually being OOMKilled versus other termination reasons. Check `kubectl describe pod` for events showing 'OOMKilled' or exit code 137, and inspect container status for `lastState.terminated.reason` showing OOMKilled. This confirms you're dealing with a memory issue rather than crashes, liveness probe failures, or other termination causes. The `oom-kill-history-detected` insight will show if this is a recurring pattern.
2Check if memory requests and limits are configured
Before diving deeper, verify the basics: does the pod spec actually define memory requests and limits? Missing limits let containers consume unbounded memory until the kernel OOM killer intervenes; missing requests cause the scheduler to misplace pods and create resource contention. Run `kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'` — if you see empty results, that's your smoking gun. Set explicit values like requests: 256Mi, limits: 512Mi as a starting point.
3Compare actual memory usage to configured limits
Check whether the app is legitimately exceeding its limits or if limits are just set too low. Compare `kubernetes_memory_usage` to `kubernetes_memory_limits` — if usage consistently approaches or exceeds limits (say, >90%), you need to either increase limits or investigate why the app is consuming so much memory. Use `kubectl top pod` for real-time usage. If the working set is steadily climbing toward limits, the `memory-pressure-from-working-set-growth` insight indicates you need bigger limits, memory optimization, or a memory leak investigation.
4Check for kernel OOM kills before Kubernetes limits are hit
Here's the tricky one: pods can be OOMKilled by the Linux kernel *before* hitting their Kubernetes memory limits, especially when requests are much lower than limits (like requests: 512Mi, limits: 2Gi). Monitor if `kubernetes_memory_usage` is approaching `kubernetes_memory_requested` while still below `kubernetes_memory_limits` when OOM occurs. This happens when the node runs out of allocatable memory because requests were too conservative. Fix this by increasing requests to 70-80% of limits, or setting requests equal to limits for critical workloads.
5Investigate node-level memory pressure and capacity
Check if the node itself is under memory pressure by comparing total `kubernetes_memory_capacity` to the sum of all pod requests on that node. Run `kubectl describe node` and look for 'MemoryPressure' condition or eviction events mentioning 'The node was low on resource: memory'. Even with autoscaling enabled, memory exhaustion can trigger faster than new nodes can spin up — memory isn't compressible like CPU, so exhaustion is catastrophic. The `aks-memory-pressure-pod-eviction` insight shows this often results in random low-priority pod kills.
6Analyze memory growth trends for leaks or spikes
Distinguish between steady memory growth (leak) and transient spikes (traffic patterns). Graph `kubernetes_memory_usage` over hours or days — if you see a linear upward trend, you likely have a memory leak that will eventually OOM regardless of limits. The `predicted-oom-within-hours` insight can forecast OOM based on growth rate, giving you advance warning. For spikes, correlate with traffic patterns, batch jobs, or cache warming. Leaks require code fixes; spikes require right-sizing limits for peak usage.
7Review QoS class and pod priority for eviction behavior
Kubernetes evicts pods differently based on QoS class: BestEffort (no requests/limits) gets killed first, then Burstable (requests < limits), then Guaranteed (requests = limits). Check your pod's QoS with `kubectl get pod <pod> -o jsonpath='{.status.qosClass}'`. If critical pods are BestEffort or Burstable, they're vulnerable to eviction when nodes face pressure. Set requests equal to limits for Guaranteed QoS on critical workloads, and use pod disruption budgets to prevent mass evictions.

Technologies

Related Insights

Pod CrashLoop from OOMKill
critical
Pods enter CrashLoopBackOff when containers are repeatedly killed by OOM (Out of Memory) killer, indicating memory limits are too restrictive or memory leaks exist.
Kubernetes Pod Memory OOM Before Limits Reached
critical
Kubernetes pods may be terminated by the OS out-of-memory killer before reaching their configured memory limits, especially when memory requests are set significantly lower than limits. This disconnect between reservation and actual usage causes unexpected pod evictions.
Memory Pressure from Working Set Growth
warning
Container working set memory approaches or exceeds configured limits, causing OOM kills or evictions that disrupt vector search availability.
Memory pressure triggers random pod eviction in AKS clusters
critical
Previous OOM kill detected in container history
critical
Predicted OOM kill within 2 hours based on memory growth trend
warning
Missing Kubernetes resource limits cause OOM kills
critical
Missing Resource Requests Cause Unpredictable Pod Placement
critical
Pods deployed without CPU/memory requests lead to scheduler misplacement, resource contention, and OOMKilled containers. The scheduler cannot reserve appropriate resources, resulting in too many pods on single nodes and performance degradation.

Relevant Metrics

Monitoring Interfaces

Kubernetes Datadog