Agent Memory Usage Spike and OOMKill

critical

Resource ContentionUpdated Feb 23, 2026

CrewAI web pods experience OOMKilled restarts when memory limits are insufficient for concurrent agent workloads, especially with high WEB_CONCURRENCY and RAILS_MAX_THREADS settings, causing service disruptions.

Sources

Troubleshooting - CrewAI Platform Helm Chartenterprise-docs.crewai.com

Technologies:

CrewAISymptoms of this issue are visible in CrewAI metrics and logs

crewai.pod.memory.usage

crewai.pod.memory.limit

crewai.pod.oomkill.count

crewai.session.memory.average

KubernetesKubernetes metrics correlate with this issue and help confirm diagnosis

How to detect:

Monitor pod memory usage trends and OOMKilled events. Track memory consumption per agent session and correlate with concurrency settings. Alert when memory usage exceeds 85% of limits or when OOMKilled events occur. Watch for memory leak patterns (steady increase over time).

Recommended action:

Increase pod memory limits (e.g., from 12Gi to 16Gi) based on observed usage. Reduce WEB_CONCURRENCY and RAILS_MAX_THREADS to lower memory footprint per pod. Implement horizontal pod autoscaling based on memory usage. Profile agent workloads to identify memory leaks. Monitor actual memory usage patterns to right-size limits without over-provisioning.