Agent Memory Usage Spike and OOMKill
criticalCrewAI web pods experience OOMKilled restarts when memory limits are insufficient for concurrent agent workloads, especially with high WEB_CONCURRENCY and RAILS_MAX_THREADS settings, causing service disruptions.
Monitor pod memory usage trends and OOMKilled events. Track memory consumption per agent session and correlate with concurrency settings. Alert when memory usage exceeds 85% of limits or when OOMKilled events occur. Watch for memory leak patterns (steady increase over time).
Increase pod memory limits (e.g., from 12Gi to 16Gi) based on observed usage. Reduce WEB_CONCURRENCY and RAILS_MAX_THREADS to lower memory footprint per pod. Implement horizontal pod autoscaling based on memory usage. Profile agent workloads to identify memory leaks. Monitor actual memory usage patterns to right-size limits without over-provisioning.