CrewAI web pods experience OOMKilled restarts when memory limits are insufficient for concurrent agent workloads, especially with high WEB_CONCURRENCY and RAILS_MAX_THREADS settings, causing service disruptions.
CrewAI crew builds fail when BuildKit cannot authenticate to container registries, causing silent build failures and preventing crew deployment updates. Network connectivity issues or missing registry secrets compound the problem.
Multi-agent CrewAI systems, especially hierarchical processes with extended deliberations, can consume API tokens at unsustainable rates, causing budget overruns and operational cost spikes without proper monitoring.
Default in-memory execution of CrewAI crews means that server restarts or process crashes during long-running workflows result in complete state loss, forcing full restart of 20+ step tasks and wasting resources.
Non-deterministic LLM behavior and tool failures can cause agent task success rates to drift downward over time without obvious errors, impacting output quality and user satisfaction.
In hierarchical CrewAI processes, the manager agent becomes a single point of serialization, throttling overall throughput when coordinating multiple specialist agents, especially under high task volume.
AI agents may execute self-looping behavior that is invisible in raw logs but detectable through Phoenix's graph-based trace visualization. These loops inflate latency and token costs while degrading user experience.
Agents appear to execute tools (generate Action/Observation traces) but tools are never actually invoked, resulting in fabricated observations and silent failures. This breaks the tool-use contract and leads to incorrect outputs without obvious errors.
Expired SSL certificates for CrewAI telemetry endpoints (telemetry.crewai.com) cause trace export failures, resulting in silent observability loss without stopping crew execution.