Self-hosted LangSmith observability stacks crash when ClickHouse runs out of disk space during async trace insertion, manifesting as NOT_ENOUGH_SPACE errors that prevent trace ingestion.
Fargate tasks crash with storage errors when the default 20GB ephemeral storage fills up. This manifests as repeated task restarts without resolution until storage is increased.
Job startup times vary unpredictably due to Docker image pulls, cache misses, or host availability. This variability compounds when multiple jobs run in parallel, making total workflow duration unpredictable.
AI agents and observability sidecars consume excessive CPU when processing high trace volumes, leading to throttling that impacts agent decision-making latency and reliability.
High CPU reservation in ECS clusters leads to task placement failures where new tasks remain pending indefinitely, preventing service scaling and causing cascading latency issues.
ECS terminates containers that exceed hard memory limits, causing unexpected task failures. Containers often fail silently without warning when memory allocation is approached.