Worker OOM Kills Tasks Without Clear Failure
criticalResource ContentionUpdated Jan 24, 2026
Tasks fail with exit code -9 (SIGKILL) or lose connection to scheduler when worker process exceeds memory limits, often leaving no clear error in task logs.
Sources
Technologies:
How to detect:
Monitor for task failures with exit code -9 or 'lost connection to worker' errors. Track worker memory usage and correlate with task failures. Check for missing task heartbeats preceding failures.
Recommended action:
Process data in chunks using generators or pandas chunking; use stream processing instead of loading full datasets into memory; increase worker memory allocation; implement explicit memory cleanup (gc.collect()) in tasks; monitor and set appropriate executor parallelism limits.