OOM kills drop during worker crash loops creating false improvement signal
warningavailabilityUpdated Feb 6, 2026(via Exa)
How to detect:
OOM kill metrics decrease during periods when workers are stuck in crash loops and backoff periods, creating false appearance of improvement when workers are actually not processing any tasks.
Recommended action:
Always correlate OOM kill metrics with pod/worker health status and task processing rates. A drop in OOM kills should be validated against whether workers are actively processing tasks or stuck in crash loops. Monitor both OOM events and worker availability together.