Flow runs stuck in PENDING state after worker crash
criticalavailabilityUpdated Feb 5, 2025(via Exa)
How to detect:
When a Kubernetes worker exits unexpectedly while a flow run is marked PENDING but before the K8s Job is scheduled, the flow run remains stuck in PENDING state forever. This is a race condition related to the non-atomic nature of marking flows as pending and submitting jobs to Kubernetes.
Recommended action:
Monitor for flows stuck in PENDING state beyond expected duration. Manual intervention required to reset stuck flows. Prevention: implement PREFECT_CLIENT_RETRY_EXTRA_CODES to reduce worker crashes. Long-term: requires storing state locally or implementing transactional job submission.