PrefectKubernetes

Ghost runs persist when runners die without server notification

critical
availabilityUpdated Nov 27, 2024(via Exa)
How to detect:

When runners are killed (e.g., by Kubernetes node eviction), Prefect server loses track of them. Runs remain stuck in 'executing' state for hours, or are marked Failed while the runner continues executing. The server cannot detect runner death without a heartbeat mechanism.

Recommended action:

Implement runner heartbeat mechanism (added in PR #16410). Configure automations to detect zombie runs when heartbeats stop. Set up timeout automations to fail and retry stuck runs. Monitor prefect.flow_run.crash and prefect.agent.heartbeat metrics.