Trino

Coordinator CPU exhausted polling dead tasks from terminated queries

critical
Resource ContentionUpdated Oct 8, 2024(via Exa)
Technologies:
How to detect:

Trino coordinator reaches 98-99% CPU utilization while continuously attempting to fetch task information from queries terminated weeks or months prior. TaskInfoFetcher threads repeatedly fail with connection errors when trying to reach workers that no longer have the tasks.

Recommended action:

Upgrade Trino from version 427 to recent version containing fixes from PR #20021 and #20023. Monitor coordinator CPU (trino.jvm.process_cpu_load) and verify running query count (trino.execution.running_queries) to detect disproportionate CPU usage. Investigate worker node connectivity if connection refusals persist.