Presto queries fail unexpectedly due to worker nodes being terminated by cloud provider spot instance interruptions, detected faster when spot interruption metadata is available.
Presto queries fail sporadically with SocketTimeoutException when communicating with Hive metastore, often under high query concurrency or when metastore is under-resourced relative to query load.
Presto coordinator unable to find nodes to run queries, indicated by 'No nodes available to run the query' errors, combined with increasing queued queries while running queries remain stable or decrease.
Growing queues of tasks waiting for execution across executor pools, indicating thread pool saturation and potential query slowdown as splits wait for processing resources.
Elevated user error failures suggesting widespread issues with query syntax, schema changes, or permission problems affecting multiple queries from users.
High rate of abandoned queries where clients disconnect before completion, potentially indicating query timeouts, impatient users, or client application crashes.
Presto queries fail with 'Query exceeded max memory size' or 'Query exceeded local memory limit' errors, often caused by inefficient join ordering where larger tables are on the right side, forcing expensive hash joins instead of broadcast joins.