Apache Flink

Job Restart Storm

critical
reliabilityUpdated Nov 18, 2025

Frequent job restarts indicate systemic instability from resource exhaustion, external dependency failures, or configuration issues, compounding into extended downtime.

How to detect:

Track flink_jobmanager_job_restarts and flink_jobmanager_job_downtime. When restarts spike and downtime accumulates while flink_jobmanager_job_uptime remains low, investigate logs for OutOfMemoryError, checkpoint failures, or external dependency errors. Cross-reference with flink_taskmanager_status_jvm_memory_heap_used and failed checkpoint metrics.

Recommended action:

Review Job Manager and Task Manager logs for exception patterns. Common causes: insufficient memory (increase allocation), checkpoint timeout (tune checkpoint interval/timeout), external sink failures (verify connectivity and permissions). Implement exponential backoff restart strategy to prevent immediate retry storms. Monitor recovery time to ensure it meets SLOs.