Job Restart Storm
criticalFrequent job restarts indicate systemic instability from resource exhaustion, external dependency failures, or configuration issues, compounding into extended downtime.
Track flink_jobmanager_job_restarts and flink_jobmanager_job_downtime. When restarts spike and downtime accumulates while flink_jobmanager_job_uptime remains low, investigate logs for OutOfMemoryError, checkpoint failures, or external dependency errors. Cross-reference with flink_taskmanager_status_jvm_memory_heap_used and failed checkpoint metrics.
Review Job Manager and Task Manager logs for exception patterns. Common causes: insufficient memory (increase allocation), checkpoint timeout (tune checkpoint interval/timeout), external sink failures (verify connectivity and permissions). Implement exponential backoff restart strategy to prevent immediate retry storms. Monitor recovery time to ensure it meets SLOs.