Checkpoint Failure Cascade
criticalRising checkpoint failures indicate upstream backpressure, state growth issues, or resource exhaustion that will eventually cause job restarts and data loss.
Monitor the ratio of failed to completed checkpoints. When flink_jobmanager_job_beroffailedcheckpoints increases while flink_jobmanager_job_berofcompletedcheckpoints stagnates, investigate checkpoint duration, alignment time, and size metrics. A growing flink_jobmanager_job_lastcheckpointsize paired with increasing flink_jobmanager_job_lastcheckpointduration signals state growth issues.
Investigate checkpoint logs for root cause (network issues, slow sinks, state backend problems). If checkpoint duration is high, consider increasing checkpoint timeout or parallelism. If alignment time is high, address backpressure in upstream operators. For growing checkpoint size, audit state retention policies and look for state leaks in user code.