Apache Flink

Checkpoint Failure Cascade

critical
reliabilityUpdated Aug 15, 2025

Rising checkpoint failures indicate upstream backpressure, state growth issues, or resource exhaustion that will eventually cause job restarts and data loss.

How to detect:

Monitor the ratio of failed to completed checkpoints. When flink_jobmanager_job_beroffailedcheckpoints increases while flink_jobmanager_job_berofcompletedcheckpoints stagnates, investigate checkpoint duration, alignment time, and size metrics. A growing flink_jobmanager_job_lastcheckpointsize paired with increasing flink_jobmanager_job_lastcheckpointduration signals state growth issues.

Recommended action:

Investigate checkpoint logs for root cause (network issues, slow sinks, state backend problems). If checkpoint duration is high, consider increasing checkpoint timeout or parallelism. If alignment time is high, address backpressure in upstream operators. For growing checkpoint size, audit state retention policies and look for state leaks in user code.