Checkpoint Failure Cascade

critical

reliabilityUpdated Aug 15, 2025

Rising checkpoint failures indicate upstream backpressure, state growth issues, or resource exhaustion that will eventually cause job restarts and data loss.

Sources

Troubleshooting Apache Flink Applications: Identifying Bottlenecksmedium.com

Troubleshoot performance issues - Managed Service for Apache Flinkdocs.aws.amazon.com

Monitoring Large-Scale Apache Flink Applications, Part 2www.ververica.com

Mastering Apache Flink in Production: A Guide to Monitoring ...bigdataboutique.com

Technologies:

Apache FlinkThe root cause of this issue originates in Apache Flink

How to detect:

Monitor the ratio of failed to completed checkpoints. When flink_jobmanager_job_beroffailedcheckpoints increases while flink_jobmanager_job_berofcompletedcheckpoints stagnates, investigate checkpoint duration, alignment time, and size metrics. A growing flink_jobmanager_job_lastcheckpointsize paired with increasing flink_jobmanager_job_lastcheckpointduration signals state growth issues.

Recommended action:

Investigate checkpoint logs for root cause (network issues, slow sinks, state backend problems). If checkpoint duration is high, consider increasing checkpoint timeout or parallelism. If alignment time is high, address backpressure in upstream operators. For growing checkpoint size, audit state retention policies and look for state leaks in user code.