HDFS Storage Capacity Exhaustion Blocking Checkpoints

critical

Resource ContentionUpdated Jan 27, 2026

Checkpoint writes to HDFS fail when storage capacity is exhausted, causing checkpoint timeout failures and preventing state persistence for fault tolerance.

Sources

Hadoop Troubleshooting: A Complete Guide - Site24x7www.site24x7.com

Troubleshoot HDFS in Azure HDInsightlearn.microsoft.com

Hadoop Monitoring: Tools, Metrics, and Observability - OpenLogicwww.openlogic.com

How to Configure Flink Checkpointingoneuptime.com

Technologies:

Hadoop HDFSThe root cause of this issue originates in Hadoop HDFS

hdfs.namenode.capacity_used_percent

hdfs.namenode.capacity_remaining

How to detect:

Monitor HDFS capacity metrics showing low free space (typically <10% remaining). Watch for checkpoint failure logs with storage-related errors and increasing checkpoint timeout occurrences. Track checkpoint directory size growth trends.

Recommended action:

Delete unnecessary data from checkpoint directories (old checkpoints beyond retention policy). Archive historical data to cheaper storage tiers. Add DataNode capacity by expanding cluster. Review and adjust state.checkpoints.num-retained to reduce checkpoint storage footprint.