HDFS Under-Replication Threatening Checkpoint Recovery

critical

reliabilityUpdated Dec 5, 2024

Insufficient data block replicas in HDFS due to DataNode failures or network issues compromise fault tolerance for externalized checkpoints and savepoints, risking data loss on recovery.

Sources

Hadoop Troubleshooting: A Complete Guide - Site24x7www.site24x7.com

Hadoop Monitoring: Tools, Metrics, and Observability - OpenLogicwww.openlogic.com

Checkpoints | Apache Flinknightlies.apache.org

Technologies:

Hadoop HDFSThe root cause of this issue originates in Hadoop HDFS

hdfs.namenode.under_replicated_blocks

hdfs.namenode.missing_blocks

hdfs.datanode.available_count

hdfs.blocks.replication_factor

How to detect:

Monitor hdfs dfsadmin -report for replication factor metrics below configured values (typically 3). Track under-replicated blocks count and DataNode availability status. Watch for checkpoint restore failures referencing missing blocks.

Recommended action:

Investigate root cause: failing DataNodes, network partitions, or insufficient cluster capacity. Monitor HDFS automatic re-replication progress. Use hdfs dfsadmin -move to manually redistribute blocks if needed. Increase checkpoint retention count (state.checkpoints.num-retained) to maintain recovery options during re-replication.