Hadoop HDFS

HDFS Under-Replication Threatening Checkpoint Recovery

critical
reliabilityUpdated Dec 5, 2024

Insufficient data block replicas in HDFS due to DataNode failures or network issues compromise fault tolerance for externalized checkpoints and savepoints, risking data loss on recovery.

How to detect:

Monitor hdfs dfsadmin -report for replication factor metrics below configured values (typically 3). Track under-replicated blocks count and DataNode availability status. Watch for checkpoint restore failures referencing missing blocks.

Recommended action:

Investigate root cause: failing DataNodes, network partitions, or insufficient cluster capacity. Monitor HDFS automatic re-replication progress. Use hdfs dfsadmin -move to manually redistribute blocks if needed. Increase checkpoint retention count (state.checkpoints.num-retained) to maintain recovery options during re-replication.