HDFS Under-Replication Threatening Checkpoint Recovery
criticalInsufficient data block replicas in HDFS due to DataNode failures or network issues compromise fault tolerance for externalized checkpoints and savepoints, risking data loss on recovery.
Monitor hdfs dfsadmin -report for replication factor metrics below configured values (typically 3). Track under-replicated blocks count and DataNode availability status. Watch for checkpoint restore failures referencing missing blocks.
Investigate root cause: failing DataNodes, network partitions, or insufficient cluster capacity. Monitor HDFS automatic re-replication progress. Use hdfs dfsadmin -move to manually redistribute blocks if needed. Increase checkpoint retention count (state.checkpoints.num-retained) to maintain recovery options during re-replication.