HDFS Checkpoint Metadata Collision Causing Job Failures
criticalreliabilityUpdated Dec 4, 2020
When Hadoop jobs fail and restart without HA, reusing fixed job IDs causes FileAlreadyExistsException on checkpoint metadata files in HDFS, leading to JobManager crashes and restart loops.
Sources
Technologies:
How to detect:
Detect by monitoring for FileAlreadyExistsException errors in checkpoint logs referencing '_metadata' files in HDFS checkpoint directories, combined with JobManager restart frequency exceeding normal baseline.
Recommended action:
Enable Hadoop HA mode to support persistent job IDs across restarts, or switch to generating unique job IDs per run. Clean up stale checkpoint directories in HDFS before job restarts. Configure externalized checkpoints with RETAIN_ON_CANCELLATION to preserve state while avoiding metadata collisions.