Frequent job restarts indicate systemic instability from resource exhaustion, external dependency failures, or configuration issues, compounding into extended downtime.
Rising checkpoint failures indicate upstream backpressure, state growth issues, or resource exhaustion that will eventually cause job restarts and data loss.
Growing source lag combined with stable or decreasing throughput indicates the job cannot keep pace with input rate, leading to increasing latency and eventual processing failure.
Insufficient registered TaskManagers or task slots prevent job scaling, causing underutilization and inability to handle load increases.
Operators experiencing backpressure prevent watermarks from advancing, causing time-based operations (windows, temporal joins) to produce no results despite incoming data.
High garbage collection pressure on TaskManagers causes processing slowdowns that create backpressure, increased state size, and eventually full GC pauses lasting minutes.
Depleted network buffers block data shuffling between operators, causing backpressure and throughput collapse in jobs with high shuffle volume.
Slow RocksDB flushes cause write stalls that propagate upstream as backpressure, degrading throughput and increasing checkpoint durations.