Consumer Lag Divergence
warningGrowing source lag combined with stable or decreasing throughput indicates the job cannot keep pace with input rate, leading to increasing latency and eventual processing failure.
Track source-specific lag metrics (Kafka records-lag-max) alongside flink_operator_recordsinpersecond. When lag increases while records-in-per-second remains flat or decreases, the job is falling behind. Cross-reference with flink_taskmanager_status_jvm_cpu_load and operator throughput (flink_operator_recordsoutpersecond) to identify resource vs. logic bottlenecks.
If CPU usage is low (<75%), investigate operator logic for inefficiencies or external dependency calls. If CPU is high (>80%), enable auto-scaling or manually increase parallelism. Check for data skew causing uneven load distribution across subtasks. For Kafka sources, verify partition count matches parallelism for optimal distribution.