Apache FlinkApache Kafka

Consumer Lag Divergence

warning
latencyUpdated Aug 15, 2025

Growing source lag combined with stable or decreasing throughput indicates the job cannot keep pace with input rate, leading to increasing latency and eventual processing failure.

How to detect:

Track source-specific lag metrics (Kafka records-lag-max) alongside flink_operator_recordsinpersecond. When lag increases while records-in-per-second remains flat or decreases, the job is falling behind. Cross-reference with flink_taskmanager_status_jvm_cpu_load and operator throughput (flink_operator_recordsoutpersecond) to identify resource vs. logic bottlenecks.

Recommended action:

If CPU usage is low (<75%), investigate operator logic for inefficiencies or external dependency calls. If CPU is high (>80%), enable auto-scaling or manually increase parallelism. Check for data skew causing uneven load distribution across subtasks. For Kafka sources, verify partition count matches parallelism for optimal distribution.