Consumer Lag Divergence

warning

latencyUpdated Aug 15, 2025

Growing source lag combined with stable or decreasing throughput indicates the job cannot keep pace with input rate, leading to increasing latency and eventual processing failure.

Sources

Troubleshooting Apache Flink Applications: Identifying Bottlenecksmedium.com

Troubleshoot performance issues - Managed Service for Apache Flinkdocs.aws.amazon.com

Mastering Apache Flink in Production: A Guide to Monitoring ...bigdataboutique.com

Technologies:

Apache FlinkSymptoms of this issue are visible in Apache Flink metrics and logs

Apache KafkaThe root cause of this issue originates in Apache Kafka

How to detect:

Track source-specific lag metrics (Kafka records-lag-max) alongside flink_operator_recordsinpersecond. When lag increases while records-in-per-second remains flat or decreases, the job is falling behind. Cross-reference with flink_taskmanager_status_jvm_cpu_load and operator throughput (flink_operator_recordsoutpersecond) to identify resource vs. logic bottlenecks.

Recommended action:

If CPU usage is low (<75%), investigate operator logic for inefficiencies or external dependency calls. If CPU is high (>80%), enable auto-scaling or manually increase parallelism. Check for data skew causing uneven load distribution across subtasks. For Kafka sources, verify partition count matches parallelism for optimal distribution.