Consumer lag escalates rapidly while broker health metrics remain normal

critical

performanceUpdated Mar 4, 2026(via Exa)

Sources

How Our Kafka Consumer Fell 14 Million Messages Behind | by The CS Engineer | Mar, 2026 | Mediummedium.com

Technologies:

Confluent Platformsubject

Apache KafkaSymptoms of this issue are visible in Apache Kafka metrics and logs

How to detect:

Consumer group lag grows exponentially (from 500,000 to 14 million messages within hours) while all broker-side health indicators appear normal: CPU at 55%, zero under-replicated partitions, stable network saturation, high page cache hit ratio, and no ISR churn. No error alerts fire and no pods crash.

Recommended action:

Monitor kafka.consumer_group.lag metric independently from broker metrics. When lag escalates despite healthy broker metrics, investigate consumer-side processing capacity and throughput. Freeze deployments during investigation to prevent introducing additional variables. Check consumer processing rate, consumer fetch settings, and downstream processing bottlenecks rather than focusing solely on broker health.