Consumer lag escalates rapidly while broker health metrics remain normal
criticalperformanceUpdated Mar 4, 2026(via Exa)
Technologies:
How to detect:
Consumer group lag grows exponentially (from 500,000 to 14 million messages within hours) while all broker-side health indicators appear normal: CPU at 55%, zero under-replicated partitions, stable network saturation, high page cache hit ratio, and no ISR churn. No error alerts fire and no pods crash.
Recommended action:
Monitor kafka.consumer_group.lag metric independently from broker metrics. When lag escalates despite healthy broker metrics, investigate consumer-side processing capacity and throughput. Freeze deployments during investigation to prevent introducing additional variables. Check consumer processing rate, consumer fetch settings, and downstream processing bottlenecks rather than focusing solely on broker health.