Consumer Lag Spike During Peak Traffic
criticalCapacity Planning
Consumer lag suddenly increases, indicating consumers are falling behind producers, risking data freshness and potential backpressure failures.
Prompt: “My Kafka consumer lag just spiked from 100 messages to 50,000 messages in the last 10 minutes. Help me figure out what's wrong and whether I need to scale up my consumer group.”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When consumer lag spikes suddenly, first verify that offsets are actually advancing to rule out a stuck consumer. Then compare producer throughput to consumer throughput to determine if this is a traffic spike overwhelming your consumers or a consumer-side processing slowdown. Finally, check consumer group membership and fetch performance to identify whether you need to scale horizontally or fix a performance bottleneck.
1Verify consumer offsets are advancing
Check `kafka.consumer_group.offset` over a 5-10 minute window to confirm offsets are actually moving forward. If the offset hasn't budged while lag increased from 100 to 50,000 messages, you have a stuck consumer (crash, deadlock, or unhandled exception) rather than a throughput problem. This is the fastest way to distinguish between 'consumer can't keep up' versus 'consumer isn't running at all.' If offsets are frozen for >10 minutes, restart the consumer and review logs for crashes or blocked threads.
2Compare producer rate to consumer consumption rate
Look at `kafka.messages_in.rate` (or `kafka.topic.message_rate`) and compare it to `kafka.consumer.records_consumed_rate` over the last 15-30 minutes. If the incoming rate suddenly doubled from 1,000 msg/sec to 2,000 msg/sec while your consumer rate stayed flat at 500 msg/sec, you've got a traffic spike that's overwhelming your consumers. If the producer rate is steady but consumer rate suddenly dropped, you have a consumer-side processing bottleneck or resource contention. This tells you whether the problem is 'too much input' or 'slow processing.'
3Monitor lag growth rate to assess urgency
Track `kafka.consumer.records_lag_max` and `kafka.consumer_group.lag_sum` to see if lag is accelerating, stabilizing, or decreasing. If lag is growing faster than 50,000 messages per minute, you're in critical territory and need immediate action (scaling or throttling). If lag spiked to 50k but is now holding steady or slowly decreasing, your consumers may catch up on their own as traffic normalizes. The insight on consumer lag escalation shows that exponential growth (500k to 14 million) despite healthy brokers means you need to scale consumer capacity, not fix infrastructure.
4Check consumer group membership for drops or rebalancing
Review `kafka.consumer_group.members` to see if your consumer group size changed recently. If you normally run 10 consumers but now only have 7 active, some consumers crashed or were killed, reducing your total parallelism and processing capacity. Even a brief rebalancing event can cause a temporary lag spike as partitions are reassigned. Cross-reference this with application logs or pod restarts in your orchestrator (Kubernetes, etc.) to identify if consumers are flapping or unhealthy.
5Measure consumer fetch performance and latency
Check `kafka.consumer.fetch_rate` and `kafka.consumer.fetch_latency_avg` to understand if the bottleneck is on the broker side (network, disk I/O) or consumer side (processing logic). If fetch latency is high (>100ms) or fetch rate is abnormally low, you may have slow brokers or network issues between consumer and broker. If fetch metrics are healthy but lag keeps growing, the problem is consumer processing speed—your application code can't process messages fast enough. Also review `kafka.consumer.records_per_request_avg` to see if you're fetching efficiently in batches.
6Determine if you need to scale the consumer group
Calculate your maximum theoretical throughput: if you have 12 partitions and 4 consumers, each consumer handles 3 partitions. If your current `kafka.consumer.records_consumed_rate` is maxed out (CPU at 90%+, processing at full capacity) but still can't keep up, you need horizontal scaling—add more consumer instances up to the number of partitions. If consumers are underutilized (<50% CPU) but still slow, the issue is inefficient processing logic, slow downstream dependencies, or resource bottlenecks (database, external API). In my experience, a sustained lag >100k messages for 5+ minutes with healthy brokers means it's time to scale up.
Technologies
Related Insights
High or growing consumer lag indicates processing bottleneck
warning
Consumer lag increases steadily due to slow processing
warning
Consumer lag escalates rapidly while broker health metrics remain normal
critical
Consumer lag prevents real-time message processing
warning
Consumer offset stops advancing indicating stuck consumer
critical
High-volume spike causes delayed processing and missed event windows
warning
Consumer Lag Divergence
warning
Growing source lag combined with stable or decreasing throughput indicates the job cannot keep pace with input rate, leading to increasing latency and eventual processing failure.
Relevant Metrics
kafka_consumergroup_lagkafka.consumer_group.lag_sumkafka.consumer.records_lag_avgkafka.consumer.records_lag_maxkafka.consumer.fetch_ratekafka.consumer.records_consumed_ratekafka.messages_in.ratekafka.topic.message_ratekafka.consumer.fetch_latency_avgkafka.consumer_group.memberskafka.consumer.records_per_request_avgkafka.consumer_group.offsetMonitoring Interfaces
Kafka Native