Consumer Lag Spike During Peak Traffic

criticalCapacity Planning

Consumer lag suddenly increases, indicating consumers are falling behind producers, risking data freshness and potential backpressure failures.

Prompt: “My Kafka consumer lag just spiked from 100 messages to 50,000 messages in the last 10 minutes. Help me figure out what's wrong and whether I need to scale up my consumer group.”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When consumer lag spikes suddenly, first verify that offsets are actually advancing to rule out a stuck consumer. Then compare producer throughput to consumer throughput to determine if this is a traffic spike overwhelming your consumers or a consumer-side processing slowdown. Finally, check consumer group membership and fetch performance to identify whether you need to scale horizontally or fix a performance bottleneck.

1Verify consumer offsets are advancing

Check `kafka.consumer_group.offset` over a 5-10 minute window to confirm offsets are actually moving forward. If the offset hasn't budged while lag increased from 100 to 50,000 messages, you have a stuck consumer (crash, deadlock, or unhandled exception) rather than a throughput problem. This is the fastest way to distinguish between 'consumer can't keep up' versus 'consumer isn't running at all.' If offsets are frozen for >10 minutes, restart the consumer and review logs for crashes or blocked threads.

Consumer offset stops advancing indicating stuck consumer kafka.consumer_group.offsetkafka_consumergroup_lag

2Compare producer rate to consumer consumption rate

Look at `kafka.messages_in.rate` (or `kafka.topic.message_rate`) and compare it to `kafka.consumer.records_consumed_rate` over the last 15-30 minutes. If the incoming rate suddenly doubled from 1,000 msg/sec to 2,000 msg/sec while your consumer rate stayed flat at 500 msg/sec, you've got a traffic spike that's overwhelming your consumers. If the producer rate is steady but consumer rate suddenly dropped, you have a consumer-side processing bottleneck or resource contention. This tells you whether the problem is 'too much input' or 'slow processing.'

High or growing consumer lag indicates processing bottleneck Consumer lag increases steadily due to slow processing High-volume spike causes delayed processing and missed event windows kafka.messages_in.ratekafka.topic.message_ratekafka.consumer.records_consumed_rate

3Monitor lag growth rate to assess urgency

Track `kafka.consumer.records_lag_max` and `kafka.consumer_group.lag_sum` to see if lag is accelerating, stabilizing, or decreasing. If lag is growing faster than 50,000 messages per minute, you're in critical territory and need immediate action (scaling or throttling). If lag spiked to 50k but is now holding steady or slowly decreasing, your consumers may catch up on their own as traffic normalizes. The insight on consumer lag escalation shows that exponential growth (500k to 14 million) despite healthy brokers means you need to scale consumer capacity, not fix infrastructure.

Consumer lag increases steadily due to slow processing Consumer lag escalates rapidly while broker health metrics remain normal Consumer Lag Divergence kafka.consumer.records_lag_maxkafka.consumer_group.lag_sumkafka.consumer.records_lag_avg

4Check consumer group membership for drops or rebalancing

Review `kafka.consumer_group.members` to see if your consumer group size changed recently. If you normally run 10 consumers but now only have 7 active, some consumers crashed or were killed, reducing your total parallelism and processing capacity. Even a brief rebalancing event can cause a temporary lag spike as partitions are reassigned. Cross-reference this with application logs or pod restarts in your orchestrator (Kubernetes, etc.) to identify if consumers are flapping or unhealthy.

High or growing consumer lag indicates processing bottleneck kafka.consumer_group.members

5Measure consumer fetch performance and latency

Check `kafka.consumer.fetch_rate` and `kafka.consumer.fetch_latency_avg` to understand if the bottleneck is on the broker side (network, disk I/O) or consumer side (processing logic). If fetch latency is high (>100ms) or fetch rate is abnormally low, you may have slow brokers or network issues between consumer and broker. If fetch metrics are healthy but lag keeps growing, the problem is consumer processing speed—your application code can't process messages fast enough. Also review `kafka.consumer.records_per_request_avg` to see if you're fetching efficiently in batches.

High or growing consumer lag indicates processing bottleneck kafka.consumer.fetch_ratekafka.consumer.fetch_latency_avgkafka.consumer.records_per_request_avg

6Determine if you need to scale the consumer group

Calculate your maximum theoretical throughput: if you have 12 partitions and 4 consumers, each consumer handles 3 partitions. If your current `kafka.consumer.records_consumed_rate` is maxed out (CPU at 90%+, processing at full capacity) but still can't keep up, you need horizontal scaling—add more consumer instances up to the number of partitions. If consumers are underutilized (<50% CPU) but still slow, the issue is inefficient processing logic, slow downstream dependencies, or resource bottlenecks (database, external API). In my experience, a sustained lag >100k messages for 5+ minutes with healthy brokers means it's time to scale up.

Consumer lag increases steadily due to slow processing Consumer lag prevents real-time message processing High-volume spike causes delayed processing and missed event windows kafka.consumer.records_consumed_ratekafka.consumer_group.memberskafka.consumer_group.lag_sum