Consumer Group Rebalancing Storm During Deployment

criticalIncident Response

Rolling deployment or consumer restarts trigger cascading rebalances that halt consumption and create a stop-the-world effect.

Prompt: We're doing a rolling deployment of our Kafka consumers and the entire consumer group has been stuck rebalancing for 30 minutes. Processing is completely stopped. What's happening and how do we fix it?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a Kafka consumer group is stuck in a rebalancing storm during deployment, first confirm continuous member churn, then immediately check for the two critical failure modes: rebalance spiral livelock (where consumers can't complete processing within max.poll.interval.ms) and static membership zombie partitions (where failed consumers hold partitions hostage). These account for the majority of prolonged rebalance scenarios and require different remediation approaches.

1Confirm continuous rebalance activity via member count oscillation
Check `kafka-consumergroup-members` or `kafka.consumer_group.members` for rapid fluctuations over the last 30 minutes. In a healthy rolling deployment, you'd see the count dip briefly as each pod restarts then stabilize. If the member count is oscillating wildly (e.g., bouncing between 8 and 12 members every 30-60 seconds), you have a rebalance storm rather than normal deployment churn. This confirms consumers are repeatedly joining and leaving the group instead of stabilizing after reassignment.
2Check for rebalance spiral livelock caused by max.poll.interval.ms violations
This is the most common cause of prolonged rebalance storms. If consumers are processing slow records (poison pills, database timeouts) and exceeding max.poll.interval.ms, they send LeaveGroup requests, rejoin, get reassigned the same partitions, fetch the same slow batch, and repeat infinitely. You'll see 100% lag on specific partitions while consumer heartbeats remain healthy. Check consumer logs for "member ... sending LeaveGroup request" followed immediately by rejoin messages. If max.poll.interval.ms is under 5 minutes and you're processing complex records, this is almost certainly your culprit.
3Identify static membership zombie partitions holding resources hostage
If you're using static membership (group.instance.id configured), a consumer that crashes or hits max.poll.interval.ms preserves its partition assignment for session.timeout.ms even after leaving. If the failure is deterministic (same poison message), that consumer reclaims the same partitions and fails again indefinitely while the broker refuses to reassign those partitions to healthy consumers. Check broker logs for "static member ... is fenced" or use kafka-consumer-groups CLI to see members with long-lived assignments despite restarts. This creates per-partition 100% lag that won't self-heal.
4Examine lag distribution to determine if it's full stop or partial starvation
Compare `kafka.consumer_group.lag_sum` and `kafka.consumer.records_lag_max` against per-partition lag. If cooperative rebalancing is enabled (CooperativeStickyAssignor), only partitions assigned to sick consumers may show 100% lag while healthy consumers maintain throughput on their partitions. Aggregate throughput metrics like `kafka.consumer.records_consumed_rate` might only dip 5-10%, masking that a subset of data is completely stalled. If lag is growing uniformly across all partitions, the entire group is stuck. If lag is isolated to specific partitions, you have a partial failure that cooperative rebalancing is hiding.
5Review consumer configuration for rebalance sensitivity
Check max.poll.interval.ms (default 300000ms/5min), session.timeout.ms (default 10000ms/10s), and heartbeat.interval.ms (default 3000ms/3s). If max.poll.interval.ms is too low for your batch processing time or session.timeout.ms is too aggressive for network conditions, consumers will be kicked out of the group prematurely. During deployment, temporary slowness (GC pauses, container startup) can trigger false-positive failures. Industry best practice for batch processors: max.poll.interval.ms at 900000ms (15min), session.timeout.ms at 45000ms (45s), heartbeat.interval.ms at 15000ms (15s) to decouple processing time from liveness detection.
6Immediate remediation: force consumer group reset or instance ID rotation
If Step 2 identified a rebalance spiral, increase max.poll.interval.ms to 900000ms and redeploy. If Step 3 identified static membership zombies, change each consumer's group.instance.id (append a timestamp or UUID) to force new assignments, or use the Kafka Admin API to explicitly delete stuck member metadata. For immediate relief without config changes, you can perform a controlled consumer group reset: stop all consumers, use kafka-consumer-groups --reset-offsets to clear state, then restart with staggered delays (30s between pods) to avoid thundering herd rebalances.

Technologies

Related Insights

Kafka Consumer Group Rebalance Storm Triggering Lambda Restarts
warning
Frequent Kafka consumer group rebalances (detected via kafka_consumergroup_members changes) can trigger Lambda function restarts (fullRestarts metric), causing processing interruptions, increased cold starts (InitDuration), and temporary offset lag spikes as Lambda event source mappings rejoin the consumer group.
High consumer rebalance rate causes processing instability
warning
Consumer Group Member Instability from Frequent Rebalances
warning
Frequent changes in consumer group membership trigger rebalances, causing processing pauses, increased latency, and temporary unavailability.
Consumer rebalancing causes periodic lag spikes
info
Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded
critical
Cooperative rebalancing masks partition starvation in aggregate metrics
warning
Static membership holds partitions hostage when consumer fails repeatedly
critical

Relevant Metrics

Monitoring Interfaces

Kafka Datadog
Kafka Prometheus
Kafka Native