Consumer Group Rebalancing Storm During Deployment

criticalIncident Response

Rolling deployment or consumer restarts trigger cascading rebalances that halt consumption and create a stop-the-world effect.

Prompt: “We're doing a rolling deployment of our Kafka consumers and the entire consumer group has been stuck rebalancing for 30 minutes. Processing is completely stopped. What's happening and how do we fix it?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a Kafka consumer group is stuck in a rebalancing storm during deployment, first confirm continuous member churn, then immediately check for the two critical failure modes: rebalance spiral livelock (where consumers can't complete processing within max.poll.interval.ms) and static membership zombie partitions (where failed consumers hold partitions hostage). These account for the majority of prolonged rebalance scenarios and require different remediation approaches.

1Confirm continuous rebalance activity via member count oscillation

Check `kafka-consumergroup-members` or `kafka.consumer_group.members` for rapid fluctuations over the last 30 minutes. In a healthy rolling deployment, you'd see the count dip briefly as each pod restarts then stabilize. If the member count is oscillating wildly (e.g., bouncing between 8 and 12 members every 30-60 seconds), you have a rebalance storm rather than normal deployment churn. This confirms consumers are repeatedly joining and leaving the group instead of stabilizing after reassignment.

Consumer Group Member Instability from Frequent Rebalances High consumer rebalance rate causes processing instability kafka_consumergroup_memberskafka.consumer_group.members

2Check for rebalance spiral livelock caused by max.poll.interval.ms violations

This is the most common cause of prolonged rebalance storms. If consumers are processing slow records (poison pills, database timeouts) and exceeding max.poll.interval.ms, they send LeaveGroup requests, rejoin, get reassigned the same partitions, fetch the same slow batch, and repeat infinitely. You'll see 100% lag on specific partitions while consumer heartbeats remain healthy. Check consumer logs for "member ... sending LeaveGroup request" followed immediately by rejoin messages. If max.poll.interval.ms is under 5 minutes and you're processing complex records, this is almost certainly your culprit.

Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded kafka.consumer.records_lag_maxkafka_consumergroup_lag

3Identify static membership zombie partitions holding resources hostage

If you're using static membership (group.instance.id configured), a consumer that crashes or hits max.poll.interval.ms preserves its partition assignment for session.timeout.ms even after leaving. If the failure is deterministic (same poison message), that consumer reclaims the same partitions and fails again indefinitely while the broker refuses to reassign those partitions to healthy consumers. Check broker logs for "static member ... is fenced" or use kafka-consumer-groups CLI to see members with long-lived assignments despite restarts. This creates per-partition 100% lag that won't self-heal.

Static membership holds partitions hostage when consumer fails repeatedly kafka_consumergroup_lagkafka.consumer.records_lag_max

4Examine lag distribution to determine if it's full stop or partial starvation

Compare `kafka.consumer_group.lag_sum` and `kafka.consumer.records_lag_max` against per-partition lag. If cooperative rebalancing is enabled (CooperativeStickyAssignor), only partitions assigned to sick consumers may show 100% lag while healthy consumers maintain throughput on their partitions. Aggregate throughput metrics like `kafka.consumer.records_consumed_rate` might only dip 5-10%, masking that a subset of data is completely stalled. If lag is growing uniformly across all partitions, the entire group is stuck. If lag is isolated to specific partitions, you have a partial failure that cooperative rebalancing is hiding.

Cooperative rebalancing masks partition starvation in aggregate metrics Consumer rebalancing causes periodic lag spikes kafka.consumer_group.lag_sumkafka.consumer.records_lag_maxkafka.consumer.records_consumed_ratekafka_consumergroup_lag

5Review consumer configuration for rebalance sensitivity

Check max.poll.interval.ms (default 300000ms/5min), session.timeout.ms (default 10000ms/10s), and heartbeat.interval.ms (default 3000ms/3s). If max.poll.interval.ms is too low for your batch processing time or session.timeout.ms is too aggressive for network conditions, consumers will be kicked out of the group prematurely. During deployment, temporary slowness (GC pauses, container startup) can trigger false-positive failures. Industry best practice for batch processors: max.poll.interval.ms at 900000ms (15min), session.timeout.ms at 45000ms (45s), heartbeat.interval.ms at 15000ms (15s) to decouple processing time from liveness detection.

Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded Consumer Group Member Instability from Frequent Rebalances

6Immediate remediation: force consumer group reset or instance ID rotation

If Step 2 identified a rebalance spiral, increase max.poll.interval.ms to 900000ms and redeploy. If Step 3 identified static membership zombies, change each consumer's group.instance.id (append a timestamp or UUID) to force new assignments, or use the Kafka Admin API to explicitly delete stuck member metadata. For immediate relief without config changes, you can perform a controlled consumer group reset: stop all consumers, use kafka-consumer-groups --reset-offsets to clear state, then restart with staggered delays (30s between pods) to avoid thundering herd rebalances.

Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded Static membership holds partitions hostage when consumer fails repeatedly