Cooperative rebalancing masks partition starvation in aggregate metrics

warning

performanceUpdated Mar 5, 2026(via Exa)

Sources

The Rebalance Spiral: Debugging Cooperative Sticky Assigner Livelocks in Kafka Consumer Groupsazguards.com

Technologies:

Confluent Platformsubject

Apache KafkaThe root cause of this issue originates in Apache Kafka

How to detect:

Under CooperativeStickyAssignor, rebalances use two-phase commit (REVOKE → ASSIGN) allowing unaffected consumers to continue processing. When one consumer enters rebalance spiral, healthy consumers maintain throughput on their partitions. Only partitions assigned to the sick consumer experience 100% lag. Aggregate throughput metrics may only dip 5%, masking that 5% of data is completely stagnant.

Recommended action:

Monitor per-partition lag and per-client-ID consumption rates, not just cluster-wide aggregates. Correlate high rebalance-rate with stable heartbeat-response-time-max to identify application thread issues. Look for specific Client IDs showing zero records-consumed-rate while global rate remains healthy. Alert on records-lag-max sawtooth patterns for individual partitions.