Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded

critical

availabilityUpdated Mar 5, 2026(via Exa)

Sources

The Rebalance Spiral: Debugging Cooperative Sticky Assigner Livelocks in Kafka Consumer Groupsazguards.com

Technologies:

Confluent Platformsubject

Apache KafkaThe root cause of this issue originates in Apache Kafka

How to detect:

Consumer exceeds max.poll.interval.ms due to slow record processing, sends LeaveGroup request, rejoins group, and CooperativeStickyAssignor reassigns same partitions. Consumer fetches same batch, hits same slow record, and cycle repeats infinitely. Results in 100% lag on affected partitions while heartbeat thread remains healthy.

Recommended action:

Increase max.poll.interval.ms to 900000ms (15 minutes) to decouple processing time from liveness detection. Keep session.timeout.ms at 45000ms for fast crash detection. Set heartbeat.interval.ms to exactly 1/3 of session.timeout.ms (15000ms). Reduce max.poll.records to 50 to limit blast radius of slow records.