Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded
criticalavailabilityUpdated Mar 5, 2026(via Exa)
Technologies:
How to detect:
Consumer exceeds max.poll.interval.ms due to slow record processing, sends LeaveGroup request, rejoins group, and CooperativeStickyAssignor reassigns same partitions. Consumer fetches same batch, hits same slow record, and cycle repeats infinitely. Results in 100% lag on affected partitions while heartbeat thread remains healthy.
Recommended action:
Increase max.poll.interval.ms to 900000ms (15 minutes) to decouple processing time from liveness detection. Keep session.timeout.ms at 45000ms for fast crash detection. Set heartbeat.interval.ms to exactly 1/3 of session.timeout.ms (15000ms). Reduce max.poll.records to 50 to limit blast radius of slow records.