Cooperative rebalancing masks partition starvation in aggregate metrics
warningUnder CooperativeStickyAssignor, rebalances use two-phase commit (REVOKE → ASSIGN) allowing unaffected consumers to continue processing. When one consumer enters rebalance spiral, healthy consumers maintain throughput on their partitions. Only partitions assigned to the sick consumer experience 100% lag. Aggregate throughput metrics may only dip 5%, masking that 5% of data is completely stagnant.
Monitor per-partition lag and per-client-ID consumption rates, not just cluster-wide aggregates. Correlate high rebalance-rate with stable heartbeat-response-time-max to identify application thread issues. Look for specific Client IDs showing zero records-consumed-rate while global rate remains healthy. Alert on records-lag-max sawtooth patterns for individual partitions.