Apache Kafka

ISR Shrink/Expand Oscillation Reveals Replica Instability

warning
ReplicationUpdated Dec 20, 2025

Frequent ISR shrink and expand events indicate replicas repeatedly falling behind and catching up, suggesting intermittent broker issues, network instability, or insufficient resources for replication workload.

How to detect:

Monitor kafka.replication.isr_expands.rate and kafka.server.ReplicaManager.IsrExpandsPerSec correlating with high variance. Sustained non-zero rates for both shrink and expand (oscillating) indicate instability. Normal state should show near-zero for both metrics.

Recommended action:

Investigate follower replica health and network latency between brokers. Check replica.lag.time.max.ms configuration - may need increase if network is slow. Review broker resource utilization - CPU, memory, disk I/O. Verify replica.fetch.max.bytes and replica.fetch.min.bytes are appropriately configured for your message sizes.