Under-Replicated Partitions Appearing
criticalIncident Response
Some partition replicas are not in sync with the leader, indicating replication lag and reduced data durability guarantees.
Prompt: “I'm seeing under-replicated partitions in my Kafka cluster monitoring. ISR count is dropping on several partitions. What's causing this and how urgent is it to fix?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When investigating under-replicated partitions in Kafka, start by confirming the scope of the problem and identifying which brokers are consistently falling out of ISR. Then check for ISR oscillation patterns to distinguish between stable degradation and intermittent issues. Finally, investigate broker health, resource saturation, and network connectivity to identify the root cause — focusing on the brokers that are repeatedly removed from ISR across multiple partitions.
1Confirm under-replication and identify affected brokers
Check `kafka.replication.under_replicated_partitions` and `kafka.server.ReplicaManager.UnderReplicatedPartitions` to confirm the problem exists (any value > 0 is concerning). Compare `kafka.partition.replicas_in_sync` against `kafka.partition.replicas` across partitions to see how many replicas have dropped out of ISR. Identify which specific brokers are consistently falling out of ISR across multiple partitions — this points to a broker-level issue rather than a partition-specific problem.
2Check ISR shrink/expand patterns for oscillation
Look at `kafka.replication.isr_shrinks_rate` and `kafka.replication.isr_expands_rate` to distinguish between stable degradation and oscillating instability. If both rates are non-zero and oscillating (shrinks >1/sec sustained with corresponding expands), you have intermittent broker or network issues causing replicas to repeatedly fall behind and catch up. If you only see shrinks without expands, the problem is stable degradation and getting worse — this is more urgent as it indicates sustained failure to replicate.
3Investigate broker resource utilization
On the brokers you identified in step 1, check CPU, memory, and especially disk I/O metrics. Disk saturation is the most common cause of replication lag — if disk I/O wait times are high or disk utilization is near 100%, the broker can't keep up with replication traffic. Memory pressure causing excessive GC pauses (check JVM GC metrics) can also cause brokers to fall out of ISR. If one broker consistently shows resource exhaustion while others don't, that broker is your problem.
4Check replication lag and replica fetch performance
Examine `kafka.replication.max_lag` to see how far behind the follower replicas are lagging from the leader. High replication lag (measured in messages or time) indicates the followers cannot keep up with the leader's write rate. Cross-reference this with the broker resource metrics from step 3 — if lag is high but resources look fine, the issue is likely network-related or configuration-related (e.g., `replica.lag.time.max.ms` set too aggressively low for your network latency).
5Verify network connectivity between brokers
Test network connectivity and latency between the leader brokers and the follower brokers that are falling out of ISR. Network partitions, packet loss, or high latency can prevent replicas from staying in sync even when broker resources are healthy. If you see intermittent connectivity issues or latency spikes correlating with ISR shrink events, network infrastructure is your root cause. Check for misconfigured security groups, network congestion, or cross-AZ latency if running in the cloud.
6Check for recent operational changes
Review recent cluster operations — specifically broker removals, scaling events, or upgrades. If you're running Confluent Platform pre-7.0 and recently removed brokers, this can cause under-replicated partitions because the removal process doesn't gracefully migrate data first. Also check if the number of brokers (`kafka.brokers`) has decreased unexpectedly, indicating a broker crash or removal that wasn't properly handled. In these cases, you may need to manually reassign partitions to restore replication.
Technologies
Related Insights
Under-Replicated Partitions Signal Broker Degradation
critical
When ISR count drops below the configured replication factor, partitions become under-replicated, indicating broker health issues, network problems, or disk saturation that compromise data durability.
Under-replicated partitions indicate data loss risk on broker failure
critical
ISR Shrink/Expand Oscillation Reveals Replica Instability
warning
Frequent ISR shrink and expand events indicate replicas repeatedly falling behind and catching up, suggesting intermittent broker issues, network instability, or insufficient resources for replication workload.
ISR shrinkage progresses silently until data loss occurs
critical
Broker removal in pre-7.0 versions causes under-replicated partitions
critical
Relevant Metrics
kafka.brokerskafka.replication.under_replicated_partitionskafka.server.ReplicaManager.UnderReplicatedPartitionskafka.partition.replicas_in_synckafka.partition.replicaskafka.partition.under_replicatedkafka.replication.isr_shrinks_ratekafka.server.ReplicaManager.IsrShrinksPerSeckafka.replication.max_lagkafka.replication.isr_expands.ratekafka.replication.isr_expands_rateMonitoring Interfaces
Kafka Native