Under-Replicated Partitions Appearing

criticalIncident Response

Some partition replicas are not in sync with the leader, indicating replication lag and reduced data durability guarantees.

Prompt: “I'm seeing under-replicated partitions in my Kafka cluster monitoring. ISR count is dropping on several partitions. What's causing this and how urgent is it to fix?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating under-replicated partitions in Kafka, start by confirming the scope of the problem and identifying which brokers are consistently falling out of ISR. Then check for ISR oscillation patterns to distinguish between stable degradation and intermittent issues. Finally, investigate broker health, resource saturation, and network connectivity to identify the root cause — focusing on the brokers that are repeatedly removed from ISR across multiple partitions.

1Confirm under-replication and identify affected brokers

Check `kafka.replication.under_replicated_partitions` and `kafka.server.ReplicaManager.UnderReplicatedPartitions` to confirm the problem exists (any value > 0 is concerning). Compare `kafka.partition.replicas_in_sync` against `kafka.partition.replicas` across partitions to see how many replicas have dropped out of ISR. Identify which specific brokers are consistently falling out of ISR across multiple partitions — this points to a broker-level issue rather than a partition-specific problem.

Under-Replicated Partitions Signal Broker Degradation Under-replicated partitions indicate data loss risk on broker failure kafka.replication.under_replicated_partitionskafka.server.ReplicaManager.UnderReplicatedPartitionskafka.partition.replicas_in_synckafka.partition.replicas

2Check ISR shrink/expand patterns for oscillation

Look at `kafka.replication.isr_shrinks_rate` and `kafka.replication.isr_expands_rate` to distinguish between stable degradation and oscillating instability. If both rates are non-zero and oscillating (shrinks >1/sec sustained with corresponding expands), you have intermittent broker or network issues causing replicas to repeatedly fall behind and catch up. If you only see shrinks without expands, the problem is stable degradation and getting worse — this is more urgent as it indicates sustained failure to replicate.

ISR Shrink/Expand Oscillation Reveals Replica Instability ISR shrinkage progresses silently until data loss occurs kafka.replication.isr_shrinks_ratekafka.server.ReplicaManager.IsrShrinksPerSeckafka.replication.isr_expands.ratekafka.replication.isr_expands_rate

3Investigate broker resource utilization

On the brokers you identified in step 1, check CPU, memory, and especially disk I/O metrics. Disk saturation is the most common cause of replication lag — if disk I/O wait times are high or disk utilization is near 100%, the broker can't keep up with replication traffic. Memory pressure causing excessive GC pauses (check JVM GC metrics) can also cause brokers to fall out of ISR. If one broker consistently shows resource exhaustion while others don't, that broker is your problem.

Under-Replicated Partitions Signal Broker Degradation

4Check replication lag and replica fetch performance

Examine `kafka.replication.max_lag` to see how far behind the follower replicas are lagging from the leader. High replication lag (measured in messages or time) indicates the followers cannot keep up with the leader's write rate. Cross-reference this with the broker resource metrics from step 3 — if lag is high but resources look fine, the issue is likely network-related or configuration-related (e.g., `replica.lag.time.max.ms` set too aggressively low for your network latency).

Under-Replicated Partitions Signal Broker Degradation kafka.replication.max_lag

5Verify network connectivity between brokers

Test network connectivity and latency between the leader brokers and the follower brokers that are falling out of ISR. Network partitions, packet loss, or high latency can prevent replicas from staying in sync even when broker resources are healthy. If you see intermittent connectivity issues or latency spikes correlating with ISR shrink events, network infrastructure is your root cause. Check for misconfigured security groups, network congestion, or cross-AZ latency if running in the cloud.

Under-Replicated Partitions Signal Broker Degradation ISR Shrink/Expand Oscillation Reveals Replica Instability

6Check for recent operational changes

Review recent cluster operations — specifically broker removals, scaling events, or upgrades. If you're running Confluent Platform pre-7.0 and recently removed brokers, this can cause under-replicated partitions because the removal process doesn't gracefully migrate data first. Also check if the number of brokers (`kafka.brokers`) has decreased unexpectedly, indicating a broker crash or removal that wasn't properly handled. In these cases, you may need to manually reassign partitions to restore replication.

Broker removal in pre-7.0 versions causes under-replicated partitions kafka.brokers