Right-Sizing Kafka Cluster on AWS MSK

infoCapacity Planning

Determining whether to scale Kafka cluster up, down, or horizontally based on current workload patterns and resource utilization.

Prompt: “My AWS MSK cluster is running on m5.large brokers with 60% CPU and 40% disk usage. Based on my current throughput and partition count, should I scale up to larger instances, add more brokers, or am I over-provisioned?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When right-sizing a Kafka cluster, start by checking if your brokers are actually working hard or sitting idle, then verify load is evenly distributed before making scaling decisions. The key is distinguishing between true capacity constraints and configuration issues like imbalanced partitions or leadership, which scaling won't fix.

1Check if brokers are actually busy or idle

Start with `kafka.request.handler_idle_percent` and `kafka.network.processor_idle_percent` to see if your brokers are truly working. If these metrics show >60-70% idle time while you're at 60% CPU, your brokers are spending most of their time waiting for work, which strongly suggests over-provisioning. Healthy busy clusters typically show <40% idle on these metrics during normal operation.

kafka.request.handler_idle_percentkafka.network.processor_idle_percent

2Verify load is evenly distributed across brokers

Compare `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` across all brokers to detect imbalance. The `broker-bytes-imbalance-suggests-uneven-load` insight warns that if any broker exceeds the cluster average by >50% sustained over 15 minutes, you have a distribution problem, not a capacity problem. Scaling won't help until you rebalance partitions - you'll just add more underutilized brokers.

Broker Bytes In/Out Imbalance Suggests Uneven Load Distribution kafka.net.bytes_in.ratekafka.net.bytes_out.rate

3Check partition and leadership distribution balance

Compare `kafka.broker.partition_count` and `kafka.broker.leader_count` across all brokers - they should be within 10-15% of each other. The `preferred-leader-imbalance-reduces-efficiency` insight notes that if >20% of partitions aren't using their preferred leader, you're wasting capacity and creating hotspots. Run preferred leader election and enable auto-rebalance before considering scaling.

Preferred Leader Imbalance Reduces Cluster Efficiency kafka.broker.partition_countkafka.broker.leader_count

4Analyze throughput relative to broker network capacity

m5.large instances have 10 Gbps network bandwidth (~1.25 GB/s or 1250 MB/s). Sum your peak `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` per broker. If you're consistently above 750-875 MB/s (60-70% of capacity), you need to scale out or up. Below 500 MB/s (<40%) suggests over-provisioning, especially combined with low CPU.

kafka.net.bytes_in.ratekafka.net.bytes_out.rate

5Calculate partition density per broker

Calculate total partition replicas per broker: (sum of all `kafka.topic.partitions` × `kafka.partition.replicas`) / `kafka.broker.count`. Best practice is <4000 partition replicas per broker for stability. High density (>3000) limits horizontal scaling and suggests you need larger broker instances. Very low density (<1000) with low CPU and high idle percentages confirms over-provisioning.

kafka.topic.partitionskafka.partition.replicaskafka.broker.count

6Look for resource exhaustion patterns during peak hours

The `broker-resource-exhaustion` insight triggers at CPU >80%, Memory >90%, or Disk I/O wait >20%. At 60% CPU and 40% disk during normal operation, you're not hitting exhaustion thresholds. Check your peak hour patterns - if you rarely exceed 70-75% on any resource even during bursts, you're likely over-provisioned for your current workload and could scale down.

Broker resource exhaustion degrades cluster performance

7Review message rate patterns for growth trajectory

Check `kafka.messages_in.rate` trends over the past 30-90 days. If your message rate is flat or declining while resources remain at 60% CPU, you're likely over-provisioned. If message rate is growing 20%+ month-over-month and you're already at 60% CPU with low idle time, you should maintain current sizing for headroom or plan horizontal scaling by adding brokers to distribute the growing load.

kafka.messages_in.rate