Right-Sizing Kafka Cluster on AWS MSK

infoCapacity Planning

Determining whether to scale Kafka cluster up, down, or horizontally based on current workload patterns and resource utilization.

Prompt: My AWS MSK cluster is running on m5.large brokers with 60% CPU and 40% disk usage. Based on my current throughput and partition count, should I scale up to larger instances, add more brokers, or am I over-provisioned?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When right-sizing a Kafka cluster, start by checking if your brokers are actually working hard or sitting idle, then verify load is evenly distributed before making scaling decisions. The key is distinguishing between true capacity constraints and configuration issues like imbalanced partitions or leadership, which scaling won't fix.

1Check if brokers are actually busy or idle
Start with `kafka.request.handler_idle_percent` and `kafka.network.processor_idle_percent` to see if your brokers are truly working. If these metrics show >60-70% idle time while you're at 60% CPU, your brokers are spending most of their time waiting for work, which strongly suggests over-provisioning. Healthy busy clusters typically show <40% idle on these metrics during normal operation.
2Verify load is evenly distributed across brokers
Compare `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` across all brokers to detect imbalance. The `broker-bytes-imbalance-suggests-uneven-load` insight warns that if any broker exceeds the cluster average by >50% sustained over 15 minutes, you have a distribution problem, not a capacity problem. Scaling won't help until you rebalance partitions - you'll just add more underutilized brokers.
3Check partition and leadership distribution balance
Compare `kafka.broker.partition_count` and `kafka.broker.leader_count` across all brokers - they should be within 10-15% of each other. The `preferred-leader-imbalance-reduces-efficiency` insight notes that if >20% of partitions aren't using their preferred leader, you're wasting capacity and creating hotspots. Run preferred leader election and enable auto-rebalance before considering scaling.
4Analyze throughput relative to broker network capacity
m5.large instances have 10 Gbps network bandwidth (~1.25 GB/s or 1250 MB/s). Sum your peak `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` per broker. If you're consistently above 750-875 MB/s (60-70% of capacity), you need to scale out or up. Below 500 MB/s (<40%) suggests over-provisioning, especially combined with low CPU.
5Calculate partition density per broker
Calculate total partition replicas per broker: (sum of all `kafka.topic.partitions` × `kafka.partition.replicas`) / `kafka.broker.count`. Best practice is <4000 partition replicas per broker for stability. High density (>3000) limits horizontal scaling and suggests you need larger broker instances. Very low density (<1000) with low CPU and high idle percentages confirms over-provisioning.
6Look for resource exhaustion patterns during peak hours
The `broker-resource-exhaustion` insight triggers at CPU >80%, Memory >90%, or Disk I/O wait >20%. At 60% CPU and 40% disk during normal operation, you're not hitting exhaustion thresholds. Check your peak hour patterns - if you rarely exceed 70-75% on any resource even during bursts, you're likely over-provisioned for your current workload and could scale down.
7Review message rate patterns for growth trajectory
Check `kafka.messages_in.rate` trends over the past 30-90 days. If your message rate is flat or declining while resources remain at 60% CPU, you're likely over-provisioned. If message rate is growing 20%+ month-over-month and you're already at 60% CPU with low idle time, you should maintain current sizing for headroom or plan horizontal scaling by adding brokers to distribute the growing load.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Kafka Datadog
Kafka Prometheus
Kafka Native