Partition Count Optimization for Throughput

infoProactive Health

Determining optimal partition count to balance parallelism, consumer scaling, and operational overhead.

Prompt: “I have a Kafka topic with 6 partitions but need to scale to handle 5x more throughput. Should I increase partition count to 30 or 60? What's the impact on broker performance and recovery time?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When scaling Kafka partitions for throughput, start by validating your consumer group can actually scale, then measure current per-partition throughput to calculate the optimal target. Check for existing partition skew and broker capacity constraints before deciding between 30 or 60 partitions, as the wrong choice can hurt recovery time without gaining throughput.

1Verify consumer group can scale with more partitions

Check `kafka.consumer_group.members` against your current 6 partitions. If you only have 2-3 consumers, adding partitions won't help—Kafka limits one consumer per partition for parallelism. You need at least 30 consumers ready to handle 30 partitions, or you're just creating overhead. This is the most common mistake I see when scaling partitions.

kafka.consumer_group.memberskafka.topic.partitions

2Calculate current per-partition throughput baseline

Measure `kafka.topic.message_rate` and `kafka.topic.bytes_in_rate` for your topic, then divide by 6 to get per-partition throughput. For example, if you're doing 30K msg/sec total, that's 5K msg/sec per partition. To handle 5x load (150K msg/sec), you'd need 30 partitions if per-partition throughput stays constant. This math tells you whether 30 or 60 is the right target.

kafka.topic.message_ratekafka.topic.bytes_in_ratekafka.topic.partitions

3Check for partition key skew that would waste new partitions

Before adding partitions, look for the `hot-partition-creates-uneven-consumer-lag` pattern by comparing lag across your current 6 partitions. If one partition consistently shows 2x or more lag than others, you have key skew—adding more partitions won't help because the hot key will still bottleneck one partition. Fix your partition key distribution first, or you'll waste the new partitions.

Hot Partition Creates Uneven Consumer Lag Partition key skew causes individual partition lag

4Assess broker capacity for total partition overhead

Check `kafka.broker.partition_count` across all brokers in your cluster. Each partition costs ~1MB memory plus file handles per replica. If you go to 60 partitions with replication factor 3, that's 180 total partition replicas across your cluster. Brokers typically handle 2K-4K partitions comfortably; beyond that you see degraded performance and longer recovery times.

kafka.broker.partition_countkafka.partition.replicas

5Calculate recovery time impact before choosing 30 vs 60

Recovery time during broker failure is roughly linear with partition count—60 partitions will take 2x longer to replicate than 30. Check your current `kafka.partition.replicas` (replication factor) and multiply by your target partition count. If you have 3x replication and choose 60 partitions, that's 180 replicas that need rebalancing during failure. If your SLA requires <5min recovery, this might push you toward 30 partitions instead.

kafka.partition.replicaskafka.topic.partitions

6Verify current cluster balance before scaling

Check for `broker-bytes-imbalance-suggests-uneven-load` by comparing `kafka.network.bytes_in_rate` across brokers—if any broker exceeds average by >50%, you have a distribution problem. Also check `preferred-leader-imbalance-reduces-efficiency` using `kafka.broker.leader_count`. Fix these imbalances first (run preferred leader election, rebalance partitions) or your new partitions will land unevenly and create new bottlenecks.

Broker Bytes In/Out Imbalance Suggests Uneven Load Distribution Preferred Leader Imbalance Reduces Cluster Efficiency kafka.broker.leader_countkafka.replication.leader_count

7Add 20-30% headroom to calculated partition count

Based on your throughput math from step 2, round up your partition count to allow headroom. If math says you need 30 partitions for 5x load, I'd go with 36-40 to handle traffic spikes and uneven load. Don't over-provision to 60 just because it sounds safe—you pay the cost in recovery time, memory overhead, and operational complexity. You can always add more partitions later, but you can't easily reduce them.

kafka.topic.partitionskafka.broker.partition_count