Producer Throughput Optimization with Batching

infoProactive Health

Producer throughput is lower than expected, and SRE needs to tune batching, compression, and other producer settings for better performance.

Prompt: My Kafka producers are only achieving 10,000 messages/sec but we need 50,000 messages/sec. I see low batch sizes in the metrics. Should I tune batch.size, linger.ms, or enable compression to improve throughput?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When producer throughput is below target with low batch sizes, start by examining actual batch size metrics and linger.ms configuration to ensure proper batching, then evaluate compression effectiveness for reducing network overhead, and finally investigate buffer memory exhaustion or in-flight request limits that could be bottlenecking your pipeline.

1Examine batch size metrics and linger.ms configuration
Check `kafka-producer-batch-size-max` and `kafka-connector-task-batch-size-avg` to see how much data you're actually batching per request. If batch sizes are consistently small (e.g., <16KB when records are piling up), the producer isn't waiting long enough to accumulate records before sending. Start by increasing linger.ms from the default 0ms to 10-20ms to allow more records to batch together, which dramatically reduces request overhead and increases throughput. Compare `kafka-producer-record-send-rate` before and after the change to measure impact.
2Evaluate compression effectiveness to reduce network overhead
Look at `kafka-producer-compression-rate` and `kafka-producer-bytes-out` to understand how much data you're actually sending over the network. If compression isn't enabled or the compression rate is near 1.0 (no compression), you're sending far more bytes than necessary and saturating network bandwidth. Enable compression (snappy or lz4 for low CPU overhead) to reduce `kafka-producer-bytes-out` by 50-70% for typical text payloads, which directly translates to higher message throughput on the same network link.
3Check for buffer memory exhaustion blocking sends
Monitor `kafka-producer-available-buffer-bytes` and `kafka-producer-record-queue-time-max` to detect if your producer is running out of buffer space. As described in the `producer-buffer-exhaustion-causing-request-blocking` insight, when buffer memory approaches zero, send calls block waiting for space, causing application threads to stall. If available buffer bytes is consistently low (e.g., <10% of total) or queue times are spiking above 100ms, increase buffer.memory from the default 32MB to 64MB or 128MB to prevent blocking.
4Verify in-flight requests aren't limiting parallelism
Check `kafka-producer-requests-in-flight` and `kafka-producer-request-rate` to see if you're hitting the concurrency ceiling. If requests in flight are consistently at the max.in.flight.requests.per.connection limit (default 5), you're throttling your own throughput by not allowing enough pipelining. Increase max.in.flight.requests.per.connection to 10-20 to allow more parallel requests, which improves throughput especially when batch.size is properly tuned and network latency is non-trivial.
5Analyze record queue times for batching behavior
Compare `kafka-producer-record-queue-time-avg` to `kafka-producer-record-queue-time-max` to understand batching consistency. If the average is very low (<5ms) but max is high, records are sometimes batching well but often being sent immediately, indicating inconsistent application send patterns. High variance combined with low `kafka-producer-batch-size-max` suggests you need to either smooth out your application's send rate or increase linger.ms further to ensure consistent batching even during low-traffic periods.
6Review overall request efficiency and resource contention
Check `kafka-producer-request-rate` against `kafka-producer-record-send-rate` to calculate records per request. If you're averaging fewer than 100 records per request when you need 50K messages/sec, you're generating too many small requests which creates overhead. As noted in the `high-request-rate-latency-resource-contention` insight, high request rates can indicate resource contention. The fix is usually to increase batch.size (try 256KB or 512KB) along with linger.ms to pack more records into each request, reducing both network overhead and broker load.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Kafka Datadog
Kafka Prometheus
Kafka Native