Network Saturation Between Brokers and Clients

warningIncident Response

Network bandwidth is becoming a bottleneck, affecting throughput and causing potential message delays or broker performance degradation.

Prompt: “My Kafka broker network usage is at 85% of capacity and I'm seeing slower producer acknowledgments. Is my network saturated? Should I enable compression or scale up my broker instance types?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating apparent network saturation in Kafka, first distinguish between network thread saturation and actual bandwidth limits, then check for uneven load distribution across brokers. Only after ruling out processing bottlenecks and load imbalance should you consider compression tuning or vertical scaling.

1Distinguish network thread saturation from bandwidth saturation

Check `kafka.network.SocketServer.NetworkProcessorAvgIdlePercent` immediately - if this drops below 30%, your network processor threads are overwhelmed handling I/O, which is very different from hitting actual NIC bandwidth limits. This is often the real culprit when 'network usage' appears high but you're not actually saturating the physical link. If idle percentage is healthy (>50%), then you likely have true bandwidth constraints. The network-processor-idle-drop-precedes-request-timeouts insight shows this saturation leads to request timeouts before requests even reach handler threads.

Network Processor Idle Drop Precedes Request Timeouts kafka.network.SocketServer.NetworkProcessorAvgIdlePercent

2Check for uneven load distribution across brokers

Compare `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` across all brokers in your cluster - if any single broker exceeds the cluster average by more than 50%, you have a partition or leadership imbalance problem, not cluster-wide network saturation. The hot broker will show 85% network usage while others sit at 40%. This is critical because scaling up instance types won't help if only one broker is saturated due to uneven partition distribution.

Broker Bytes In/Out Imbalance Suggests Uneven Load Distribution kafka.net.bytes_in.ratekafka.net.bytes_out.ratekafka.network.bytes_in_ratekafka.network.bytes_out_rate

3Evaluate current compression effectiveness

Check `kafka.producer.compression_rate` to see if compression is already enabled and working effectively. If the rate is 0 or very low, enabling producer-side compression (snappy or lz4) can reduce network traffic by 50-70% for typical message payloads, directly addressing your bandwidth concerns. If you're already compressing well (rate >0.5), the bottleneck is elsewhere and you won't gain much from compression tuning. This directly answers whether enabling compression would help before you invest in larger instances.

kafka.producer.compression_rate

4Identify message format conversion overhead

Look for message format conversion happening due to legacy clients - the message-conversion-overhead-from-legacy-clients insight shows this adds significant CPU overhead and can masquerade as network saturation. When brokers convert message formats for old clients, it increases both processing load and apparent network utilization while adding latency. If conversion is occurring, upgrading legacy clients to current message formats is more effective than scaling brokers.

Message Conversion Overhead from Legacy Clients

5Check request handler thread capacity

Even if network processor threads are healthy, saturated I/O request handlers cause producer requests to queue up, manifesting as increased latency and apparent network issues. The request-handler-saturation-cascades-to-producer-latency insight indicates that handler saturation (low idle percentage) directly impacts producer acknowledgment times. Cross-reference with `kafka.request.produce_time_99p` - if P99 latency exceeds 500ms while network processors are idle, you need more I/O threads (`num.io.threads`), not more network capacity.

Request Handler Saturation Cascades to Producer Latency kafka.request.produce_time_99p

6Correlate with producer-side throughput and latency

Look at the producer perspective using `kafka.request.produce_time_99p`, `kafka.producer.request_rate`, and `kafka.producer.bytes_out` together. High P99 produce latency (>500ms) combined with high request rates and bytes out confirms broker-side performance degradation is impacting clients. Compare `kafka.producer.bytes_out` trends with broker-side `kafka.net.bytes_in.rate` to verify if the bottleneck is truly on the broker network or if client-side batching/throttling is reducing effective throughput.

High P99 request latency indicates broker performance issues kafka.request.produce_time_99pkafka.producer.request_ratekafka.producer.bytes_outkafka.net.bytes_in.rate