Network Saturation Between Brokers and Clients
warningIncident Response
Network bandwidth is becoming a bottleneck, affecting throughput and causing potential message delays or broker performance degradation.
Prompt: “My Kafka broker network usage is at 85% of capacity and I'm seeing slower producer acknowledgments. Is my network saturated? Should I enable compression or scale up my broker instance types?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When investigating apparent network saturation in Kafka, first distinguish between network thread saturation and actual bandwidth limits, then check for uneven load distribution across brokers. Only after ruling out processing bottlenecks and load imbalance should you consider compression tuning or vertical scaling.
1Distinguish network thread saturation from bandwidth saturation
Check `kafka.network.SocketServer.NetworkProcessorAvgIdlePercent` immediately - if this drops below 30%, your network processor threads are overwhelmed handling I/O, which is very different from hitting actual NIC bandwidth limits. This is often the real culprit when 'network usage' appears high but you're not actually saturating the physical link. If idle percentage is healthy (>50%), then you likely have true bandwidth constraints. The network-processor-idle-drop-precedes-request-timeouts insight shows this saturation leads to request timeouts before requests even reach handler threads.
2Check for uneven load distribution across brokers
Compare `kafka.net.bytes_in.rate` and `kafka.net.bytes_out.rate` across all brokers in your cluster - if any single broker exceeds the cluster average by more than 50%, you have a partition or leadership imbalance problem, not cluster-wide network saturation. The hot broker will show 85% network usage while others sit at 40%. This is critical because scaling up instance types won't help if only one broker is saturated due to uneven partition distribution.
3Evaluate current compression effectiveness
Check `kafka.producer.compression_rate` to see if compression is already enabled and working effectively. If the rate is 0 or very low, enabling producer-side compression (snappy or lz4) can reduce network traffic by 50-70% for typical message payloads, directly addressing your bandwidth concerns. If you're already compressing well (rate >0.5), the bottleneck is elsewhere and you won't gain much from compression tuning. This directly answers whether enabling compression would help before you invest in larger instances.
4Identify message format conversion overhead
Look for message format conversion happening due to legacy clients - the message-conversion-overhead-from-legacy-clients insight shows this adds significant CPU overhead and can masquerade as network saturation. When brokers convert message formats for old clients, it increases both processing load and apparent network utilization while adding latency. If conversion is occurring, upgrading legacy clients to current message formats is more effective than scaling brokers.
5Check request handler thread capacity
Even if network processor threads are healthy, saturated I/O request handlers cause producer requests to queue up, manifesting as increased latency and apparent network issues. The request-handler-saturation-cascades-to-producer-latency insight indicates that handler saturation (low idle percentage) directly impacts producer acknowledgment times. Cross-reference with `kafka.request.produce_time_99p` - if P99 latency exceeds 500ms while network processors are idle, you need more I/O threads (`num.io.threads`), not more network capacity.
6Correlate with producer-side throughput and latency
Look at the producer perspective using `kafka.request.produce_time_99p`, `kafka.producer.request_rate`, and `kafka.producer.bytes_out` together. High P99 produce latency (>500ms) combined with high request rates and bytes out confirms broker-side performance degradation is impacting clients. Compare `kafka.producer.bytes_out` trends with broker-side `kafka.net.bytes_in.rate` to verify if the bottleneck is truly on the broker network or if client-side batching/throttling is reducing effective throughput.
Technologies
Related Insights
Network Processor Idle Drop Precedes Request Timeouts
critical
When network processor average idle percentage drops significantly, it indicates the broker is overwhelmed processing network I/O, leading to client-side request timeouts before requests reach handler threads.
Broker Bytes In/Out Imbalance Suggests Uneven Load Distribution
warning
When some brokers show significantly higher network traffic than others, it indicates uneven partition distribution or leadership imbalance, causing inefficient resource utilization.
Message Conversion Overhead from Legacy Clients
warning
When brokers perform message format conversion for legacy clients, it adds significant CPU overhead and latency, reducing overall throughput.
Request Handler Saturation Cascades to Producer Latency
warning
When Kafka request handlers are saturated (low idle percentage), producer requests queue up, increasing end-to-end latency and potentially triggering producer-side timeouts.
High P99 request latency indicates broker performance issues
warning
Relevant Metrics
kafka.net.bytes_in.ratekafka.net.bytes_out.ratekafka.connect.outgoing_byte_ratekafka.network.bytes_in_ratekafka.network.bytes_out_ratekafka.request.update_metadata_time_avgkafka.producer.compression_ratekafka.network.SocketServer.NetworkProcessorAvgIdlePercentkafka.producer.request_ratekafka.producer.bytes_outkafka.request.produce_time_99pMonitoring Interfaces
Kafka Native