P99 Latency Degradation Investigation
warningIncident Response
99th percentile latency for produce or fetch requests is increasing, indicating performance issues affecting a subset of requests.
Prompt: “My Kafka cluster's P99 produce latency just jumped from 50ms to 300ms, but P50 is still around 20ms. What's causing the tail latency spike and how do I troubleshoot it?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When P99 latency spikes but P50 remains normal, you're dealing with tail latency affecting a subset of requests. Start by checking request handler saturation and queue buildup, then investigate disk I/O and log flush latency. Message conversion overhead from legacy clients and network processor saturation are also common culprits.
1Check request handler saturation
First, look at `kafka.request.handler_idle_percent` — if it's below 20% (0.2), your request handlers are saturated and requests are queuing up. Cross-reference with `kafka.request.produce_time_99p` to confirm the P99 spike correlates with low handler idle time. When handlers are busy, the tail of requests waits in queue while most requests (P50) still process quickly, creating exactly the pattern you're seeing.
2Verify request queue buildup
Check `kafka.request.channel.queue.size` — if it's consistently above 100 and trending upward, requests are backing up faster than handlers can process them. This confirms broker overload and explains the P99 degradation. A growing queue means the unlucky requests at the tail are waiting longer while most requests still get through quickly.
3Investigate log flush latency and disk I/O
Look at `kafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95th` — if it's consistently above 100ms, slow disk writes are causing produce requests to stall. Also check if `kafka.log.flush_rate` is dropping while P99 latency increases; this indicates the disk can't keep up with flush operations. Disk I/O spikes often affect tail latency disproportionately because only the requests that hit a flush operation pay the penalty.
4Check for message conversion overhead from legacy clients
If `kafka.network.produce_message_conversions_rate` is greater than 0, the broker is converting message formats for legacy clients, adding significant CPU overhead. This conversion work can cause request handler threads to be busy with conversion instead of processing new requests, increasing P99 latency. Identify which producers are using old message formats and upgrade them to eliminate this overhead.
5Assess network processor capacity
Check `kafka.network.processor_idle_percent` — if it's low (below 20%), your network layer is saturated and requests are queuing before they even reach the request handlers. This is less common than handler saturation but can happen under very high request rates. Low network processor idle combined with high P99 latency suggests you need to scale broker capacity or reduce request frequency from clients.
6Review broader broker health and correlate with P99 thresholds
Finally, check if `kafka.request.produce_time_99p` has crossed the 500ms warning threshold defined in the high-p99-request-latency insight. Also look at fetch consumer times (`kafka.request.fetch_consumer_time_avg`) and follower fetch P99 (`kafka.request.fetch_follower_time_99p`) to determine if the issue is isolated to produce requests or affecting the entire broker. This broader view helps identify whether you're dealing with a produce-specific bottleneck or general broker degradation.
Technologies
Related Insights
High P99 request latency indicates broker performance issues
warning
Log Flush Latency Spikes Causing Write Stalls
warning
When log flush operations take excessive time, produce requests are delayed as Kafka waits for data to be flushed to disk, impacting producer latency and throughput.
Request Handler Saturation Cascades to Producer Latency
warning
When Kafka request handlers are saturated (low idle percentage), producer requests queue up, increasing end-to-end latency and potentially triggering producer-side timeouts.
High Request Queue Size Indicates Broker Overload
warning
When request queue size grows, it indicates the broker cannot process incoming requests fast enough, leading to increased latency and potential client timeouts.
Message Conversion Overhead from Legacy Clients
warning
When brokers perform message format conversion for legacy clients, it adds significant CPU overhead and latency, reducing overall throughput.
Relevant Metrics
kafka.request.produce_time_99pkafka.request.update_metadata_time_avgkafka.request.fetch_follower_time_99pkafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95thkafka.request.handler_idle_percentkafka.request.fetch_consumer_time_avgkafka.log.flush_ratekafka.network.processor_idle_percentkafka.request.channel.queue.sizeMonitoring Interfaces
Kafka Native