P99 Latency Degradation Investigation

warningIncident Response

99th percentile latency for produce or fetch requests is increasing, indicating performance issues affecting a subset of requests.

Prompt: “My Kafka cluster's P99 produce latency just jumped from 50ms to 300ms, but P50 is still around 20ms. What's causing the tail latency spike and how do I troubleshoot it?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When P99 latency spikes but P50 remains normal, you're dealing with tail latency affecting a subset of requests. Start by checking request handler saturation and queue buildup, then investigate disk I/O and log flush latency. Message conversion overhead from legacy clients and network processor saturation are also common culprits.

1Check request handler saturation

First, look at `kafka.request.handler_idle_percent` — if it's below 20% (0.2), your request handlers are saturated and requests are queuing up. Cross-reference with `kafka.request.produce_time_99p` to confirm the P99 spike correlates with low handler idle time. When handlers are busy, the tail of requests waits in queue while most requests (P50) still process quickly, creating exactly the pattern you're seeing.

Request Handler Saturation Cascades to Producer Latency kafka.request.handler_idle_percentkafka.request.produce_time_99p

2Verify request queue buildup

Check `kafka.request.channel.queue.size` — if it's consistently above 100 and trending upward, requests are backing up faster than handlers can process them. This confirms broker overload and explains the P99 degradation. A growing queue means the unlucky requests at the tail are waiting longer while most requests still get through quickly.

High Request Queue Size Indicates Broker Overload kafka.request.channel.queue.size

3Investigate log flush latency and disk I/O

Look at `kafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95th` — if it's consistently above 100ms, slow disk writes are causing produce requests to stall. Also check if `kafka.log.flush_rate` is dropping while P99 latency increases; this indicates the disk can't keep up with flush operations. Disk I/O spikes often affect tail latency disproportionately because only the requests that hit a flush operation pay the penalty.

Log Flush Latency Spikes Causing Write Stalls kafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95thkafka.log.flush_rate

4Check for message conversion overhead from legacy clients

If `kafka.network.produce_message_conversions_rate` is greater than 0, the broker is converting message formats for legacy clients, adding significant CPU overhead. This conversion work can cause request handler threads to be busy with conversion instead of processing new requests, increasing P99 latency. Identify which producers are using old message formats and upgrade them to eliminate this overhead.

Message Conversion Overhead from Legacy Clients

5Assess network processor capacity

Check `kafka.network.processor_idle_percent` — if it's low (below 20%), your network layer is saturated and requests are queuing before they even reach the request handlers. This is less common than handler saturation but can happen under very high request rates. Low network processor idle combined with high P99 latency suggests you need to scale broker capacity or reduce request frequency from clients.

kafka.network.processor_idle_percent

6Review broader broker health and correlate with P99 thresholds

Finally, check if `kafka.request.produce_time_99p` has crossed the 500ms warning threshold defined in the high-p99-request-latency insight. Also look at fetch consumer times (`kafka.request.fetch_consumer_time_avg`) and follower fetch P99 (`kafka.request.fetch_follower_time_99p`) to determine if the issue is isolated to produce requests or affecting the entire broker. This broader view helps identify whether you're dealing with a produce-specific bottleneck or general broker degradation.

High P99 request latency indicates broker performance issues kafka.request.produce_time_99pkafka.request.fetch_consumer_time_avgkafka.request.fetch_follower_time_99p