Heap Memory and Garbage Collection Tuning

warningProactive Health

Kafka broker JVM is experiencing frequent or long garbage collection pauses, impacting request processing and causing latency spikes.

Prompt: “My Kafka brokers are experiencing GC pauses of 200-500ms every few minutes, causing request timeouts. Current heap is set to 8GB with default GC settings. How should I tune the JVM for better performance?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing Kafka JVM GC issues, start by confirming that GC pauses correlate with request latency spikes, then assess the downstream impact on client timeouts and broker capacity. Check for contributing factors like log flush pressure before tuning heap size and switching to G1GC with appropriate pause time targets.

1Correlate GC pause timing with request latency spikes

First, confirm GC is the root cause by checking if your 200-500ms GC pauses align with spikes in `kafka-request-produce-time-99percentile` and `kafka-request-fetch-consumer-time-99percentile`. If P99 latency jumps from ~50ms to 200-500ms at the same intervals as GC pauses, you've confirmed the connection. The `jvm-gc-pause-cascading-to-kafka-consumer-lag` insight shows this creates cascading failures where even short pauses cause measurable consumer lag.

JVM GC Pause Cascading to Kafka Consumer Lag High P99 request latency indicates broker performance issues kafka.request.produce_time_99p

2Measure client-side impact through request expirations

Check `kafka-producer-expires-per-seconds` and `kafka-consumer-expires-per-second` to quantify how many requests are timing out during GC pauses. If you see spikes of 10+ expirations/second correlating with GC events, clients are hitting their timeout thresholds (typically 30-60s for producers, shorter for consumers). Also monitor `kafka-expires-sec` for the overall expiration rate — any non-zero value indicates GC pauses are severe enough to cause client failures.

High P99 request latency indicates broker performance issues kafka.producer.request_expiration_ratekafka.consumer.request_expiration_ratekafka.expires_sec

3Check if request handler saturation is amplifying GC impact

Look at `kafka-request-handler-avg-idle-pct-rate` — if it's below 0.2 (20% idle), your brokers are already CPU-bound and GC pauses hit a system with no headroom. This creates a cascading effect where requests queue up in `kafka-request-channel-queue-size` during GC, taking even longer to drain afterward. The `request-handler-saturation-cascades-to-producer-latency` insight shows this pattern: saturated handlers + GC pauses = exponentially worse latency.

Request Handler Saturation Cascades to Producer Latency kafka.request.handler_idle_percentkafka.request.channel.queue.size

4Examine log flush patterns for memory pressure

Monitor `kafka-log-flush-rate-rate` to see if aggressive flushing is contributing to heap churn. If you're flushing more than once per second per topic-partition, you're creating memory allocation pressure that triggers more frequent GC. The `log-flush-latency-causing-write-stalls` insight explains how excessive flushing not only causes direct I/O waits but also increases GC frequency by churning through buffer memory faster.

Log Flush Latency Spikes Causing Write Stalls kafka.log.flush_rate

5Analyze heap allocation patterns and GC type

With 8GB heap and default settings, you're likely using ParallelGC which stops all threads during collection. Check your JVM logs for heap occupancy before each GC — if you're consistently hitting 80%+ utilization before major GC events, your heap is undersized for your workload. In my experience, Kafka brokers need heap sized to keep utilization under 70% during normal operation to avoid frequent pause-the-world collections.

JVM GC Pause Cascading to Kafka Consumer Lag

6Tune JVM for G1GC with appropriate pause targets

Switch to G1GC with `-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:InitiatingHeapOccupancyPercent=35` to get more predictable, shorter pauses. For your 8GB heap with 200-500ms pauses, I'd increase to 12-16GB (set both `-Xms` and `-Xmx` to the same value) and target 100ms max pauses. G1GC will do more frequent but shorter collections, reducing the impact on request processing. Monitor the results against `kafka-request-produce-time-99percentile` and `kafka-request-fetch-consumer-time-99percentile` to verify P99 latency drops below 100ms consistently.

JVM GC Pause Cascading to Kafka Consumer Lag kafka.request.produce_time_99p