Heap Memory and Garbage Collection Tuning
warningProactive Health
Kafka broker JVM is experiencing frequent or long garbage collection pauses, impacting request processing and causing latency spikes.
Prompt: “My Kafka brokers are experiencing GC pauses of 200-500ms every few minutes, causing request timeouts. Current heap is set to 8GB with default GC settings. How should I tune the JVM for better performance?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When diagnosing Kafka JVM GC issues, start by confirming that GC pauses correlate with request latency spikes, then assess the downstream impact on client timeouts and broker capacity. Check for contributing factors like log flush pressure before tuning heap size and switching to G1GC with appropriate pause time targets.
1Correlate GC pause timing with request latency spikes
First, confirm GC is the root cause by checking if your 200-500ms GC pauses align with spikes in `kafka-request-produce-time-99percentile` and `kafka-request-fetch-consumer-time-99percentile`. If P99 latency jumps from ~50ms to 200-500ms at the same intervals as GC pauses, you've confirmed the connection. The `jvm-gc-pause-cascading-to-kafka-consumer-lag` insight shows this creates cascading failures where even short pauses cause measurable consumer lag.
2Measure client-side impact through request expirations
Check `kafka-producer-expires-per-seconds` and `kafka-consumer-expires-per-second` to quantify how many requests are timing out during GC pauses. If you see spikes of 10+ expirations/second correlating with GC events, clients are hitting their timeout thresholds (typically 30-60s for producers, shorter for consumers). Also monitor `kafka-expires-sec` for the overall expiration rate — any non-zero value indicates GC pauses are severe enough to cause client failures.
3Check if request handler saturation is amplifying GC impact
Look at `kafka-request-handler-avg-idle-pct-rate` — if it's below 0.2 (20% idle), your brokers are already CPU-bound and GC pauses hit a system with no headroom. This creates a cascading effect where requests queue up in `kafka-request-channel-queue-size` during GC, taking even longer to drain afterward. The `request-handler-saturation-cascades-to-producer-latency` insight shows this pattern: saturated handlers + GC pauses = exponentially worse latency.
4Examine log flush patterns for memory pressure
Monitor `kafka-log-flush-rate-rate` to see if aggressive flushing is contributing to heap churn. If you're flushing more than once per second per topic-partition, you're creating memory allocation pressure that triggers more frequent GC. The `log-flush-latency-causing-write-stalls` insight explains how excessive flushing not only causes direct I/O waits but also increases GC frequency by churning through buffer memory faster.
5Analyze heap allocation patterns and GC type
With 8GB heap and default settings, you're likely using ParallelGC which stops all threads during collection. Check your JVM logs for heap occupancy before each GC — if you're consistently hitting 80%+ utilization before major GC events, your heap is undersized for your workload. In my experience, Kafka brokers need heap sized to keep utilization under 70% during normal operation to avoid frequent pause-the-world collections.
6Tune JVM for G1GC with appropriate pause targets
Switch to G1GC with `-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:InitiatingHeapOccupancyPercent=35` to get more predictable, shorter pauses. For your 8GB heap with 200-500ms pauses, I'd increase to 12-16GB (set both `-Xms` and `-Xmx` to the same value) and target 100ms max pauses. G1GC will do more frequent but shorter collections, reducing the impact on request processing. Monitor the results against `kafka-request-produce-time-99percentile` and `kafka-request-fetch-consumer-time-99percentile` to verify P99 latency drops below 100ms consistently.
Technologies
Related Insights
High P99 request latency indicates broker performance issues
warning
Request Handler Saturation Cascades to Producer Latency
warning
When Kafka request handlers are saturated (low idle percentage), producer requests queue up, increasing end-to-end latency and potentially triggering producer-side timeouts.
Log Flush Latency Spikes Causing Write Stalls
warning
When log flush operations take excessive time, produce requests are delayed as Kafka waits for data to be flushed to disk, impacting producer latency and throughput.
JVM GC Pause Cascading to Kafka Consumer Lag
warning
Prolonged JVM garbage collection pauses in DataHub consumers cause Kafka consumer lag to spike. This creates a cascading failure where metadata ingestion stalls, leading to stale catalog data and failed data quality checks.
Relevant Metrics
kafka.request.produce_time_99pkafka.request.fetch_consumer_time_avgkafka.request.fetch_follower_time_99pkafka.request.update_metadata_time_avgkafka.producer.request_expiration_ratekafka.consumer.request_expiration_ratekafka.expires_seckafka.request.metadata_time_99pkafka.network.processor_idle_percentkafka.request.handler_idle_percentkafka.request.channel.queue.sizekafka.log.flush_rateMonitoring Interfaces
Kafka Native