JVM GC Pause Cascading to Kafka Consumer Lag
warningreliabilityUpdated Oct 8, 2025
Prolonged JVM garbage collection pauses in DataHub consumers cause Kafka consumer lag to spike. This creates a cascading failure where metadata ingestion stalls, leading to stale catalog data and failed data quality checks.
Technologies:
How to detect:
Monitor jvm_gc_pause duration and frequency. When GC pauses exceed 1s, check kafka_consumer_lag for corresponding spikes. Correlate with jvm_memory_used approaching heap limits and messaging_process_time latency increases.
Recommended action:
Tune JVM heap settings (JAVA_OPTS: '-Xms4g -Xmx6g -XX:+UseG1GC'). Ensure heap is 50% of container memory. Scale consumer replicas to distribute load. Enable ENTITY_SERVICE_ENABLE_CACHE to reduce memory pressure. Consider increasing container memory limits if heap pressure persists.