Apache FlinkKubernetes

JVM GC Death Spiral

critical
Resource ContentionUpdated Jan 22, 2025

High garbage collection pressure on TaskManagers causes processing slowdowns that create backpressure, increased state size, and eventually full GC pauses lasting minutes.

How to detect:

Monitor flink_taskmanager_status_jvm_memory_heap_used approaching flink_taskmanager_status_jvm_memory_heap_max. Correlate with increasing GC time and decreasing throughput (flink_operator_recordsoutpersec). When heap usage exceeds 85% and GC time grows while throughput drops, a GC spiral is imminent. This is especially critical for jobs using HashMap state backend which stores state on-heap.

Recommended action:

For heap-based state backends, switch to RocksDB to move state off-heap. Review state retention policies to prevent unbounded growth. Check for memory leaks in user code (not disposing threads/objects). Increase TaskManager memory allocation if legitimate state growth. Monitor individual TaskManager memory via container_memory_working_set_bytes before JVM limits are hit.