JVM GC Death Spiral
criticalHigh garbage collection pressure on TaskManagers causes processing slowdowns that create backpressure, increased state size, and eventually full GC pauses lasting minutes.
Monitor flink_taskmanager_status_jvm_memory_heap_used approaching flink_taskmanager_status_jvm_memory_heap_max. Correlate with increasing GC time and decreasing throughput (flink_operator_recordsoutpersec). When heap usage exceeds 85% and GC time grows while throughput drops, a GC spiral is imminent. This is especially critical for jobs using HashMap state backend which stores state on-heap.
For heap-based state backends, switch to RocksDB to move state off-heap. Review state retention policies to prevent unbounded growth. Check for memory leaks in user code (not disposing threads/objects). Increase TaskManager memory allocation if legitimate state growth. Monitor individual TaskManager memory via container_memory_working_set_bytes before JVM limits are hit.