JVM GC Death Spiral

critical

Resource ContentionUpdated Jan 22, 2025

High garbage collection pressure on TaskManagers causes processing slowdowns that create backpressure, increased state size, and eventually full GC pauses lasting minutes.

Sources

Troubleshooting Apache Flink Applications: Identifying Bottlenecksmedium.com

Troubleshoot performance issues - Managed Service for Apache Flinkdocs.aws.amazon.com

Monitoring Large-Scale Apache Flink Applications, Part 2www.ververica.com

Technologies:

Apache FlinkThe root cause of this issue originates in Apache Flink

KubernetesKubernetes metrics correlate with this issue and help confirm diagnosis

How to detect:

Monitor flink_taskmanager_status_jvm_memory_heap_used approaching flink_taskmanager_status_jvm_memory_heap_max. Correlate with increasing GC time and decreasing throughput (flink_operator_recordsoutpersec). When heap usage exceeds 85% and GC time grows while throughput drops, a GC spiral is imminent. This is especially critical for jobs using HashMap state backend which stores state on-heap.

Recommended action:

For heap-based state backends, switch to RocksDB to move state off-heap. Review state retention policies to prevent unbounded growth. Check for memory leaks in user code (not disposing threads/objects). Increase TaskManager memory allocation if legitimate state growth. Monitor individual TaskManager memory via container_memory_working_set_bytes before JVM limits are hit.