RocksDB Write Stall Cascade
warningSlow RocksDB flushes cause write stalls that propagate upstream as backpressure, degrading throughput and increasing checkpoint durations.
Monitor RocksDB metrics actual-delayed-write-rate, num-running-flushes, and mem-table-flush-pending. When actual-delayed-write-rate is non-zero while num-running-flushes remains low relative to pending flushes, disk I/O is insufficient. Correlate with increasing flink_task_checkpointalignmenttime and decreasing flink_operator_recordsoutpersec.
Increase RocksDB background thread concurrency via state.backend.rocksdb.thread.num to saturate available disk I/O. Verify disk throughput is adequate for workload. Consider enabling incremental checkpointing to reduce checkpoint pressure. Review RocksDB tuning for write-heavy workloads, balancing write, read, and space amplification.