Log Flush Latency Spikes Causing Write Stalls
warningWhen log flush operations take excessive time, produce requests are delayed as Kafka waits for data to be flushed to disk, impacting producer latency and throughput.
Monitor kafka.log.flush_rate dropping while kafka.request.produce_time_99p increases. Check kafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95th exceeding 100ms consistently.
1. Check disk I/O: Monitor disk write latency and throughput. 2. Tune log.flush.interval.ms: Increase interval to reduce flush frequency (trades durability for performance). 3. Use faster storage: Consider SSD or NVMe for log directories. 4. Review RAID configuration: Ensure RAID controller has write cache enabled. 5. Monitor filesystem: Check for filesystem issues or fragmentation. 6. Adjust OS I/O scheduler: Use deadline or noop scheduler for better performance.