Apache Pulsar

Storage Write Latency Spikes Indicate BookKeeper Bottleneck

critical
latencyUpdated Dec 16, 2024

High storage write latency (>1s) signals that BookKeeper ledgers cannot persist messages fast enough, creating a bottleneck that cascades to producer publish latency and overall throughput degradation.

How to detect:

Track pulsar_storage_write_rate latency buckets, especially messages taking >1s to persist. Correlate with pulsar_bookie_write_size and pulsar_bookie_flush metrics to identify BookKeeper storage layer saturation. Check pulsar_bookkeeper_server_add_entry_count for write throughput limits.

Recommended action:

Tune BookKeeper journal configuration (journalBufferedWritesThreshold, journalMaxGroupWaitMSec, journalWriteBufferSizeKB) to optimize write batching and durability tradeoffs. Use multiple disks for ledgers to distribute I/O load. Scale BookKeeper cluster horizontally by adding more bookies. Monitor bookie disk I/O utilization and consider faster storage tiers.