Apache ZooKeeper insights
Open SourceVersions: [current]14 metricsApache Kafka (pre-KRaft) stores critical metadata in ZooKeeper including broker registrations, topic configurations, and controller election state. High ZooKeeper latency or unavailability directly impacts Kafka broker operations and can cause broker flapping or topic unavailability.
ZooKeeper ensemble health depends on maintaining a quorum of synchronized followers. When synced_followers drops to exactly quorum size, the cluster is one failure away from losing write availability.
When outstanding_requests grows above zero consistently, ZooKeeper cannot keep up with incoming request load. This leads to increased latency and potential client session timeouts.
ZooKeeper's heartbeat mechanism means high server latency directly causes client disconnects. When average latency exceeds 50ms or max latency spikes above 100ms, clients may fail to send heartbeats within their session timeout window.
A very high watch count (tens of thousands) consumes significant memory and causes latency spikes when many watches fire simultaneously. This is often caused by clients creating watches without proper cleanup.
When client connections are not evenly distributed across ensemble members, some nodes become overloaded while others remain underutilized, leading to performance degradation on heavily loaded nodes.
ZooKeeper performance degrades when the total number of znodes approaches 1 million. High znode counts increase memory usage and can slow down operations, especially startup and snapshot creation.
ZooKeeper writes transaction logs synchronously to disk using fsync. High fsync times (>100ms average) directly block request processing and cause cascading latency issues across the entire ensemble.
Frequent or prolonged leader elections indicate network connectivity issues between ensemble members. During elections, the cluster cannot process write requests, causing application-level outages.
ZooKeeper opens file descriptors for client connections, log files, and internal operations. Approaching the OS file descriptor limit causes connection failures and operational issues.
Long GC pauses halt all ZooKeeper threads including heartbeat processing, causing client sessions to expire. Even well-tuned JVMs can experience occasional long pauses that exceed typical session timeout windows.