ZooKeeper to KRaft Migration Planning

warningMigration

Migrating from ZooKeeper-based Kafka to KRaft mode before mandatory deadline in Kafka 4.0, requiring careful planning and execution.

Prompt: “We're still running Kafka with ZooKeeper and need to migrate to KRaft before upgrading to Kafka 4.0. What's involved in this migration and can we do it without downtime?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When planning a ZooKeeper to KRaft migration, start by assessing your current ZooKeeper health and stability — any existing issues will complicate the migration. Then baseline your controller stability and metadata operations to set performance expectations for KRaft. Finally, prepare KRaft-specific monitoring and test the migration procedure thoroughly in non-production before attempting it on production clusters.

1Assess current ZooKeeper health and stability

Before migrating, check if your existing ZooKeeper setup is healthy. Monitor `kafka.zookeeper.expire_rate` and `kafka.zookeeper.disconnect_rate` — any non-zero values indicate instability that will complicate migration. The `kafka-dependency-on-zookeeper-latency-cascade` insight shows that ZooKeeper latency above 50ms correlates with broker state changes and controller elections. If you're seeing frequent session expirations or high latency, fix these issues first or be prepared for them to surface during migration.

ZooKeeper Session Expiration Causing Broker Instability Kafka Dependency on ZooKeeper Latency Cascade kafka.zookeeper.expire_ratekafka.zookeeper.disconnect_ratekafka.server.SessionExpireListener.ZooKeeperExpiresPerSec.OneMinuteRatekafka.server.SessionExpireListener.ZooKeeperDisconnectsPerSec.OneMinuteRate

2Verify controller stability before migration

Check `kafka.replication.active_controller_count` — it should always be exactly 1. If you're seeing controller flapping (frequent changes in `kafka.cluster.controller_id`), this indicates underlying issues that KRaft won't magically fix. Controller instability in ZooKeeper mode often stems from network issues or resource constraints that will affect KRaft controllers too. Stabilize your current setup first to ensure a clean migration baseline.

ZooKeeper Session Expiration Causing Broker Instability kafka.replication.active_controller_countkafka.cluster.controller_id

3Plan KRaft controller deployment and resource allocation

KRaft requires dedicated controller nodes (typically 3 or 5 for quorum). Plan for fast local disk storage for the metadata log and sufficient CPU capacity. The `kraft-controller-metrics-not-configured` insight warns that controllers must be configured to export metrics from day one — you can't troubleshoot what you can't see. Controllers handle all metadata operations that ZooKeeper previously managed, so under-provisioning them will create a new bottleneck.

Broker/Partition counts missing in KRaft mode without controller metrics

4Set up KRaft-specific monitoring before migration

Configure monitoring for KRaft operational metrics before you migrate. Watch `kafka.raft.commit_latency_avg` (should stay below 1 second) and `kafka.raft.metadata_apply_error_count` (should always be zero). The `raft-commit-latency-spike-delays-metadata` and `raft-metadata-apply-errors-indicate-controller-issues` insights show these are your primary health signals for KRaft controllers. Having these dashboards ready before migration lets you quickly spot issues rather than scrambling to add monitoring mid-incident.

Raft Commit Latency Spike Delays Metadata Propagation Raft Metadata Apply Errors Indicate Controller Issues kafka.raft.commit_latency_avgkafka.raft.commit_latency_maxkafka.raft.metadata_apply_error_countkafka.raft.metadata_load_error_countkafka.raft.current_leaderkafka.raft.log_end_offset

5Test migration procedure in non-production environment

Execute the full migration procedure on a non-production cluster that mirrors your production topology. Time how long each phase takes and document any issues encountered. Pay special attention to metadata migration timing — large clusters with thousands of topics and partitions will take longer. This test run reveals operational quirks like the `kraft-unclean-leader-election-delay` insight, which shows that enabling unclean leader elections in KRaft requires manual triggering to avoid a 5-minute wait.

Unclean leader election delayed 5 minutes in KRaft mode without manual trigger

6Validate rollback and contingency procedures

Before migrating production, document and test your rollback procedure. While KRaft migrations are generally one-way, you need a plan for what happens if you discover critical issues post-migration. Understand at what point in the migration process you can still roll back versus when you're committed to moving forward. Have runbooks ready for common KRaft issues like metadata apply errors or commit latency spikes so you can respond quickly if they occur.

Raft Metadata Apply Errors Indicate Controller Issues Raft Commit Latency Spike Delays Metadata Propagation kafka.raft.metadata_apply_error_countkafka.raft.commit_latency_avg