ZooKeeper to KRaft Migration Planning

warningMigration

Migrating from ZooKeeper-based Kafka to KRaft mode before mandatory deadline in Kafka 4.0, requiring careful planning and execution.

Prompt: We're still running Kafka with ZooKeeper and need to migrate to KRaft before upgrading to Kafka 4.0. What's involved in this migration and can we do it without downtime?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When planning a ZooKeeper to KRaft migration, start by assessing your current ZooKeeper health and stability — any existing issues will complicate the migration. Then baseline your controller stability and metadata operations to set performance expectations for KRaft. Finally, prepare KRaft-specific monitoring and test the migration procedure thoroughly in non-production before attempting it on production clusters.

1Assess current ZooKeeper health and stability
Before migrating, check if your existing ZooKeeper setup is healthy. Monitor `kafka.zookeeper.expire_rate` and `kafka.zookeeper.disconnect_rate` — any non-zero values indicate instability that will complicate migration. The `kafka-dependency-on-zookeeper-latency-cascade` insight shows that ZooKeeper latency above 50ms correlates with broker state changes and controller elections. If you're seeing frequent session expirations or high latency, fix these issues first or be prepared for them to surface during migration.
2Verify controller stability before migration
Check `kafka.replication.active_controller_count` — it should always be exactly 1. If you're seeing controller flapping (frequent changes in `kafka.cluster.controller_id`), this indicates underlying issues that KRaft won't magically fix. Controller instability in ZooKeeper mode often stems from network issues or resource constraints that will affect KRaft controllers too. Stabilize your current setup first to ensure a clean migration baseline.
3Plan KRaft controller deployment and resource allocation
KRaft requires dedicated controller nodes (typically 3 or 5 for quorum). Plan for fast local disk storage for the metadata log and sufficient CPU capacity. The `kraft-controller-metrics-not-configured` insight warns that controllers must be configured to export metrics from day one — you can't troubleshoot what you can't see. Controllers handle all metadata operations that ZooKeeper previously managed, so under-provisioning them will create a new bottleneck.
4Set up KRaft-specific monitoring before migration
Configure monitoring for KRaft operational metrics before you migrate. Watch `kafka.raft.commit_latency_avg` (should stay below 1 second) and `kafka.raft.metadata_apply_error_count` (should always be zero). The `raft-commit-latency-spike-delays-metadata` and `raft-metadata-apply-errors-indicate-controller-issues` insights show these are your primary health signals for KRaft controllers. Having these dashboards ready before migration lets you quickly spot issues rather than scrambling to add monitoring mid-incident.
5Test migration procedure in non-production environment
Execute the full migration procedure on a non-production cluster that mirrors your production topology. Time how long each phase takes and document any issues encountered. Pay special attention to metadata migration timing — large clusters with thousands of topics and partitions will take longer. This test run reveals operational quirks like the `kraft-unclean-leader-election-delay` insight, which shows that enabling unclean leader elections in KRaft requires manual triggering to avoid a 5-minute wait.
6Validate rollback and contingency procedures
Before migrating production, document and test your rollback procedure. While KRaft migrations are generally one-way, you need a plan for what happens if you discover critical issues post-migration. Understand at what point in the migration process you can still roll back versus when you're committed to moving forward. Have runbooks ready for common KRaft issues like metadata apply errors or commit latency spikes so you can respond quickly if they occur.

Technologies

Related Insights

ZooKeeper Session Expiration Causing Broker Instability
critical
When ZooKeeper sessions expire, brokers re-register causing controller changes, partition leadership changes, and temporary unavailability. Frequent expirations indicate ZooKeeper or network issues.
Kafka Dependency on ZooKeeper Latency Cascade
critical
Apache Kafka (pre-KRaft) stores critical metadata in ZooKeeper including broker registrations, topic configurations, and controller election state. High ZooKeeper latency or unavailability directly impacts Kafka broker operations and can cause broker flapping or topic unavailability.
Broker/Partition counts missing in KRaft mode without controller metrics
warning
Raft Commit Latency Spike Delays Metadata Propagation
warning
In KRaft mode, high Raft commit latency delays metadata changes from propagating through the cluster, causing stale metadata and operational delays.
Raft Metadata Apply Errors Indicate Controller Issues
critical
In KRaft mode, metadata apply errors indicate the controller is failing to apply metadata changes, potentially causing inconsistent cluster state.
Unclean leader election delayed 5 minutes in KRaft mode without manual trigger
warning

Relevant Metrics

Monitoring Interfaces

Kafka Prometheus
Kafka Datadog
Kafka Native