Migration from Self-Hosted to Managed Kafka

infoMigration

Planning and executing migration from self-managed Kafka to AWS MSK or Confluent Cloud while maintaining availability and data consistency.

Prompt: We're planning to migrate our self-hosted Kafka cluster to AWS MSK. What's the best approach to migrate without downtime? Should we use MirrorMaker 2 or Cluster Linking?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When planning a migration from self-hosted Kafka to AWS MSK, start by validating your source cluster health — check for under-replicated partitions and ISR stability, as you can't afford instability during migration. Then baseline your throughput and replication configuration to properly size the target cluster. Only after confirming source cluster stability should you choose between MirrorMaker 2 (more flexible, works across any Kafka) and Cluster Linking (Confluent-specific, lower latency, but requires Confluent Platform or Cloud).

1Verify source cluster replication health
Check `kafka.replication.under_replicated_partitions` — it should be zero or very close to zero before starting a migration. The `under-replicated-partitions-data-loss-risk` insight warns that under-replicated partitions create data loss risk, and you absolutely cannot afford that during a migration when you're running dual clusters. Also verify `kafka.partition.replicas` matches your `kafka.broker.config.default_replication_factor` across all critical topics.
2Check for ISR oscillation indicating replica instability
Look for the pattern described in `isr-shrink-expand-oscillation-reveals-replica-instability` — if your replicas are constantly falling behind and catching up, adding MirrorMaker 2 or Cluster Linking will only amplify the problem. ISR shrink/expand rates should be near-zero. If you see oscillation, resolve broker resource contention, network issues, or tune `replica.lag.time.max.ms` before proceeding with migration planning.
3Baseline current throughput for target cluster sizing
Measure `kafka.messages_in.rate`, `kafka.net.bytes_in.rate`, and `kafka.net.bytes_out.rate` over at least a week including peak periods. MSK broker sizing depends heavily on these metrics — undersizing your target cluster means MirrorMaker 2 won't keep up, creating lag that defeats the purpose of zero-downtime migration. Also note `kafka.broker.partition_count` to ensure MSK instance types can handle the partition load.
4Validate current replication lag is minimal
Confirm `kafka.replication.max_lag` is consistently low (ideally under 1000 messages) across all partitions. High replication lag in your source cluster indicates brokers struggling to keep replicas in sync, and MirrorMaker 2 adds another layer of replication overhead. If max lag is high, investigate broker resource utilization and network before adding cross-cluster replication.
5Assess topic and partition topology
Document your `kafka.topic.partitions` counts and how they're distributed across `kafka.brokers`. MirrorMaker 2 creates matching topics in the target cluster, but MSK has limits on partitions per broker (1000 for kafka.m5.large, 2000 for larger). If you have hot topics with many partitions or uneven distribution, plan to rebalance post-migration or choose MSK instance types accordingly.
6Plan Schema Registry migration if applicable
If you're using Schema Registry, be aware of the `schema-registry-replication-factor-mismatch` issue — Schema Registry's internal topics need proper replication factor. For MSK, you'll likely use AWS Glue Schema Registry instead, requiring schema export/import. For Confluent Cloud migration, ensure Schema Registry is deployed only after the target Kafka cluster is fully stable with all brokers running.
7Choose replication strategy based on requirements
Use MirrorMaker 2 if migrating to AWS MSK (it's open-source and works with any Kafka), or if you need granular control over topic filtering and transformation. Use Cluster Linking if migrating to Confluent Cloud or Confluent Platform — it offers lower latency, automatic offset translation, and less operational overhead, but it's Confluent-specific. Both support zero-downtime migration, but MirrorMaker 2 requires more monitoring of `kafka.replication.max_lag` to ensure mirror topics stay current during the cutover window.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Kafka Datadog
Kafka Prometheus
Kafka Native