Redis Cluster Resharding for Horizontal Scaling

warningCapacity Planning

Adding or removing nodes in Redis Cluster requires resharding hash slots, with concerns about performance impact and data migration timing.

Prompt: “We need to add 3 more nodes to our Redis cluster to handle growth, but I'm worried about the resharding process causing latency spikes or downtime - help me understand how long slot migration will take, whether it can run during business hours, and if clients need any changes to handle the resharding.”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When planning Redis Cluster resharding, start by establishing performance baselines and estimating migration time based on key count and throughput, then monitor latency and operations metrics closely during the actual resharding to catch any client-side issues or unexpected performance degradation early.

1Establish baseline cluster performance metrics

Before touching anything, capture baseline values for `redis-stats-instantaneous-ops-per-sec`, `redis-info-latency-ms`, and `redis-memory-used` across all nodes during your typical peak traffic period. You need to know what 'normal' looks like so you can immediately spot anomalies during resharding. If your baseline latency is already high (>10ms) or memory is near maxmemory limits (>80%), address those issues before adding the complexity of resharding.

redis.stats.instantaneous_ops_per_secredis.info.latency_msredis.memory.used

2Calculate migration time and throughput impact

Use `redis-db-keys` to get total key count and `redis-net-output` to estimate current network throughput. Slot migration moves keys one at a time (or in small pipelines), so a node with 10M keys might take 30-60 minutes per 4096 slots at typical speeds. Compare this to your current `redis-stats-instantaneous-ops-per-sec` — if you're doing 50K ops/sec and migration will add 5-10K ops/sec overhead, you can probably run during business hours. If you're already at 80%+ capacity, wait for off-peak.

redis.db.keysredis.net.outputredis.stats.instantaneous_ops_per_sec

3Monitor latency and slowlog during active resharding

Once resharding starts, watch `redis-info-latency-ms` and `redis-slowlog-length` in real-time on both source and target nodes. Latency spikes above 2-3x your baseline or a growing slowlog indicate the migration is consuming too many resources. Redis Cluster migration blocks briefly during key copying, so if you see consistent spikes, you may need to pause and resume during lower traffic periods or reduce the pipeline size.

redis.info.latency_msredis.slowlog.length

4Check for unexpected CPU or memory spikes during migration

Keep an eye on CPU and memory usage patterns during slot migration. The `redis-cpu-spike-cache-deletions` insight shows that bulk key operations can spike CPU to 83%+ because Redis must scan the entire keyspace. If you have TTL-based expiration or any deletion patterns running concurrently with migration, this could compound and cause severe performance degradation. Monitor `redis-memory-used` too — fragmentation can temporarily increase during migration.

Redis CPU utilization spikes during cache key deletions redis.memory.used

5Verify client handling of MOVED and ASK redirects

During resharding, clients will receive MOVED redirects when accessing migrated slots and ASK redirects for slots currently being migrated. Check your application logs and error rates — proper Redis Cluster clients should handle these transparently, but older or misconfigured clients might surface these as errors to users. Monitor `redis-net-commands` for unexpected connection churn or failed commands. If you see elevated error rates, clients may need library updates or configuration changes.

redis.net.commands

6Track cache hit ratio degradation during migration

Monitor the ratio of `redis-keyspace-hits` to `redis-keyspace-misses` during resharding. You'll likely see a temporary decrease in hit ratio as keys are moved and some cache warming is lost, but it should recover quickly after migration completes. If hit ratio drops more than 10-15% and doesn't recover, it might indicate client connection issues or routing problems where requests aren't finding their keys on the new nodes.

redis.keyspace.hitsredis.keyspace.misses

Technologies

Redis

Related Insights

Redis CPU utilization spikes during cache key deletions

critical

Relevant Metrics

redis.memory.usedredis.stats.instantaneous_ops_per_secredis.net.inputredis.net.outputredis.keyspace.hitsredis.keyspace.missesredis.info.latency_msredis.db.keysredis.net.commandsredis.slowlog.length

Monitoring Interfaces

Redis Prometheus

Redis Native Metrics

Redis Datadog