Cost Optimization with Tiered Storage Strategy

infoCost Optimization

Balancing Kafka storage costs by leveraging tiered storage for older data while maintaining performance for active workloads.

Prompt: “My AWS MSK cluster storage costs are high due to 90-day retention requirements. Should I enable tiered storage to move older data to S3? What's the cost-performance trade-off?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When evaluating tiered storage for cost optimization, start by understanding your actual data access patterns and storage growth rate. Most clusters have significant cost savings potential if consumers primarily read recent data and rarely need historical replay. The key is confirming that your workload can tolerate the added latency of S3 reads for older segments.

1Analyze actual data access patterns across consumer groups

Check `kafka-partition-current-offset` against `kafka-partition-oldest-offset` for each topic to see how much of your retained data is actually being consumed. If most consumer groups are within a few days of the current offset but you're retaining 90 days, you're paying for storage that's rarely accessed—prime candidate for tiered storage. Look for patterns where consumers stay near the head of the log; this indicates older data can safely move to S3 without performance impact.

kafka.partition.current_offsetkafka.partition.oldest_offset

2Calculate current storage costs and growth rate

Use `kafka-topic-net-bytes-in-rate` to determine your daily data ingestion rate, then multiply by your retention period from `kafka-topic-config-retention-ms` to estimate total storage required. For a cluster ingesting 100 GB/day with 90-day retention, you're storing ~9 TB. With AWS MSK, local storage costs roughly $0.10/GB-month while S3 is $0.023/GB-month—moving 80% of that to tiered storage saves about $550/month per TB retained.

kafka.topic.bytes_in_ratekafka.topic.config.retention_ms

3Verify consumers can tolerate increased replay latency

Tiered storage adds 100-300ms latency when reading from S3 versus <10ms for local disk. Review the `kafka-retention-recovery-window` insight to understand your recovery scenarios—if disaster recovery is your main reason for long retention and you can tolerate slower replay during incidents, tiered storage is ideal. If you have operational playbooks that frequently replay historical data for backfills or analysis, factor in the 10-30x latency increase.

Kafka retention enables point-in-time recovery window for data loss

4Check for slow consumers approaching retention limits

Review the `topic-retention-approaching-with-slow-consumption` insight to identify any consumer groups with growing lag. If consumers are already struggling to keep up, enabling tiered storage won't cause data loss (old segments move to S3, not deleted), but it will make catch-up slower due to S3 read latency. Address consumer throughput issues before enabling tiered storage to avoid exacerbating lag problems.

Topic Retention Approaching With Insufficient Consumer Throughput

5Audit per-topic retention configurations for optimization opportunities

Before implementing tiered storage cluster-wide, examine `kafka-topic-config-retention-ms` and `kafka-topic-config-retention-bytes` for each topic. Some topics might not need 90-day retention—internal metrics or debugging topics could use 7 days. Reducing retention on low-value topics is simpler than tiered storage and eliminates storage costs entirely rather than just reducing them. Focus tiered storage on topics where 90-day retention is truly required.

kafka.topic.config.retention_mskafka.topic.config.retention_bytes

6Consider partition count impact on tiered storage efficiency

Check `kafka-broker-partition-count` and `kafka-topic-partitions` to understand your partition distribution. Tiered storage works best with larger segment sizes (1GB+), but high partition counts lead to many small segments. If you have 1000 partitions with 100MB segments, you'll have frequent small uploads to S3 with higher API costs. Clusters with <500 partitions and segment sizes >512MB get better cost efficiency from tiered storage.

kafka.broker.partition_countkafka.topic.partitions