Confluent PlatformApache KafkaPrometheus

Per-partition monitoring generates excessive cardinality

warning
performanceUpdated Dec 19, 2025(via Exa)
How to detect:

Monitoring lag for every partition generates excessive time series that overwhelm storage. Example: 100 topics × 10 partitions × 5 groups × 3 metrics = 15,000 time series. At scale: 50,000 partitions × 3 metrics = 150,000 time series, producing 864 million data points daily, causing 10GB/day (300GB/month) storage impact.

Recommended action:

Track max lag per consumer group instead of per partition. Drill into partition-level metrics only when actively troubleshooting. Use P4 priority (on-demand only) for per-partition metrics.