Broker Disk Space Approaching Capacity

criticalIncident Response

Kafka broker disk utilization is critically high, risking broker crashes and cluster unavailability if disk fills completely.

Prompt: My Kafka broker disk usage just hit 92% and climbing. What should I do to prevent the broker from crashing? Should I delete old segments or increase retention settings?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When Kafka broker disk hits 92%, the first priority is understanding what's consuming space and how fast it's growing. Check retention configurations across all topics, identify the heaviest consumers of disk, and verify consumer lag before making any retention changes — reducing retention while consumers are lagging will cause data loss. Only after understanding the full picture should you decide between temporary retention reduction, partition rebalancing, or adding capacity.

1Check retention configurations and identify misconfigured topics
Start by examining `kafka.broker.config.log_retention_bytes` and `kafka.broker.config.log_retention_ms` for broker-level defaults, then drill into individual topics using `kafka.topic.config.retention_bytes` and `kafka.topic.config.retention_ms`. Look for topics with unlimited retention (-1) or excessively long retention windows (>7 days for high-throughput topics). A single misconfigured topic with unlimited retention can fill a disk in hours. This is the fastest way to identify low-hanging fruit before taking more invasive action.
2Identify which topics are consuming the most disk space
Calculate per-topic disk usage by examining offset ranges: subtract `kafka.partition.oldest_offset` from `kafka.partition.current_offset` for each partition, then aggregate by topic. Topics with millions of messages per partition are your prime candidates for retention tuning. Multiply partition count by average message size to estimate total topic footprint. Focus your mitigation efforts on the top 3-5 space consumers — optimizing a topic consuming 60GB is far more impactful than tweaking one using 2GB.
3Verify consumer lag before reducing retention
This is critical: check consumer group lag for all topics before touching retention settings. The `topic-retention-approaching-with-slow-consumption` insight warns that reducing retention while consumers are lagging will cause data loss — messages will age out before consumers read them. If your consumers are already struggling to keep up (lag increasing over time), you need to fix consumption throughput first, not reduce retention. Only reduce retention on topics where consumers are caught up or where data loss is acceptable.
4Check log segment size and active segment rollover
Review `kafka.broker.config.log_segment_bytes` — if segments are configured too large (e.g., 1GB+), Kafka won't delete old data until the entire segment ages out, even if retention time has passed. The active segment is never deleted regardless of age, so if you have low-throughput topics with large segment sizes, old data sticks around indefinitely. For topics with bursty writes, consider reducing segment size to 256-512MB to allow more frequent cleanup.
5Review partition distribution and rebalancing opportunities
Check `kafka.broker.partition_count` across all brokers in the cluster. If this broker has significantly more partitions than others (say 800 vs cluster average of 500), uneven partition distribution is your root cause. The `broker-resource-exhaustion` insight highlights that improper partition distribution degrades performance and resource utilization. Use partition reassignment to rebalance load across brokers — this provides immediate relief without data loss or retention changes.
6Implement immediate mitigation based on findings
Based on steps 1-5, choose your action: if you found misconfigured topics with no consumer lag, reduce their retention temporarily (e.g., from 7d to 3d). If partition distribution is skewed, initiate reassignment. If consumers are lagging, scale consumer groups immediately — adding capacity won't help if consumption can't keep up. Consider the `kafka-retention-recovery-window` insight when adjusting retention: reducing from 7d to 2d saves space but limits your ability to recover from downstream failures. For production systems, never go below your incident detection + resolution time window.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

Kafka Prometheus
Kafka Datadog
Kafka Native
Kafka OpenTelemetry