Broker Failure and Excessive Producer Instantiation
criticalIncident Response
Application bug causing excessive producer instantiation leads to broker heap exhaustion and cascading failures.
Prompt: “Our Kafka brokers just crashed with OutOfMemoryError. I see millions of producer connections being tracked. Did something in our application code go wrong?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When investigating broker OOM with excessive producer connections, start by confirming the connection spike and correlating it with heap exhaustion. Then examine producer-side symptoms like buffer starvation and waiting threads to confirm multiple producer instances. Finally, check request handler saturation and compare connection growth to actual throughput — if connections explode but throughput stays flat, you've confirmed an application bug creating producers without closing them.
1Confirm the connection explosion and compare to broker limits
First, check `kafka.network.connection_count` to quantify the connection spike — you mentioned millions, which is abnormal for most clusters. Compare this to your configured `max.connections.per.broker` limit (typically 2000-5000). The `connection-count-approaching-limit` insight warns when you hit 80% of max, but if you're seeing OOM, you've likely exceeded it or the sheer number of connection objects exhausted heap before hitting the limit. This confirms whether it's truly a connection problem versus other heap issues.
2Verify broker heap exhaustion and resource saturation
Check JVM heap usage on the affected brokers — you should see memory utilization >90% before the OOM occurred, and likely heavy GC activity. The `broker-resource-exhaustion` insight flags memory >90% as critical. Each producer connection consumes heap for buffers, metadata, and connection state, so thousands of producers can easily exhaust a 4-8GB heap. Also check if CPU was saturated (>80%), which compounds the problem as GC thrashing kicks in.
3Identify producer-side buffer starvation and waiting threads
On the producer side, check `kafka.producer.waiting_threads` and `kafka.producer.available_buffer_bytes` across your application instances. If you're creating hundreds or thousands of producer instances (the bug), each gets its own buffer pool (default 32MB). You'll see many threads waiting for buffer space and `kafka.producer.available_buffer_bytes` dropping to zero across many producer instances. This is the smoking gun — legitimate applications reuse a single producer instance, not create thousands.
4Check request handler saturation and queue buildup
Look at `kafka.request.channel.queue.size` — if it's >100 and growing, brokers can't keep up with the request flood from thousands of producers. Cross-reference with the `request-handler-saturation-cascades-to-producer-latency` and `high-request-queue-size-indicates-broker-overload` insights. Each producer instance sends metadata requests, heartbeats, and produce requests, so excessive instantiation creates request storms that overwhelm broker I/O threads. This explains the cascading failure after heap pressure starts.
5Analyze the ratio of connections to actual throughput
Compare the growth in `kafka.network.connection_count` to `kafka.net.bytes_in.rate` or `kafka.network.bytes_in_rate` over the incident window. If connections spiked 10x but throughput stayed flat or only increased modestly, that's definitive proof of an application bug — you're creating producers without closing them, not handling legitimate load growth. In a healthy system, connection count should be relatively stable and proportional to the number of application instances, not proportional to requests or time.
6Examine producer request patterns and in-flight request distribution
Check `kafka.producer.request_rate` and `kafka.producer.requests_in_flight` across your application instances. With proper producer reuse, you'd see high request rates from a small number of producers with reasonable in-flight requests (1-5 per producer). With the bug, you'll see thousands of producer instances each with low request rates and low in-flight counts — classic sign of producer proliferation. Also compare `kafka.producer.request_rate` to `kafka.request.produce_rate` on the broker to see the aggregate impact.
Technologies
Related Insights
Connection Count Approaching Limit
warning
When active connection count approaches configured maximum, new clients will be rejected, causing connection failures and application errors.
Broker resource exhaustion degrades cluster performance
warning
Request Handler Saturation Cascades to Producer Latency
warning
When Kafka request handlers are saturated (low idle percentage), producer requests queue up, increasing end-to-end latency and potentially triggering producer-side timeouts.
High Request Queue Size Indicates Broker Overload
warning
When request queue size grows, it indicates the broker cannot process incoming requests fast enough, leading to increased latency and potential client timeouts.
Relevant Metrics
Monitoring Interfaces
Kafka Native