Broker Failure and Excessive Producer Instantiation

criticalIncident Response

Application bug causing excessive producer instantiation leads to broker heap exhaustion and cascading failures.

Prompt: “Our Kafka brokers just crashed with OutOfMemoryError. I see millions of producer connections being tracked. Did something in our application code go wrong?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating broker OOM with excessive producer connections, start by confirming the connection spike and correlating it with heap exhaustion. Then examine producer-side symptoms like buffer starvation and waiting threads to confirm multiple producer instances. Finally, check request handler saturation and compare connection growth to actual throughput — if connections explode but throughput stays flat, you've confirmed an application bug creating producers without closing them.

1Confirm the connection explosion and compare to broker limits

First, check `kafka.network.connection_count` to quantify the connection spike — you mentioned millions, which is abnormal for most clusters. Compare this to your configured `max.connections.per.broker` limit (typically 2000-5000). The `connection-count-approaching-limit` insight warns when you hit 80% of max, but if you're seeing OOM, you've likely exceeded it or the sheer number of connection objects exhausted heap before hitting the limit. This confirms whether it's truly a connection problem versus other heap issues.

Connection Count Approaching Limit kafka.network.connection_count

2Verify broker heap exhaustion and resource saturation

Check JVM heap usage on the affected brokers — you should see memory utilization >90% before the OOM occurred, and likely heavy GC activity. The `broker-resource-exhaustion` insight flags memory >90% as critical. Each producer connection consumes heap for buffers, metadata, and connection state, so thousands of producers can easily exhaust a 4-8GB heap. Also check if CPU was saturated (>80%), which compounds the problem as GC thrashing kicks in.

Broker resource exhaustion degrades cluster performance

3Identify producer-side buffer starvation and waiting threads

On the producer side, check `kafka.producer.waiting_threads` and `kafka.producer.available_buffer_bytes` across your application instances. If you're creating hundreds or thousands of producer instances (the bug), each gets its own buffer pool (default 32MB). You'll see many threads waiting for buffer space and `kafka.producer.available_buffer_bytes` dropping to zero across many producer instances. This is the smoking gun — legitimate applications reuse a single producer instance, not create thousands.

kafka.producer.waiting_threadskafka.producer.available_buffer_byteskafka.producer.buffer_total_bytes

4Check request handler saturation and queue buildup

Look at `kafka.request.channel.queue.size` — if it's >100 and growing, brokers can't keep up with the request flood from thousands of producers. Cross-reference with the `request-handler-saturation-cascades-to-producer-latency` and `high-request-queue-size-indicates-broker-overload` insights. Each producer instance sends metadata requests, heartbeats, and produce requests, so excessive instantiation creates request storms that overwhelm broker I/O threads. This explains the cascading failure after heap pressure starts.

Request Handler Saturation Cascades to Producer Latency High Request Queue Size Indicates Broker Overload kafka.request.channel.queue.size

5Analyze the ratio of connections to actual throughput

Compare the growth in `kafka.network.connection_count` to `kafka.net.bytes_in.rate` or `kafka.network.bytes_in_rate` over the incident window. If connections spiked 10x but throughput stayed flat or only increased modestly, that's definitive proof of an application bug — you're creating producers without closing them, not handling legitimate load growth. In a healthy system, connection count should be relatively stable and proportional to the number of application instances, not proportional to requests or time.

kafka.network.connection_countkafka.net.bytes_in.ratekafka.network.bytes_in_rate

6Examine producer request patterns and in-flight request distribution

Check `kafka.producer.request_rate` and `kafka.producer.requests_in_flight` across your application instances. With proper producer reuse, you'd see high request rates from a small number of producers with reasonable in-flight requests (1-5 per producer). With the bug, you'll see thousands of producer instances each with low request rates and low in-flight counts — classic sign of producer proliferation. Also compare `kafka.producer.request_rate` to `kafka.request.produce_rate` on the broker to see the aggregate impact.

kafka.producer.request_ratekafka.producer.requests_in_flightkafka.request.produce_rate