Apache Kafka insights

Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceeded

kafka.consumer_group.lag

4mo ago▸

Static membership holds partitions hostage when consumer fails repeatedly

kafka.consumer_group.lag

4mo ago▸

Cooperative rebalancing masks partition starvation in aggregate metrics

kafka.consumer_group.lag kafka_server_brokertopicmetrics_messagesinpersec

4mo ago▸

Heartbeat thread stability masks application thread processing failures

4mo ago▸

Poison pill messages cause deterministic partition processing failure

4mo ago▸

Async processing in worker threads breaks offset commit guarantees

4mo ago▸

Consumer lag escalates rapidly while broker health metrics remain normal

kafka.consumer_group.lag kafka.server.ReplicaManager.UnderReplicatedPartitions kafka.server.ReplicaManager.IsrShrinksPerSec

medium.com

4mo ago▸

Broker Bytes In/Out Imbalance Suggests Uneven Load Distributionwarning

When some brokers show significantly higher network traffic than others, it indicates uneven partition distribution or leadership imbalance, causing inefficient resource utilization.

kafka.network.bytes_in_rate kafka.network.bytes_out_rate kafka.broker.partition_count+1 more

4mo ago▸

Schema Registry Subject Count Growth Without Cleanupinfo

Unbounded growth of schema subjects and versions can lead to registry performance degradation, increased memory usage, and slower schema lookups.

kafka.schema_registry.subjects kafka.schema_registry.versions

4mo ago▸

Consumer Group Member Instability from Frequent Rebalanceswarning

Frequent changes in consumer group membership trigger rebalances, causing processing pauses, increased latency, and temporary unavailability.

kafka_consumergroup_members kafka.consumer.records_lag_max

4mo ago▸

Raft Commit Latency Spike Delays Metadata Propagationwarning

In KRaft mode, high Raft commit latency delays metadata changes from propagating through the cluster, causing stale metadata and operational delays.

kafka.raft.commit_latency_avg kafka.raft.commit_latency_max kafka.raft.append_records_rate

4mo ago▸

Raft Metadata Apply Errors Indicate Controller Issuescritical

In KRaft mode, metadata apply errors indicate the controller is failing to apply metadata changes, potentially causing inconsistent cluster state.

kafka.raft.metadata_apply_error_count kafka.raft.metadata_load_error_count

4mo ago▸

Connection Count Approaching Limitwarning

When active connection count approaches configured maximum, new clients will be rejected, causing connection failures and application errors.

kafka.network.connection_count

4mo ago▸

Message Conversion Overhead from Legacy Clientswarning

When brokers perform message format conversion for legacy clients, it adds significant CPU overhead and latency, reducing overall throughput.

kafka.network.produce_message_conversions_rate kafka.request.produce_time_99p

4mo ago▸

Producer Request Expiration Indicates Timeout Issueswarning

When producer requests expire before broker response, it indicates either broker overload, network issues, or producer timeout configuration is too aggressive.

kafka.producer.request_expiration_rate kafka.expires_sec kafka.request.produce_time_99p

4mo ago▸

Consumer Fetch Latency Spike from Broker Overloadwarning

When consumer fetch latency increases significantly, it indicates the broker is slow to respond to fetch requests, often due to disk I/O, CPU saturation, or competing produce traffic.

kafka.consumer.fetch_latency_avg kafka.request.fetch_consumer_time_avg kafka.request.handler_idle_percent+1 more

4mo ago▸

Preferred Leader Imbalance Reduces Cluster Efficiencyinfo

When partitions do not use their preferred leader, cluster load becomes unbalanced, reducing throughput and increasing latency as some brokers handle disproportionate leadership.

kafka.partition.leader_is_preferred kafka.broker.leader_count

4mo ago▸

Under Min ISR Partitions Block Producer Writescritical

When ISR count drops below min.insync.replicas, producers configured with acks=all will fail with NotEnoughReplicasException, blocking writes to maintain durability guarantees.

kafka.replication.under_min_isr_partitions_count kafka.request.produce_failed_rate kafka.replication.under_replicated_partitions

4mo ago▸

Offline Partitions Indicate Total Partition Unavailabilitycritical

When partitions go offline (all replicas unavailable), they cannot serve produce or consume requests, causing complete unavailability for affected data.

kafka.replication.offline_partitions_count kafka.partition.offline

4mo ago▸

Fetch Session Cache Thrashing from Consumer Patternwarning

When fetch sessions are evicted frequently, it indicates consumers are not maintaining stable fetch patterns, causing the broker to rebuild fetch contexts repeatedly and increasing CPU overhead.

kafka.session.fetch_eviction_rate kafka.session.fetch_count

4mo ago▸

ZooKeeper Session Expiration Causing Broker Instabilitycritical

When ZooKeeper sessions expire, brokers re-register causing controller changes, partition leadership changes, and temporary unavailability. Frequent expirations indicate ZooKeeper or network issues.

kafka.zookeeper.expire_rate kafka.server.SessionExpireListener.ZooKeeperExpiresPerSec.OneMinuteRate kafka.zookeeper.disconnect_rate+1 more

4mo ago▸

Log Flush Latency Spikes Causing Write Stallswarning

When log flush operations take excessive time, produce requests are delayed as Kafka waits for data to be flushed to disk, impacting producer latency and throughput.

kafka.log.flush_rate kafka.request.produce_time_99p kafka.log.LogFlushStats.LogFlushRateAndTimeMs.Percentile95th

4mo ago▸

High Request Queue Size Indicates Broker Overloadwarning

When request queue size grows, it indicates the broker cannot process incoming requests fast enough, leading to increased latency and potential client timeouts.

kafka.request.queue_size kafka.response.queue_size kafka.request.handler_idle_percent

4mo ago▸

Producer Buffer Exhaustion Causing Request Blockingwarning

When producer buffer memory is exhausted, send calls block waiting for space, causing application threads to stall and reducing overall throughput.

kafka.producer.buffer_pool_wait_time kafka.producer.available_buffer_bytes kafka.producer.buffer_total_bytes+1 more

4mo ago▸

Topic Retention Approaching With Insufficient Consumer Throughputcritical

When oldest message age approaches retention limit and consumer lag is high, messages will be deleted before consumption, causing data loss.

kafka.partition.oldest_offset kafka.consumer_group.offset kafka.topic.config.retention_ms+1 more

4mo ago▸

High or growing consumer lag indicates processing bottleneck

kafka.consumer_group.lag

docs.confluent.io

4mo ago▸

Kafka Partition Imbalance in Lambda Event Processingcritical

Lambda limits MaximumPollers to the number of Kafka topic partitions to maintain ordered processing within partitions. When a topic has few partitions relative to message volume, Lambda cannot scale event pollers sufficiently, creating a throughput ceiling regardless of provisioned capacity.

kafka.topic.partitions kafka.topic.messages_in.rate kafka.consumer.records_lag_max

Kafka Prometheus

Lambda CloudWatch

docs.aws.amazon.com

4mo ago▸

Kafka Event Poller Autoscaling Lag Indicatorwarning

Lambda's on-demand Kafka event pollers scale based on offset lag evaluation every minute, but the autoscaling process takes up to three minutes to complete. High offset lag combined with low event poller counts indicates insufficient polling capacity before autoscaling can respond.

kafka.consumer.records_lag_max kafka.consumer.records_consumed_rateKafka Native

Lambda CloudWatch

docs.aws.amazon.com

4mo ago▸

DataHub

Kafka Consumer Lag Ingestion Backlog