kafka.consumer.records_lag_max
Maximum consumer lagInterface Metrics (2)
Related Insights (13)
When consumer processing rate falls below production rate, lag grows continuously until retention limits are reached, risking data loss. Lag trends reveal impending failures before customers notice.
When one partition consistently shows higher lag than others, it indicates uneven key distribution or specific message types requiring more processing time, creating a processing bottleneck.
When Lambda function timeout is increased without considering batch processing dynamics, functions may process fewer batches per unit time, paradoxically increasing overall lag despite having more time per invocation.
When minimum event pollers are set too high relative to actual throughput requirements, Lambda ESM provisioned mode incurs unnecessary costs. Each event poller handles up to 5 MB/sec or 5 concurrent invocations for Kafka.
When Lambda functions consuming from Kafka (MSK or self-managed) experience throttling due to concurrency limits, Kafka offset lag increases, creating a feedback loop where backed-up messages cause further Lambda invocations that hit throttle limits.
Lambda's on-demand Kafka event pollers scale based on offset lag evaluation every minute, but the autoscaling process takes up to three minutes to complete. High offset lag combined with low event poller counts indicates insufficient polling capacity before autoscaling can respond.
When Lambda timeout is increased from default to maximum (15 minutes) for Kafka event source mappings, execution duration increases but offset lag continues to grow, indicating the timeout increase is masking an underlying processing bottleneck rather than solving throughput issues.
Lambda limits MaximumPollers to the number of Kafka topic partitions to maintain ordered processing within partitions. When a topic has few partitions relative to message volume, Lambda cannot scale event pollers sufficiently, creating a throughput ceiling regardless of provisioned capacity.
For Lambda functions with asynchronous invocation patterns consuming Kafka events, AsyncEventAge increasing alongside Kafka offset lag indicates events are queuing in Lambda's internal queue before invocation, creating double-buffering that delays processing and increases the risk of event loss if retries exhaust.
When Kafka under-replicated partitions increase (kafka.replication.under_replicated_partitions) or ISR shrinks (kafka.server.ReplicaManager.IsrExpandsPerSec drops), Lambda event source mappings may experience increased fetch latency and offset lag as the cluster struggles to maintain replication consistency.
Frequent Kafka consumer group rebalances (detected via kafka_consumergroup_members changes) can trigger Lambda function restarts (fullRestarts metric), causing processing interruptions, increased cold starts (InitDuration), and temporary offset lag spikes as Lambda event source mappings rejoin the consumer group.
When oldest message age approaches retention limit and consumer lag is high, messages will be deleted before consumption, causing data loss.
Frequent changes in consumer group membership trigger rebalances, causing processing pauses, increased latency, and temporary unavailability.