Technologies/Apache Flink/kafka.consumer_group.lag

Apache FlinkMetric

kafka.consumer_group.lag

Consumer group lag for a topic partition

Dimensions:None

Technical Annotations (63)

Configuration Parameters (13)

fetch.min.bytes

consumer setting controlling minimum data size per fetch request

max.poll.records

consumer setting controlling records returned per poll for processing

enable.auto.commitrecommended: false

Prevents offset commits before message processing completes during rebalancing

max.poll.interval.msrecommended: 900000

Upper bound for worst-case processing logic including GC pauses and retries

session.timeout.msrecommended: 45000

Detects heartbeat thread death (JVM crash), should remain relatively low

heartbeat.interval.msrecommended: 15000

Must be strictly 1/3 of session.timeout.ms to allow for missed heartbeats

group.instance.idrecommended: change to unique value or remove

Persistent ID causes Coordinator to reserve partitions for failing consumer

partition.assignment.strategyrecommended: CooperativeSticky

enables incremental cooperative rebalancing to reduce disruption

follower.replication.throttled.raterecommended: > 1MB/sec

minimum for accurate throttling behavior during recovery

leader.replication.throttled.raterecommended: > 1MB/sec

minimum for accurate throttling behavior during recovery

delete.retention.msrecommended: 86400000

Default 1 day retention for tombstone markers on compacted topics

cleanup.policyrecommended: compact

Setting applies only to compacted topics

retention.msrecommended: 259200000

3 days (example for windowed processing with late arrivals and recovery)

Error Signatures (3)

Schema Mismatchexception

poll timeout expiredlog pattern

left grouplog pattern

CLI Commands (5)

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-groupdiagnostic

consumer.subscribe(['topic'], on_assign=on_assign, on_revoke=on_revoke)remediation

producer.produce(topic, key=str(user_id), value=value, partition=hash(user_id) % 4)remediation

admin_client.list_consumer_group_offsets('my_group')diagnostic

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groupsmonitoring

Technical References (42)

EventLoopcomponentRSSconceptcollectDefaultMetricscomponentconsumer lagconceptoffset commitsconceptindexing latencyconceptbulk requestsconceptElasticsearchcomponentCooperativeStickyAssignorcomponentKIP-62conceptApplication ThreadcomponentHeartbeat ThreadcomponentGroup CoordinatorcomponentStatic Membershipconceptgroup.instance.idcomponentZombie PartitionconceptLatent Partition StarvationconceptCooperative Rebalancingprotocolconsumer groupcomponentISR churnconceptpage cache hit ratioconceptrebalanceconceptpartitionconceptpartition assignmentconceptoffsetconceptISRconceptConfluent Control Centercomponentpartition key hashingconceptcustom partitionercomponentKafkaAdminClientcomponentJMX metricsconceptfollower.replication.throttled.replicascomponentleader.replication.throttled.replicascomponentlog compactionconcepttombstone markersconceptassign()componentsubscribe()componentconsumer group coordinatorconceptSingle Message TransformscomponentSMTcomponentKafka StreamscomponentksqlDBcomponent

Related Insights (32)

EventLoop lag blocks Kafka consumer causing consumer group lagcritical

▸

Service restart temporarily masks memory leak by resetting Kafka consumer lagwarning

▸

Suboptimal consumer fetch settings increase latencyinfo

▸

Poison pill message causes consumer infinite retry loop on single partitioncritical

▸

Aggregated metrics hide single-partition consumer failureswarning

▸

Consumer group Stable state misleading when partition stuckwarning

▸

Partition shows continuous writes but zero reads indicates stuck consumercritical

▸

Consumer lag prevents real-time message processingwarning

▸

Consumer offset stops advancing indicating stuck consumercritical

▸

Consumer lag increases steadily due to slow processingwarning

▸

Downstream Elasticsearch saturation manifests as Kafka consumer lag spikecritical

▸

Consumer lag alerts without downstream latency monitoring cause misdiagnosiswarning

▸

Auto-commit enabled causes silent data loss during consumer rebalancingcritical

▸

Rebalance spiral livelock prevents partition processing when max.poll.interval.ms exceededcritical

▸

Static membership holds partitions hostage when consumer fails repeatedlycritical

▸

Cooperative rebalancing masks partition starvation in aggregate metricswarning

▸

Consumer lag escalates rapidly while broker health metrics remain normalcritical

▸

Consumer poll timeout causes rebalance loopswarning

▸

Consumer lag spike without root cause visibilitywarning

▸

Gradual lag accumulation goes undetectedwarning

▸

Frequent consumer rebalances cause consumer lagwarning

▸

Missing monitoring of ISR shrinks and consumer lag delays incident responsewarning

▸

Consumer group misconfiguration causes lag and duplicationwarning

▸

Poor partition key selection creates hot partitionswarning

▸

Silent failures occur without monitoring and alertingwarning

▸

High or growing consumer lag indicates processing bottleneckwarning

▸

Insufficient replication throttle causes cluster overload during recoverywarning

▸

Consumer reads from compacted topics fail when scan time exceeds tombstone retentionwarning

▸

Consumer lag exceeds retention window causing data losswarning

▸

Consumer lag metrics missing for consumer groupswarning

▸

SMTs introduce latency under heavy traffic with complex enrichmentwarning

▸

Sink connector tasks accumulating lagwarning

▸