DataHubApache KafkaElasticsearch

Kafka Consumer Lag Masking Ingestion Failures

critical
reliabilityUpdated Jan 1, 2025

DataHub's asynchronous write architecture can hide processing failures. High Kafka consumer lag combined with ingestion warnings/failures indicates metadata events are queued but not successfully persisting to primary or search storage.

How to detect:

Monitor kafka_consumer_lag trending upward while ingestion_failure or ingestion_warning counters increase. Cross-reference with metadata_change_proposal_process_time latency spikes and check trace API for write failures in primary or search storage.

Recommended action:

Use DataHub's Trace API with trace IDs from systemMetadata to identify which URNs/aspects are failing. Check for aspect conflicts (e.g., upstreamLineage + siblings conflicts), Elasticsearch mapping issues, or database version conflicts causing ERROR/UNKNOWN write states. Scale Kafka consumers if PENDING states dominate.