Missing metadata prevents reproduction of pipeline failures

critical

availabilityUpdated Feb 22, 2024(via Exa)

Sources

Data Engineering Best Practices - #2. Metadata & Loggingwww.startdataengineering.com

Technologies:

Apache DataFusionsubject

Apache SparkApache Spark metrics correlate with this issue and help confirm diagnosis

How to detect:

Data pipelines fail and cannot be debugged or reproduced because metadata about inputs, execution context, and data state is not captured. Engineers cannot determine what inputs caused the failure.

Recommended action:

Implement metadata logging for three categories: (1) State of data pipeline (run state, time taken per task) via orchestrator and Spark UI/History Server, (2) What/How/Where/When (function inputs, retries, execution engine, execution time) using log_metadata decorator, (3) State of data (min/max/mean/median, schema, natural key, storage location) using dataclass structures. Ensure metadata includes unique run ID for traceability across systems.