Missing metadata prevents reproduction of pipeline failures
criticalavailabilityUpdated Feb 22, 2024(via Exa)
Technologies:
How to detect:
Data pipelines fail and cannot be debugged or reproduced because metadata about inputs, execution context, and data state is not captured. Engineers cannot determine what inputs caused the failure.
Recommended action:
Implement metadata logging for three categories: (1) State of data pipeline (run state, time taken per task) via orchestrator and Spark UI/History Server, (2) What/How/Where/When (function inputs, retries, execution engine, execution time) using log_metadata decorator, (3) State of data (min/max/mean/median, schema, natural key, storage location) using dataclass structures. Ensure metadata includes unique run ID for traceability across systems.