Resource exhaustion undetected without metric tracking

warning

Resource ContentionUpdated Feb 22, 2024(via Exa)

Sources

Data Engineering Best Practices - #2. Metadata & Loggingwww.startdataengineering.com

Technologies:

Apache DataFusionsubject

Apache SparkApache Spark metrics correlate with this issue and help confirm diagnosis

How to detect:

Systems like Spark or Polars exhaust disk, memory, or CPU resources due to buggy code, non-optimized queries, or unexpectedly large datasets, but no alerts trigger because system metrics are not tracked or thresholds are not configured.

Recommended action:

Implement system metric collection for data processing engines. For Spark, expose metrics via Prometheus endpoint. Configure Prometheus or DataDog to scrape metrics (disk usage, memory usage, CPU usage) at regular intervals (e.g., every 5 seconds). Set up alerts when usage exceeds specific thresholds. Use Grafana or similar to visualize metrics over time for capacity planning and scaling decisions.