Resource exhaustion undetected without metric tracking
warningResource ContentionUpdated Feb 22, 2024(via Exa)
Technologies:
How to detect:
Systems like Spark or Polars exhaust disk, memory, or CPU resources due to buggy code, non-optimized queries, or unexpectedly large datasets, but no alerts trigger because system metrics are not tracked or thresholds are not configured.
Recommended action:
Implement system metric collection for data processing engines. For Spark, expose metrics via Prometheus endpoint. Configure Prometheus or DataDog to scrape metrics (disk usage, memory usage, CPU usage) at regular intervals (e.g., every 5 seconds). Set up alerts when usage exceeds specific thresholds. Use Grafana or similar to visualize metrics over time for capacity planning and scaling decisions.