Apache Spark

Autoscaling Lag During Load Spikes

warning
scalingUpdated Feb 16, 2026

Databricks autoscaling takes 3-5 minutes to provision new VMs during demand spikes, causing queued tasks and degraded performance while pending tasks exceed available executor cores.

How to detect:

Track the gap between required executors (inferred from spark_executor_count_tasks divided by cores) and running executors. Monitor spark_stage_count_active_tasks accumulating while spark_executor count remains static. Detect when provisioning delay exceeds acceptable thresholds.

Recommended action:

Implement cluster pools with min_idle_instances set to 2-4 pre-warmed VMs. This reduces scale-up time from 3-5 minutes to seconds. Set autoscale min_workers to baseline needs and max_workers to prevent runaway costs. For predictable workloads, increase min_workers to avoid cold starts entirely.