Databricks autoscaling takes 3-5 minutes to provision new VMs during demand spikes, causing queued tasks and degraded performance while pending tasks exceed available executor cores.
Spark executors fail with OOM errors when processing partitions significantly larger than 200-500MB, exhausting executor heap memory and causing cascading failures across the cluster.
Excessive JVM garbage collection time relative to task execution indicates memory leaks, inefficient data structures, or insufficient executor heap allocation, severely degrading Spark performance.
Uneven data distribution during shuffle operations causes specific executors to process disproportionate data volumes, leading to straggler tasks and prolonged job durations.
Indiscriminate caching of large datasets competes with execution memory, causing high cache eviction rates and forcing re-computation of expensive operations.