Shuffle Operation Data Skew
warningUneven data distribution during shuffle operations causes specific executors to process disproportionate data volumes, leading to straggler tasks and prolonged job durations.
Monitor spark_stage_shuffle_read_size and spark_stage_shuffle_write across executors. Identify outlier executors where shuffle metrics are 2-3x higher than median. Cross-reference with spark_executor_id_count_time showing disproportionate task duration and spark_executor_disk_used spikes on specific nodes.
Enable Adaptive Query Execution (spark.sql.adaptive.enabled) for automatic skew handling. For persistent skew, implement manual salting by adding random prefixes to skewed keys before joins. Monitor partition distribution in Spark UI and repartition by skewed columns with higher partition counts.