Apache Spark

Shuffle Operation Data Skew

warning
Resource ContentionUpdated Jan 5, 2026

Uneven data distribution during shuffle operations causes specific executors to process disproportionate data volumes, leading to straggler tasks and prolonged job durations.

How to detect:

Monitor spark_stage_shuffle_read_size and spark_stage_shuffle_write across executors. Identify outlier executors where shuffle metrics are 2-3x higher than median. Cross-reference with spark_executor_id_count_time showing disproportionate task duration and spark_executor_disk_used spikes on specific nodes.

Recommended action:

Enable Adaptive Query Execution (spark.sql.adaptive.enabled) for automatic skew handling. For persistent skew, implement manual salting by adding random prefixes to skewed keys before joins. Monitor partition distribution in Spark UI and repartition by skewed columns with higher partition counts.