High per-task input standard deviation indicates data skew
warningperformanceUpdated Sep 15, 2024
Technologies:
How to detect:
Workers process uneven amounts of input data, causing some workers to be overloaded while others idle. Shown by high standard deviation in per-task input metrics and explicit skew warnings. Example: per task avg 4.00 with std.dev. 2.00 (50% coefficient of variation).
Recommended action:
Review partitioning keys for uniform distribution across workers. Analyze data distribution in source tables using table statistics. Consider using different hash functions, adding salt to skewed partition keys, or increasing parallelism to distribute load. Identify and address hot partition keys.