Apache DataFusion

Hash join optimizer selects non-partitioned mode causing 52x slower query execution

critical
performanceUpdated Mar 10, 2026(via Exa)
How to detect:

DataFusion optimizer defaults to CollectLeft joins when Partitioned hash joins would be dramatically faster (up to 52x speedup observed). TPC-DS query 99 executes in 5303ms with CollectLeft but only 103ms when forced to Partitioned mode. This affects multiple TPC-DS queries with 2-27x performance degradation.

Recommended action:

Force partitioned joins by setting environment variables: DATAFUSION_OPTIMIZER_REPARTITION_JOINS=true DATAFUSION_OPTIMIZER_HASH_JOIN_SINGLE_PARTITION_THRESHOLD=0 DATAFUSION_OPTIMIZER_HASH_JOIN_SINGLE_PARTITION_THRESHOLD_ROWS=0. For production, monitor datafusion.join.build_input_rows and consider dynamic switching when build side exceeds 1M rows. Review query plans for CollectLeft joins on large datasets.