Apache DataFusion

Unnest with GROUP BY causes unbounded memory growth despite streaming

critical
Resource ContentionUpdated Mar 7, 2026(via Exa)
How to detect:

When processing Parquet files with array columns using unnest followed by GROUP BY, memory usage grows proportionally to input_rows × array_size. A 341 MB file with 20,000 rows and 2,000-element arrays consumes 53+ GB of RAM. The GROUP BY operator must buffer all unnested rows before emitting results, preventing streaming execution even when configured for streaming mode.

Recommended action:

Monitor datafusion.memory_pool.used and datafusion.operator.memory_used metrics during unnest operations. Set explicit memory limits using datafusion.memory_pool.limit. Consider alternative query patterns that avoid grouping unnested data, or break processing into smaller batches. For bioinformatics VCF workloads and similar array-heavy datasets, pre-aggregate before unnesting or use materialized views to cache intermediate results.