Unnest with GROUP BY causes unbounded memory growth despite streaming
criticalWhen processing Parquet files with array columns using unnest followed by GROUP BY, memory usage grows proportionally to input_rows × array_size. A 341 MB file with 20,000 rows and 2,000-element arrays consumes 53+ GB of RAM. The GROUP BY operator must buffer all unnested rows before emitting results, preventing streaming execution even when configured for streaming mode.
Monitor datafusion.memory_pool.used and datafusion.operator.memory_used metrics during unnest operations. Set explicit memory limits using datafusion.memory_pool.limit. Consider alternative query patterns that avoid grouping unnested data, or break processing into smaller batches. For bioinformatics VCF workloads and similar array-heavy datasets, pre-aggregate before unnesting or use materialized views to cache intermediate results.