datafusion.aggregate.groups

Distinct aggregation groups

Dimensions:None

Technical Annotations (50)

Configuration Parameters (6)

datafusion.execution.skip_partial_aggregation_probe_ratio_thresholdrecommended: 0.8

Default 0.8; lower to skip partial aggregation earlier for high-cardinality data

datafusion.execution.skip_partial_aggregation_probe_rows_thresholdrecommended: 100000

Default 100k rows before checking aggregation ratio; lower to detect inefficiency sooner

datafusion.execution.target_partitionsrecommended: 1-4

Lower values reduce memory duplication in high cardinality GROUP BY queries

memory_limitrecommended: Set 500MB below actual available memory

Leave headroom for Vec exponential growth overhead not tracked by MemoryPool

MEMORY_FRACTIONrecommended: 1.0

Memory fraction used in RuntimeConfig, shown in reproducer setup

batch_sizerecommended: Reduce from default (e.g., from 8192 to lower value)

Smaller batches enable more fine-grained memory accounting and reduce per-batch allocation spikes

Error Signatures (3)

memory allocation of 25690112 bytes failedexception

DatafusionError/ResourcesExhausted: Failed to allocate additionalexception

Aborted (core dumped)exit code

CLI Commands (4)

set datafusion.execution.target_partitions = 1;diagnostic

explain SELECT "WatchID", "ClientIP", COUNT(*) AS c FROM hits GROUP BY "WatchID", "ClientIP";diagnostic

ulimit -v 1152000diagnostic

writer.write(&batch)diagnostic

Technical References (37)

AggregateMode::PartialcomponentAggregateMode::FinalcomponentRepartitionExeccomponenthash value reuseconceptSinglePartitionedcomponentarrow::RowcomponentRowConvertercomponentcardinalityconceptpartial aggregate skippingconceptAggregateExec: mode=PartialcomponentAggregateExec: mode=FinalPartitionedcomponentTop K optimizationconceptSortPreservingMergeExeccomponentGlobalLimitExeccomponentGroupedHashAggregateStreamcomponentarrow_row::variable::encodecomponenthash seedconceptClickBenchcomponentbucket distributionconceptcache localityconceptGroupValuesColumncomponentvectorized_interncomponentGroupOrdering::FullcomponentMemoryPoolcomponentgroup_aggregate_batch()componentVec::grow_amortized()componentdatafusion/physical-plan/src/aggregates/row_hash.rsfile pathIPCWritercomponentemitcomponentAggregateExeccomponenttotal_byte_sizeconceptjoin_selection.rsfile pathphysical-plan/aggregates/mod.rsfile pathphysical-optimizer/join_selection.rsfile pathhash_join/exec.rsfile pathFinalcomponentFinalPartitionedcomponent

Related Insights (13)

Partial aggregation continues despite low aggregation ratio, wasting resourceswarning

▸

High cardinality aggregations incur triple hashing overhead in multi-phase repartition planswarning

▸

Single-mode aggregation outperforms partial/final for high cardinality by avoiding row conversionswarning

▸

Low cardinality aggregates benefit from partial/final mode while high cardinality suffersinfo

▸

High cardinality aggregations cause memory usage to scale linearly with partition countcritical

▸

GROUP BY with ORDER BY and LIMIT still allocates memory for all groupswarning

▸

RowConverter consumes 75% of aggregation time on high-cardinality group by operationswarning

▸

Hash seed reuse prevents rehashing during aggregation merge phaseinfo

▸

Spillable aggregation produces duplicate group keys due to internal state mismatchcritical

▸

GroupedHashAggregateStream OOM from Vec exponential growth during group-by with large stringscritical

▸

Large single-batch spill files cause merge failureswarning

▸

Missing byte-size statistics after aggregation causes incorrect join build-side selectionwarning

▸

Aggregation operations under-partition causing multi-fold performance degradationwarning

▸