Technologies/DuckDB/datafusion.aggregate.groups
DuckDBDuckDBMetric

datafusion.aggregate.groups

Distinct aggregation groups
Dimensions:None

Technical Annotations (50)

Configuration Parameters (6)
datafusion.execution.skip_partial_aggregation_probe_ratio_thresholdrecommended: 0.8
Default 0.8; lower to skip partial aggregation earlier for high-cardinality data
datafusion.execution.skip_partial_aggregation_probe_rows_thresholdrecommended: 100000
Default 100k rows before checking aggregation ratio; lower to detect inefficiency sooner
datafusion.execution.target_partitionsrecommended: 1-4
Lower values reduce memory duplication in high cardinality GROUP BY queries
memory_limitrecommended: Set 500MB below actual available memory
Leave headroom for Vec exponential growth overhead not tracked by MemoryPool
MEMORY_FRACTIONrecommended: 1.0
Memory fraction used in RuntimeConfig, shown in reproducer setup
batch_sizerecommended: Reduce from default (e.g., from 8192 to lower value)
Smaller batches enable more fine-grained memory accounting and reduce per-batch allocation spikes
Error Signatures (3)
memory allocation of 25690112 bytes failedexception
DatafusionError/ResourcesExhausted: Failed to allocate additionalexception
Aborted (core dumped)exit code
CLI Commands (4)
set datafusion.execution.target_partitions = 1;diagnostic
explain SELECT "WatchID", "ClientIP", COUNT(*) AS c FROM hits GROUP BY "WatchID", "ClientIP";diagnostic
ulimit -v 1152000diagnostic
writer.write(&batch)diagnostic
Technical References (37)
AggregateMode::PartialcomponentAggregateMode::FinalcomponentRepartitionExeccomponenthash value reuseconceptSinglePartitionedcomponentarrow::RowcomponentRowConvertercomponentcardinalityconceptpartial aggregate skippingconceptAggregateExec: mode=PartialcomponentAggregateExec: mode=FinalPartitionedcomponentTop K optimizationconceptSortPreservingMergeExeccomponentGlobalLimitExeccomponentGroupedHashAggregateStreamcomponentarrow_row::variable::encodecomponenthash seedconceptClickBenchcomponentbucket distributionconceptcache localityconceptGroupValuesColumncomponentvectorized_interncomponentGroupOrdering::FullcomponentMemoryPoolcomponentgroup_aggregate_batch()componentVec::grow_amortized()componentdatafusion/physical-plan/src/aggregates/row_hash.rsfile pathIPCWritercomponentemitcomponentAggregateExeccomponenttotal_byte_sizeconceptjoin_selection.rsfile pathphysical-plan/aggregates/mod.rsfile pathphysical-optimizer/join_selection.rsfile pathhash_join/exec.rsfile pathFinalcomponentFinalPartitionedcomponent
Related Insights (13)
Partial aggregation continues despite low aggregation ratio, wasting resourceswarning
High cardinality aggregations incur triple hashing overhead in multi-phase repartition planswarning
Single-mode aggregation outperforms partial/final for high cardinality by avoiding row conversionswarning
Low cardinality aggregates benefit from partial/final mode while high cardinality suffersinfo
High cardinality aggregations cause memory usage to scale linearly with partition countcritical
GROUP BY with ORDER BY and LIMIT still allocates memory for all groupswarning
RowConverter consumes 75% of aggregation time on high-cardinality group by operationswarning
Hash seed reuse prevents rehashing during aggregation merge phaseinfo
Spillable aggregation produces duplicate group keys due to internal state mismatchcritical
GroupedHashAggregateStream OOM from Vec exponential growth during group-by with large stringscritical
Large single-batch spill files cause merge failureswarning
Missing byte-size statistics after aggregation causes incorrect join build-side selectionwarning
Aggregation operations under-partition causing multi-fold performance degradationwarning