Apache DataFusion

Spillable aggregation produces duplicate group keys due to internal state mismatch

critical
storageUpdated Mar 5, 2026(via Exa)
How to detect:

When a Final aggregation spills to disk and switches to sorted streaming mode, the group_values state remains as GroupValuesColumn<Streaming=false> using vectorized_intern, which causes duplicate grouping keys in query results. This occurs because group_ordering is reset to GroupOrdering::Full without rebuilding group_values to the streaming implementation.

Recommended action:

Apply the fix by resetting group_values when group_ordering gets mutated during spill transition: self.group_ordering = GroupOrdering::Full(GroupOrderingFull::new()); self.group_values = new_group_values(group_schema, &self.group_ordering)?; This ensures the streaming implementation is used after spill. Monitor datafusion.operator.spill_count to detect when aggregations are spilling, which is when this bug manifests.