Trino

Outlier values poison auto-collected column statistics

warning
configurationUpdated Jul 20, 2025
Technologies:
How to detect:

When hive.collect-column-statistics-on-write is enabled and upstream data contains extreme outliers (e.g., duration range expanding from capped ±8e8 to -6.19e11 to 1.75e12), the collected statistics misrepresent actual data distribution. The optimizer assumes uniform distribution across the entire range, making filters appear highly selective when they actually match 99.991% of rows.

Recommended action:

Monitor column statistics for anomalous min/max values after data loads. Implement upstream data validation and capping logic to prevent extreme outliers from entering tables. After fixing data quality issues upstream, refresh table statistics. Consider validating statistics match actual data distribution for critical query columns.