Hi team
I’m using the excellent DuckDB datasketches extension for large-scale analytics use cases. One common requirement in our datasets is to compute the mode() (most frequent item) per group, but the built-in exact mode() function in DuckDB leads to high memory usage or even OOMs when applied on large, high-cardinality datasets.
Feature Request
Please consider adding support for approximate mode estimation using FrequentItemsSketch from Apache DataSketches.
Why is this useful?
- mode() is commonly needed in aggregations over grouped data, e.g.:
SELECT x, y, mode(z) FROM table GROUP BY x, y;
- On large datasets (e.g., 30M+ rows, 1K+ groups), the exact mode() leads to memory exhaustion.
- Approximate mode with bounded error would be a great tradeoff and fits well into the sketch philosophy.
References