feat(arrow-row): add MSD radix sort kernel for row-encoded keys by mbutrovich · Pull Request #9683 · apache/arrow-rs

mbutrovich · 2026-04-09T20:00:24Z

Which issue does this PR close?

N/A.

Rationale for this change

The existing lexsort_to_indices uses comparison sort on columnar arrays, which is O(n log n × comparison_cost) where comparison cost scales with the number of columns. The Arrow row format (RowConverter) produces memcmp-comparable byte sequences, making it a natural fit for radix sort — O(n × key_width) — which can overcome the encoding overhead by eliminating per-comparison column traversal.

Inspired by DuckDB's sorting redesign, which uses MSD radix sort on normalized key prefixes with a comparison sort fallback, this PR adds an radix_sort_to_indices kernel that operates directly on row-encoded keys. Like DuckDB, we limit radix depth to 8 bytes before falling back to comparison sort, balancing radix efficiency against diminishing returns on deep recursion.

What changes are included in this PR?

arrow-row/src/radix.rs (new): MSD radix sort on Rows with:
- 256-bucket histogram + out-of-place scatter per byte position
- Comparison sort fallback for small buckets (≤64 elements) or after 8 bytes of radix depth
- Single pre-allocated temp buffer reused across recursion levels
arrow-row/src/lib.rs: Exposes pub mod radix
arrow/benches/lexsort.rs: Adds lexsort_radix benchmark variant alongside existing lexsort_to_indices and lexsort_rows, and removes a duplicate benchmark case

Benchmark results

All three variants include the full pipeline (encoding + sort) so the comparison against lexsort_to_indices (which doesn't encode) is apples-to-apples. Here are a subset of results from cargo bench --bench lexsort. I'll post the full table in a follow-up comment.

Schema, N	lexsort_to_indices	lexsort_rows	lexsort_radix
[i32, i32_opt], 4096	86.211 µs	117.00 µs	71.987 µs
[i32, i32_opt], 32768	866.91 µs	1.2274 ms	359.07 µs
[i32, str_opt(16)], 32768	860.45 µs	1.8661 ms	510.93 µs
[str_opt(16), str(16)], 32768	2.4401 ms	1.6789 ms	946.89 µs
[3x str], 32768	2.4590 ms	1.9281 ms	1.2148 ms
[i32_opt, dict], 32768	1.1873 ms	1.3615 ms	627.81 µs
[dict, dict], 32768	499.32 µs	732.40 µs	722.44 µs
[3x dict, str(16)], 32768	4.1305 ms	2.0389 ms	1.5303 ms
[i32_opt, i32_list], 32768	1.5171 ms	2.8395 ms	1.3393 ms
[i32, i32_list, str(16)], 32768	862.46 µs	2.2412 ms	1.1633 ms

Radix sort is the fastest in the majority of cases. The main exception is pure low-cardinality dictionary columns where lexsort_to_indices avoids encoding overhead entirely. The module documentation provides guidance on when to use each approach.

Are these changes tested?

17 tests including:

Deterministic tests for integers, strings, multi-column, nulls, all-equal, empty, single-element
All 4 sort option combinations (ascending/descending × nulls_first)
Float64 with NaN/Infinity, booleans
Threshold boundary tests (sizes 1–1000 around the fallback threshold)
Fuzz test: 100 iterations × 1–4 random columns × random types × random sort options × 5–500 rows
Cross-validation: verifies radix output matches comparison sort on the same Rows

Are there any user-facing changes?

New public API: arrow_row::radix::radix_sort_to_indices(&Rows) -> Vec<u32>.

mbutrovich · 2026-04-09T20:01:11Z

Full cargo bench --bench lexsort results on M3 Max:

Schema, N	lexsort_to_indices	lexsort_rows	lexsort_radix
[i32, i32_opt], 4096	86.211 µs	117.00 µs	71.987 µs
[i32, i32_opt], 32768	866.91 µs	1.2274 ms	359.07 µs
[i32, str_opt(16)], 4096	85.640 µs	156.87 µs	88.564 µs
[i32, str_opt(16)], 32768	860.45 µs	1.8661 ms	510.93 µs
[i32, str(16)], 4096	84.783 µs	136.31 µs	85.021 µs
[i32, str(16)], 32768	859.64 µs	1.4682 ms	457.38 µs
[str_opt(16), str(16)], 4096	230.48 µs	159.14 µs	123.71 µs
[str_opt(16), str(16)], 32768	2.4401 ms	1.6789 ms	946.89 µs
[str_opt(16), str_opt(50), str(16)], 4096	233.39 µs	192.74 µs	155.40 µs
[str_opt(16), str_opt(50), str(16)], 32768	2.4590 ms	1.9281 ms	1.2148 ms
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096	232.35 µs	226.56 µs	188.98 µs
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768	2.4572 ms	2.2451 ms	1.5222 ms
[i32_opt, dict(100,str_opt(50))], 4096	109.14 µs	137.92 µs	100.16 µs
[i32_opt, dict(100,str_opt(50))], 32768	1.1873 ms	1.3615 ms	627.81 µs
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096	50.649 µs	88.038 µs	90.044 µs
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768	499.32 µs	732.40 µs	722.44 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 4096	375.75 µs	193.73 µs	179.50 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 32768	4.1305 ms	2.0389 ms	1.5303 ms
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 4096	380.06 µs	214.73 µs	192.86 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 32768	4.2629 ms	2.2013 ms	1.7082 ms
[i32_opt, i32_list], 4096	138.11 µs	244.78 µs	157.57 µs
[i32_opt, i32_list], 32768	1.5171 ms	2.8395 ms	1.3393 ms
[i32_opt, i32_list_opt], 4096	149.05 µs	232.17 µs	161.53 µs
[i32_opt, i32_list_opt], 32768	1.6480 ms	2.8980 ms	1.3731 ms
[i32_list_opt, i32_opt], 4096	251.74 µs	216.05 µs	173.67 µs
[i32_list_opt, i32_opt], 32768	2.7722 ms	2.5670 ms	1.5963 ms
[i32, str_list(4)], 4096	85.625 µs	374.40 µs	287.85 µs
[i32, str_list(4)], 32768	858.96 µs	4.7725 ms	3.3054 ms
[str_list(4), i32], 4096	297.00 µs	390.12 µs	297.47 µs
[str_list(4), i32], 32768	3.3962 ms	4.3830 ms	3.8276 ms
[i32, str_list_opt(4)], 4096	84.844 µs	374.89 µs	261.91 µs
[i32, str_list_opt(4)], 32768	872.22 µs	4.7227 ms	3.1478 ms
[str_list_opt(4), i32], 4096	354.73 µs	327.17 µs	300.98 µs
[str_list_opt(4), i32], 32768	4.1949 ms	4.3219 ms	3.7982 ms
[i32, i32_list, str(16)], 4096	85.755 µs	220.73 µs	170.03 µs
[i32, i32_list, str(16)], 32768	862.46 µs	2.2412 ms	1.1633 ms
[i32_opt, i32_list_opt, str_opt(50)], 4096	156.44 µs	240.29 µs	195.77 µs
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.7604 ms	2.4651 ms	1.6350 ms

arrow-row/src/radix.rs

mbutrovich · 2026-04-10T00:32:37Z

Thanks for the feedback @Dandandan!

I tried the 4 optimizations you suggested, and here are the results:

Schema, N	Before	After	Change
[i32, i32_opt], 4096	71.987 µs	70.200 µs	-2.5%
[i32, i32_opt], 32768	359.07 µs	341.22 µs	-5.0%
[i32, str_opt(16)], 4096	88.564 µs	88.874 µs	+0.4%
[i32, str_opt(16)], 32768	510.93 µs	493.01 µs	-3.5%
[i32, str(16)], 4096	85.021 µs	92.958 µs	+9.3%
[i32, str(16)], 32768	457.38 µs	425.87 µs	-6.9%
[str_opt(16), str(16)], 4096	123.71 µs	133.51 µs	+7.9%
[str_opt(16), str(16)], 32768	946.89 µs	993.63 µs	+4.9%
[str_opt(16), str_opt(50), str(16)], 4096	155.40 µs	165.69 µs	+6.6%
[str_opt(16), str_opt(50), str(16)], 32768	1.2148 ms	1.2654 ms	+4.2%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096	188.98 µs	199.13 µs	+5.4%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768	1.5222 ms	1.5486 ms	+1.7%
[i32_opt, dict(100,str_opt(50))], 4096	100.16 µs	101.17 µs	+1.0%
[i32_opt, dict(100,str_opt(50))], 32768	627.81 µs	686.16 µs	+9.3%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096	90.044 µs	89.670 µs	-0.4%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768	722.44 µs	741.11 µs	+2.6%
[dict(100,str_opt(50))x3, str(16)], 4096	179.50 µs	187.09 µs	+4.2%
[dict(100,str_opt(50))x3, str(16)], 32768	1.5303 ms	1.5696 ms	+2.6%
[dict(100,str_opt(50))x3, str_opt(50)], 4096	192.86 µs	200.81 µs	+4.1%
[dict(100,str_opt(50))x3, str_opt(50)], 32768	1.7082 ms	1.7309 ms	+1.3%
[i32_opt, i32_list], 4096	157.57 µs	160.09 µs	+1.6%
[i32_opt, i32_list], 32768	1.3393 ms	1.4233 ms	+6.3%
[i32_opt, i32_list_opt], 4096	161.53 µs	164.17 µs	+1.6%
[i32_opt, i32_list_opt], 32768	1.3731 ms	1.4992 ms	+9.2%
[i32_list_opt, i32_opt], 4096	173.67 µs	179.41 µs	+3.3%
[i32_list_opt, i32_opt], 32768	1.5963 ms	1.5744 ms	-1.4%
[i32, str_list(4)], 4096	287.85 µs	291.31 µs	+1.2%
[i32, str_list(4)], 32768	3.3054 ms	3.2674 ms	-1.1%
[str_list(4), i32], 4096	297.47 µs	360.26 µs	+21.1%
[str_list(4), i32], 32768	3.8276 ms	3.8494 ms	+0.6%
[i32, str_list_opt(4)], 4096	261.91 µs	325.47 µs	+24.3%
[i32, str_list_opt(4)], 32768	3.1478 ms	3.1075 ms	-1.3%
[str_list_opt(4), i32], 4096	300.98 µs	334.20 µs	+11.0%
[str_list_opt(4), i32], 32768	3.7982 ms	3.8360 ms	+1.0%
[i32, i32_list, str(16)], 4096	170.03 µs	175.74 µs	+3.4%
[i32, i32_list, str(16)], 32768	1.1633 ms	1.1742 ms	+0.9%
[i32_opt, i32_list_opt, str_opt(50)], 4096	195.77 µs	198.17 µs	+1.2%
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.6350 ms	1.7419 ms	+6.5%

Claude's attempt at interpreting the results:

Integer schemas at large N improved ~3-5%. Most other schemas regressed, especially list-leading schemas at small N (+21-24%).

Likely causes:

Ping-pong fallback overhead: The old code did one bulk copy_from_slice per level. Ping-pong eliminates that but replaces it with many small per bucket copies in the fallback path. For schemas that produce many small buckets (strings, lists, dicts), many scattered copies are less efficient than one contiguous memcpy.

data_from comparison overhead: The old fallback used ra.cmp(&rb) — a direct memcmp. The new data_from(byte_pos).cmp(...) adds a bounds check + subslice per row per comparison. For schemas hitting the fallback frequently, this outweighs the benefit of skipping the shared prefix.

Integer schemas benefit because they have fewer, larger buckets that stay in the radix path longer where the eliminated per-level memcpy pays off.

Do you want the commit anyway just to iterate off of?

mbutrovich · 2026-04-10T01:16:49Z

Byte Buffer is the current commit:

Schema, N	Original	PR Feedback	Byte Buffer	Buffer vs Original
[i32, i32_opt], 4096	71.987 µs	70.200 µs	74.722 µs	+3.8%
[i32, i32_opt], 32768	359.07 µs	341.22 µs	320.76 µs	-10.7%
[i32, str_opt(16)], 4096	88.564 µs	88.874 µs	87.527 µs	-1.2%
[i32, str_opt(16)], 32768	510.93 µs	493.01 µs	485.17 µs	-5.0%
[i32, str(16)], 4096	85.021 µs	92.958 µs	88.301 µs	+3.9%
[i32, str(16)], 32768	457.38 µs	425.87 µs	413.71 µs	-9.5%
[str_opt(16), str(16)], 4096	123.71 µs	133.51 µs	132.79 µs	+7.3%
[str_opt(16), str(16)], 32768	946.89 µs	993.63 µs	948.77 µs	+0.2%
[str_opt(16), str_opt(50), str(16)], 4096	155.40 µs	165.69 µs	164.27 µs	+5.7%
[str_opt(16), str_opt(50), str(16)], 32768	1.2148 ms	1.2654 ms	1.2207 ms	+0.5%
[str_opt(16), str(16), x5], 4096	188.98 µs	199.13 µs	199.64 µs	+5.6%
[str_opt(16), str(16), x5], 32768	1.5222 ms	1.5486 ms	1.5407 ms	+1.2%
[i32_opt, dict(100,str_opt(50))], 4096	100.16 µs	101.17 µs	102.22 µs	+2.1%
[i32_opt, dict(100,str_opt(50))], 32768	627.81 µs	686.16 µs	709.65 µs	+13.0%
[dict x2], 4096	90.044 µs	89.670 µs	93.701 µs	+4.1%
[dict x2], 32768	722.44 µs	741.11 µs	760.66 µs	+5.3%
[dict x3, str(16)], 4096	179.50 µs	187.09 µs	182.74 µs	+1.8%
[dict x3, str(16)], 32768	1.5303 ms	1.5696 ms	1.5158 ms	-0.9%
[dict x3, str_opt(50)], 4096	192.86 µs	200.81 µs	197.92 µs	+2.6%
[dict x3, str_opt(50)], 32768	1.7082 ms	1.7309 ms	1.6815 ms	-1.6%
[i32_opt, i32_list], 4096	157.57 µs	160.09 µs	158.90 µs	+0.8%
[i32_opt, i32_list], 32768	1.3393 ms	1.4233 ms	1.4378 ms	+7.4%
[i32_opt, i32_list_opt], 4096	161.53 µs	164.17 µs	171.16 µs	+6.0%
[i32_opt, i32_list_opt], 32768	1.3731 ms	1.4992 ms	1.4596 ms	+6.3%
[i32_list_opt, i32_opt], 4096	173.67 µs	179.41 µs	184.39 µs	+6.2%
[i32_list_opt, i32_opt], 32768	1.5963 ms	1.5744 ms	1.5886 ms	-0.5%
[i32, str_list(4)], 4096	287.85 µs	291.31 µs	302.80 µs	+5.2%
[i32, str_list(4)], 32768	3.3054 ms	3.2674 ms	3.2501 ms	-1.7%
[str_list(4), i32], 4096	297.47 µs	360.26 µs	324.41 µs	+9.1%
[str_list(4), i32], 32768	3.8276 ms	3.8494 ms	3.8520 ms	+0.6%
[i32, str_list_opt(4)], 4096	261.91 µs	325.47 µs	287.95 µs	+9.9%
[i32, str_list_opt(4)], 32768	3.1478 ms	3.1075 ms	3.0864 ms	-2.0%
[str_list_opt(4), i32], 4096	300.98 µs	334.20 µs	317.49 µs	+5.5%
[str_list_opt(4), i32], 32768	3.7982 ms	3.8360 ms	3.7977 ms	0.0%
[i32, i32_list, str(16)], 4096	170.03 µs	175.74 µs	171.79 µs	+1.0%
[i32, i32_list, str(16)], 32768	1.1633 ms	1.1742 ms	1.1638 ms	0.0%
[i32_opt, i32_list_opt, str_opt(50)], 4096	195.77 µs	198.17 µs	195.62 µs	-0.1%
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.6350 ms	1.7419 ms	1.7033 ms	+4.2%

Here it is compared to the other sorts:

Schema, N	Best Other Sort	Radix (Byte Buffer)	Radix Wins?
[i32, i32_opt], 4096	86.1 µs	74.7 µs	Yes
[i32, i32_opt], 32768	859.9 µs	320.8 µs	Yes
[i32, str_opt(16)], 4096	87.4 µs	87.5 µs	Tie
[i32, str_opt(16)], 32768	873.2 µs	485.2 µs	Yes
[i32, str(16)], 4096	85.7 µs	88.3 µs	No
[i32, str(16)], 32768	877.2 µs	413.7 µs	Yes
[str_opt(16), str(16)], 4096	159.2 µs	132.8 µs	Yes
[str_opt(16), str(16)], 32768	1.6544 ms	948.8 µs	Yes
[str_opt(16), str_opt(50), str(16)], 4096	194.0 µs	164.3 µs	Yes
[str_opt(16), str_opt(50), str(16)], 32768	2.0054 ms	1.2207 ms	Yes
[str x5], 4096	229.2 µs	199.6 µs	Yes
[str x5], 32768	2.2503 ms	1.5407 ms	Yes
[i32_opt, dict], 4096	110.0 µs	102.2 µs	Yes
[i32_opt, dict], 32768	1.2056 ms	709.7 µs	Yes
[dict x2], 4096	50.7 µs	93.7 µs	No
[dict x2], 32768	503.1 µs	760.7 µs	No
[dict x3, str(16)], 4096	192.1 µs	182.7 µs	Yes
[dict x3, str(16)], 32768	2.0349 ms	1.5158 ms	Yes
[dict x3, str_opt(50)], 4096	213.7 µs	197.9 µs	Yes
[dict x3, str_opt(50)], 32768	2.1790 ms	1.6815 ms	Yes
[i32_opt, i32_list], 4096	139.1 µs	158.9 µs	No
[i32_opt, i32_list], 32768	1.4647 ms	1.4378 ms	Yes
[i32_opt, i32_list_opt], 4096	148.1 µs	171.2 µs	No
[i32_opt, i32_list_opt], 32768	1.6325 ms	1.4596 ms	Yes
[i32_list_opt, i32_opt], 4096	225.0 µs	184.4 µs	Yes
[i32_list_opt, i32_opt], 32768	2.5477 ms	1.5886 ms	Yes
[i32, str_list(4)], 4096	85.0 µs	302.8 µs	No
[i32, str_list(4)], 32768	874.5 µs	3.2501 ms	No
[str_list(4), i32], 4096	295.8 µs	324.4 µs	No
[str_list(4), i32], 32768	3.4113 ms	3.8520 ms	No
[i32, str_list_opt(4)], 4096	87.0 µs	288.0 µs	No
[i32, str_list_opt(4)], 32768	873.5 µs	3.0864 ms	No
[str_list_opt(4), i32], 4096	354.1 µs	317.5 µs	Yes
[str_list_opt(4), i32], 32768	3.7977 ms	3.7977 ms	Tie
[i32, i32_list, str(16)], 4096	85.0 µs	171.8 µs	No
[i32, i32_list, str(16)], 32768	868.2 µs	1.1638 ms	No
[i32_opt, i32_list_opt, str_opt(50)], 4096	157.4 µs	195.6 µs	No
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.7421 ms	1.7033 ms	Yes

mbutrovich added 4 commits April 9, 2026 14:07

Implement radix sort.

6eda8be

Update comments.

5fb4231

Hoist temp buffer out.

026a254

Clarify confusing comment.

c8b7a96

github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2026

mbutrovich changed the title ~~Radix row sort~~ feat(arrow-row): add MSD radix sort kernel for row-encoded keys Apr 9, 2026

Fix clippy.

8401ebe

mbutrovich mentioned this pull request Apr 9, 2026

perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter apache/datafusion#21525

Draft

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

mbutrovich added 3 commits April 9, 2026 20:33

Address PR feedback.

f9a4c41

Address PR feedback.

89ceaf1

Byte buffer extraction.

15351fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:radix_row_sort

mbutrovich commented Apr 9, 2026

Uh oh!

mbutrovich commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented Apr 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark results

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 10, 2026 •

edited

Loading