Skip to content

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683

Open
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:radix_row_sort
Open

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:radix_row_sort

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

N/A.

Rationale for this change

The existing lexsort_to_indices uses comparison sort on columnar arrays, which is O(n log n × comparison_cost) where comparison cost scales with the number of columns. The Arrow row format (RowConverter) produces memcmp-comparable byte sequences, making it a natural fit for radix sort — O(n × key_width) — which can overcome the encoding overhead by eliminating per-comparison column traversal.

Inspired by DuckDB's sorting redesign, which uses MSD radix sort on normalized key prefixes with a comparison sort fallback, this PR adds an radix_sort_to_indices kernel that operates directly on row-encoded keys. Like DuckDB, we limit radix depth to 8 bytes before falling back to comparison sort, balancing radix efficiency against diminishing returns on deep recursion.

What changes are included in this PR?

  • arrow-row/src/radix.rs (new): MSD radix sort on Rows with:
    • 256-bucket histogram + out-of-place scatter per byte position
    • Comparison sort fallback for small buckets (≤64 elements) or after 8 bytes of radix depth
    • Single pre-allocated temp buffer reused across recursion levels
  • arrow-row/src/lib.rs: Exposes pub mod radix
  • arrow/benches/lexsort.rs: Adds lexsort_radix benchmark variant alongside existing lexsort_to_indices and lexsort_rows, and removes a duplicate benchmark case

Benchmark results

All three variants include the full pipeline (encoding + sort) so the comparison against lexsort_to_indices (which doesn't encode) is apples-to-apples. Here are a subset of results from cargo bench --bench lexsort. I'll post the full table in a follow-up comment.

Schema, N lexsort_to_indices lexsort_rows lexsort_radix
[i32, i32_opt], 4096 86.211 µs 117.00 µs 71.987 µs
[i32, i32_opt], 32768 866.91 µs 1.2274 ms 359.07 µs
[i32, str_opt(16)], 32768 860.45 µs 1.8661 ms 510.93 µs
[str_opt(16), str(16)], 32768 2.4401 ms 1.6789 ms 946.89 µs
[3x str], 32768 2.4590 ms 1.9281 ms 1.2148 ms
[i32_opt, dict], 32768 1.1873 ms 1.3615 ms 627.81 µs
[dict, dict], 32768 499.32 µs 732.40 µs 722.44 µs
[3x dict, str(16)], 32768 4.1305 ms 2.0389 ms 1.5303 ms
[i32_opt, i32_list], 32768 1.5171 ms 2.8395 ms 1.3393 ms
[i32, i32_list, str(16)], 32768 862.46 µs 2.2412 ms 1.1633 ms

Radix sort is the fastest in the majority of cases. The main exception is pure low-cardinality dictionary columns where lexsort_to_indices avoids encoding overhead entirely. The module documentation provides guidance on when to use each approach.

Are these changes tested?

17 tests including:

  • Deterministic tests for integers, strings, multi-column, nulls, all-equal, empty, single-element
  • All 4 sort option combinations (ascending/descending × nulls_first)
  • Float64 with NaN/Infinity, booleans
  • Threshold boundary tests (sizes 1–1000 around the fallback threshold)
  • Fuzz test: 100 iterations × 1–4 random columns × random types × random sort options × 5–500 rows
  • Cross-validation: verifies radix output matches comparison sort on the same Rows

Are there any user-facing changes?

New public API: arrow_row::radix::radix_sort_to_indices(&Rows) -> Vec<u32>.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor Author

Full cargo bench --bench lexsort results on M3 Max:

Schema, N lexsort_to_indices lexsort_rows lexsort_radix
[i32, i32_opt], 4096 86.211 µs 117.00 µs 71.987 µs
[i32, i32_opt], 32768 866.91 µs 1.2274 ms 359.07 µs
[i32, str_opt(16)], 4096 85.640 µs 156.87 µs 88.564 µs
[i32, str_opt(16)], 32768 860.45 µs 1.8661 ms 510.93 µs
[i32, str(16)], 4096 84.783 µs 136.31 µs 85.021 µs
[i32, str(16)], 32768 859.64 µs 1.4682 ms 457.38 µs
[str_opt(16), str(16)], 4096 230.48 µs 159.14 µs 123.71 µs
[str_opt(16), str(16)], 32768 2.4401 ms 1.6789 ms 946.89 µs
[str_opt(16), str_opt(50), str(16)], 4096 233.39 µs 192.74 µs 155.40 µs
[str_opt(16), str_opt(50), str(16)], 32768 2.4590 ms 1.9281 ms 1.2148 ms
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096 232.35 µs 226.56 µs 188.98 µs
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768 2.4572 ms 2.2451 ms 1.5222 ms
[i32_opt, dict(100,str_opt(50))], 4096 109.14 µs 137.92 µs 100.16 µs
[i32_opt, dict(100,str_opt(50))], 32768 1.1873 ms 1.3615 ms 627.81 µs
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096 50.649 µs 88.038 µs 90.044 µs
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768 499.32 µs 732.40 µs 722.44 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 4096 375.75 µs 193.73 µs 179.50 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 32768 4.1305 ms 2.0389 ms 1.5303 ms
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 4096 380.06 µs 214.73 µs 192.86 µs
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 32768 4.2629 ms 2.2013 ms 1.7082 ms
[i32_opt, i32_list], 4096 138.11 µs 244.78 µs 157.57 µs
[i32_opt, i32_list], 32768 1.5171 ms 2.8395 ms 1.3393 ms
[i32_opt, i32_list_opt], 4096 149.05 µs 232.17 µs 161.53 µs
[i32_opt, i32_list_opt], 32768 1.6480 ms 2.8980 ms 1.3731 ms
[i32_list_opt, i32_opt], 4096 251.74 µs 216.05 µs 173.67 µs
[i32_list_opt, i32_opt], 32768 2.7722 ms 2.5670 ms 1.5963 ms
[i32, str_list(4)], 4096 85.625 µs 374.40 µs 287.85 µs
[i32, str_list(4)], 32768 858.96 µs 4.7725 ms 3.3054 ms
[str_list(4), i32], 4096 297.00 µs 390.12 µs 297.47 µs
[str_list(4), i32], 32768 3.3962 ms 4.3830 ms 3.8276 ms
[i32, str_list_opt(4)], 4096 84.844 µs 374.89 µs 261.91 µs
[i32, str_list_opt(4)], 32768 872.22 µs 4.7227 ms 3.1478 ms
[str_list_opt(4), i32], 4096 354.73 µs 327.17 µs 300.98 µs
[str_list_opt(4), i32], 32768 4.1949 ms 4.3219 ms 3.7982 ms
[i32, i32_list, str(16)], 4096 85.755 µs 220.73 µs 170.03 µs
[i32, i32_list, str(16)], 32768 862.46 µs 2.2412 ms 1.1633 ms
[i32_opt, i32_list_opt, str_opt(50)], 4096 156.44 µs 240.29 µs 195.77 µs
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.7604 ms 2.4651 ms 1.6350 ms

@mbutrovich mbutrovich changed the title Radix row sort feat(arrow-row): add MSD radix sort kernel for row-encoded keys Apr 9, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented Apr 10, 2026

Thanks for the feedback @Dandandan!

I tried the 4 optimizations you suggested, and here are the results:

Schema, N Before After Change
[i32, i32_opt], 4096 71.987 µs 70.200 µs -2.5%
[i32, i32_opt], 32768 359.07 µs 341.22 µs -5.0%
[i32, str_opt(16)], 4096 88.564 µs 88.874 µs +0.4%
[i32, str_opt(16)], 32768 510.93 µs 493.01 µs -3.5%
[i32, str(16)], 4096 85.021 µs 92.958 µs +9.3%
[i32, str(16)], 32768 457.38 µs 425.87 µs -6.9%
[str_opt(16), str(16)], 4096 123.71 µs 133.51 µs +7.9%
[str_opt(16), str(16)], 32768 946.89 µs 993.63 µs +4.9%
[str_opt(16), str_opt(50), str(16)], 4096 155.40 µs 165.69 µs +6.6%
[str_opt(16), str_opt(50), str(16)], 32768 1.2148 ms 1.2654 ms +4.2%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096 188.98 µs 199.13 µs +5.4%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768 1.5222 ms 1.5486 ms +1.7%
[i32_opt, dict(100,str_opt(50))], 4096 100.16 µs 101.17 µs +1.0%
[i32_opt, dict(100,str_opt(50))], 32768 627.81 µs 686.16 µs +9.3%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096 90.044 µs 89.670 µs -0.4%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768 722.44 µs 741.11 µs +2.6%
[dict(100,str_opt(50))x3, str(16)], 4096 179.50 µs 187.09 µs +4.2%
[dict(100,str_opt(50))x3, str(16)], 32768 1.5303 ms 1.5696 ms +2.6%
[dict(100,str_opt(50))x3, str_opt(50)], 4096 192.86 µs 200.81 µs +4.1%
[dict(100,str_opt(50))x3, str_opt(50)], 32768 1.7082 ms 1.7309 ms +1.3%
[i32_opt, i32_list], 4096 157.57 µs 160.09 µs +1.6%
[i32_opt, i32_list], 32768 1.3393 ms 1.4233 ms +6.3%
[i32_opt, i32_list_opt], 4096 161.53 µs 164.17 µs +1.6%
[i32_opt, i32_list_opt], 32768 1.3731 ms 1.4992 ms +9.2%
[i32_list_opt, i32_opt], 4096 173.67 µs 179.41 µs +3.3%
[i32_list_opt, i32_opt], 32768 1.5963 ms 1.5744 ms -1.4%
[i32, str_list(4)], 4096 287.85 µs 291.31 µs +1.2%
[i32, str_list(4)], 32768 3.3054 ms 3.2674 ms -1.1%
[str_list(4), i32], 4096 297.47 µs 360.26 µs +21.1%
[str_list(4), i32], 32768 3.8276 ms 3.8494 ms +0.6%
[i32, str_list_opt(4)], 4096 261.91 µs 325.47 µs +24.3%
[i32, str_list_opt(4)], 32768 3.1478 ms 3.1075 ms -1.3%
[str_list_opt(4), i32], 4096 300.98 µs 334.20 µs +11.0%
[str_list_opt(4), i32], 32768 3.7982 ms 3.8360 ms +1.0%
[i32, i32_list, str(16)], 4096 170.03 µs 175.74 µs +3.4%
[i32, i32_list, str(16)], 32768 1.1633 ms 1.1742 ms +0.9%
[i32_opt, i32_list_opt, str_opt(50)], 4096 195.77 µs 198.17 µs +1.2%
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.6350 ms 1.7419 ms +6.5%

Claude's attempt at interpreting the results:

Integer schemas at large N improved ~3-5%. Most other schemas regressed, especially list-leading schemas at small N (+21-24%).

Likely causes:

  1. Ping-pong fallback overhead: The old code did one bulk copy_from_slice per level. Ping-pong eliminates that but replaces it with many small per bucket copies in the fallback path. For schemas that produce many small buckets (strings, lists, dicts), many scattered copies are less efficient than one contiguous memcpy.
  2. data_from comparison overhead: The old fallback used ra.cmp(&rb) — a direct memcmp. The new data_from(byte_pos).cmp(...) adds a bounds check + subslice per row per comparison. For schemas hitting the fallback frequently, this outweighs the benefit of skipping the shared prefix.

Integer schemas benefit because they have fewer, larger buckets that stay in the radix path longer where the eliminated per-level memcpy pays off.

Do you want the commit anyway just to iterate off of?

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Byte Buffer is the current commit:

Schema, N Original PR Feedback Byte Buffer Buffer vs Original
[i32, i32_opt], 4096 71.987 µs 70.200 µs 74.722 µs +3.8%
[i32, i32_opt], 32768 359.07 µs 341.22 µs 320.76 µs -10.7%
[i32, str_opt(16)], 4096 88.564 µs 88.874 µs 87.527 µs -1.2%
[i32, str_opt(16)], 32768 510.93 µs 493.01 µs 485.17 µs -5.0%
[i32, str(16)], 4096 85.021 µs 92.958 µs 88.301 µs +3.9%
[i32, str(16)], 32768 457.38 µs 425.87 µs 413.71 µs -9.5%
[str_opt(16), str(16)], 4096 123.71 µs 133.51 µs 132.79 µs +7.3%
[str_opt(16), str(16)], 32768 946.89 µs 993.63 µs 948.77 µs +0.2%
[str_opt(16), str_opt(50), str(16)], 4096 155.40 µs 165.69 µs 164.27 µs +5.7%
[str_opt(16), str_opt(50), str(16)], 32768 1.2148 ms 1.2654 ms 1.2207 ms +0.5%
[str_opt(16), str(16), x5], 4096 188.98 µs 199.13 µs 199.64 µs +5.6%
[str_opt(16), str(16), x5], 32768 1.5222 ms 1.5486 ms 1.5407 ms +1.2%
[i32_opt, dict(100,str_opt(50))], 4096 100.16 µs 101.17 µs 102.22 µs +2.1%
[i32_opt, dict(100,str_opt(50))], 32768 627.81 µs 686.16 µs 709.65 µs +13.0%
[dict x2], 4096 90.044 µs 89.670 µs 93.701 µs +4.1%
[dict x2], 32768 722.44 µs 741.11 µs 760.66 µs +5.3%
[dict x3, str(16)], 4096 179.50 µs 187.09 µs 182.74 µs +1.8%
[dict x3, str(16)], 32768 1.5303 ms 1.5696 ms 1.5158 ms -0.9%
[dict x3, str_opt(50)], 4096 192.86 µs 200.81 µs 197.92 µs +2.6%
[dict x3, str_opt(50)], 32768 1.7082 ms 1.7309 ms 1.6815 ms -1.6%
[i32_opt, i32_list], 4096 157.57 µs 160.09 µs 158.90 µs +0.8%
[i32_opt, i32_list], 32768 1.3393 ms 1.4233 ms 1.4378 ms +7.4%
[i32_opt, i32_list_opt], 4096 161.53 µs 164.17 µs 171.16 µs +6.0%
[i32_opt, i32_list_opt], 32768 1.3731 ms 1.4992 ms 1.4596 ms +6.3%
[i32_list_opt, i32_opt], 4096 173.67 µs 179.41 µs 184.39 µs +6.2%
[i32_list_opt, i32_opt], 32768 1.5963 ms 1.5744 ms 1.5886 ms -0.5%
[i32, str_list(4)], 4096 287.85 µs 291.31 µs 302.80 µs +5.2%
[i32, str_list(4)], 32768 3.3054 ms 3.2674 ms 3.2501 ms -1.7%
[str_list(4), i32], 4096 297.47 µs 360.26 µs 324.41 µs +9.1%
[str_list(4), i32], 32768 3.8276 ms 3.8494 ms 3.8520 ms +0.6%
[i32, str_list_opt(4)], 4096 261.91 µs 325.47 µs 287.95 µs +9.9%
[i32, str_list_opt(4)], 32768 3.1478 ms 3.1075 ms 3.0864 ms -2.0%
[str_list_opt(4), i32], 4096 300.98 µs 334.20 µs 317.49 µs +5.5%
[str_list_opt(4), i32], 32768 3.7982 ms 3.8360 ms 3.7977 ms 0.0%
[i32, i32_list, str(16)], 4096 170.03 µs 175.74 µs 171.79 µs +1.0%
[i32, i32_list, str(16)], 32768 1.1633 ms 1.1742 ms 1.1638 ms 0.0%
[i32_opt, i32_list_opt, str_opt(50)], 4096 195.77 µs 198.17 µs 195.62 µs -0.1%
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.6350 ms 1.7419 ms 1.7033 ms +4.2%

Here it is compared to the other sorts:

Schema, N Best Other Sort Radix (Byte Buffer) Radix Wins?
[i32, i32_opt], 4096 86.1 µs 74.7 µs Yes
[i32, i32_opt], 32768 859.9 µs 320.8 µs Yes
[i32, str_opt(16)], 4096 87.4 µs 87.5 µs Tie
[i32, str_opt(16)], 32768 873.2 µs 485.2 µs Yes
[i32, str(16)], 4096 85.7 µs 88.3 µs No
[i32, str(16)], 32768 877.2 µs 413.7 µs Yes
[str_opt(16), str(16)], 4096 159.2 µs 132.8 µs Yes
[str_opt(16), str(16)], 32768 1.6544 ms 948.8 µs Yes
[str_opt(16), str_opt(50), str(16)], 4096 194.0 µs 164.3 µs Yes
[str_opt(16), str_opt(50), str(16)], 32768 2.0054 ms 1.2207 ms Yes
[str x5], 4096 229.2 µs 199.6 µs Yes
[str x5], 32768 2.2503 ms 1.5407 ms Yes
[i32_opt, dict], 4096 110.0 µs 102.2 µs Yes
[i32_opt, dict], 32768 1.2056 ms 709.7 µs Yes
[dict x2], 4096 50.7 µs 93.7 µs No
[dict x2], 32768 503.1 µs 760.7 µs No
[dict x3, str(16)], 4096 192.1 µs 182.7 µs Yes
[dict x3, str(16)], 32768 2.0349 ms 1.5158 ms Yes
[dict x3, str_opt(50)], 4096 213.7 µs 197.9 µs Yes
[dict x3, str_opt(50)], 32768 2.1790 ms 1.6815 ms Yes
[i32_opt, i32_list], 4096 139.1 µs 158.9 µs No
[i32_opt, i32_list], 32768 1.4647 ms 1.4378 ms Yes
[i32_opt, i32_list_opt], 4096 148.1 µs 171.2 µs No
[i32_opt, i32_list_opt], 32768 1.6325 ms 1.4596 ms Yes
[i32_list_opt, i32_opt], 4096 225.0 µs 184.4 µs Yes
[i32_list_opt, i32_opt], 32768 2.5477 ms 1.5886 ms Yes
[i32, str_list(4)], 4096 85.0 µs 302.8 µs No
[i32, str_list(4)], 32768 874.5 µs 3.2501 ms No
[str_list(4), i32], 4096 295.8 µs 324.4 µs No
[str_list(4), i32], 32768 3.4113 ms 3.8520 ms No
[i32, str_list_opt(4)], 4096 87.0 µs 288.0 µs No
[i32, str_list_opt(4)], 32768 873.5 µs 3.0864 ms No
[str_list_opt(4), i32], 4096 354.1 µs 317.5 µs Yes
[str_list_opt(4), i32], 32768 3.7977 ms 3.7977 ms Tie
[i32, i32_list, str(16)], 4096 85.0 µs 171.8 µs No
[i32, i32_list, str(16)], 32768 868.2 µs 1.1638 ms No
[i32_opt, i32_list_opt, str_opt(50)], 4096 157.4 µs 195.6 µs No
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.7421 ms 1.7033 ms Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants