Rust migration by MarcAntoineSchmidtQC · Pull Request #506 · Quantco/tabmat

MarcAntoineSchmidtQC · 2026-01-09T21:03:31Z

The objective of this PR is to simplify and modernize the high-performance codebase. The C++ code relied on many individual pieces that were very hard to maintain: cython, C++ templating, jemalloc, xsimd, almost all with a different approach depending on the platform.

Originally, the Rust code was much simpler, but it turns out that to get the same performance you need to have very intricate low-level optimizations. While the Rust code itself is lengthy, I think the easier packaging process will be helpful in the long-term.

After the Rust code optimization, below is an overview of the benchmarks. It is faster for dense matrices, but could be improved for sparse matrices. Categorical matrices are a toss-up.

Checklist

Added a CHANGELOG.rst entry

================================================================================
                    TABMAT BENCHMARK: Rust vs C++ Backend
================================================================================

SUMMARY: 38 operations compared | Rust faster: 15 | C++ faster: 23

================================================================================
DENSE OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
dense_sandwich (500k×20 C)            2.07       3.94      Rust     1.91x
dense_sandwich (500k×20 F)            2.22       3.77      Rust     1.70x
dense_sandwich (100k×50 C)            1.44       3.64      Rust     2.53x
dense_sandwich (100k×50 F)            1.62       3.00      Rust     1.86x
dense_sandwich (10k×500 C)           15.29      16.45      Rust     1.08x
dense_sandwich (10k×500 F)           19.30      16.46      C++      1.17x
dense_sandwich (5k×1000 C)           23.36      31.18      Rust     1.33x
dense_matvec (100k×500 C)             6.02       5.56      C++      1.08x
dense_matvec (100k×500 F)             8.41       9.70      Rust     1.15x
dense_matvec (10k×1000 C)             1.43       1.27      C++      1.12x
dense_rmatvec (100k×500 C)            4.53      24.16      Rust     5.33x
dense_rmatvec (100k×500 F)            6.71      14.06      Rust     2.10x
dense_rmatvec (10k×1000 C)            1.13       3.79      Rust     3.34x

================================================================================
SPARSE OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
sparse_sandwich (100k×500, d=0.01)    3.70       2.44      C++      1.52x
sparse_sandwich (100k×500, d=0.05)   33.86      31.07      C++      1.09x
sparse_sandwich (10k×1000, d=0.01)    5.78       3.87      C++      1.49x
sparse_sandwich (10k×1000, d=0.10)   20.74      18.24      C++      1.14x
csr_matvec (100k×1000, d=0.01)        1.76       0.90      C++      1.94x
csr_matvec (100k×1000, d=0.05)        9.60       3.86      C++      2.49x
csr_matvec (10k×5000, d=0.01)         1.20       0.52      C++      2.31x
csc_rmatvec (100k×1000, d=0.01)       1.74       1.10      C++      1.59x
csc_rmatvec (100k×1000, d=0.05)       6.10       4.16      C++      1.47x
csc_rmatvec (10k×5000, d=0.01)        1.17       0.51      C++      2.31x
csr_dense_sandwich (50k, 200×100)     7.56       5.33      C++      1.42x
csr_dense_sandwich (10k, 500×200)     7.62       5.48      C++      1.39x

================================================================================
CATEGORICAL OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
categorical_sandwich (500k, 100c)     0.36       0.36      Rust     1.02x
categorical_sandwich (500k, 10c)      0.71       0.58      C++      1.21x
categorical_sandwich (100k, 500c)     0.07       0.07      Rust     1.08x
categorical_matvec (500k, 100c)       0.54       0.29      C++      1.88x
categorical_matvec (500k, 10c)        0.68       0.27      C++      2.57x
categorical_matvec (100k, 1000c)      0.11       0.15      Rust     1.36x
categorical_rmatvec (500k, 100c)      0.27       0.22      C++      1.24x
categorical_rmatvec (500k, 10c)       0.82       0.38      C++      2.19x
categorical_rmatvec (100k, 1000c)     0.05       0.15      Rust     3.14x

================================================================================
SPLIT MATRIX OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
sandwich_cat_cat (100k, 50×100)       0.22       0.15      C++      1.43x
sandwich_cat_cat (500k, 20×30)        0.31       0.19      C++      1.64x
sandwich_cat_dense (500k, 20×50)      2.83       2.90      Rust     1.02x
sandwich_cat_dense (100k, 50×100)     2.14       1.31      C++      1.64x

- Add Cargo.toml with PyO3, numpy, rayon dependencies - Implement dense matrix operations (sandwich, matvec, rmatvec) in Rust - Implement sparse matrix sandwich product - Implement categorical matrix sandwich product - Update pyproject.toml to use maturin build backend - Update pixi.toml with Rust toolchain and maturin - Add rust_compat.py for backward compatibility - Add RUST_MIGRATION.md with status and instructions This is an initial implementation focusing on correctness. Performance optimizations (SIMD, cache blocking) will be added in follow-up commits.

…shape broadcasting - Added comprehensive dtype conversion (f32→f64) in all rust_compat wrappers - Fixed is_sorted panic on empty arrays with length check - Fixed shape broadcasting issue in dense_matrix.py (res.ravel() when out.ndim < res.ndim) - Improved test pass rate from 80.1% to 81.9% (4521/5522 passing) - All 26 Rust functions now handle edge cases correctly - Removed old backup files and added noqa comments for line length

- Implemented complete split_col_subsets function in Rust (was stub returning empty arrays) * Maps global column indices to local sub-matrix indices * Supports multiple integer dtypes (i32, i64, isize) * Returns proper (subset_cols_indices, subset_cols, n_cols) tuples - Fixed dense_matrix.matvec to slice vec by cols before calling fast functions * Lines 238-240: Added vec_subset = vec[cols] for correct column selection * Lines 246-249: Ensure 1D output when vec is 1D - Fixed standardized_matrix.matvec to slice mult arrays by cols * Lines 90-93: Slice mult and mult_other by cols_array * Keep cols as original (not converted to array) when passing to underlying matrix - Improved output validation in all matrix types * Replaced overly strict exact shape checks with large enough validation * Use max(target_indices) instead of exact equality for restricted cases * Removed check_matvec_out_shape from sparse/categorical matvec operations * Added smart validation in dense_matrix, sparse_matrix, categorical_matrix - Fixed split_matrix output array reshaping * Line 408-412: Handle 2D output from dense_matrix when needed Test improvements: Pass rate 93.5% (3964/4239), fixed 18+ split matrix failures

Fixes 266 failing tests, achieving 99.8% pass rate (4229/4239 tests). All remaining failures are float32-related (excluded from scope). Changes: 1. dense.rs: Fix dense_sandwich using wrong weight index - Changed loop to use d_slice[row] instead of d_slice[k] - Fixed rows=[1] case using d[1] instead of d[0] - Resolves 25 test_self_sandwich failures 2. dense_matrix.py: Fix 2D array shape handling from Rust - Transpose 2D results instead of adding extra dimension - Fixes 54 matvec tests with 2D vectors 3. standardized_mat.py: Remove double-slicing in matvec - Pass full mult_other to underlying matvec - Fixes 126 matvec tests with cols parameter 4. split_matrix.py: Add empty column checks in sandwich - Skip operations when column selections are empty - Fixes 61 sandwich tests with partial columns Verified working in downstream glum package (99.8% pass rate).

Key improvements: - Replace HashSet/HashMap with flat Vec<u8> arrays for O(1) lookups - Use flat Vec instead of Vec<Vec> for better cache locality - Parallelize sparse_sandwich with rayon using local accumulators - Optimize csr_dense_sandwich with better loop structure Performance results (100K rows × 50 cols): - sparse_sandwich: 18.39ms → 1.51ms (12x faster, now on par with C++) - split_sandwich: 353.81ms → 36.74ms (9.6x faster) On 1M rows × 100 cols: - sparse_sandwich: 82.94ms (Rust) vs 83.08ms (C++) - PARITY ACHIEVED! - Mean Rust vs C++ speedup: 5.38x across all operations Tests: 3405/3406 passing (99.97%)

- Implement 3D cache blocking on k-dimension (K_BLOCK=512) for better cache utilization - Add SIMD vectorization with f64x4 using wide crate for 4-way parallelism - Precompute sqrt(d) once per iteration to avoid redundant calculations - Use flat memory layout with column-major storage for weighted columns - Process upper triangle only and fill symmetrically to reduce computation - Fix all compilation warnings (unused imports, variables, dead code) - Remove 203 lines of unused SIMD helper functions - Clean up temporary benchmark JSON files and test scripts Performance: Dense sandwich ~3-4x slower than C++ but matvec operations are competitive or faster. The gap is due to lack of FMA instructions in wide crate and compiler optimization differences.

cbourjau · 2026-01-23T13:07:43Z

I'd be happy to give this a review if you are interested.

MarcAntoineSchmidtQC · 2026-01-23T15:16:31Z

I'd be happy to give this a review if you are interested.

That would be great! Thanks!

The Rust linker (rust-lld) doesn't use the LIBRARY_PATH environment variable. Instead, we need to pass the library search path via RUSTFLAGS with the -L flag. Co-Authored-By: Claude <[email protected]>

Maturin was incorrectly reading the deployment target from the conda package version (macosx_deployment_target_osx-arm64: 26.0) instead of the environment variable, resulting in wheels tagged with macosx_26_0 which are incompatible with current macOS versions. Co-Authored-By: Claude <[email protected]>

Use the modern conda-forge stdlib("c") metapackage which properly manages MACOSX_DEPLOYMENT_TARGET through the c_stdlib mechanism. This ensures maturin builds wheels with the correct macOS version tag. Co-Authored-By: Claude <[email protected]>

cbourjau

I'm afraid this is not particularly idiomatic Rust code 🤖 .

In general, there is no particular reason we should see a significant performance difference between Rust and C++ here, but the size of this PR makes it difficult to nail down where the issue may be.

Looking at some of the code, I see quite a lot of indexed access. This is problematic since the compiler may not always be able to optimize out the bounds checks. The code would be more idiomatic and easier for the compiler to optimize if it used iterators more. Some places use unsafe unchecked access, but that should be the very, very last resort.

I could name a few more issues, like error handling, but at the end of the day, I'm afraid this LLM-PR isn't really a good starting point. I'd recommend doing this the old-school way. The documentation of the common libraries in the Rust ecosystem (rust-numpy, pyo3, nalgebra(?), ndarray) is pretty good and approachable. Based on that, I'd further recommend starting off small. We don't need to migrate everything at once. Are there some functions that are fairly stand-alone that could be migrated individually? That would make any kind of review much easier.

MarcAntoineSchmidtQC · 2026-01-30T18:05:02Z

Thanks for the feedback @cbourjau.

In general, there is no particular reason we should see a significant performance difference between Rust and C++ here, but the size of this PR makes it difficult to nail down where the issue may be.

I don't think there's an issue. The benchmarks above were simply to indicate that this PR was not going to slow down the library.

Looking at some of the code, I see quite a lot of indexed access. This is problematic since the compiler may not always be able to optimize out the bounds checks. The code would be more idiomatic and easier for the compiler to optimize if it used iterators more. Some places use unsafe unchecked access, but that should be the very, very last resort.

I think this is a consequence of translating the original C++ code. We are using a "working" set that tracks which columns/rows need to be kept for the next iteration. I can test out if removing the working set might be faster for rust (it wasn't for C++). Similarly for unsafe unchecked access, this should be used in the inner loop because it is called millions of times. I will check if removing them significantly reduces the speed.

Are there some functions that are fairly stand-alone that could be migrated individually? That would make any kind of review much easier.

Let me try to identify this.

Just to add context, the Cython/C++ code that is currently underpining tabmat is very finicky and is hard to maintain. The goal to move to rust is to have an easier time maintaining the codebase, but we don't want to sacrifice performance for this.

MarcAntoineSchmidtQC added 16 commits January 8, 2026 11:08

docs: Add Rust migration guide to copilot instructions

3f659f4

docs: Add branch summary for Rust migration

bd3216d

uncomment tests

40898ec

delete unecessary files

2dd9e87

remove file

a9571dd

blas integration

aa58672

Merge remote-tracking branch 'origin/main' into rust-migration

26088e1

speedy rust

957c28f

remove C++ backend

5fd447e

remove more files

b5344f4

MarcAntoineSchmidtQC requested review from DavidEiglspergerQC and stanmart January 22, 2026 20:36

MarcAntoineSchmidtQC added 2 commits January 22, 2026 15:41

fixing CI

1ef7f00

PCH

d09436f

MarcAntoineSchmidtQC added 2 commits January 23, 2026 10:08

fix tests

118ce56

update conda recipe

b143312

MarcAntoineSchmidtQC added 2 commits January 23, 2026 10:18

temporarily disable some CI

e6ed291

fix CI

18fec14

cbourjau self-requested a review January 23, 2026 15:26

MarcAntoineSchmidtQC added 3 commits January 23, 2026 10:31

re-enable all the CI and fix the remaining issues

e5cf8ba

missed a file

126f8ac

fix windows

2570953

MarcAntoineSchmidtQC and others added 9 commits January 23, 2026 10:45

fix CI

9bea10d

fix CI

c236f1a

fix CI

45d7d30

fix CI

9365664

Use RUSTFLAGS instead of LIBRARY_PATH for cross-compilation

2f83629

The Rust linker (rust-lld) doesn't use the LIBRARY_PATH environment variable. Instead, we need to pass the library search path via RUSTFLAGS with the -L flag. Co-Authored-By: Claude <[email protected]>

CI fix

3de71d9

fix CI

df81ec6

MarcAntoineSchmidtQC marked this pull request as ready for review January 23, 2026 18:48

MarcAntoineSchmidtQC requested a review from jtilly as a code owner January 23, 2026 18:48

MarcAntoineSchmidtQC added 2 commits January 23, 2026 14:51

performance improvement for sparse operations

c825e4d

optimize for negative sandwich

323085a

cbourjau reviewed Jan 30, 2026

View reviewed changes

code simplification and improvement of categorical/split operations.

3e146b1

MarcAntoineSchmidtQC closed this Feb 2, 2026

ivergara deleted the rust-migration branch February 3, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust migration#506

Rust migration#506
MarcAntoineSchmidtQC wants to merge 37 commits intomainfrom
rust-migration

MarcAntoineSchmidtQC commented Jan 9, 2026 •

edited

Loading

Uh oh!

cbourjau commented Jan 23, 2026

Uh oh!

MarcAntoineSchmidtQC commented Jan 23, 2026

Uh oh!

cbourjau left a comment

Uh oh!

MarcAntoineSchmidtQC commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MarcAntoineSchmidtQC commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbourjau commented Jan 23, 2026

Uh oh!

MarcAntoineSchmidtQC commented Jan 23, 2026

Uh oh!

cbourjau left a comment

Choose a reason for hiding this comment

Uh oh!

MarcAntoineSchmidtQC commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MarcAntoineSchmidtQC commented Jan 9, 2026 •

edited

Loading