Skip to content

Conversation

@MarcAntoineSchmidtQC
Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC commented Jan 9, 2026

The objective of this PR is to simplify and modernize the high-performance codebase. The C++ code relied on many individual pieces that were very hard to maintain: cython, C++ templating, jemalloc, xsimd, almost all with a different approach depending on the platform.

Originally, the Rust code was much simpler, but it turns out that to get the same performance you need to have very intricate low-level optimizations. While the Rust code itself is lengthy, I think the easier packaging process will be helpful in the long-term.

After the Rust code optimization, below is an overview of the benchmarks. It is faster for dense matrices, but could be improved for sparse matrices. Categorical matrices are a toss-up.

Checklist

  • Added a CHANGELOG.rst entry
================================================================================
                    TABMAT BENCHMARK: Rust vs C++ Backend
================================================================================

SUMMARY: 38 operations compared | Rust faster: 15 | C++ faster: 23

================================================================================
DENSE OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
dense_sandwich (500k×20 C)            2.07       3.94      Rust     1.91x
dense_sandwich (500k×20 F)            2.22       3.77      Rust     1.70x
dense_sandwich (100k×50 C)            1.44       3.64      Rust     2.53x
dense_sandwich (100k×50 F)            1.62       3.00      Rust     1.86x
dense_sandwich (10k×500 C)           15.29      16.45      Rust     1.08x
dense_sandwich (10k×500 F)           19.30      16.46      C++      1.17x
dense_sandwich (5k×1000 C)           23.36      31.18      Rust     1.33x
dense_matvec (100k×500 C)             6.02       5.56      C++      1.08x
dense_matvec (100k×500 F)             8.41       9.70      Rust     1.15x
dense_matvec (10k×1000 C)             1.43       1.27      C++      1.12x
dense_rmatvec (100k×500 C)            4.53      24.16      Rust     5.33x
dense_rmatvec (100k×500 F)            6.71      14.06      Rust     2.10x
dense_rmatvec (10k×1000 C)            1.13       3.79      Rust     3.34x

================================================================================
SPARSE OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
sparse_sandwich (100k×500, d=0.01)    3.70       2.44      C++      1.52x
sparse_sandwich (100k×500, d=0.05)   33.86      31.07      C++      1.09x
sparse_sandwich (10k×1000, d=0.01)    5.78       3.87      C++      1.49x
sparse_sandwich (10k×1000, d=0.10)   20.74      18.24      C++      1.14x
csr_matvec (100k×1000, d=0.01)        1.76       0.90      C++      1.94x
csr_matvec (100k×1000, d=0.05)        9.60       3.86      C++      2.49x
csr_matvec (10k×5000, d=0.01)         1.20       0.52      C++      2.31x
csc_rmatvec (100k×1000, d=0.01)       1.74       1.10      C++      1.59x
csc_rmatvec (100k×1000, d=0.05)       6.10       4.16      C++      1.47x
csc_rmatvec (10k×5000, d=0.01)        1.17       0.51      C++      2.31x
csr_dense_sandwich (50k, 200×100)     7.56       5.33      C++      1.42x
csr_dense_sandwich (10k, 500×200)     7.62       5.48      C++      1.39x

================================================================================
CATEGORICAL OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
categorical_sandwich (500k, 100c)     0.36       0.36      Rust     1.02x
categorical_sandwich (500k, 10c)      0.71       0.58      C++      1.21x
categorical_sandwich (100k, 500c)     0.07       0.07      Rust     1.08x
categorical_matvec (500k, 100c)       0.54       0.29      C++      1.88x
categorical_matvec (500k, 10c)        0.68       0.27      C++      2.57x
categorical_matvec (100k, 1000c)      0.11       0.15      Rust     1.36x
categorical_rmatvec (500k, 100c)      0.27       0.22      C++      1.24x
categorical_rmatvec (500k, 10c)       0.82       0.38      C++      2.19x
categorical_rmatvec (100k, 1000c)     0.05       0.15      Rust     3.14x

================================================================================
SPLIT MATRIX OPERATIONS
================================================================================
Operation                          Rust (ms)   C++ (ms)   Winner   Speedup
--------------------------------------------------------------------------------
sandwich_cat_cat (100k, 50×100)       0.22       0.15      C++      1.43x
sandwich_cat_cat (500k, 20×30)        0.31       0.19      C++      1.64x
sandwich_cat_dense (500k, 20×50)      2.83       2.90      Rust     1.02x
sandwich_cat_dense (100k, 50×100)     2.14       1.31      C++      1.64x

- Add Cargo.toml with PyO3, numpy, rayon dependencies
- Implement dense matrix operations (sandwich, matvec, rmatvec) in Rust
- Implement sparse matrix sandwich product
- Implement categorical matrix sandwich product
- Update pyproject.toml to use maturin build backend
- Update pixi.toml with Rust toolchain and maturin
- Add rust_compat.py for backward compatibility
- Add RUST_MIGRATION.md with status and instructions

This is an initial implementation focusing on correctness. Performance
optimizations (SIMD, cache blocking) will be added in follow-up commits.
…shape broadcasting

- Added comprehensive dtype conversion (f32→f64) in all rust_compat wrappers
- Fixed is_sorted panic on empty arrays with length check
- Fixed shape broadcasting issue in dense_matrix.py (res.ravel() when out.ndim < res.ndim)
- Improved test pass rate from 80.1% to 81.9% (4521/5522 passing)
- All 26 Rust functions now handle edge cases correctly
- Removed old backup files and added noqa comments for line length
- Implemented complete split_col_subsets function in Rust (was stub returning empty arrays)
  * Maps global column indices to local sub-matrix indices
  * Supports multiple integer dtypes (i32, i64, isize)
  * Returns proper (subset_cols_indices, subset_cols, n_cols) tuples

- Fixed dense_matrix.matvec to slice vec by cols before calling fast functions
  * Lines 238-240: Added vec_subset = vec[cols] for correct column selection
  * Lines 246-249: Ensure 1D output when vec is 1D

- Fixed standardized_matrix.matvec to slice mult arrays by cols
  * Lines 90-93: Slice mult and mult_other by cols_array
  * Keep cols as original (not converted to array) when passing to underlying matrix

- Improved output validation in all matrix types
  * Replaced overly strict exact shape checks with large enough validation
  * Use max(target_indices) instead of exact equality for restricted cases
  * Removed check_matvec_out_shape from sparse/categorical matvec operations
  * Added smart validation in dense_matrix, sparse_matrix, categorical_matrix

- Fixed split_matrix output array reshaping
  * Line 408-412: Handle 2D output from dense_matrix when needed

Test improvements: Pass rate 93.5% (3964/4239), fixed 18+ split matrix failures
Fixes 266 failing tests, achieving 99.8% pass rate (4229/4239 tests).
All remaining failures are float32-related (excluded from scope).

Changes:
1. dense.rs: Fix dense_sandwich using wrong weight index
   - Changed loop to use d_slice[row] instead of d_slice[k]
   - Fixed rows=[1] case using d[1] instead of d[0]
   - Resolves 25 test_self_sandwich failures

2. dense_matrix.py: Fix 2D array shape handling from Rust
   - Transpose 2D results instead of adding extra dimension
   - Fixes 54 matvec tests with 2D vectors

3. standardized_mat.py: Remove double-slicing in matvec
   - Pass full mult_other to underlying matvec
   - Fixes 126 matvec tests with cols parameter

4. split_matrix.py: Add empty column checks in sandwich
   - Skip operations when column selections are empty
   - Fixes 61 sandwich tests with partial columns

Verified working in downstream glum package (99.8% pass rate).
Key improvements:
- Replace HashSet/HashMap with flat Vec<u8> arrays for O(1) lookups
- Use flat Vec instead of Vec<Vec> for better cache locality
- Parallelize sparse_sandwich with rayon using local accumulators
- Optimize csr_dense_sandwich with better loop structure

Performance results (100K rows × 50 cols):
- sparse_sandwich: 18.39ms → 1.51ms (12x faster, now on par with C++)
- split_sandwich: 353.81ms → 36.74ms (9.6x faster)

On 1M rows × 100 cols:
- sparse_sandwich: 82.94ms (Rust) vs 83.08ms (C++) - PARITY ACHIEVED!
- Mean Rust vs C++ speedup: 5.38x across all operations

Tests: 3405/3406 passing (99.97%)
- Implement 3D cache blocking on k-dimension (K_BLOCK=512) for better cache utilization
- Add SIMD vectorization with f64x4 using wide crate for 4-way parallelism
- Precompute sqrt(d) once per iteration to avoid redundant calculations
- Use flat memory layout with column-major storage for weighted columns
- Process upper triangle only and fill symmetrically to reduce computation
- Fix all compilation warnings (unused imports, variables, dead code)
- Remove 203 lines of unused SIMD helper functions
- Clean up temporary benchmark JSON files and test scripts

Performance: Dense sandwich ~3-4x slower than C++ but matvec operations
are competitive or faster. The gap is due to lack of FMA instructions in
wide crate and compiler optimization differences.
@cbourjau
Copy link

I'd be happy to give this a review if you are interested.

@MarcAntoineSchmidtQC
Copy link
Member Author

I'd be happy to give this a review if you are interested.

That would be great! Thanks!

@cbourjau cbourjau self-requested a review January 23, 2026 15:26
MarcAntoineSchmidtQC and others added 9 commits January 23, 2026 10:45
The Rust linker (rust-lld) doesn't use the LIBRARY_PATH environment
variable. Instead, we need to pass the library search path via RUSTFLAGS
with the -L flag.

Co-Authored-By: Claude <[email protected]>
Maturin was incorrectly reading the deployment target from the
conda package version (macosx_deployment_target_osx-arm64: 26.0)
instead of the environment variable, resulting in wheels tagged
with macosx_26_0 which are incompatible with current macOS versions.

Co-Authored-By: Claude <[email protected]>
Use the modern conda-forge stdlib("c") metapackage which properly
manages MACOSX_DEPLOYMENT_TARGET through the c_stdlib mechanism.
This ensures maturin builds wheels with the correct macOS version tag.

Co-Authored-By: Claude <[email protected]>
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC marked this pull request as ready for review January 23, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants