[CI] Add GitHub workflow for building and releasing fat wheels#91
Conversation
…83) Replace the monolithic `cula.cudac` extension with per-arch extensions (`cula._cudac_sm90`, `cula._cudac_sm100`) so that SM90 and SM100/SM103 kernels are compiled independently with their own `-gencode` flags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time. Key changes: - Split pybind.cu into per-file PYBIND11_MODULE definitions - Add `cula/cudac.py` proxy module for backwards-compatible imports - Add `CULA_BUILD_ALL_ARCHS=1` env var to enable all SM targets - Add `--fat` flag to build_wheel.sh for CI fat-binary builds - Pin dependency versions and use `no-local-version` scheme for reproducible wheel filenames - Use setuptools_scm for dynamic `__version__` - Document pre-built wheel installation in README
There was a problem hiding this comment.
Code Review
This pull request restructures the build system and CUDA extension packaging for cuLA to support separate per-architecture extensions (_cudac_sm100 and _cudac_sm90) and fat-binary builds. It introduces a lazy-loading proxy module (cula.cudac) to dynamically expose the compiled extension functions, and updates the versioning to use setuptools_scm. Feedback on these changes highlights a thread-safety vulnerability in the lazy-loading proxy that could cause race conditions during concurrent imports, and advises against strict version pinning of runtime dependencies in pyproject.toml to prevent dependency conflicts for downstream users.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
This PR refactors cuLA’s CUDA packaging so architecture-specific kernels are built as separate extensions (SM90 vs SM100/SM103) and adds CI automation to build and publish “fat” wheels (multi-arch) via GitHub Releases, while keeping import cula.cudac working via a Python proxy module.
Changes:
- Split the monolithic CUDA extension into per-architecture
CUDAExtensions and movePYBIND11_MODULEbindings into per-arch.cuentrypoints. - Add
CULA_BUILD_ALL_ARCHS=1support and a--fatoption in the wheel build script for CI fat-binary builds. - Introduce a GitHub Actions workflow to build cu129/cu130 wheels for x86_64/aarch64 and attach them to GitHub Releases; update versioning + README install docs accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/build-release.yml |
New CI workflow to build and upload CUDA-versioned wheel artifacts and draft a GitHub Release. |
setup.py |
Builds per-arch CUDA extensions (cula._cudac_sm90, cula._cudac_sm100) and adds CULA_BUILD_ALL_ARCHS behavior. |
csrc/api/pybind.cu |
Removes the monolithic binding module entrypoint. |
csrc/api/kda_sm90.cu |
Adds SM90-specific PYBIND11_MODULE bindings. |
csrc/api/kda_sm100.cu |
Adds SM100/SM103-specific PYBIND11_MODULE bindings. |
cula/cudac.py |
New proxy module to preserve import cula.cudac API across split extensions. |
cula/__init__.py |
Switches to setuptools_scm-generated runtime version when available. |
scripts/build_wheel.sh |
Adds --fat flag to set CULA_BUILD_ALL_ARCHS=1 during wheel builds. |
README.md |
Documents installing pre-built wheels from GitHub Releases. |
pyproject.toml |
Uses setuptools_scm no-local-version scheme and keeps writing cula/_version.py. |
tests/conftest.py |
Minor formatting adjustment in test collection marker logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add double-checked locking to _CudacProxy._load() to prevent race conditions in multi-threaded environments. Raise a descriptive ImportError when no CUDA extensions can be loaded instead of silently producing AttributeError later.
The blanket `except ImportError: pass` swallowed the actual failure reason, making it impossible to diagnose missing shared libraries or build issues. Collect each extension's ImportError and include them in the raised message.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
/gemini review again |
There was a problem hiding this comment.
Code Review
This pull request refactors the build system and CUDA extension loading mechanism of cuLA by splitting the monolithic cula.cudac extension into two separate per-architecture extensions (cula._cudac_sm100 and cula._cudac_sm90) and introducing a lazy-loading proxy module. It also adds support for building fat binary wheels. The review feedback points out that the CULA_SM100_ENABLED and CULA_SM103_ENABLED preprocessor macros were omitted in the new setup.py, which could lead to compilation failures. Additionally, it is recommended to simplify the proxy implementation using standard PEP 562 module-level functions and to broaden exception handling during dynamic imports to catch RuntimeError and OSError.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Catch (ImportError, AttributeError, OSError) when scanning per-arch extensions: pybind11 modules commonly surface missing-symbol / ABI / libcudart failures as AttributeError or OSError rather than ImportError, so the prior narrow catch silently dropped one extension's failure when another succeeded, leaving its kernels missing without diagnostic. Emit a UserWarning naming each failing extension on partial failure (all-fail still raises ImportError), preserving the c955d47 intent of surfacing per-extension errors. Also document the load-once-per-process semantics in the module docstring.
Select the per-architecture CUDA extension from the active device compute capability instead of scanning every built extension. SM100/SM103 now load the SM100 extension, while SM90 loads the SM90 extension. This avoids exposing kernels from mismatched GPU architectures and reports clearer errors when the matching extension is missing or unsupported.
📌 Description
Replace the monolithic
cula.cudacextension with per-arch extensions (cula._cudac_sm90,cula._cudac_sm100) so that SM90 and SM100/SM103 kernels are compiled independently with their own-gencodeflags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time.Key changes:
cula/cudac.pyproxy module for backwards-compatible importsCULA_BUILD_ALL_ARCHS=1env var to enable all SM targets--fatflag to build_wheel.sh for CI fat-binary buildsno-local-versionscheme for reproducible wheel filenames__version__🔍 Related Issues
Fix #83
🚀 Pull Request Checklist
Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
⚡ Performance
Reviewer Notes