CI: allow specifying custom driver versions in test matrix by leofang · Pull Request #2176 · NVIDIA/cuda-python

leofang · 2026-06-07T01:49:28Z

Description

closes #293
closes #1265

Extends the DRIVER field in ci/test-matrix.yml to accept an explicit driver version string (e.g. 580.65.06) in addition to the existing latest / earliest.

Linux — new ci/tools/install_gpu_driver.sh swaps the driver in-job. It is adapted nearly verbatim from the runner team's nvgha-driver CLI (copied so we don't depend on its rollout schedule). The script nsenters onto the host for the install and refreshes the toolkit bind mounts back inside the test container. Because our script lives in the GH workspace mount (container-only), the host-side re-exec reads the script from stdin via bash -s < "$0" rather than running "$0" directly (the relative path doesn't resolve after the mount-namespace flip).
Windows — ci/tools/install_gpu_driver.ps1 is split into two scripts: install_gpu_driver.ps1 (now install-only, reads DRIVER from env, errors on latest/earliest) and a new ci/tools/configure_driver_mode.ps1 (driver-mode + pnputil device cycle, runs on every job). This also fixes a long-standing wart: the previous script unconditionally installed a hardcoded 581.15 even when the matrix row used a latest/earliest runner that already carried the right driver.

Matrix wiring (in both test-wheel-linux.yml and test-wheel-windows.yml):

compute-matrix adds a new RUNNER_DRIVER field per row — equal to DRIVER for latest / earliest, otherwise latest. runs-on: is keyed on RUNNER_DRIVER so custom-DRIVER rows land on the most recent pre-installed runner image (the install scripts perform the actual swap).
On Linux, container.options only adds --privileged --pid=host for custom-DRIVER rows (required by the nsenter dance).
On Linux, custom DRIVER combined with FLAVOR=wsl is rejected eagerly in compute-matrix — the in-container swap doesn't work under WSL.
The "Ensure GPU is working" step (nvidia-smi) now runs after the install / configure step in every workflow, so it validates the post-install driver state on custom-DRIVER rows.
coverage.yml (Windows path, hardcoded DRIVER: latest) was updated alongside since it was the other caller of the old combined script.

Matrix rows flipped to exercise the new code path:

PR matrix (Linux, every PR):
- amd64 / 3.13 / 13.3.0 / local-CTK / rtxpro6000 → DRIVER: '610.43.02'
- amd64 / 3.14 / 13.3.0 / local-CTK / l4 → DRIVER: '610.43.02'
Nightly numba-cuda:
- Linux amd64 / 3.12 / 13.3.0 / l4 → DRIVER: '580.65.06'
- Windows amd64 / 3.12 / 13.3.0 / l4 → DRIVER: '596.36'

Also enables workflow_dispatch: on ci.yml so the main CI pipeline can be re-run manually from the Actions UI (no inputs — the workflow already builds every wheel it tests, and the existing should-skip / detect-changes gates handle non-PR events correctly).

All other matrix rows continue to use DRIVER: latest and are unaffected (same runners, no install step, no privileged container).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Extends the DRIVER field in ci/test-matrix.yml beyond 'latest'/'earliest' to accept an explicit version string (e.g. '580.65.06'). For Linux, ci/tools/install_gpu_driver.sh (adapted from nv-gha-runners/vm-images PR NVIDIA#256) swaps the driver in-job via nsenter when the row uses a custom version; for Windows, ci/tools/install_gpu_driver.ps1 is split into install + configure_driver_mode, with the install step gated on the DRIVER value and the mode step always running. The matrix row is routed to a 'latest' runner image when the DRIVER is a custom version (the install scripts perform the swap themselves). Container privileges on Linux (--privileged --pid=host) are added only on rows with a custom DRIVER. Custom DRIVER + FLAVOR=wsl is rejected eagerly in the compute-matrix step. Two existing nightly-numba-cuda rows exercise the new path: - Linux amd64 / 13.3.0 / l4 -> 580.65.06 - Windows amd64 / 13.3.0 / l4 -> 610.47 Closes NVIDIA#293 Closes NVIDIA#1265

copy-pr-bot · 2026-06-07T01:49:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2026-06-07T01:56:12Z

/ok to test b1b6070

github-actions · 2026-06-07T02:18:00Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2176/
https://nvidia.github.io/cuda-python/pr-preview/pr-2176/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2176/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2176/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

….yml dispatch - install_gpu_driver.sh: pipe the script body to the host-side bash via stdin (bash -s < "$0") instead of re-execing "$0". The script lives in the GH workspace mount (container-only), so the relative path doesn't resolve after nsenter switches the mount namespace. The < "$0" fd is opened before nsenter and survives the flip. - test-matrix.yml: Windows nightly-numba-cuda row 610.47 -> 596.36 (610.47 isn't published on the CDN; install hit 404). - ci.yml: add workflow_dispatch: trigger so the pipeline can be re-run manually. The existing should-skip / detect-changes gates already handle non-PR events.

leofang · 2026-06-07T03:13:14Z

/ok to test 3e016b5

leofang · 2026-06-07T03:24:03Z

    # nightly-numba-cuda
    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' }
-    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' }
+    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '596.36',  DRIVER_MODE: 'TCC' }


xref: https://github.com/NVIDIA/cuda-python/actions/runs/27081260952/job/79927472454

leofang · 2026-06-07T03:24:58Z

    # nightly-numba-cuda
    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
-    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
+    - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '580.65.06' }


xref https://github.com/NVIDIA/cuda-python/actions/runs/27081260952/job/79927472925

So nvidia-smi validates the post-install driver state on custom-DRIVER rows. Windows test-wheel + coverage already use Install -> Configure -> Ensure; this brings the Linux test-wheel job into line.

Exercises the custom-driver install path on every PR (not just nightly). Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both in the 'open' kernel-module flavor (only Volta needs 'legacy').

leofang · 2026-06-07T03:31:32Z

/ok to test 4a23b23

Linux: After install_gpu_driver.sh stops nvidia-persistenced and the apt purge removes the package, the .run installer reinstalls the systemd service but leaves it stopped. cuda.core's test_persistence_mode_enabled fails with NVML_ERROR_UNKNOWN on driver 610.43.02 when the daemon is not running; explicitly start it again at the end of host_install(). Windows: configure_driver_mode.ps1's trailing 'Start-Sleep -Seconds 5' is not enough on slower-coming-back-up multi-GPU rows (observed: 2x H100 MCDM). Replace it with a poll-until-success loop on nvidia-smi with a 60s deadline, matching the runner-team nvgha-driver.ps1 pattern. Previously masked because every Windows row used to run the full install pipeline; with custom-DRIVER plumbing, latest/earliest rows skip the install and the cycle is no longer preceded by warm-up time.

leofang · 2026-06-07T14:18:30Z

/ok to test d33a928

Runner-latest L4 images come up with Persistence-M=On (set somewhere in the runner team's image setup, not in cuda-python). Our .run install leaves it Off, which breaks cuda.core's test_persistence_mode_enabled on driver 610.43.02 -- the test calls device.is_persistence_mode_enabled = False on a device that already reports False, and 610.43.02 returns NVML_ERROR_UNKNOWN for that no-op set. Restore the runner baseline by calling `nvidia-smi -pm 1` at the end of host_install() (sets the kernel persistence flag directly via NVML). Also daemon-reload + start nvidia-persistenced.service best-effort so tools that look for the daemon find it; `set -x` around this trailing block so the next run's log confirms which lines fired.

leofang · 2026-06-07T15:19:05Z

/ok to test 00896dc

refresh_container_libs() used 'cp -f --remove-destination' (verbatim from the runner team's nvgha-driver), which without -p/--preserve strips the SUID/SGID bits on the destination. /usr/bin/nvidia-modprobe ships 4755 and NVML's state-changing calls (e.g. nvmlDeviceSetPersistenceMode) route through it; once SUID is gone the container-side call returns NVML_ERROR_UNKNOWN, which is what cuda.core's test_persistence_mode_enabled was hitting. Add a stat diagnostic line at the end of refresh_container_libs() so the next CI log records nvidia-modprobe's post-refresh mode.

leofang · 2026-06-07T16:55:59Z

/ok to test 0d5f0e9

The `--silent --no-questions` .run installer drops /usr/bin/nvidia- persistenced but does not reliably install a usable systemd unit, so `systemctl start nvidia-persistenced.service` was a no-op (verified in CI logs: `+ true` after the start). With the daemon down, the /run/nvidia-persistenced/socket bind-mounted into the test container is stale, and NVML state-changing calls (e.g. nvmlDeviceSetPersistenceMode) made by root inside the container return NVML_ERROR_UNKNOWN -- which is what cuda.core's test_persistence_mode_enabled has been failing on. Verified on ComputeLab with the same driver (610.43.02), same GPU arch (Ada L40S), root in container: with the daemon up, the SET call returns NVML_SUCCESS; with the daemon down it returns UnknownError. Fix: exec /usr/bin/nvidia-persistenced directly. The binary self-daemonizes and creates the socket on its own. (Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.)

leofang · 2026-06-07T18:52:33Z

/ok to test 3dfaa84

nvidia-persistenced defaults to `--user nvidia-persistenced`, which our apt-purge of `nvidia-compute-utils-*` removed. Without that user the daemon's setuid(3) post-fork fails and the process exits silently -- the `nvidia-smi -pm 1` right after sees Persistence-M briefly On (daemon held it), then it flips back to Off (daemon gone), and the test container's NVML SET call later returns NVML_ERROR_UNKNOWN. Pass --user root so the daemon doesn't depend on a user account that the purge deleted. Also add a `pgrep nvidia-persistenced` + `ls -la /run/nvidia-persistenced/` diagnostic so the next CI log proves the daemon is alive when the test starts.

Allocates one L4 GPU + privileged container, runs install_gpu_driver.sh with DRIVER=610.43.02, then drives nvmlDeviceSetPersistenceMode via raw ctypes -- the exact NVML call that cuda.core's test_persistence_mode_enabled exercises. Exits 1 on NVML_ERROR_UNKNOWN so the smoke test fails loudly when the install path leaves the daemon dead. Total runtime ~5 min vs ~30 min for the full test matrix. Triggered by workflow_dispatch only -- this is an opt-in debugging job, not regular PR or nightly traffic.

…ery PR

leofang · 2026-06-07T19:24:42Z

/ok to test c5fef92

refresh_container_libs() walks /proc/self/mountinfo for entries containing 'nvidia' or 'libcuda'. /run/nvidia-persistenced/socket matches that pattern and was being umount'd + cp'd over -- which breaks the container's view of the daemon's IPC socket (the container ends up with a 0-link unlinked socket inode instead of the live host one). Without a working socket, NVML state-changing calls inside the container return NVML_ERROR_UNKNOWN -- which is exactly what cuda.core's test_persistence_mode_enabled was hitting. Restrict the refresh to /usr/(bin|lib) so it only touches the actual binaries + shared libraries that change version with the driver swap. /dev/nvidia*, /proc/driver/nvidia, /run/nvidia-*, /tmp/nvidia-mps are all left as the toolkit set them up. Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver; their CUDA-runtime validation workload never queries the daemon socket so they haven't surfaced it.

leofang · 2026-06-07T19:36:58Z

/ok to test f17dd7f

The packaged nvidia-persistenced.service has `RuntimeDirectory=nvidia-persistenced`, which makes systemd `unlink()` /run/nvidia-persistenced/ when the unit stops. The container has that directory bind-mounted from the host as of container-start time. When systemd removes the inode and our subsequent `/usr/bin/nvidia-persistenced --user root` call re-creates it, the container's bind mount is stranded on the deleted inode -- its /run/nvidia-persistenced/socket shows up with link count 0 and NVML state-changing calls return NVML_ERROR_UNKNOWN. `pkill -TERM nvidia-persistenced` sends SIGTERM directly to the daemon, which exits cleanly without involving systemd's RuntimeDirectory cleanup. The host dir keeps its inode across the swap; the container's bind mount stays valid; the new daemon's socket is visible to in-container NVML clients.

leofang · 2026-06-07T19:46:09Z

/ok to test 6412f4f

The container's bind mount of /run/nvidia-persistenced/ is taken at container-start time and pinned to the host directory's then-current inode. Across the install the host directory gets recreated under a fresh inode (the daemon's shutdown + restart cycle replaces it), and the container is stranded on the deleted inode -- socket file shows up with link count 0 inside the container, NVML state-changing calls return NVML_ERROR_UNKNOWN. After refresh_container_libs, umount the stale bind, mkdir the local mount point if missing, and re-bind from /proc/1/root/run/nvidia- persistenced (the host's current view via the privileged container's host-pid-ns access). CAP_SYS_ADMIN required, which custom-DRIVER rows already grant via --privileged --pid=host.

leofang · 2026-06-07T19:55:01Z

/ok to test 2b34f1f

…earing - Revert `pkill -TERM nvidia-persistenced` to `systemctl stop`; pkill alone didn't prevent the host dir's inode from flipping, the re-bind of /run/nvidia-persistenced/ is what restores the container's view. - Drop `nvidia-smi -pm 1`; the test exercises NVML's set call, which succeeds once the daemon socket is reachable regardless of current Persistence-M state. - Trim `set -x` blocks and `pgrep`/`ls -la`/`stat` diagnostics that served their purpose during debugging. Keeps the load-bearing changes (nsenter bash -s, /usr/(bin|lib) refresh filter, exec nvidia-persistenced --user root, the /run/nvidia-persistenced re-bind, cp --preserve=mode) and brings the diff against Justin's nvgha-driver back down to the strict minimum.

leofang · 2026-06-07T20:40:12Z

/ok to test 8d8a9ef

Added in a3f1573 for fast iteration on install_gpu_driver.sh; no longer needed now that the script has stabilized.

leofang · 2026-06-07T21:49:07Z

/ok to test d2c25eb

leofang · 2026-06-08T14:40:47Z

    - { ARCH: 'amd64', PY_VER: '3.13',  CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'v100',       GPU_COUNT: '1', DRIVER: 'latest' }
    - { ARCH: 'amd64', PY_VER: '3.13',  CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: 'latest' }
-    - { ARCH: 'amd64', PY_VER: '3.13',  CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: 'latest' }
+    - { ARCH: 'amd64', PY_VER: '3.13',  CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: '610.43.02' }


xref https://github.com/NVIDIA/cuda-python/actions/runs/27105815636/job/79997044862?pr=2176

leofang · 2026-06-08T14:41:43Z

    - { ARCH: 'amd64', PY_VER: '3.14',  CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 't4',         GPU_COUNT: '1', DRIVER: 'latest' }
    - { ARCH: 'amd64', PY_VER: '3.14',  CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'l4',         GPU_COUNT: '1', DRIVER: 'latest' }
-    - { ARCH: 'amd64', PY_VER: '3.14',  CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'l4',         GPU_COUNT: '1', DRIVER: 'latest' }
+    - { ARCH: 'amd64', PY_VER: '3.14',  CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'l4',         GPU_COUNT: '1', DRIVER: '610.43.02' }


xref https://github.com/NVIDIA/cuda-python/actions/runs/27105815636/job/79997044897?pr=2176

- ci.yml: `workflow_dispatch:` -> `workflow_dispatch: {}` so the empty mapping reads as intentional rather than ambiguous YAML. - test-wheel-linux.yml: declare `util-linux` in `Install dependencies` instead of running a second apt-get inline; util-linux ships in ubuntu:22.04 by default so this is mostly belt-and-suspenders, but it removes the redundant apt-get call. - install_gpu_driver.sh: drop `2>/dev/null` on `systemctl stop` so real errors surface (`|| true` keeps the script non-fatal). The redirect was inherited verbatim from nv-gha-runners/vm-images PR 256 with no specific need.

leofang · 2026-06-08T21:50:05Z

/ok to test fa7940a

github-actions Bot added the CI/CD CI/CD infrastructure label Jun 7, 2026

leofang self-assigned this Jun 7, 2026

leofang added the P0 High priority - Must do! label Jun 7, 2026

leofang added this to the cuda.core v1.1.0 milestone Jun 7, 2026

leofang commented Jun 7, 2026

View reviewed changes

leofang added 2 commits June 7, 2026 03:27

CI: move 'Ensure GPU is working' after 'Install GPU driver' on Linux

c0ca869

So nvidia-smi validates the post-install driver state on custom-DRIVER rows. Windows test-wheel + coverage already use Install -> Configure -> Ensure; this brings the Linux test-wheel job into line.

CI: flip two PR-matrix Linux rows to DRIVER=610.43.02

4a23b23

Exercises the custom-driver install path on every PR (not just nightly). Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both in the 'open' kernel-module flavor (only Volta needs 'legacy').

rwgk reviewed Jun 7, 2026

View reviewed changes

Comment thread ci/tools/configure_driver_mode.ps1

leofang added the feature New feature or request label Jun 7, 2026

leofang added 3 commits June 7, 2026 19:17

CI: drop workflow_dispatch gate on probe-driver-swap so it runs on ev…

c5fef92

…ery PR

Revert: remove the probe-driver-swap fast-feedback job

d2c25eb

Added in a3f1573 for fast iteration on install_gpu_driver.sh; no longer needed now that the script has stabilized.

This comment was marked as outdated.

Sign in to view

leofang marked this pull request as ready for review June 7, 2026 22:15

leofang requested review from kkraus14, mdboom and rwgk June 8, 2026 13:43

leofang commented Jun 8, 2026

View reviewed changes

rwgk added the PR review get-together Mark PRs you'd like the team to review at the weekly PR review get-together. label Jun 8, 2026

mdboom requested changes Jun 8, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/workflows/test-wheel-linux.yml

Comment thread .github/workflows/test-wheel-linux.yml Outdated

Comment thread ci/tools/install_gpu_driver.sh Outdated

Conversation

leofang commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

leofang Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

leofang commented Jun 7, 2026

Uh oh!

This comment was marked as outdated.

leofang commented Jun 7, 2026

Uh oh!

leofang Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leofang commented Jun 7, 2026 •

edited

Loading

leofang Jun 8, 2026 •

edited

Loading