Avoid CUDA context initialization during op compatibility checks at import#8078
Avoid CUDA context initialization during op compatibility checks at import#8078Achyuthan-S wants to merge 3 commits into
Conversation
…ai#7918) import deepspeed eagerly calls is_compatible() for all ops; eight builders probed get_device_properties(0), which lazy-inits CUDA and breaks fork()-based multiprocessing. Gate the probe on is_initialized() via a shared CUDAOpBuilder.cuda_capability_major() helper, and clarify that pytest --forked is safe now that import no longer initializes a CUDA context. Fixes deepspeedai#7918 Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a fork-safety issue where import deepspeed could initialize a CUDA context (via import-time op compatibility checks), breaking fork()-based multiprocessing. It introduces a fork-safe CUDA capability probe and updates CUDA op builders to avoid context creation during import.
Changes:
- Add
CUDAOpBuilder.cuda_capability_major()to safely query compute capability only when CUDA is already initialized and not in a bad-fork state. - Update affected CUDA op builders’
is_compatible()logic to use the helper and skip capability gating when probing would be unsafe. - Add unit/regression tests and update contributing docs to reflect that
--forkedis now safe with DeepSpeed imports.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/ops/test_op_builder.py | Adds unit tests for the new helper and a subprocess regression test to ensure import deepspeed doesn’t initialize CUDA. |
| op_builder/builder.py | Introduces CUDAOpBuilder.cuda_capability_major() with guards to avoid CUDA context initialization. |
| op_builder/transformer_inference.py | Switches capability checks to the fork-safe helper and gates comparisons on None. |
| op_builder/spatial_inference.py | Switches Ampere gating to the fork-safe helper and guards on None. |
| op_builder/ragged_utils.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/ragged_ops.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/inference_cutlass_builder.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/inference_core_ops.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/fp_quantizer.py | Switches capability checks to the fork-safe helper and guards on None. |
| op_builder/evoformer_attn.py | Switches capability checks to the fork-safe helper and guards on None. |
| docs/contributing.md | Updates contributing guidance to clarify that --forked is safe now that imports don’t initialize CUDA. |
| CONTRIBUTING.md | Mirrors the contributing guidance update from docs/contributing.md. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| check = ( | ||
| "import torch, deepspeed; " | ||
| "assert not torch.cuda.is_initialized(), " #ignore-cuda | ||
| "'import deepspeed initialized a CUDA context (issue #7918)'") | ||
| result = subprocess.run([sys.executable, "-c", check], capture_output=True, text=True) | ||
| if "ModuleNotFoundError" in result.stderr: | ||
| pytest.skip("deepspeed/torch not importable in a subprocess in this environment") | ||
| assert result.returncode == 0, result.stderr |
There was a problem hiding this comment.
Replaced this test — the new one sets explicit PYTHONPATH, a timeout, and only skips on No module named 'deepspeed'/'torch' or a genuinely absent GPU.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02b1c335cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "import torch, deepspeed; " | ||
| "assert not torch.cuda.is_initialized(), " #ignore-cuda |
There was a problem hiding this comment.
Verify fork safety, not just CUDA context state
This regression check can pass while the fork failure still exists: import deepspeed still runs op compatibility checks that call torch.cuda.is_available(), and PyTorch only documents that call as non-poisoning when PYTORCH_NVML_BASED_CUDA_CHECK=1 is set (https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html). Since is_available() can poison fork without making torch.cuda.is_initialized() true, CUDA-enabled environments can still fail in a forked child even though this assertion succeeds; the test should actually fork after import and touch CUDA, or the import path must avoid/use the NVML-safe availability check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — is_available() poisons via cuInit by default without setting is_initialized(), so the old assertion was a false green. deepspeed/init.py now sets PYTORCH_NVML_BASED_CUDA_CHECK=1, and the test forks after import and has the child use CUDA.
|
Hi @Achyuthan-S, thank you for opening this PR! Could you add a regression test for that fork-after-import behavior and update the fix until that test passes? |
torch.cuda.is_available() runs cudaGetDeviceCount/cuInit by default, creating a CUDA context at 'import deepspeed' (accelerator auto-detect and every op builder's is_compatible() call it). That poisons fork()-based multiprocessing without ever setting torch.cuda.is_initialized(), so the previous import-time assertion passed while the fork still failed. Opt into PyTorch's NVML-based availability check (PYTORCH_NVML_BASED_CUDA_CHECK=1) so is_available() no longer initializes a CUDA context; combined with the existing get_device_properties guard, importing DeepSpeed leaves CUDA uninitialized. Replace the weak import-time check with a fork-after-import regression test that a forked child can use CUDA. Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
|
HI @tohtana , thank you for the review . You're correct — the earlier change only guarded get_device_properties, while torch.cuda.is_available() (called during accelerator detection and in every builder's is_compatible()) was the real poison, since it runs cuInit by default without flipping is_initialized(). Pushed a fix that opts into PYTORCH_NVML_BASED_CUDA_CHECK=1 at import and adds a fork-after-import regression test (child uses CUDA in a forked subprocess). |
|
Thanks for the update. I tested the current head in a CUDA environment, and I think the original issue is still not fixed: after |
ds_transformer gated its triton-kernel import on is_triton_supported(), which reads the GPU compute capability and thereby creates a CUDA context at 'import deepspeed' (issue deepspeedai#7918) — the real initializer on a GPU node with triton installed. Import the triton kernels whenever triton is available; actual triton use remains gated at runtime by config.use_triton, where CUDA is already initialized. With this, importing DeepSpeed no longer initializes CUDA and a forked child can use CUDA. Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
|
Good catch @tohtana — the real initializer was is_triton_supported() called at import from ds_transformer.py (reads compute capability → creates a CUDA context). I've dropped the import-time capability probe (triton kernels are imported on HAS_TRITON; their use stays gated by config.use_triton at runtime). Verified on a GPU node: after import deepspeed, is_initialized() is False and a forked child uses CUDA successfully. |
|
Hello @tohtana, could you please verify and let me know , I think it should work now |
Summary
import deepspeedinitialized a CUDA context in the parent process, which permanently breaksfork()-based multiprocessing (Cannot re-initialize CUDA in forked subprocess). This makes importing DeepSpeed fork-safe.Fixes #7918.
Root cause
On a GPU box,
import deepspeedreached three distinct calls that create a CUDA context, each gated differently (which is why a single patch kept missing one):torch.cuda.is_available()— called during accelerator auto-detection (real_accelerator.py) and in every CUDA op builder'sis_compatible(). By default it runscudaGetDeviceCount → cuInit, creating a context. Per the PyTorch docs this is only avoided withPYTORCH_NVML_BASED_CUDA_CHECK=1. Note it does not settorch.cuda.is_initialized(), so an import-timeassert not is_initialized()is a false-green.torch.cuda.get_device_properties(0)— in the eight builders'is_compatible()(run at import bygit_version_info.py); triggerstorch.cuda._lazy_init().is_triton_supported()→torch.cuda.get_device_capability()— called at module import inds_transformer.py, gated ondeepspeed.HAS_TRITON. This only fires when triton is installed, so it was invisible in triton-less environments — but it was the first initializer on a real GPU node.Fix
deepspeed/__init__.pysetsos.environ.setdefault("PYTORCH_NVML_BASED_CUDA_CHECK", "1")as the very first statement, sotorch.cuda.is_available()uses the NVML-based check and never initializes a context.setdefault()preserves an explicit user setting.CUDAOpBuilder.cuda_capability_major()(inop_builder/builder.py) reads compute capability only when a context already exists (is_initialized()) and we are not in a forked child (_is_in_bad_fork(), mirroring Avoid CUDA reinit error in CI tests #7977); otherwise returnsNone. All eight builders route through it and skip the capability gate when probing is unsafe.ds_transformer.pyimports the triton kernels whenever triton is installed (if deepspeed.HAS_TRITON:) instead of also gating onis_triton_supported(). The capability probe is removed from import; actual triton use stays gated at runtime byconfig.use_triton, where CUDA is already initialized.Behavior / tradeoff
cudaGetDeviceCountif NVML is unavailable (documented PyTorch behavior); a non-issue on standard NVIDIA boxes.config.use_triton.Tests
cuda_capability_major()'s decision tree (not-initialized → skip, initialized → probe, bad-fork → skip), mockedtorch.cuda, no GPU required.test_forked_child_can_use_cuda_after_importing_deepspeed— forks afterimport deepspeed, the child runs a real CUDA op, parent asserts success.Validation
Verified on a CUDA GPU node (NVIDIA, torch 2.4.1+cu121). After
import deepspeed:torch.cuda.is_initialized()→Falsetorch.ones(1, device="cuda")successfully (exit 0)torch.cuda._lazy_initshows 0 distinct import-time CUDA-touch sites (down from theds_transformer.py:17initializer + its downstream builder probe).Docs
Updated
CONTRIBUTING.mdanddocs/contributing.md:--forkedis safe now thatimport deepspeedno longer initializes CUDA.cc @tjruwase @loadams @tohtana