Avoid CUDA context initialization during op compatibility checks at import by Achyuthan-S · Pull Request #8078 · deepspeedai/DeepSpeed

Achyuthan-S · 2026-06-19T11:05:15Z

Summary

import deepspeed initialized a CUDA context in the parent process, which permanently breaks fork()-based multiprocessing (Cannot re-initialize CUDA in forked subprocess). This makes importing DeepSpeed fork-safe.

Fixes #7918.

Root cause

On a GPU box, import deepspeed reached three distinct calls that create a CUDA context, each gated differently (which is why a single patch kept missing one):

torch.cuda.is_available() — called during accelerator auto-detection (real_accelerator.py) and in every CUDA op builder's is_compatible(). By default it runs cudaGetDeviceCount → cuInit, creating a context. Per the PyTorch docs this is only avoided with PYTORCH_NVML_BASED_CUDA_CHECK=1. Note it does not set torch.cuda.is_initialized(), so an import-time assert not is_initialized() is a false-green.
torch.cuda.get_device_properties(0) — in the eight builders' is_compatible() (run at import by git_version_info.py); triggers torch.cuda._lazy_init().
is_triton_supported() → torch.cuda.get_device_capability() — called at module import in ds_transformer.py, gated on deepspeed.HAS_TRITON. This only fires when triton is installed, so it was invisible in triton-less environments — but it was the first initializer on a real GPU node.

Fix

deepspeed/__init__.py sets os.environ.setdefault("PYTORCH_NVML_BASED_CUDA_CHECK", "1") as the very first statement, so torch.cuda.is_available() uses the NVML-based check and never initializes a context. setdefault() preserves an explicit user setting.
CUDAOpBuilder.cuda_capability_major() (in op_builder/builder.py) reads compute capability only when a context already exists (is_initialized()) and we are not in a forked child (_is_in_bad_fork(), mirroring Avoid CUDA reinit error in CI tests #7977); otherwise returns None. All eight builders route through it and skip the capability gate when probing is unsafe.
ds_transformer.py imports the triton kernels whenever triton is installed (if deepspeed.HAS_TRITON:) instead of also gating on is_triton_supported(). The capability probe is removed from import; actual triton use stays gated at runtime by config.use_triton, where CUDA is already initialized.

Behavior / tradeoff

NVML-based availability is a slightly weaker assessment than the default runtime check and falls back to cudaGetDeviceCount if NVML is unavailable (documented PyTorch behavior); a non-issue on standard NVIDIA boxes.
Dropping the import-time capability gate means triton kernel modules are imported whenever triton is installed (even on pre-Ampere). Importing them has no CUDA side effects; their use is still gated by config.use_triton.

Tests

Three unit tests for cuda_capability_major()'s decision tree (not-initialized → skip, initialized → probe, bad-fork → skip), mocked torch.cuda, no GPU required.
test_forked_child_can_use_cuda_after_importing_deepspeed — forks after import deepspeed, the child runs a real CUDA op, parent asserts success.

Validation

Verified on a CUDA GPU node (NVIDIA, torch 2.4.1+cu121). After import deepspeed:

torch.cuda.is_initialized() → False
a forked child runs torch.ones(1, device="cuda") successfully (exit 0)
instrumenting torch.cuda._lazy_init shows 0 distinct import-time CUDA-touch sites (down from the ds_transformer.py:17 initializer + its downstream builder probe).

Docs

Updated CONTRIBUTING.md and docs/contributing.md: --forked is safe now that import deepspeed no longer initializes CUDA.

cc @tjruwase @loadams @tohtana

…ai#7918) import deepspeed eagerly calls is_compatible() for all ops; eight builders probed get_device_properties(0), which lazy-inits CUDA and breaks fork()-based multiprocessing. Gate the probe on is_initialized() via a shared CUDAOpBuilder.cuda_capability_major() helper, and clarify that pytest --forked is safe now that import no longer initializes a CUDA context. Fixes deepspeedai#7918 Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR addresses a fork-safety issue where import deepspeed could initialize a CUDA context (via import-time op compatibility checks), breaking fork()-based multiprocessing. It introduces a fork-safe CUDA capability probe and updates CUDA op builders to avoid context creation during import.

Changes:

Add CUDAOpBuilder.cuda_capability_major() to safely query compute capability only when CUDA is already initialized and not in a bad-fork state.
Update affected CUDA op builders’ is_compatible() logic to use the helper and skip capability gating when probing would be unsafe.
Add unit/regression tests and update contributing docs to reflect that --forked is now safe with DeepSpeed imports.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/ops/test_op_builder.py	Adds unit tests for the new helper and a subprocess regression test to ensure `import deepspeed` doesn’t initialize CUDA.
op_builder/builder.py	Introduces `CUDAOpBuilder.cuda_capability_major()` with guards to avoid CUDA context initialization.
op_builder/transformer_inference.py	Switches capability checks to the fork-safe helper and gates comparisons on `None`.
op_builder/spatial_inference.py	Switches Ampere gating to the fork-safe helper and guards on `None`.
op_builder/ragged_utils.py	Switches capability checks to the fork-safe helper and guards on `None`.
op_builder/ragged_ops.py	Switches capability checks to the fork-safe helper and guards on `None`.
op_builder/inference_cutlass_builder.py	Switches capability checks to the fork-safe helper and guards on `None`.
op_builder/inference_core_ops.py	Switches capability checks to the fork-safe helper and guards on `None`.
op_builder/fp_quantizer.py	Switches capability checks to the fork-safe helper and guards on `None`.
op_builder/evoformer_attn.py	Switches capability checks to the fork-safe helper and guards on `None`.
docs/contributing.md	Updates contributing guidance to clarify that `--forked` is safe now that imports don’t initialize CUDA.
CONTRIBUTING.md	Mirrors the contributing guidance update from `docs/contributing.md`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Achyuthan-S · 2026-06-23T05:49:57Z

+    check = (
+        "import torch, deepspeed; "
+        "assert not torch.cuda.is_initialized(), "  #ignore-cuda
+        "'import deepspeed initialized a CUDA context (issue #7918)'")
+    result = subprocess.run([sys.executable, "-c", check], capture_output=True, text=True)
+    if "ModuleNotFoundError" in result.stderr:
+        pytest.skip("deepspeed/torch not importable in a subprocess in this environment")
+    assert result.returncode == 0, result.stderr


Replaced this test — the new one sets explicit PYTHONPATH, a timeout, and only skips on No module named 'deepspeed'/'torch' or a genuinely absent GPU.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02b1c335cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-19T11:09:42Z

+        "import torch, deepspeed; "
+        "assert not torch.cuda.is_initialized(), "  #ignore-cuda


Verify fork safety, not just CUDA context state

This regression check can pass while the fork failure still exists: import deepspeed still runs op compatibility checks that call torch.cuda.is_available(), and PyTorch only documents that call as non-poisoning when PYTORCH_NVML_BASED_CUDA_CHECK=1 is set (https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html). Since is_available() can poison fork without making torch.cuda.is_initialized() true, CUDA-enabled environments can still fail in a forked child even though this assertion succeeds; the test should actually fork after import and touch CUDA, or the import path must avoid/use the NVML-safe availability check.

Useful? React with 👍 / 👎.

Addressed — is_available() poisons via cuInit by default without setting is_initialized(), so the old assertion was a false green. deepspeed/init.py now sets PYTORCH_NVML_BASED_CUDA_CHECK=1, and the test forks after import and has the child use CUDA.

tohtana · 2026-06-23T05:09:19Z

Hi @Achyuthan-S, thank you for opening this PR!
I think the key regression should verify the actual issue contract: after the parent process runs import deepspeed, a forked child should still be able to use CUDA.
I tried it on the current PR head by importing DeepSpeed in the parent, forking, and calling CUDA from the child, and it still failed with Cannot re-initialize CUDA in forked subprocess.

Could you add a regression test for that fork-after-import behavior and update the fix until that test passes?

torch.cuda.is_available() runs cudaGetDeviceCount/cuInit by default, creating a CUDA context at 'import deepspeed' (accelerator auto-detect and every op builder's is_compatible() call it). That poisons fork()-based multiprocessing without ever setting torch.cuda.is_initialized(), so the previous import-time assertion passed while the fork still failed. Opt into PyTorch's NVML-based availability check (PYTORCH_NVML_BASED_CUDA_CHECK=1) so is_available() no longer initializes a CUDA context; combined with the existing get_device_properties guard, importing DeepSpeed leaves CUDA uninitialized. Replace the weak import-time check with a fork-after-import regression test that a forked child can use CUDA. Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>

Achyuthan-S · 2026-06-23T05:45:32Z

HI @tohtana , thank you for the review . You're correct — the earlier change only guarded get_device_properties, while torch.cuda.is_available() (called during accelerator detection and in every builder's is_compatible()) was the real poison, since it runs cuInit by default without flipping is_initialized(). Pushed a fix that opts into PYTORCH_NVML_BASED_CUDA_CHECK=1 at import and adds a fork-after-import regression test (child uses CUDA in a forked subprocess).

tohtana · 2026-06-23T07:01:18Z

Thanks for the update. I tested the current head in a CUDA environment, and I think the original issue is still not fixed: after import deepspeed in the parent process, CUDA is already initialized, so a forked child still fails when it first touches CUDA with Cannot re-initialize CUDA in forked subprocess.

ds_transformer gated its triton-kernel import on is_triton_supported(), which reads the GPU compute capability and thereby creates a CUDA context at 'import deepspeed' (issue deepspeedai#7918) — the real initializer on a GPU node with triton installed. Import the triton kernels whenever triton is available; actual triton use remains gated at runtime by config.use_triton, where CUDA is already initialized. With this, importing DeepSpeed no longer initializes CUDA and a forked child can use CUDA. Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>

Achyuthan-S · 2026-06-23T08:44:32Z

Good catch @tohtana — the real initializer was is_triton_supported() called at import from ds_transformer.py (reads compute capability → creates a CUDA context). I've dropped the import-time capability probe (triton kernels are imported on HAS_TRITON; their use stays gated by config.use_triton at runtime). Verified on a GPU node: after import deepspeed, is_initialized() is False and a forked child uses CUDA successfully.

Achyuthan-S · 2026-06-25T09:08:44Z

Hello @tohtana, could you please verify and let me know , I think it should work now

Copilot AI review requested due to automatic review settings June 19, 2026 11:05

Achyuthan-S requested review from loadams, tjruwase and tohtana as code owners June 19, 2026 11:05

Copilot started reviewing on behalf of Achyuthan-S June 19, 2026 11:05 View session

Achyuthan-S mentioned this pull request Jun 19, 2026

Fork safety #7918

Open

Copilot AI reviewed Jun 19, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid CUDA context initialization during op compatibility checks at import#8078

Avoid CUDA context initialization during op compatibility checks at import#8078
Achyuthan-S wants to merge 3 commits into
deepspeedai:masterfrom
Achyuthan-S:fix/import-fork-safety

Achyuthan-S commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Achyuthan-S Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

Achyuthan-S Jun 23, 2026

Uh oh!

tohtana commented Jun 23, 2026 •

edited

Loading

Uh oh!

Achyuthan-S commented Jun 23, 2026

Uh oh!

tohtana commented Jun 23, 2026

Uh oh!

Achyuthan-S commented Jun 23, 2026

Uh oh!

Achyuthan-S commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"import torch, deepspeed; "
		"assert not torch.cuda.is_initialized(), " #ignore-cuda

Uh oh!

Conversation

Achyuthan-S commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Behavior / tradeoff

Tests

Validation

Docs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Achyuthan-S Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Achyuthan-S Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Achyuthan-S commented Jun 23, 2026

Uh oh!

tohtana commented Jun 23, 2026

Uh oh!

Achyuthan-S commented Jun 23, 2026

Uh oh!

Achyuthan-S commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Achyuthan-S commented Jun 19, 2026 •

edited

Loading

tohtana commented Jun 23, 2026 •

edited

Loading