Skip to content

Avoid CUDA context initialization during op compatibility checks at import#8078

Open
Achyuthan-S wants to merge 3 commits into
deepspeedai:masterfrom
Achyuthan-S:fix/import-fork-safety
Open

Avoid CUDA context initialization during op compatibility checks at import#8078
Achyuthan-S wants to merge 3 commits into
deepspeedai:masterfrom
Achyuthan-S:fix/import-fork-safety

Conversation

@Achyuthan-S

@Achyuthan-S Achyuthan-S commented Jun 19, 2026

Copy link
Copy Markdown

Summary

import deepspeed initialized a CUDA context in the parent process, which permanently breaks fork()-based multiprocessing (Cannot re-initialize CUDA in forked subprocess). This makes importing DeepSpeed fork-safe.

Fixes #7918.

Root cause

On a GPU box, import deepspeed reached three distinct calls that create a CUDA context, each gated differently (which is why a single patch kept missing one):

  1. torch.cuda.is_available() — called during accelerator auto-detection (real_accelerator.py) and in every CUDA op builder's is_compatible(). By default it runs cudaGetDeviceCount → cuInit, creating a context. Per the PyTorch docs this is only avoided with PYTORCH_NVML_BASED_CUDA_CHECK=1. Note it does not set torch.cuda.is_initialized(), so an import-time assert not is_initialized() is a false-green.
  2. torch.cuda.get_device_properties(0) — in the eight builders' is_compatible() (run at import by git_version_info.py); triggers torch.cuda._lazy_init().
  3. is_triton_supported()torch.cuda.get_device_capability() — called at module import in ds_transformer.py, gated on deepspeed.HAS_TRITON. This only fires when triton is installed, so it was invisible in triton-less environments — but it was the first initializer on a real GPU node.

Fix

  1. deepspeed/__init__.py sets os.environ.setdefault("PYTORCH_NVML_BASED_CUDA_CHECK", "1") as the very first statement, so torch.cuda.is_available() uses the NVML-based check and never initializes a context. setdefault() preserves an explicit user setting.
  2. CUDAOpBuilder.cuda_capability_major() (in op_builder/builder.py) reads compute capability only when a context already exists (is_initialized()) and we are not in a forked child (_is_in_bad_fork(), mirroring Avoid CUDA reinit error in CI tests #7977); otherwise returns None. All eight builders route through it and skip the capability gate when probing is unsafe.
  3. ds_transformer.py imports the triton kernels whenever triton is installed (if deepspeed.HAS_TRITON:) instead of also gating on is_triton_supported(). The capability probe is removed from import; actual triton use stays gated at runtime by config.use_triton, where CUDA is already initialized.

Behavior / tradeoff

  • NVML-based availability is a slightly weaker assessment than the default runtime check and falls back to cudaGetDeviceCount if NVML is unavailable (documented PyTorch behavior); a non-issue on standard NVIDIA boxes.
  • Dropping the import-time capability gate means triton kernel modules are imported whenever triton is installed (even on pre-Ampere). Importing them has no CUDA side effects; their use is still gated by config.use_triton.

Tests

  • Three unit tests for cuda_capability_major()'s decision tree (not-initialized → skip, initialized → probe, bad-fork → skip), mocked torch.cuda, no GPU required.
  • test_forked_child_can_use_cuda_after_importing_deepspeed — forks after import deepspeed, the child runs a real CUDA op, parent asserts success.

Validation

Verified on a CUDA GPU node (NVIDIA, torch 2.4.1+cu121). After import deepspeed:

  • torch.cuda.is_initialized()False
  • a forked child runs torch.ones(1, device="cuda") successfully (exit 0)
  • instrumenting torch.cuda._lazy_init shows 0 distinct import-time CUDA-touch sites (down from the ds_transformer.py:17 initializer + its downstream builder probe).

Docs

Updated CONTRIBUTING.md and docs/contributing.md: --forked is safe now that import deepspeed no longer initializes CUDA.

cc @tjruwase @loadams @tohtana

…ai#7918)

import deepspeed eagerly calls is_compatible() for all ops; eight builders
probed get_device_properties(0), which lazy-inits CUDA and breaks fork()-based
multiprocessing. Gate the probe on is_initialized() via a shared
CUDAOpBuilder.cuda_capability_major() helper, and clarify that pytest --forked
is safe now that import no longer initializes a CUDA context.

Fixes deepspeedai#7918

Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 19, 2026 11:05
@Achyuthan-S Achyuthan-S mentioned this pull request Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a fork-safety issue where import deepspeed could initialize a CUDA context (via import-time op compatibility checks), breaking fork()-based multiprocessing. It introduces a fork-safe CUDA capability probe and updates CUDA op builders to avoid context creation during import.

Changes:

  • Add CUDAOpBuilder.cuda_capability_major() to safely query compute capability only when CUDA is already initialized and not in a bad-fork state.
  • Update affected CUDA op builders’ is_compatible() logic to use the helper and skip capability gating when probing would be unsafe.
  • Add unit/regression tests and update contributing docs to reflect that --forked is now safe with DeepSpeed imports.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/ops/test_op_builder.py Adds unit tests for the new helper and a subprocess regression test to ensure import deepspeed doesn’t initialize CUDA.
op_builder/builder.py Introduces CUDAOpBuilder.cuda_capability_major() with guards to avoid CUDA context initialization.
op_builder/transformer_inference.py Switches capability checks to the fork-safe helper and gates comparisons on None.
op_builder/spatial_inference.py Switches Ampere gating to the fork-safe helper and guards on None.
op_builder/ragged_utils.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/ragged_ops.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/inference_cutlass_builder.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/inference_core_ops.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/fp_quantizer.py Switches capability checks to the fork-safe helper and guards on None.
op_builder/evoformer_attn.py Switches capability checks to the fork-safe helper and guards on None.
docs/contributing.md Updates contributing guidance to clarify that --forked is safe now that imports don’t initialize CUDA.
CONTRIBUTING.md Mirrors the contributing guidance update from docs/contributing.md.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit/ops/test_op_builder.py Outdated
Comment on lines +270 to +277
check = (
"import torch, deepspeed; "
"assert not torch.cuda.is_initialized(), " #ignore-cuda
"'import deepspeed initialized a CUDA context (issue #7918)'")
result = subprocess.run([sys.executable, "-c", check], capture_output=True, text=True)
if "ModuleNotFoundError" in result.stderr:
pytest.skip("deepspeed/torch not importable in a subprocess in this environment")
assert result.returncode == 0, result.stderr

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced this test — the new one sets explicit PYTHONPATH, a timeout, and only skips on No module named 'deepspeed'/'torch' or a genuinely absent GPU.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02b1c335cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/unit/ops/test_op_builder.py Outdated
Comment on lines +271 to +272
"import torch, deepspeed; "
"assert not torch.cuda.is_initialized(), " #ignore-cuda

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify fork safety, not just CUDA context state

This regression check can pass while the fork failure still exists: import deepspeed still runs op compatibility checks that call torch.cuda.is_available(), and PyTorch only documents that call as non-poisoning when PYTORCH_NVML_BASED_CUDA_CHECK=1 is set (https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html). Since is_available() can poison fork without making torch.cuda.is_initialized() true, CUDA-enabled environments can still fail in a forked child even though this assertion succeeds; the test should actually fork after import and touch CUDA, or the import path must avoid/use the NVML-safe availability check.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — is_available() poisons via cuInit by default without setting is_initialized(), so the old assertion was a false green. deepspeed/init.py now sets PYTORCH_NVML_BASED_CUDA_CHECK=1, and the test forks after import and has the child use CUDA.

@tohtana

tohtana commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Hi @Achyuthan-S, thank you for opening this PR!
I think the key regression should verify the actual issue contract: after the parent process runs import deepspeed, a forked child should still be able to use CUDA.
I tried it on the current PR head by importing DeepSpeed in the parent, forking, and calling CUDA from the child, and it still failed with Cannot re-initialize CUDA in forked subprocess.

Could you add a regression test for that fork-after-import behavior and update the fix until that test passes?

torch.cuda.is_available() runs cudaGetDeviceCount/cuInit by default, creating a
CUDA context at 'import deepspeed' (accelerator auto-detect and every op builder's
is_compatible() call it). That poisons fork()-based multiprocessing without ever
setting torch.cuda.is_initialized(), so the previous import-time assertion passed
while the fork still failed.

Opt into PyTorch's NVML-based availability check
(PYTORCH_NVML_BASED_CUDA_CHECK=1) so is_available() no longer initializes a CUDA
context; combined with the existing get_device_properties guard, importing
DeepSpeed leaves CUDA uninitialized. Replace the weak import-time check with a
fork-after-import regression test that a forked child can use CUDA.

Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
@Achyuthan-S

Copy link
Copy Markdown
Author

HI @tohtana , thank you for the review . You're correct — the earlier change only guarded get_device_properties, while torch.cuda.is_available() (called during accelerator detection and in every builder's is_compatible()) was the real poison, since it runs cuInit by default without flipping is_initialized(). Pushed a fix that opts into PYTORCH_NVML_BASED_CUDA_CHECK=1 at import and adds a fork-after-import regression test (child uses CUDA in a forked subprocess).

@tohtana

tohtana commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the update. I tested the current head in a CUDA environment, and I think the original issue is still not fixed: after import deepspeed in the parent process, CUDA is already initialized, so a forked child still fails when it first touches CUDA with Cannot re-initialize CUDA in forked subprocess.

ds_transformer gated its triton-kernel import on is_triton_supported(), which
reads the GPU compute capability and thereby creates a CUDA context at
'import deepspeed' (issue deepspeedai#7918) — the real initializer on a GPU node with triton
installed. Import the triton kernels whenever triton is available; actual triton
use remains gated at runtime by config.use_triton, where CUDA is already
initialized. With this, importing DeepSpeed no longer initializes CUDA and a
forked child can use CUDA.

Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
@Achyuthan-S

Copy link
Copy Markdown
Author

Good catch @tohtana — the real initializer was is_triton_supported() called at import from ds_transformer.py (reads compute capability → creates a CUDA context). I've dropped the import-time capability probe (triton kernels are imported on HAS_TRITON; their use stays gated by config.use_triton at runtime). Verified on a GPU node: after import deepspeed, is_initialized() is False and a forked child uses CUDA successfully.

@Achyuthan-S

Copy link
Copy Markdown
Author

Hello @tohtana, could you please verify and let me know , I think it should work now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fork safety

3 participants