Skip to content

[deps] split ci_docgpu CPU/GPU depsets#62596

Open
ans9868 wants to merge 4 commits intoray-project:masterfrom
ans9868:fix/docgpu-split-cpu-gpu-locks
Open

[deps] split ci_docgpu CPU/GPU depsets#62596
ans9868 wants to merge 4 commits intoray-project:masterfrom
ans9868:fix/docgpu-split-cpu-gpu-locks

Conversation

@ans9868
Copy link
Copy Markdown

@ans9868 ans9868 commented Apr 14, 2026

Summary

Fixes the torch-spline-conv conflict in docgpu depset by splitting CPU and GPU variants into separate depsets with their respective PyTorch wheel indices.

What Changed

  • ci/raydepsets/configs/ci_docgpu.depsets.yaml: Split single ci_docgpu_depset into two:

    • ci_docgpu_cpu_depset_${PYTHON_SHORT}: CPU-only with --index https://download.pytorch.org/whl/cpu
    • ci_docgpu_gpu_depset_${PYTHON_SHORT}: GPU-only with --index https://download.pytorch.org/whl/cu128
  • ci/docker/docgpu.build.wanda.yaml (line 5): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)

  • ci/docker/docgpu.build.Dockerfile (line 7): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)

Why

PR #62485 introduced a single depset combining both CPU and GPU requirements, creating an unsolvable torch-spline-conv conflict:

Because you require torch-spline-conv==1.2.2+pt27cu128 and
torch-spline-conv==1.2.2+pt27cpu, your requirements are unsatisfiable.

Splitting into separate depsets with explicit indices (following the ci_ml pattern) resolves this.

Note

This PR includes the configuration and Docker file changes. Lock files will
be regenerated and committed in a separate follow-up PR because:

  1. Lock file generation requires running bazel run //ci/raydepsets:raydepsets -- build,
    which compiles all dependencies and exposes a pre-existing etils version conflict
    (etils==1.5.2 in dl-cpu-requirements.txt vs etils==1.14.0 in the constraint file).

  2. The architectural fix (config split) is complete and correct regardless of
    lock file state. It can merge immediately while the etils conflict is resolved
    separately.

  3. This keeps the PR focused: architectural changes now, lock regeneration later
    once etils is fixed.

Related

Closes #62595
Related to #62485
Found via: buildkite/precheck/dependencies/build: raydepsets: compile all dependencies [g8_s1] #60522

@ans9868 ans9868 requested review from a team, matthewdeng and richardliaw as code owners April 14, 2026 04:45
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ray-gardener ray-gardener bot added devprod community-contribution Contributed by the community labels Apr 14, 2026
@ans9868 ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 2e76459 to 8eff66b Compare April 14, 2026 15:30
@ans9868
Copy link
Copy Markdown
Author

ans9868 commented Apr 14, 2026

Someone fixed this it seems like in master. Going to close this here.

@ans9868 ans9868 closed this Apr 14, 2026
@ans9868
Copy link
Copy Markdown
Author

ans9868 commented Apr 14, 2026

Ah I got confused; there isn't a solution to this on the master branch.

@ans9868 ans9868 reopened this Apr 14, 2026
Fixes the torch-spline-conv conflict introduced in PR ray-project#62485 by splitting
the single ci_docgpu_depset into separate CPU and GPU variants:

- ci_docgpu_cpu_depset: CPU-only with --index https://download.pytorch.org/whl/cpu
- ci_docgpu_gpu_depset: GPU-only with --index https://download.pytorch.org/whl/cu128

Update Docker build files to reference the GPU lock only (docgpu_gpu_depset_py.lock).

This follows the proven raydepsets pattern used by ci_ml_build_depset (CPU)
and ci_ml_gpubuild_depset (GPU).

Note: Lock file regeneration is blocked by a pre-existing etils version
conflict (separate issue). Lock files will be committed once that is resolved.

Closes ray-project#62595

Signed-off-by: Adel Nour <[email protected]>
@ans9868 ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 4d69759 to 57dbd9a Compare April 14, 2026 15:57
Copy link
Copy Markdown
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elliot-barn could you help review this?

@aslonnie aslonnie requested a review from elliot-barn April 14, 2026 16:00
@ans9868 ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from b2db55a to ca1881f Compare April 14, 2026 16:07
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ca1881f. Configure here.

- py310
- py312
pre_hooks:
- ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU depset generates lock files nothing consumes

Low Severity

The ci_docgpu_cpu_depset_${PYTHON_SHORT} depset generates docgpu_cpu_depset_py${PYTHON_VERSION}.lock files, but no Dockerfile or wanda config references them — only the GPU variant is consumed by docgpu.build.Dockerfile and docgpu.build.wanda.yaml. This differs from the ci_ml pattern being followed, where both CPU and GPU lock files are consumed via BUILD_VARIANT. The CPU depset will cost CI time to compile in the follow-up lock generation without being used.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ca1881f. Configure here.

Adds missing CPU-only packages (jax, torchmetrics, torchtext, etils, etc.)
to GPU depset via new docgpu_gpu_additions.txt file.

This avoids merging full dl-cpu with dl-gpu in one compile, preventing
torch-spline-conv conflict while ensuring GPU image has all needed packages.

Updates ci_docgpu.depsets.yaml GPU variant to reference both:
- python/requirements/ml/py313/dl-gpu-requirements.txt (GPU PyTorch/PyG)
- python/requirements/ml/py313/docgpu_gpu_additions.txt (missing CPU-only packages)

INCOMPLETE: GPU lock files (docgpu_gpu_depset_py3.10.lock, py3.12.lock) not yet
generated. Docker build will fail until locks are committed in follow-up PR.
Lock generation blocked by pre-existing etils version conflict (separate issue).

Signed-off-by: Adel Nour <[email protected]>
@ans9868
Copy link
Copy Markdown
Author

ans9868 commented Apr 14, 2026

docgpu Split Fix — Work in Progress

Problem

PR #62485 merged CPU and GPU requirements into one depset, causing torch-spline-conv conflict (CPU plain version vs GPU cu128 variant cannot resolve in single compiler run).

Initial Approach

Split into two depsets with separate indices. But GPU depset using only dl-gpu meant missing jax, torchmetrics, torchtext from dl-cpu.

Current Solution

Added docgpu_gpu_additions.txt to pull non-conflicting CPU packages into GPU depset. Avoids recreating the conflict when merging both full files while ensuring GPU image has all packages.

Status

jax versions now correct (jax==0.4.28 + jaxlib==0.4.28+cuda backend match). But several unknowns remain:

  • etils version conflict (pre-existing): py3.10 builds may fail on constraint mismatch
  • JAX/jaxlib pairing untested until lock generation
  • docgpu_gpu_additions compilation with cu128 index unvalidated
  • GPU lock files not yet generated (blocked by etils)

P.S. Lock file generation is separate issue. Will handle once etils resolved.

Depending on my schedule I think I could fix this in 3-7 days. Feedback on this approach is welcome. This is trickier than I initially expected. more information about the full bug in the issue here: #62595

Generate four lock files completing the ci_docgpu depset split:

- docgpu_cpu_depset_py3.{10,12}.lock: CPU PyTorch wheels
- docgpu_gpu_depset_py3.{10,12}.lock: GPU cu128 wheels

Remove old undivided docgpu_depset_py3.{10,12}.lock.

Also improve docgpu_gpu_additions.txt: add python_version < '3.13'
guard to torchtext (no cp313 wheel exists for 0.18.0) and clarify
comments.

Validated:
- torch-spline-conv: +pt27cpu in CPU locks, +pt27cu128 in GPU locks
- etils: 1.5.2 on py3.10, 1.14.0 on py3.12
- jaxlib: 0.4.28+cuda12.cudnn89 in both GPU locks

Fixes ray-project#62595

Signed-off-by: Adel Nour <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community devprod

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deps: split ci_docgpu CPU/GPU depsets and regenerate locks

2 participants