[deps] split ci_docgpu CPU/GPU depsets by ans9868 · Pull Request #62596 · ray-project/ray

ans9868 · 2026-04-14T04:45:16Z

Summary

Fixes the torch-spline-conv conflict in docgpu depset by splitting CPU and GPU variants into separate depsets with their respective PyTorch wheel indices.

What Changed

ci/raydepsets/configs/ci_docgpu.depsets.yaml: Split single ci_docgpu_depset into two:
- ci_docgpu_cpu_depset_${PYTHON_SHORT}: CPU-only with --index https://download.pytorch.org/whl/cpu
- ci_docgpu_gpu_depset_${PYTHON_SHORT}: GPU-only with --index https://download.pytorch.org/whl/cu128
ci/docker/docgpu.build.wanda.yaml (line 5): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)
ci/docker/docgpu.build.Dockerfile (line 7): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)

Why

PR #62485 introduced a single depset combining both CPU and GPU requirements, creating an unsolvable torch-spline-conv conflict:

Because you require torch-spline-conv==1.2.2+pt27cu128 and
torch-spline-conv==1.2.2+pt27cpu, your requirements are unsatisfiable.

Splitting into separate depsets with explicit indices (following the ci_ml pattern) resolves this.

Note

This PR includes the configuration and Docker file changes. Lock files will
be regenerated and committed in a separate follow-up PR because:

Lock file generation requires running bazel run //ci/raydepsets:raydepsets -- build,
which compiles all dependencies and exposes a pre-existing etils version conflict
(etils==1.5.2 in dl-cpu-requirements.txt vs etils==1.14.0 in the constraint file).
The architectural fix (config split) is complete and correct regardless of
lock file state. It can merge immediately while the etils conflict is resolved
separately.
This keeps the PR focused: architectural changes now, lock regeneration later
once etils is fixed.

Fixes the torch-spline-conv conflict introduced in PR ray-project#62485 by splitting the single ci_docgpu_depset into separate CPU and GPU variants: - ci_docgpu_cpu_depset: CPU-only with --index https://download.pytorch.org/whl/cpu - ci_docgpu_gpu_depset: GPU-only with --index https://download.pytorch.org/whl/cu128 Update Docker build files to reference the GPU lock only (docgpu_gpu_depset_py.lock). This follows the proven raydepsets pattern used by ci_ml_build_depset (CPU) and ci_ml_gpubuild_depset (GPU). Note: Lock file regeneration is blocked by a pre-existing etils version conflict (separate issue). Lock files will be committed once that is resolved. Closes ray-project#62595 Signed-off-by: Adel Nour <[email protected]>

aslonnie

@elliot-barn could you help review this?

ci/raydepsets/configs/ci_docgpu.depsets.yaml

Signed-off-by: Nour999 <[email protected]>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit ca1881f. Configure here.}

cursor · 2026-04-14T16:27:47Z

ci/raydepsets/configs/ci_docgpu.depsets.yaml

+      - py310
+      - py312
+    pre_hooks:
+      - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13


CPU depset generates lock files nothing consumes

Low Severity

The ci_docgpu_cpu_depset_${PYTHON_SHORT} depset generates docgpu_cpu_depset_py${PYTHON_VERSION}.lock files, but no Dockerfile or wanda config references them — only the GPU variant is consumed by docgpu.build.Dockerfile and docgpu.build.wanda.yaml. This differs from the ci_ml pattern being followed, where both CPU and GPU lock files are consumed via BUILD_VARIANT. The CPU depset will cost CI time to compile in the follow-up lock generation without being used.

^{Reviewed by Cursor Bugbot for commit ca1881f. Configure here.}

ci/raydepsets/configs/ci_docgpu.depsets.yaml

Adds missing CPU-only packages (jax, torchmetrics, torchtext, etils, etc.) to GPU depset via new docgpu_gpu_additions.txt file. This avoids merging full dl-cpu with dl-gpu in one compile, preventing torch-spline-conv conflict while ensuring GPU image has all needed packages. Updates ci_docgpu.depsets.yaml GPU variant to reference both: - python/requirements/ml/py313/dl-gpu-requirements.txt (GPU PyTorch/PyG) - python/requirements/ml/py313/docgpu_gpu_additions.txt (missing CPU-only packages) INCOMPLETE: GPU lock files (docgpu_gpu_depset_py3.10.lock, py3.12.lock) not yet generated. Docker build will fail until locks are committed in follow-up PR. Lock generation blocked by pre-existing etils version conflict (separate issue). Signed-off-by: Adel Nour <[email protected]>

ans9868 · 2026-04-14T17:31:53Z

docgpu Split Fix — Work in Progress

Problem

PR #62485 merged CPU and GPU requirements into one depset, causing torch-spline-conv conflict (CPU plain version vs GPU cu128 variant cannot resolve in single compiler run).

Initial Approach

Split into two depsets with separate indices. But GPU depset using only dl-gpu meant missing jax, torchmetrics, torchtext from dl-cpu.

Current Solution

Added docgpu_gpu_additions.txt to pull non-conflicting CPU packages into GPU depset. Avoids recreating the conflict when merging both full files while ensuring GPU image has all packages.

Status

jax versions now correct (jax==0.4.28 + jaxlib==0.4.28+cuda backend match). But several unknowns remain:

etils version conflict (pre-existing): py3.10 builds may fail on constraint mismatch
JAX/jaxlib pairing untested until lock generation
docgpu_gpu_additions compilation with cu128 index unvalidated
GPU lock files not yet generated (blocked by etils)

P.S. Lock file generation is separate issue. Will handle once etils resolved.

Depending on my schedule I think I could fix this in 3-7 days. Feedback on this approach is welcome. This is trickier than I initially expected. more information about the full bug in the issue here: #62595

Generate four lock files completing the ci_docgpu depset split: - docgpu_cpu_depset_py3.{10,12}.lock: CPU PyTorch wheels - docgpu_gpu_depset_py3.{10,12}.lock: GPU cu128 wheels Remove old undivided docgpu_depset_py3.{10,12}.lock. Also improve docgpu_gpu_additions.txt: add python_version < '3.13' guard to torchtext (no cp313 wheel exists for 0.18.0) and clarify comments. Validated: - torch-spline-conv: +pt27cpu in CPU locks, +pt27cu128 in GPU locks - etils: 1.5.2 on py3.10, 1.14.0 on py3.12 - jaxlib: 0.4.28+cuda12.cudnn89 in both GPU locks Fixes ray-project#62595 Signed-off-by: Adel Nour <[email protected]>

ans9868 requested review from a team, matthewdeng and richardliaw as code owners April 14, 2026 04:45

ray-gardener bot added devprod community-contribution Contributed by the community labels Apr 14, 2026

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 2e76459 to 8eff66b Compare April 14, 2026 15:30

ans9868 closed this Apr 14, 2026

cursor bot reviewed Apr 14, 2026

View reviewed changes

ci/raydepsets/configs/ci_docgpu.depsets.yaml Show resolved Hide resolved

ci/raydepsets/configs/ci_docgpu.depsets.yaml Show resolved Hide resolved

ans9868 reopened this Apr 14, 2026

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 4d69759 to 57dbd9a Compare April 14, 2026 15:57

aslonnie reviewed Apr 14, 2026

View reviewed changes

aslonnie requested a review from elliot-barn April 14, 2026 16:00

cursor bot reviewed Apr 14, 2026

View reviewed changes

ci/raydepsets/configs/ci_docgpu.depsets.yaml Outdated Show resolved Hide resolved

Merge branch 'master' into fix/docgpu-split-cpu-gpu-locks

ca1881f

Signed-off-by: Nour999 <[email protected]>

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from b2db55a to ca1881f Compare April 14, 2026 16:07

cursor bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deps] split ci_docgpu CPU/GPU depsets#62596

[deps] split ci_docgpu CPU/GPU depsets#62596
ans9868 wants to merge 4 commits intoray-project:masterfrom
ans9868:fix/docgpu-split-cpu-gpu-locks

ans9868 commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

aslonnie left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 14, 2026

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ans9868 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why

Note

Related

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 14, 2026

Choose a reason for hiding this comment

CPU depset generates lock files nothing consumes

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

docgpu Split Fix — Work in Progress

Problem

Initial Approach

Current Solution

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ans9868 commented Apr 14, 2026 •

edited

Loading

ans9868 commented Apr 14, 2026 •

edited

Loading