Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions .github/workflows/test_uccl_wheels.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Copyright Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT

name: Test UCCL Wheels

on:
workflow_dispatch:
inputs:
amdgpu_family:
description: GPU family to test
required: true
type: string
default: "gfx94X-dcgpu"
test_runs_on:
description: Runner label to use (must have multiple GPUs for intranode tests)
required: true
type: string
default: "linux-gfx942-8gpu-ossci-rocm"
package_index_url:
description: Base Python package index URL (without GPU family subdir)
required: true
type: string
default: "https://rocm.prereleases.amd.com/whl"
python_version:
required: true
type: string
default: "3.12"
torch_version:
description: "torch version to install (e.g. '2.7.1+rocm7.10.0a20251120'). Leave empty for latest."
required: false
type: string
default: ""
uccl_git_ref:
description: UCCL ref to checkout test sources from (e.g. "main")
type: string
default: "main"

workflow_call:
inputs:
amdgpu_family:
required: true
type: string
test_runs_on:
required: true
type: string
package_index_url:
required: true
type: string
python_version:
required: true
type: string
torch_version:
required: false
type: string
default: ""
uccl_git_ref:
type: string
default: "main"
repository:
description: "Repository to checkout. Otherwise, defaults to `github.repository`."
type: string
ref:
description: "Branch, tag or SHA to checkout. Defaults to the reference or SHA that triggered the workflow."
type: string

permissions:
contents: read

run-name: Test UCCL (${{ inputs.amdgpu_family }}, ${{ inputs.uccl_git_ref }}, ${{ inputs.test_runs_on }})

jobs:
test_wheels:
name: Test UCCL | ${{ inputs.amdgpu_family }}
runs-on: ${{ inputs.test_runs_on }}
container:
image: ${{ contains(inputs.test_runs_on, 'linux') && 'ghcr.io/rocm/no_rocm_image_ubuntu24_04@sha256:405945a40deaff9db90b9839c0f41d4cba4a383c1a7459b28627047bf6302a26' || null }}
options: --ipc host
--group-add video
--device /dev/kfd
--device /dev/dri
--group-add 992
--group-add 110
--env-file /etc/podinfo/gha-gpu-isolation-settings
--user 0:0
defaults:
run:
shell: bash
env:
VENV_DIR: ${{ github.workspace }}/.venv
AMDGPU_FAMILY: ${{ inputs.amdgpu_family }}

steps:
- name: Checkout TheRock
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: ${{ inputs.repository || github.repository }}
ref: ${{ inputs.ref || '' }}

- name: Set up Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: ${{ inputs.python_version }}

- name: Set git options
run: |
git config --global core.longpaths true

- name: Checkout UCCL source
run: |
python external-builds/uccl/uccl_repo.py checkout \
--repo-hashtag ${{ inputs.uccl_git_ref }}

- name: Build UCCL wheel from source
run: |
python external-builds/uccl/build_prod_wheels.py \
--output-dir ${{ github.workspace }}/uccl-wheels \
--python-version ${{ inputs.python_version }} \
--index-url ${{ inputs.package_index_url }}/${{ inputs.amdgpu_family }}

- name: Set up virtual environment
run: |
TORCH_PKG="torch"
if [ -n "${{ inputs.torch_version }}" ]; then
TORCH_PKG="torch==${{ inputs.torch_version }}"
fi
python build_tools/setup_venv.py ${VENV_DIR} \
--packages \
"${TORCH_PKG}" \
--index-url=${{ inputs.package_index_url }} \
--index-subdir=${{ inputs.amdgpu_family }} \
--activate-in-future-github-actions-steps

- name: Install UCCL wheel
run: |
pip install --extra-index-url ${{ inputs.package_index_url }}/${{ inputs.amdgpu_family }} \
"$(ls ${{ github.workspace }}/uccl-wheels/uccl-*.whl)[rocm]"

- name: Install test requirements
run: |
python -m pip install -r external-builds/uccl/requirements-test.txt
pip freeze

- name: Runner health status
run: |
./build_tools/health_status.py

- name: Run rocm-sdk sanity tests
run: |
rocm-sdk test

- name: Run UCCL smoke tests
run: |
python ./external-builds/uccl/run_uccl_smoke_tests.py -- \
--log-cli-level=INFO \
-v

- name: Run UCCL intranode EP tests
timeout-minutes: 30
run: |
python ./external-builds/uccl/run_uccl_tests.py \
--uccl-dir external-builds/uccl/uccl
42 changes: 42 additions & 0 deletions external-builds/uccl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,45 @@ pip install --extra-index-url https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu
Note the use of `--extra-index-url` instead of `--index-url` to
accommodate resolution of non-ROCm dependences of UCCL to be satisfied
by the default PyPI index.

## Testing UCCL

Tests are structured in two tiers, following the same pattern as
`external-builds/pytorch/`.

### Smoke tests

Quick sanity checks that verify the UCCL wheel is installed correctly,
GPU hardware is accessible, and the UCCL Python API is importable.

```bash
python run_uccl_smoke_tests.py -- --log-cli-level=INFO -v
```

### Intranode EP tests

Runs the upstream `test_intranode.py` test via `torchrun` in standalone
mode. This exercises Expert Parallelism dispatch, combine, and tuning
kernels on a single node with multiple GPUs. Requires a UCCL source
checkout for the test files.

```bash
# Checkout UCCL sources first
python uccl_repo.py checkout

# Run with all available GPUs (auto-detected)
python run_uccl_tests.py

# Or specify GPU count
python run_uccl_tests.py --nproc-per-node 4

# Dry-run to see the command without executing
python run_uccl_tests.py --dry-run
```

### CI workflow

The `test_uccl_wheels.yml` workflow runs both test tiers. It can be
triggered manually via `workflow_dispatch` or called from other
workflows via `workflow_call`. See the workflow file for the full list
of inputs.
2 changes: 2 additions & 0 deletions external-builds/uccl/requirements-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pytest==8.3.5
numpy
86 changes: 86 additions & 0 deletions external-builds/uccl/run_uccl_smoke_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/usr/bin/env python3
# Copyright Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT

"""UCCL ROCm Smoke Tests Runner.

Runs lightweight smoke tests to verify that the UCCL wheel is installed
correctly, GPU hardware is accessible, and the basic UCCL Python API
is importable.

Usage Examples
--------------
Basic usage (auto-detect GPU):
$ python run_uccl_smoke_tests.py

Specify GPU family:
$ python run_uccl_smoke_tests.py --amdgpu-family gfx942

Pass additional pytest arguments after "--":
$ python run_uccl_smoke_tests.py -- --tb=short -x
"""

import argparse
import os
import subprocess
import sys
from pathlib import Path

THIS_SCRIPT_DIR = Path(__file__).resolve().parent


def cmd_arguments(argv: list[str]) -> tuple[argparse.Namespace, list[str]]:
try:
rest_pos = argv.index("--")
except ValueError:
passthrough_pytest_args = []
else:
passthrough_pytest_args = argv[rest_pos + 1 :]
argv = argv[:rest_pos]

parser = argparse.ArgumentParser(
description="Runs UCCL smoke-tests for AMD GPUs. "
'All arguments after "--" are passed directly to pytest.'
)

parser.add_argument(
"--amdgpu-family",
type=str,
default=os.getenv("AMDGPU_FAMILY", ""),
help='AMDGPU family (e.g. "gfx942"). Used to select GPU via '
"HIP_VISIBLE_DEVICES before running tests.",
)

args = parser.parse_args(argv)
return args, passthrough_pytest_args


def main() -> int:
args, passthrough_pytest_args = cmd_arguments(sys.argv[1:])

smoke_tests_dir = THIS_SCRIPT_DIR / "smoke-tests"
if not smoke_tests_dir.exists():
print(f"ERROR: Smoke test directory '{smoke_tests_dir}' does not exist.")
return 1

# Build pytest command. We invoke pytest as a subprocess rather than
# via pytest.main() so that HIP_VISIBLE_DEVICES (if set externally)
# takes effect before torch is imported.
pytest_cmd = [
sys.executable,
"-m",
"pytest",
str(smoke_tests_dir),
]
pytest_cmd.extend(passthrough_pytest_args)

print(f"Running UCCL smoke tests from {smoke_tests_dir}")
print(f"Command: {' '.join(pytest_cmd)}")

result = subprocess.run(pytest_cmd)
print(f"Smoke tests finished with return code: {result.returncode}")
return result.returncode


if __name__ == "__main__":
sys.exit(main())
Loading
Loading