Skip to content

Add UCCL test infrastructure and CI workflow#4437

Draft
adityakankariya wants to merge 4 commits intomainfrom
users/adityakankariya/uccl-tests
Draft

Add UCCL test infrastructure and CI workflow#4437
adityakankariya wants to merge 4 commits intomainfrom
users/adityakankariya/uccl-tests

Conversation

@adityakankariya
Copy link
Copy Markdown

@adityakankariya adityakankariya commented Apr 9, 2026

Motivation

UCCL currently has no test infrastructure within TheRock's CI. The upstream UCCL project has its own CI workflow that runs multi-node tests via SSH, but TheRock needs a single-node, container-based test suite that follows the existing patterns established by the PyTorch external build (external-builds/pytorch/).

Technical Details

Adds a two-tier test suite for UCCL, mirroring the structure in external-builds/pytorch/:

New files in external-builds/uccl/:

  • requirements-test.txt — test dependencies (pytest, numpy)
  • smoke-tests/uccl_smoke_test.py — pytest smoke tests: ROCm availability, UCCL import verification, basic GPU tensor operations
  • run_uccl_smoke_tests.py — smoke test runner with --amdgpu-family support and pytest passthrough
  • run_uccl_tests.py — wraps upstream test_intranode.py via torchrun --standalone, exercising EP dispatch/combine kernels on a single node with auto-detected GPU count

New CI workflow:

  • .github/workflows/test_uccl_wheels.yml — reusable workflow (workflow_dispatch + workflow_call) that runs smoke tests then intranode EP tests inside a container with GPU device passthrough, using the same container image and runner patterns as test_pytorch_wheels.yml

Updated:

  • external-builds/uccl/README.md — added testing documentation

Test Plan

  • uccl_repo.py checkout successfully clones UCCL sources with test_intranode.py at the expected path
  • run_uccl_tests.py --dry-run --nproc-per-node 8 produces the correct torchrun command
  • Workflow YAML passes syntax validation
  • End-to-end GPU execution (requires CI runner with GPU access and a published UCCL wheel)

Test Result

Local validation passed on all non-GPU checks. The dry-run output confirms the correct torchrun invocation:

Submission Checklist

…er, and CI workflow

Signed-off-by: Aditya Kankariya <adkankar@amd.com>
Signed-off-by: Aditya Kankariya <adkankar@amd.com>
@adityakankariya adityakankariya changed the title Users/adityakankariya/uccl tests Add UCCL test infrastructure and CI workflow Apr 9, 2026
@adityakankariya adityakankariya marked this pull request as draft April 9, 2026 19:38
…est nightly

Signed-off-by: Aditya Kankariya <adkankar@amd.com>
Signed-off-by: Aditya Kankariya <adkankar@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

1 participant