BUILD: optimizations by ikryukov · Pull Request #1283 · openucx/ucc

ikryukov · 2026-03-11T10:49:58Z

What

Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.

Server: 96 cores, CUDA 13.1
Configure:

--with-ucx=$HPCX_UCX_DIR --with-cuda=$CUDA_HOME --with-mpi=$HPCX_MPI_DIR
--with-nvcc-gencode="-gencode=arch=compute_90,code=sm_90"

The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.

Why ?

NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.

How ?

Split ec_cpu_reduce.c into per-type TUs (int8/16/32/64, float, complex).
Split CUDA executor reduce and ec_cuda_reduce.cu into per-type .cu files.
Start CUDA kernel directories (ec/cuda/kernel, tl/cuda/kernels) before libucc.la finishes — NVCC only needs headers, not the linked library.
Replace automake SUBDIRS with explicit COMPONENT_DIRS and PARALLEL_SUBDIRS targets so GNU make's job server schedules components, tests, and tools concurrently.

greptile-apps · 2026-03-11T11:17:29Z

Greptile Summary

This PR delivers a significant build-time improvement (reported ~6× speedup on a 96-core machine) by parallelising the UCC build system at three levels: splitting large monolithic reduce translation units so more compiler processes can run simultaneously, launching CUDA kernel directories early so NVCC overlaps with libucc.la compilation, and replacing serial automake SUBDIRS with explicit parallel pattern rules.

Key changes:

Top-level Makefile.am: SUBDIRS replaced with PARALLEL_SUBDIRS plus hand-rolled all-par/%, install-par/%, check-par/%, clean-par/%, uninstall-par/%, and distclean-par/% pattern rules driven by FORCE, preserving standard automake target behaviour including make check.
src/Makefile.am: COMPONENT_DIRS pattern rules with explicit libucc.la ordering; early_cuda_kernel_dirs targets overlap NVCC invocations with the core library build; mutex dependency prevents two make processes from entering the same kernel directory.
CPU EC reduce split: ec_cpu_reduce.c is reduced to a thin dispatcher; all macros move to ec_cpu_reduce.h; six per-type TUs (int8, int16, int32, int64, float, complex) are introduced.
CUDA executor reduce split: old two-file split (_reduce_int.cu, _reduce_fp.cu) replaced by twelve single-type TUs; TRY_REDUCE macro in ec_cuda_executor.cu correctly propagates non-UCC_ERR_NOT_SUPPORTED errors immediately.
ec_cuda_reduce.cu split: integer/float/complex reduction kernels moved to separate TUs; dispatcher now chains calls through ucc_ec_cuda_reduce_int/float/complex.
One minor concern: LAUNCH_REDUCE_A and LAUNCH_REDUCE macros are duplicated verbatim across ec_cuda_reduce_int.cu, ec_cuda_reduce_float.cu, and ec_cuda_reduce_complex.cu; extracting them to a shared private header would eliminate the triple-maintenance burden.

Confidence Score: 4/5

Safe to merge; the build-system refactor is logically sound and all standard automake targets are preserved. The one actionable concern (macro duplication) is a maintenance issue, not a correctness bug.
The parallelism logic is carefully constructed — libucc.la ordering, CUDA kernel early-start, and the mutex guard against concurrent make processes in the same directory are all handled correctly. The CPU and CUDA reduce TU splits preserve existing behaviour (same macros, same dispatch logic, same SIZEOF_* guards). make check, make dist, make install, make clean, and make distclean all have correct hooks in both Makefiles. The only issue identified is the copy-paste of two macros across three CUDA files, which is a maintainability concern rather than a correctness problem.
The three ec_cuda_reduce_*.cu files (ec_cuda_reduce_int.cu, ec_cuda_reduce_float.cu, ec_cuda_reduce_complex.cu) each carry identical copies of LAUNCH_REDUCE_A/LAUNCH_REDUCE that should be consolidated.

Important Files Changed

Filename	Overview
Makefile.am	Replaces serial `SUBDIRS` with `PARALLEL_SUBDIRS` and hand-rolled pattern rules; correctly adds `check-local`, `distclean-local`, and `uninstall-local` hooks so standard automake targets still work for the parallel directories.
autogen.sh	Updates generated TL plugin lines from `SUBDIRS +=` to `COMPONENT_DIRS +=` and also appends `DIST_SUBDIRS +=` so `make dist` correctly includes dynamically discovered TL plugin directories.
src/Makefile.am	Core build parallelism changes: components moved to `COMPONENT_DIRS` with explicit ordering rules (`all-component/%: libucc.la`), early CUDA kernel compilation via `compile-kernels/%`, and an explicit serialization guard preventing two make processes from entering the same kernel directory simultaneously. `DIST_SUBDIRS` is manually maintained but TL plugin dirs are handled via the generated `components/tl/makefile.am`.
src/components/ec/cpu/ec_cpu_reduce.c	Reduced to a thin dispatcher that delegates to per-type TU functions; each datatype group correctly maps to its implementation. The removal of `return UCC_OK` after the switch is sound since every case now returns directly.
src/components/ec/cpu/ec_cpu_reduce.h	New shared header centralising all reduce macros (`DO_DT_REDUCE_INT`, `DO_DT_REDUCE_FLOAT`, etc.) and forward-declaring the per-type functions, enabling the TU-split while keeping all logic in one header.
src/components/ec/cuda/kernel/Makefile.am	Old two-file `ec_cuda_executor_reduce_int.cu` / `ec_cuda_executor_reduce_fp.cu` replaced by twelve per-type TUs; the `ec_cuda_executor_dlink.lo` device-link step and `libucc_ec_cuda_kernels_la_SOURCES` updated accordingly. The `ec_cuda_reduce.cu` split (int/float/complex) is also reflected.
src/components/ec/cuda/kernel/ec_cuda_executor.cu	The `TRY_REDUCE` macro chain (12 per-type device functions) correctly propagates non-`UCC_ERR_NOT_SUPPORTED` errors immediately, an improvement over the original which silently fell through on non-OK / non-NOT_SUPPORTED statuses. Rest of changes are whitespace/formatting only.
src/components/ec/cuda/kernel/ec_cuda_reduce.cu	Slimmed to a dispatch wrapper delegating to `ucc_ec_cuda_reduce_int`, `_float`, and `_complex`. The `CUDA_CHECK` / error-logging path for truly unsupported types is correctly preserved. The `#include "utils/ucc_math.h"` added inside the `extern "C"` block has no visible users in the remaining code and may be a leftover from a refactoring step.
src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu	Contains a verbatim copy of `LAUNCH_REDUCE_A` / `LAUNCH_REDUCE` macros also present in `ec_cuda_reduce_int.cu` and `ec_cuda_reduce_complex.cu`; should be extracted to a shared header to avoid triple-maintenance.
src/components/ec/cuda/kernel/ec_cuda_reduce_int.cu	Contains a verbatim copy of `LAUNCH_REDUCE_A` / `LAUNCH_REDUCE` macros also present in `ec_cuda_reduce_float.cu` and `ec_cuda_reduce_complex.cu`; should be extracted to a shared header.
src/components/ec/cuda/kernel/ec_cuda_reduce_complex.cu	Correctly handles `UCC_DT_FLOAT32_COMPLEX` and `UCC_DT_FLOAT64_COMPLEX` with compile-time size guards; contains the same duplicated `LAUNCH_REDUCE_A`/`LAUNCH_REDUCE` macros as the other two split files.

_{Last reviewed commit: "BUILD: split TUs for..."}

Makefile.am

configure.ac

src/Makefile.am

Split monolithic ec_cpu_reduce.c into per-type translation units (int8/16/32/64, float, complex) and ec_cuda_executor_reduce into per-type .cu files so that the compiler/NVCC can build them in parallel under make -j. Split ec_cuda_reduce.cu into ec_cuda_reduce_{int,float,complex}.cu for the same reason. Replace top-level and src/ serial SUBDIRS with explicit parallel pattern rules so that all components build concurrently after libucc.la is ready. Start CUDA kernel directories early (before libucc.la) since NVCC only needs headers to compile .cu files. Also add FORCE prerequisite to all pattern-rule targets so GNU make always re-evaluates them regardless of filesystem state, and extend DIST_SUBDIRS / autogen.sh to cover kernel subdirectories and any future TL plugins added dynamically. Signed-off-by: Ilya Kryukov <[email protected]>

ikryukov · 2026-03-20T17:03:00Z

/build

ikryukov changed the title ~~Build optimizations~~ BUILD: optimizations Mar 11, 2026

ikryukov force-pushed the build_optimizations branch from dce383d to 2e22c77 Compare March 11, 2026 11:09

ikryukov self-assigned this Mar 11, 2026

ikryukov added the Ready-for-Review label Mar 11, 2026

ikryukov requested a review from Sergei-Lebedev March 11, 2026 11:13

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Makefile.am Outdated Show resolved Hide resolved

ikryukov force-pushed the build_optimizations branch from 2e22c77 to 2b54d28 Compare March 11, 2026 13:41

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

configure.ac Outdated Show resolved Hide resolved

src/Makefile.am Show resolved Hide resolved

src/Makefile.am Show resolved Hide resolved

janjust force-pushed the build_optimizations branch 2 times, most recently from 3200742 to 4537de0 Compare March 19, 2026 19:38

ikryukov force-pushed the build_optimizations branch 2 times, most recently from b3c5951 to 894925c Compare March 20, 2026 16:53

ikryukov force-pushed the build_optimizations branch from 894925c to 888614c Compare March 20, 2026 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILD: optimizations#1283

BUILD: optimizations#1283
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:build_optimizations

ikryukov commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikryukov commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikryukov commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why ?

How ?

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikryukov commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ikryukov commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 11, 2026 •

edited

Loading