Skip to content

BUILD: optimizations#1283

Open
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:build_optimizations
Open

BUILD: optimizations#1283
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:build_optimizations

Conversation

@ikryukov
Copy link
Copy Markdown
Collaborator

@ikryukov ikryukov commented Mar 11, 2026

What

Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.

Server: 96 cores, CUDA 13.1
Configure:

--with-ucx=$HPCX_UCX_DIR --with-cuda=$CUDA_HOME --with-mpi=$HPCX_MPI_DIR
--with-nvcc-gencode="-gencode=arch=compute_90,code=sm_90"

The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.

Why ?

NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.

How ?

  • Split ec_cpu_reduce.c into per-type TUs (int8/16/32/64, float, complex).
  • Split CUDA executor reduce and ec_cuda_reduce.cu into per-type .cu files.
  • Start CUDA kernel directories (ec/cuda/kernel, tl/cuda/kernels) before libucc.la finishes — NVCC only needs headers, not the linked library.
  • Replace automake SUBDIRS with explicit COMPONENT_DIRS and PARALLEL_SUBDIRS targets so GNU make's job server schedules components, tests, and tools concurrently.

@ikryukov ikryukov changed the title Build optimizations BUILD: optimizations Mar 11, 2026
@ikryukov ikryukov force-pushed the build_optimizations branch from dce383d to 2e22c77 Compare March 11, 2026 11:09
@ikryukov ikryukov self-assigned this Mar 11, 2026
@ikryukov ikryukov requested a review from Sergei-Lebedev March 11, 2026 11:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR delivers a significant build-time improvement (reported ~6× speedup on a 96-core machine) by parallelising the UCC build system at three levels: splitting large monolithic reduce translation units so more compiler processes can run simultaneously, launching CUDA kernel directories early so NVCC overlaps with libucc.la compilation, and replacing serial automake SUBDIRS with explicit parallel pattern rules.

Key changes:

  • Top-level Makefile.am: SUBDIRS replaced with PARALLEL_SUBDIRS plus hand-rolled all-par/%, install-par/%, check-par/%, clean-par/%, uninstall-par/%, and distclean-par/% pattern rules driven by FORCE, preserving standard automake target behaviour including make check.
  • src/Makefile.am: COMPONENT_DIRS pattern rules with explicit libucc.la ordering; early_cuda_kernel_dirs targets overlap NVCC invocations with the core library build; mutex dependency prevents two make processes from entering the same kernel directory.
  • CPU EC reduce split: ec_cpu_reduce.c is reduced to a thin dispatcher; all macros move to ec_cpu_reduce.h; six per-type TUs (int8, int16, int32, int64, float, complex) are introduced.
  • CUDA executor reduce split: old two-file split (_reduce_int.cu, _reduce_fp.cu) replaced by twelve single-type TUs; TRY_REDUCE macro in ec_cuda_executor.cu correctly propagates non-UCC_ERR_NOT_SUPPORTED errors immediately.
  • ec_cuda_reduce.cu split: integer/float/complex reduction kernels moved to separate TUs; dispatcher now chains calls through ucc_ec_cuda_reduce_int/float/complex.
  • One minor concern: LAUNCH_REDUCE_A and LAUNCH_REDUCE macros are duplicated verbatim across ec_cuda_reduce_int.cu, ec_cuda_reduce_float.cu, and ec_cuda_reduce_complex.cu; extracting them to a shared private header would eliminate the triple-maintenance burden.

Confidence Score: 4/5

  • Safe to merge; the build-system refactor is logically sound and all standard automake targets are preserved. The one actionable concern (macro duplication) is a maintenance issue, not a correctness bug.
  • The parallelism logic is carefully constructed — libucc.la ordering, CUDA kernel early-start, and the mutex guard against concurrent make processes in the same directory are all handled correctly. The CPU and CUDA reduce TU splits preserve existing behaviour (same macros, same dispatch logic, same SIZEOF_* guards). make check, make dist, make install, make clean, and make distclean all have correct hooks in both Makefiles. The only issue identified is the copy-paste of two macros across three CUDA files, which is a maintainability concern rather than a correctness problem.
  • The three ec_cuda_reduce_*.cu files (ec_cuda_reduce_int.cu, ec_cuda_reduce_float.cu, ec_cuda_reduce_complex.cu) each carry identical copies of LAUNCH_REDUCE_A/LAUNCH_REDUCE that should be consolidated.

Important Files Changed

Filename Overview
Makefile.am Replaces serial SUBDIRS with PARALLEL_SUBDIRS and hand-rolled pattern rules; correctly adds check-local, distclean-local, and uninstall-local hooks so standard automake targets still work for the parallel directories.
autogen.sh Updates generated TL plugin lines from SUBDIRS += to COMPONENT_DIRS += and also appends DIST_SUBDIRS += so make dist correctly includes dynamically discovered TL plugin directories.
src/Makefile.am Core build parallelism changes: components moved to COMPONENT_DIRS with explicit ordering rules (all-component/%: libucc.la), early CUDA kernel compilation via compile-kernels/%, and an explicit serialization guard preventing two make processes from entering the same kernel directory simultaneously. DIST_SUBDIRS is manually maintained but TL plugin dirs are handled via the generated components/tl/makefile.am.
src/components/ec/cpu/ec_cpu_reduce.c Reduced to a thin dispatcher that delegates to per-type TU functions; each datatype group correctly maps to its implementation. The removal of return UCC_OK after the switch is sound since every case now returns directly.
src/components/ec/cpu/ec_cpu_reduce.h New shared header centralising all reduce macros (DO_DT_REDUCE_INT, DO_DT_REDUCE_FLOAT, etc.) and forward-declaring the per-type functions, enabling the TU-split while keeping all logic in one header.
src/components/ec/cuda/kernel/Makefile.am Old two-file ec_cuda_executor_reduce_int.cu / ec_cuda_executor_reduce_fp.cu replaced by twelve per-type TUs; the ec_cuda_executor_dlink.lo device-link step and libucc_ec_cuda_kernels_la_SOURCES updated accordingly. The ec_cuda_reduce.cu split (int/float/complex) is also reflected.
src/components/ec/cuda/kernel/ec_cuda_executor.cu The TRY_REDUCE macro chain (12 per-type device functions) correctly propagates non-UCC_ERR_NOT_SUPPORTED errors immediately, an improvement over the original which silently fell through on non-OK / non-NOT_SUPPORTED statuses. Rest of changes are whitespace/formatting only.
src/components/ec/cuda/kernel/ec_cuda_reduce.cu Slimmed to a dispatch wrapper delegating to ucc_ec_cuda_reduce_int, _float, and _complex. The CUDA_CHECK / error-logging path for truly unsupported types is correctly preserved. The #include "utils/ucc_math.h" added inside the extern "C" block has no visible users in the remaining code and may be a leftover from a refactoring step.
src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu Contains a verbatim copy of LAUNCH_REDUCE_A / LAUNCH_REDUCE macros also present in ec_cuda_reduce_int.cu and ec_cuda_reduce_complex.cu; should be extracted to a shared header to avoid triple-maintenance.
src/components/ec/cuda/kernel/ec_cuda_reduce_int.cu Contains a verbatim copy of LAUNCH_REDUCE_A / LAUNCH_REDUCE macros also present in ec_cuda_reduce_float.cu and ec_cuda_reduce_complex.cu; should be extracted to a shared header.
src/components/ec/cuda/kernel/ec_cuda_reduce_complex.cu Correctly handles UCC_DT_FLOAT32_COMPLEX and UCC_DT_FLOAT64_COMPLEX with compile-time size guards; contains the same duplicated LAUNCH_REDUCE_A/LAUNCH_REDUCE macros as the other two split files.

Last reviewed commit: "BUILD: split TUs for..."

@ikryukov ikryukov force-pushed the build_optimizations branch from 2e22c77 to 2b54d28 Compare March 11, 2026 13:41
@janjust janjust force-pushed the build_optimizations branch 2 times, most recently from 3200742 to 4537de0 Compare March 19, 2026 19:38
@ikryukov ikryukov force-pushed the build_optimizations branch 2 times, most recently from b3c5951 to 894925c Compare March 20, 2026 16:53
Split monolithic ec_cpu_reduce.c into per-type translation units
(int8/16/32/64, float, complex) and ec_cuda_executor_reduce into
per-type .cu files so that the compiler/NVCC can build them in
parallel under make -j.

Split ec_cuda_reduce.cu into ec_cuda_reduce_{int,float,complex}.cu
for the same reason.

Replace top-level and src/ serial SUBDIRS with explicit parallel
pattern rules so that all components build concurrently after
libucc.la is ready. Start CUDA kernel directories early (before
libucc.la) since NVCC only needs headers to compile .cu files.

Also add FORCE prerequisite to all pattern-rule targets so GNU make
always re-evaluates them regardless of filesystem state, and extend
DIST_SUBDIRS / autogen.sh to cover kernel subdirectories and any
future TL plugins added dynamically.

Signed-off-by: Ilya Kryukov <[email protected]>
@ikryukov ikryukov force-pushed the build_optimizations branch from 894925c to 888614c Compare March 20, 2026 17:02
@ikryukov
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant