Open
Conversation
dce383d to
2e22c77
Compare
Contributor
|
| Filename | Overview |
|---|---|
| Makefile.am | Replaces serial SUBDIRS with PARALLEL_SUBDIRS and hand-rolled pattern rules; correctly adds check-local, distclean-local, and uninstall-local hooks so standard automake targets still work for the parallel directories. |
| autogen.sh | Updates generated TL plugin lines from SUBDIRS += to COMPONENT_DIRS += and also appends DIST_SUBDIRS += so make dist correctly includes dynamically discovered TL plugin directories. |
| src/Makefile.am | Core build parallelism changes: components moved to COMPONENT_DIRS with explicit ordering rules (all-component/%: libucc.la), early CUDA kernel compilation via compile-kernels/%, and an explicit serialization guard preventing two make processes from entering the same kernel directory simultaneously. DIST_SUBDIRS is manually maintained but TL plugin dirs are handled via the generated components/tl/makefile.am. |
| src/components/ec/cpu/ec_cpu_reduce.c | Reduced to a thin dispatcher that delegates to per-type TU functions; each datatype group correctly maps to its implementation. The removal of return UCC_OK after the switch is sound since every case now returns directly. |
| src/components/ec/cpu/ec_cpu_reduce.h | New shared header centralising all reduce macros (DO_DT_REDUCE_INT, DO_DT_REDUCE_FLOAT, etc.) and forward-declaring the per-type functions, enabling the TU-split while keeping all logic in one header. |
| src/components/ec/cuda/kernel/Makefile.am | Old two-file ec_cuda_executor_reduce_int.cu / ec_cuda_executor_reduce_fp.cu replaced by twelve per-type TUs; the ec_cuda_executor_dlink.lo device-link step and libucc_ec_cuda_kernels_la_SOURCES updated accordingly. The ec_cuda_reduce.cu split (int/float/complex) is also reflected. |
| src/components/ec/cuda/kernel/ec_cuda_executor.cu | The TRY_REDUCE macro chain (12 per-type device functions) correctly propagates non-UCC_ERR_NOT_SUPPORTED errors immediately, an improvement over the original which silently fell through on non-OK / non-NOT_SUPPORTED statuses. Rest of changes are whitespace/formatting only. |
| src/components/ec/cuda/kernel/ec_cuda_reduce.cu | Slimmed to a dispatch wrapper delegating to ucc_ec_cuda_reduce_int, _float, and _complex. The CUDA_CHECK / error-logging path for truly unsupported types is correctly preserved. The #include "utils/ucc_math.h" added inside the extern "C" block has no visible users in the remaining code and may be a leftover from a refactoring step. |
| src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu | Contains a verbatim copy of LAUNCH_REDUCE_A / LAUNCH_REDUCE macros also present in ec_cuda_reduce_int.cu and ec_cuda_reduce_complex.cu; should be extracted to a shared header to avoid triple-maintenance. |
| src/components/ec/cuda/kernel/ec_cuda_reduce_int.cu | Contains a verbatim copy of LAUNCH_REDUCE_A / LAUNCH_REDUCE macros also present in ec_cuda_reduce_float.cu and ec_cuda_reduce_complex.cu; should be extracted to a shared header. |
| src/components/ec/cuda/kernel/ec_cuda_reduce_complex.cu | Correctly handles UCC_DT_FLOAT32_COMPLEX and UCC_DT_FLOAT64_COMPLEX with compile-time size guards; contains the same duplicated LAUNCH_REDUCE_A/LAUNCH_REDUCE macros as the other two split files. |
Last reviewed commit: "BUILD: split TUs for..."
2e22c77 to
2b54d28
Compare
3200742 to
4537de0
Compare
b3c5951 to
894925c
Compare
Split monolithic ec_cpu_reduce.c into per-type translation units
(int8/16/32/64, float, complex) and ec_cuda_executor_reduce into
per-type .cu files so that the compiler/NVCC can build them in
parallel under make -j.
Split ec_cuda_reduce.cu into ec_cuda_reduce_{int,float,complex}.cu
for the same reason.
Replace top-level and src/ serial SUBDIRS with explicit parallel
pattern rules so that all components build concurrently after
libucc.la is ready. Start CUDA kernel directories early (before
libucc.la) since NVCC only needs headers to compile .cu files.
Also add FORCE prerequisite to all pattern-rule targets so GNU make
always re-evaluates them regardless of filesystem state, and extend
DIST_SUBDIRS / autogen.sh to cover kernel subdirectories and any
future TL plugins added dynamically.
Signed-off-by: Ilya Kryukov <[email protected]>
894925c to
888614c
Compare
Collaborator
Author
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.
Server: 96 cores, CUDA 13.1
Configure:
The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.
Why ?
NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.
How ?