-
Notifications
You must be signed in to change notification settings - Fork 126
Multi-rank profiling with MPI Communication #3013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
183d90e to
7c80599
Compare
c16b11f to
c953de3
Compare
d9ac823 to
643c0c2
Compare
643c0c2 to
05e66ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds multi-rank (MPI) awareness to ROCm Compute Profiler outputs and tests, including a new MPI-enabled Laplace sample workload and a new --output-directory mechanism with placeholder substitution.
Changes:
- Introduces
--output-directory(with%hostname%,%gpumodel%,%rank%,%env{...}%) and MPI-rank-based output directory isolation. - Adds MPI-enabled Laplace sample build/test path plus new multi-rank pytest coverage.
- Updates docs/changelog to describe the new output directory behavior and MPI multi-rank profiling.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocprofiler-compute/src/rocprof_compute_base.py | Implements output-directory placeholder replacement, rank detection integration, and refactors machine spec generation/load flow. |
| projects/rocprofiler-compute/src/utils/utils.py | Adds rank/env placeholder helpers (get_rank, replace_rank, replace_env). |
| projects/rocprofiler-compute/src/argparser.py | Adds --output-directory CLI option and deprecates --path/--subpath in help text. |
| projects/rocprofiler-compute/src/roofline.py | Adds a setter to update roofline workload directory at post-processing time. |
| projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py | Updates roofline post-processing to use the (potentially rewritten) workload directory. |
| projects/rocprofiler-compute/sample/laplace_eqn.hip | Adds optional MPI support and halo-exchange scaffolding for an MPI-aware sample workload. |
| projects/rocprofiler-compute/tests/conftest.py | Updates test harness to use --output-directory and introduces knobs for skipping --name and using default output dir behavior. |
| projects/rocprofiler-compute/tests/test_profile_general.py | Adds output-directory placeholder tests and multi-rank MPI tests (including MPI-aware Laplace case). |
| projects/rocprofiler-compute/tests/CMakeLists.txt | Builds mpi_aware_laplace_eqn when MPI is available. |
| projects/rocprofiler-compute/CMakeLists.txt | Adds an MPI-driven ctest for multi-rank pytest execution. |
| projects/rocprofiler-compute/requirements-test.txt | Adds mpi4py and pytest-mpi as test dependencies. |
| projects/rocprofiler-compute/pyproject.toml | Adds multi_rank pytest marker. |
| projects/rocprofiler-compute/docs/how-to/use.rst | Documents new MPI-rank default output behavior and deprecation of --path/--subpath. |
| projects/rocprofiler-compute/docs/how-to/profile/mode.rst | Adds detailed output-directory configuration and multi-rank profiling documentation. |
| projects/rocprofiler-compute/CHANGELOG.md | Records new --output-directory, MPI rank detection, and deprecations. |
Comments suppressed due to low confidence (1)
projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py:738
- Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
Overridden method signature does not match call, where it is passed too few arguments. Overriding method function demarcate.wrap_function matches the call.
def analysis_setup(self, roofline_parameters: Optional[dict[str, Any]]) -> None:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if(MPI_CXX_COMPILER) | ||
| message(STATUS "MPI CXX compiler found: ${MPI_CXX_COMPILER}") | ||
| else() | ||
| message(FATAL_ERROR "MPI CXX compiler not found. MPI tests will be skipped.") | ||
| endif() |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake now hard-fails configuration when MPI isn’t found, but the message says "MPI tests will be skipped". This makes all builds require MPI even when tests aren’t being built/run. Suggest making MPI optional (e.g., find_package(MPI QUIET ...) and only adding MPI tests/targets when found) and/or guarding this behind the project’s test option so non-MPI environments can still configure/build.
| # Check with hostname | ||
| hostname = "test_node" | ||
| with patch("socket.gethostname", return_value=hostname): | ||
| workload_base_dir = test_utils.get_output_dir(param_id="hostname") | ||
| workload_dir = os.path.join(workload_base_dir, "%hostname%") |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test’s hostname/gpumodel cases assume no MPI-rank env vars are set; however on HPC schedulers SLURM_PROCID/etc can be present even for non-MPI pytest runs, which would make the output-directory logic append /<rank> and break these assertions. Consider clearing the rank-related env vars at the start of the test (before the hostname/gpumodel checks) to make it environment-independent.
| subdirectory named by its rank to avoid output collisions. | ||
|
|
||
| Example Usage | ||
| ----- |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reStructuredText, the underline length should be at least as long as the section title. Here Example Usage is underlined with only 5 dashes (-----), which can trigger Sphinx/RST formatting warnings/errors. Use an underline at least the length of the title (e.g., -------------) or switch to a consistent subsection adornment style used elsewhere in the doc.
| ----- | |
| ------------- |
| print(f"Path: {self.__args.path}") | ||
| print(f"Name: {self.__args.name}") | ||
| print(f"Output_directory: {self.__args.output_directory}") |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The print(f"Path/Name/Output_directory...") lines look like leftover debug statements and can pollute CLI output and test logs. Prefer console_debug(...) (or remove entirely) so user-facing output stays clean, especially when followed by console_error/console_warning.
| print(f"Path: {self.__args.path}") | |
| print(f"Name: {self.__args.name}") | |
| print(f"Output_directory: {self.__args.output_directory}") | |
| console_debug(f"Path: {self.__args.path}") | |
| console_debug(f"Name: {self.__args.name}") | |
| console_debug(f"Output_directory: {self.__args.output_directory}") |
| for (int j = 0; j < N; ++j) { | ||
| float x = 2.0f * (float)i / (float)N - 1.0f; | ||
| float y = 2.0f * (float)j / (float)N - 1.0f; | ||
| #endif // USE_MPI | ||
| float x = 2.0f * (float)i / (float)N - 1.0f; | ||
| U[i * N + j] = 1.0f / (1.0f + x * x + y * y); // Initial guess |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MPI mode you allocate arrays with an extra halo (ny = N + 2) and initialize() loops j=1..N, but indexing still uses U[i * N + j] (stride N). This doesn’t match the allocated leading dimension and will overlap rows / misplace the intended halo. Use a leading-dimension variable (e.g., ld = USE_MPI ? (N + 2) : N) and index as U[i * ld + j] (and initialize halo cells intentionally).
| self.generate_machine_specs() | ||
|
|
||
| self.sanitize() |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__init__() now calls self.generate_machine_specs() unconditionally. generate_machine_specs() executes ROCm/amd-smi probing when sysinfo is not provided, which makes offline analyze workflows (analyzing an existing workload on a non-ROCm host) fail early. Consider deferring machine-spec generation until profile mode, or using workload sysinfo.csv during analyze mode instead of probing local hardware.
| HIP_CHECK(hipMemcpy(h_send_left[i], &d_U_olds[i][n_sizes[i]], | ||
| n_sizes[i] * sizeof(float), hipMemcpyDeviceToHost)); | ||
| HIP_CHECK(hipMemcpy(h_send_right[i], &d_U_olds[i][n_sizes[i] * (n_sizes[i] + 1)], | ||
| n_sizes[i] * sizeof(float), hipMemcpyDeviceToHost)); | ||
| } |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The halo exchange copies contiguous buffers starting at offsets like &d_U_olds[i][n_sizes[i]] and &d_U_olds[i][n_sizes[i] * (n_sizes[i] + 1)]. For a 2D domain partitioned in one dimension, true halo boundaries are typically strided and require packing/unpacking (or hipMemcpy2D) using the correct leading dimension. As written, the exchanged data does not correspond to the stated neighboring-halo regions.
| const int ny_small = n_small + 2; | ||
| const int ny_mid = n_mid + 2; | ||
| const int ny_large = n_large + 2; | ||
| #else | ||
| const int ny_small = n_small; |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MPI mode you increase ny_* by 2 (halo) and later allocate/copy nx * ny elements, but the kernel launches still pass N and the kernel indexes an N x N domain. Unless the kernel is updated to use the halo/leading dimension, these extra halo elements (and the MPI halo exchanges) won’t influence the stencil computation.
| "iteration_multiplexing_stochastic", | ||
| "noise_clamp", | ||
| "torch_ops", | ||
| "multi_rank", | ||
| ] |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests use @pytest.mark.mpi, but mpi is not listed in tool.pytest.ini_options.markers. This will raise PytestUnknownMarkWarning (or fail under strict marker settings). Add "mpi" to the markers list alongside multi_rank.
|
|
||
| @demarcate | ||
| def load_soc_specs(self, sysinfo: Optional[dict] = None) -> None: | ||
| """Load OmniSoC instance for RocProfCompute run""" |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_soc_specs(self, sysinfo=...) no longer uses the sysinfo argument and instead relies on self.__mspec generated from the current host. In analyze mode, run_analysis() passes a sysinfo dict from sysinfo.csv (rocprof_compute_base.py:575), so this can load the wrong SoC (or fail on a machine without ROCm). Restore behavior by regenerating self.__mspec from sysinfo when provided (e.g., generate_machine_specs(self.__args, sysinfo)) before selecting the SoC module.
| """Load OmniSoC instance for RocProfCompute run""" | |
| """Load OmniSoC instance for RocProfCompute run | |
| If sysinfo is provided (e.g., in analyze mode from sysinfo.csv), | |
| regenerate the MachineSpecs from that data instead of using the | |
| current host's characteristics. | |
| """ | |
| if sysinfo is not None: | |
| # Regenerate machine specs based on the provided sysinfo rather than | |
| # the current host. This is important for analyze mode where we may | |
| # be running on a different machine than the one that produced the data. | |
| self.__mspec = generate_machine_specs(self.__args, sysinfo) |
Motivation
Technical Details
JIRA ID
AIPROFCOMP-132
Test Plan
Test Result
Submission Checklist