Skip to content

Conversation

@abchoudh-amd
Copy link
Contributor

@abchoudh-amd abchoudh-amd commented Feb 2, 2026

Motivation

Technical Details

  • Added MPI-aware laplace equation
  • Added test case for MPI aware workload

JIRA ID

AIPROFCOMP-132

Test Plan

  • Run ctests
  • [ X] Run MPI aware workload

Test Result

  • [X ] Ctests passes
  • [X ] MPI aware workload can be profiled

Submission Checklist

@github-actions github-actions bot added documentation Improvements or additions to documentation project: rocprofiler-compute labels Feb 2, 2026
@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/mpi-communication-aware-bug branch from 183d90e to 7c80599 Compare February 3, 2026 08:12
@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/mpi-communication-aware-bug branch from c16b11f to c953de3 Compare February 3, 2026 12:57
@abchoudh-amd abchoudh-amd marked this pull request as ready for review February 3, 2026 16:56
@abchoudh-amd abchoudh-amd requested review from a team and prbasyal-amd as code owners February 3, 2026 16:56
Copilot AI review requested due to automatic review settings February 3, 2026 16:56
@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/mpi-communication-aware-bug branch from d9ac823 to 643c0c2 Compare February 3, 2026 17:02
@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/mpi-communication-aware-bug branch from 643c0c2 to 05e66ca Compare February 3, 2026 17:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-rank (MPI) awareness to ROCm Compute Profiler outputs and tests, including a new MPI-enabled Laplace sample workload and a new --output-directory mechanism with placeholder substitution.

Changes:

  • Introduces --output-directory (with %hostname%, %gpumodel%, %rank%, %env{...}%) and MPI-rank-based output directory isolation.
  • Adds MPI-enabled Laplace sample build/test path plus new multi-rank pytest coverage.
  • Updates docs/changelog to describe the new output directory behavior and MPI multi-rank profiling.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
projects/rocprofiler-compute/src/rocprof_compute_base.py Implements output-directory placeholder replacement, rank detection integration, and refactors machine spec generation/load flow.
projects/rocprofiler-compute/src/utils/utils.py Adds rank/env placeholder helpers (get_rank, replace_rank, replace_env).
projects/rocprofiler-compute/src/argparser.py Adds --output-directory CLI option and deprecates --path/--subpath in help text.
projects/rocprofiler-compute/src/roofline.py Adds a setter to update roofline workload directory at post-processing time.
projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py Updates roofline post-processing to use the (potentially rewritten) workload directory.
projects/rocprofiler-compute/sample/laplace_eqn.hip Adds optional MPI support and halo-exchange scaffolding for an MPI-aware sample workload.
projects/rocprofiler-compute/tests/conftest.py Updates test harness to use --output-directory and introduces knobs for skipping --name and using default output dir behavior.
projects/rocprofiler-compute/tests/test_profile_general.py Adds output-directory placeholder tests and multi-rank MPI tests (including MPI-aware Laplace case).
projects/rocprofiler-compute/tests/CMakeLists.txt Builds mpi_aware_laplace_eqn when MPI is available.
projects/rocprofiler-compute/CMakeLists.txt Adds an MPI-driven ctest for multi-rank pytest execution.
projects/rocprofiler-compute/requirements-test.txt Adds mpi4py and pytest-mpi as test dependencies.
projects/rocprofiler-compute/pyproject.toml Adds multi_rank pytest marker.
projects/rocprofiler-compute/docs/how-to/use.rst Documents new MPI-rank default output behavior and deprecation of --path/--subpath.
projects/rocprofiler-compute/docs/how-to/profile/mode.rst Adds detailed output-directory configuration and multi-rank profiling documentation.
projects/rocprofiler-compute/CHANGELOG.md Records new --output-directory, MPI rank detection, and deprecations.
Comments suppressed due to low confidence (1)

projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py:738

    def analysis_setup(self, roofline_parameters: Optional[dict[str, Any]]) -> None:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +254 to +258
if(MPI_CXX_COMPILER)
message(STATUS "MPI CXX compiler found: ${MPI_CXX_COMPILER}")
else()
message(FATAL_ERROR "MPI CXX compiler not found. MPI tests will be skipped.")
endif()
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake now hard-fails configuration when MPI isn’t found, but the message says "MPI tests will be skipped". This makes all builds require MPI even when tests aren’t being built/run. Suggest making MPI optional (e.g., find_package(MPI QUIET ...) and only adding MPI tests/targets when found) and/or guarding this behind the project’s test option so non-MPI environments can still configure/build.

Copilot uses AI. Check for mistakes.
Comment on lines +790 to +794
# Check with hostname
hostname = "test_node"
with patch("socket.gethostname", return_value=hostname):
workload_base_dir = test_utils.get_output_dir(param_id="hostname")
workload_dir = os.path.join(workload_base_dir, "%hostname%")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test’s hostname/gpumodel cases assume no MPI-rank env vars are set; however on HPC schedulers SLURM_PROCID/etc can be present even for non-MPI pytest runs, which would make the output-directory logic append /<rank> and break these assertions. Consider clearing the rank-related env vars at the start of the test (before the hostname/gpumodel checks) to make it environment-independent.

Copilot uses AI. Check for mistakes.
subdirectory named by its rank to avoid output collisions.

Example Usage
-----
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reStructuredText, the underline length should be at least as long as the section title. Here Example Usage is underlined with only 5 dashes (-----), which can trigger Sphinx/RST formatting warnings/errors. Use an underline at least the length of the title (e.g., -------------) or switch to a consistent subsection adornment style used elsewhere in the doc.

Suggested change
-----
-------------

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +187
print(f"Path: {self.__args.path}")
print(f"Name: {self.__args.name}")
print(f"Output_directory: {self.__args.output_directory}")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print(f"Path/Name/Output_directory...") lines look like leftover debug statements and can pollute CLI output and test logs. Prefer console_debug(...) (or remove entirely) so user-facing output stays clean, especially when followed by console_error/console_warning.

Suggested change
print(f"Path: {self.__args.path}")
print(f"Name: {self.__args.name}")
print(f"Output_directory: {self.__args.output_directory}")
console_debug(f"Path: {self.__args.path}")
console_debug(f"Name: {self.__args.name}")
console_debug(f"Output_directory: {self.__args.output_directory}")

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 87
for (int j = 0; j < N; ++j) {
float x = 2.0f * (float)i / (float)N - 1.0f;
float y = 2.0f * (float)j / (float)N - 1.0f;
#endif // USE_MPI
float x = 2.0f * (float)i / (float)N - 1.0f;
U[i * N + j] = 1.0f / (1.0f + x * x + y * y); // Initial guess
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MPI mode you allocate arrays with an extra halo (ny = N + 2) and initialize() loops j=1..N, but indexing still uses U[i * N + j] (stride N). This doesn’t match the allocated leading dimension and will overlap rows / misplace the intended halo. Use a leading-dimension variable (e.g., ld = USE_MPI ? (N + 2) : N) and index as U[i * ld + j] (and initialize halo cells intentionally).

Copilot uses AI. Check for mistakes.
Comment on lines +87 to 89
self.generate_machine_specs()

self.sanitize()
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__() now calls self.generate_machine_specs() unconditionally. generate_machine_specs() executes ROCm/amd-smi probing when sysinfo is not provided, which makes offline analyze workflows (analyzing an existing workload on a non-ROCm host) fail early. Consider deferring machine-spec generation until profile mode, or using workload sysinfo.csv during analyze mode instead of probing local hardware.

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +323
HIP_CHECK(hipMemcpy(h_send_left[i], &d_U_olds[i][n_sizes[i]],
n_sizes[i] * sizeof(float), hipMemcpyDeviceToHost));
HIP_CHECK(hipMemcpy(h_send_right[i], &d_U_olds[i][n_sizes[i] * (n_sizes[i] + 1)],
n_sizes[i] * sizeof(float), hipMemcpyDeviceToHost));
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The halo exchange copies contiguous buffers starting at offsets like &d_U_olds[i][n_sizes[i]] and &d_U_olds[i][n_sizes[i] * (n_sizes[i] + 1)]. For a 2D domain partitioned in one dimension, true halo boundaries are typically strided and require packing/unpacking (or hipMemcpy2D) using the correct leading dimension. As written, the exchanged data does not correspond to the stated neighboring-halo regions.

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +248
const int ny_small = n_small + 2;
const int ny_mid = n_mid + 2;
const int ny_large = n_large + 2;
#else
const int ny_small = n_small;
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MPI mode you increase ny_* by 2 (halo) and later allocate/copy nx * ny elements, but the kernel launches still pass N and the kernel indexes an N x N domain. Unless the kernel is updated to use the halo/leading dimension, these extra halo elements (and the MPI halo exchanges) won’t influence the stencil computation.

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 116
"iteration_multiplexing_stochastic",
"noise_clamp",
"torch_ops",
"multi_rank",
]
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests use @pytest.mark.mpi, but mpi is not listed in tool.pytest.ini_options.markers. This will raise PytestUnknownMarkWarning (or fail under strict marker settings). Add "mpi" to the markers list alongside multi_rank.

Copilot uses AI. Check for mistakes.

@demarcate
def load_soc_specs(self, sysinfo: Optional[dict] = None) -> None:
"""Load OmniSoC instance for RocProfCompute run"""
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_soc_specs(self, sysinfo=...) no longer uses the sysinfo argument and instead relies on self.__mspec generated from the current host. In analyze mode, run_analysis() passes a sysinfo dict from sysinfo.csv (rocprof_compute_base.py:575), so this can load the wrong SoC (or fail on a machine without ROCm). Restore behavior by regenerating self.__mspec from sysinfo when provided (e.g., generate_machine_specs(self.__args, sysinfo)) before selecting the SoC module.

Suggested change
"""Load OmniSoC instance for RocProfCompute run"""
"""Load OmniSoC instance for RocProfCompute run
If sysinfo is provided (e.g., in analyze mode from sysinfo.csv),
regenerate the MachineSpecs from that data instead of using the
current host's characteristics.
"""
if sysinfo is not None:
# Regenerate machine specs based on the provided sysinfo rather than
# the current host. This is important for analyze mode where we may
# be running on a different machine than the one that produced the data.
self.__mspec = generate_machine_specs(self.__args, sysinfo)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants