Skip to content

Conversation

@amd-wsung102
Copy link

@amd-wsung102 amd-wsung102 commented Dec 18, 2025

This PR optimizes the performance of index_select_scalar_cumsum_kernel on ROCm using the following strategies.

  • Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
  • Uses a fast path for single-block workloads, which uses a simple block scan and early return.
  • Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:

  • keyed_jagged_index_select_dim1.cu
  • inclusive_sum_scan.cuh

ROCm-specific changes are guarded with #ifdef USE_ROCM.

@meta-cla
Copy link

meta-cla bot commented Dec 18, 2025

Hi @amd-wsung102!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla
Copy link

meta-cla bot commented Dec 18, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the cla signed label Dec 18, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Dec 19, 2025

@q10 has imported this pull request. If you are a Meta employee, you can view this in D89525621.

PTA_B(indices, index_t, 1, 32),
num_batches,
batch_size,
grid_size == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what case is grid size = 0? grid size cannot be 0 right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have no compute on GPU, we should do early return. grid size should not be 0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @spcyppt, thank you for reviewing this PR. You are correct that grid size cannot be 0. This has been fixed in this commit 8fe70ea. This commit also adds an early return, as you suggested.

: nullptr);
});
});
auto dispatch_cumsum = [&](auto vec_tag, auto grid_calc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be a function or macro, passing just numbers. why do we need to pass functions here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here dispatch_cumsum is used to avoid duplicating kernel launch code for ROCm. Since the kernel template needs compile-time values, 3 different instantiations are needed (as seen in the switch case statements). Without dispatch_cumsum, the kernel launch code would have to be duplicated 3 times (for 3 instantiations).

#else
const int entries_per_thread = 1;
const int ENTRIES_PER_BLOCK = MAX_CUMSUM_ENTRIES_PER_BLOCK;
auto grid_size = cuda_calc_xblock_count(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the changes below, where is this used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit 200de28 fixed this issue. Now variables entries_per_thread and ENTRIES_PER_BLOCK are used only in ROCm path. This commit also provides a cleaner code separation between ROCm and CUDA paths.


#ifdef USE_ROCM
auto rocm_grid = [&](int entries_per_block) {
return (num_output_lengths + entries_per_block - 1) / entries_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add proper assert, entries_per_block cannot be 0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit 200de28 has removed the code you highlighted here.

constexpr int ENTRIES_PER_THREAD = decltype(vec_tag)::value;
constexpr int ENTRIES_PER_BLOCK =
MAX_CUMSUM_ENTRIES_PER_BLOCK * ENTRIES_PER_THREAD;
const auto grid_size = grid_calc(ENTRIES_PER_BLOCK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the variable name here to avoid confusion? line 276 already defines grid_size. Where does grid_size in line 276 get used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit af5ccbc has changed the variable name grid_size to rocm_grid_size for those used within the ROCm path. Now rocm_grid_size is used only in ROCm-specific paths, and grid_size is used in places not specific to either ROCm or CUDA. Hopefully this provides better clarity and reduces confusion.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Jan 28, 2026
…5263)

Summary:
X-link: facebookresearch/FBGEMM#2317

This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies.

- Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
- Uses a fast path for single-block workloads, which uses a simple block scan and early return.
- Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:
- `keyed_jagged_index_select_dim1.cu`
- `inclusive_sum_scan.cuh`

ROCm-specific changes are guarded with `#ifdef USE_ROCM`.


Test Plan:
```
tests=(
  "keyed-jagged-index-select-dim1-forward-bench"
  "sweep-output-batch-sizes"
  "sweep-num-output-lengths"
)

# Update as needed
platform=amd/mi350

cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd

for test in "${tests[@]}"; do
    bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}"
done
```

https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0

Reviewed By: echen4096

Differential Revision: D89525621

Pulled By: q10
q10 pushed a commit to q10/FBGEMM that referenced this pull request Jan 28, 2026
…5353)

Summary:
Pull Request resolved: pytorch#5353

X-link: https://github.com/facebookresearch/FBGEMM/pull/2317

This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies.

- Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
- Uses a fast path for single-block workloads, which uses a simple block scan and early return.
- Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:
- `keyed_jagged_index_select_dim1.cu`
- `inclusive_sum_scan.cuh`

ROCm-specific changes are guarded with `#ifdef USE_ROCM`.

Pull Request resolved: pytorch#5263

Test Plan:
```
tests=(
  "keyed-jagged-index-select-dim1-forward-bench"
  "sweep-output-batch-sizes"
  "sweep-num-output-lengths"
)

# Update as needed
platform=amd/mi350

cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd

for test in "${tests[@]}"; do
    bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}"
done
```

https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0

Reviewed By: echen4096

Differential Revision: D89525621

Pulled By: q10
@meta-codesync meta-codesync bot closed this in e350418 Jan 28, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Jan 28, 2026

@q10 merged this pull request in e350418.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants