Optimizations for index_select_scalar_cumsum_kernel on ROCm #5263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

amd-wsung102 wants to merge 16 commits into pytorch:main from ROCm:fbgemm_opt

amd-wsung102 commented Dec 18, 2025 •

edited

Loading

This PR optimizes the performance of index_select_scalar_cumsum_kernel on ROCm using the following strategies.

Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
Uses a fast path for single-block workloads, which uses a simple block scan and early return.
Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:

keyed_jagged_index_select_dim1.cu
inclusive_sum_scan.cuh

ROCm-specific changes are guarded with #ifdef USE_ROCM.

root added 13 commits

December 16, 2025 20:13


          Optimizations for index_select kernel

dbab5d3


          Optimizations for index_select kernel

fe9bfbc


          Optimizations for index_select kernel

782bf91


          Updated cumsum

b9f8625


          Updated template for kernel API

3ef975b


          Updated formatting

20d1445


          Updated VEC and formatting

27e5537


          Updated formatting

2182e0b


          Updated formatting

fe929e5


          Addressing PR comments

489e4b6


          Added back VEC templateparameter

6083e78


          Fixed formatting

1a75460


          Removed volatile and added back spin-lock

2d160b2

pytorch-bot bot added the module: rocm label

meta-cla bot commented Dec 18, 2025

Hi @amd-wsung102!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla bot commented Dec 18, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

meta-cla bot added the cla signed label

Contributor

meta-codesync bot commented Dec 19, 2025

@q10 has imported this pull request. If you are a Meta employee, you can view this in D89525621.

spcyppt reviewed

View reviewed changes

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu Outdated

+                                          PTA_B(indices, index_t, 1, 32),
+                                          num_batches,
+                                          batch_size,
+                                          grid_size == 0

Contributor

spcyppt Jan 20, 2026

what case is grid size = 0? grid size cannot be 0 right?

Contributor

spcyppt Jan 20, 2026

if we have no compute on GPU, we should do early return. grid size should not be 0.

Author

amd-wsung102 Jan 22, 2026

Hi @spcyppt, thank you for reviewing this PR. You are correct that grid size cannot be 0. This has been fixed in this commit 8fe70ea. This commit also adds an early return, as you suggested.

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu Outdated

-                                                      : nullptr);
-                                  });
-                            });
+                  auto dispatch_cumsum = [&](auto vec_tag, auto grid_calc) {

Contributor

spcyppt Jan 20, 2026

This could just be a function or macro, passing just numbers. why do we need to pass functions here?

Author

amd-wsung102 Jan 22, 2026

Here dispatch_cumsum is used to avoid duplicating kernel launch code for ROCm. Since the kernel template needs compile-time values, 3 different instantiations are needed (as seen in the switch case statements). Without dispatch_cumsum, the kernel launch code would have to be duplicated 3 times (for 3 instantiations).

spcyppt reviewed

View reviewed changes

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu

+              #else
+                  const int entries_per_thread = 1;
+                  const int ENTRIES_PER_BLOCK = MAX_CUMSUM_ENTRIES_PER_BLOCK;
                   auto grid_size = cuda_calc_xblock_count(

Contributor

spcyppt Jan 20, 2026

based on the changes below, where is this used?

Author

amd-wsung102 Jan 22, 2026

This commit 200de28 fixed this issue. Now variables entries_per_thread and ENTRIES_PER_BLOCK are used only in ROCm path. This commit also provides a cleaner code separation between ROCm and CUDA paths.

spcyppt reviewed

View reviewed changes

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu Outdated

+              #ifdef USE_ROCM
+                  auto rocm_grid = [&](int entries_per_block) {
+                    return (num_output_lengths + entries_per_block - 1) / entries_per_block;

Contributor

spcyppt Jan 20, 2026

add proper assert, entries_per_block cannot be 0.

Author

amd-wsung102 Jan 22, 2026

The commit 200de28 has removed the code you highlighted here.

spcyppt reviewed

View reviewed changes

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu Outdated

+                    constexpr int ENTRIES_PER_THREAD = decltype(vec_tag)::value;
+                    constexpr int ENTRIES_PER_BLOCK =
+                        MAX_CUMSUM_ENTRIES_PER_BLOCK * ENTRIES_PER_THREAD;
+                    const auto grid_size = grid_calc(ENTRIES_PER_BLOCK);

Contributor

spcyppt Jan 20, 2026

can we change the variable name here to avoid confusion? line 276 already defines grid_size. Where does grid_size in line 276 get used?

Author

amd-wsung102 Jan 22, 2026

The commit af5ccbc has changed the variable name grid_size to rocm_grid_size for those used within the ROCm path. Now rocm_grid_size is used only in ROCm-specific paths, and grid_size is used in places not specific to either ROCm or CUDA. Hopefully this provides better clarity and reduces confusion.

root added 3 commits

January 22, 2026 20:47


          Fixed grid_size is 0 and added early return

8fe70ea


          Cleaner separation between ROCm and CUDA paths

200de28


          Changed variable name grid_size to rocm_grid_size for better clarity

af5ccbc

q10 pushed a commit to q10/FBGEMM that referenced this pull request


          Optimizations for index_select_scalar_cumsum_kernel on ROCm (pytorch#…

a13204f

…5263)

Summary:
X-link: facebookresearch/FBGEMM#2317

This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies.

- Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
- Uses a fast path for single-block workloads, which uses a simple block scan and early return.
- Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:
- `keyed_jagged_index_select_dim1.cu`
- `inclusive_sum_scan.cuh`

ROCm-specific changes are guarded with `#ifdef USE_ROCM`.


Test Plan:
```
tests=(
  "keyed-jagged-index-select-dim1-forward-bench"
  "sweep-output-batch-sizes"
  "sweep-num-output-lengths"
)

# Update as needed
platform=amd/mi350

cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd

for test in "${tests[@]}"; do
    bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}"
done
```

https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0

Reviewed By: echen4096

Differential Revision: D89525621

Pulled By: q10

q10 pushed a commit to q10/FBGEMM that referenced this pull request


          Optimizations for index_select_scalar_cumsum_kernel on ROCm (pytorch#…

f297679

…5353)

Summary:
Pull Request resolved: pytorch#5353

X-link: https://github.com/facebookresearch/FBGEMM/pull/2317

This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies.

- Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth.
- Uses a fast path for single-block workloads, which uses a simple block scan and early return.
- Uses a shared block prefix for multi-block cumulative sum.

There are code changes in 2 files:
- `keyed_jagged_index_select_dim1.cu`
- `inclusive_sum_scan.cuh`

ROCm-specific changes are guarded with `#ifdef USE_ROCM`.

Pull Request resolved: pytorch#5263

Test Plan:
```
tests=(
  "keyed-jagged-index-select-dim1-forward-bench"
  "sweep-output-batch-sizes"
  "sweep-num-output-lengths"
)

# Update as needed
platform=amd/mi350

cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd

for test in "${tests[@]}"; do
    bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}"
done
```

https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0

Reviewed By: echen4096

Differential Revision: D89525621

Pulled By: q10

meta-codesync bot closed this in

e350418

facebook-github-bot added the Merged label

Contributor

meta-codesync bot commented Jan 28, 2026

@q10 merged this pull request in e350418.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: rocm