-
Notifications
You must be signed in to change notification settings - Fork 708
Optimizations for index_select_scalar_cumsum_kernel on ROCm #5263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @amd-wsung102! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
| PTA_B(indices, index_t, 1, 32), | ||
| num_batches, | ||
| batch_size, | ||
| grid_size == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case is grid size = 0? grid size cannot be 0 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have no compute on GPU, we should do early return. grid size should not be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| : nullptr); | ||
| }); | ||
| }); | ||
| auto dispatch_cumsum = [&](auto vec_tag, auto grid_calc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be a function or macro, passing just numbers. why do we need to pass functions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here dispatch_cumsum is used to avoid duplicating kernel launch code for ROCm. Since the kernel template needs compile-time values, 3 different instantiations are needed (as seen in the switch case statements). Without dispatch_cumsum, the kernel launch code would have to be duplicated 3 times (for 3 instantiations).
| #else | ||
| const int entries_per_thread = 1; | ||
| const int ENTRIES_PER_BLOCK = MAX_CUMSUM_ENTRIES_PER_BLOCK; | ||
| auto grid_size = cuda_calc_xblock_count( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the changes below, where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit 200de28 fixed this issue. Now variables entries_per_thread and ENTRIES_PER_BLOCK are used only in ROCm path. This commit also provides a cleaner code separation between ROCm and CUDA paths.
|
|
||
| #ifdef USE_ROCM | ||
| auto rocm_grid = [&](int entries_per_block) { | ||
| return (num_output_lengths + entries_per_block - 1) / entries_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add proper assert, entries_per_block cannot be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit 200de28 has removed the code you highlighted here.
| constexpr int ENTRIES_PER_THREAD = decltype(vec_tag)::value; | ||
| constexpr int ENTRIES_PER_BLOCK = | ||
| MAX_CUMSUM_ENTRIES_PER_BLOCK * ENTRIES_PER_THREAD; | ||
| const auto grid_size = grid_calc(ENTRIES_PER_BLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the variable name here to avoid confusion? line 276 already defines grid_size. Where does grid_size in line 276 get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit af5ccbc has changed the variable name grid_size to rocm_grid_size for those used within the ROCm path. Now rocm_grid_size is used only in ROCm-specific paths, and grid_size is used in places not specific to either ROCm or CUDA. Hopefully this provides better clarity and reduces confusion.
…5263) Summary: X-link: facebookresearch/FBGEMM#2317 This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies. - Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth. - Uses a fast path for single-block workloads, which uses a simple block scan and early return. - Uses a shared block prefix for multi-block cumulative sum. There are code changes in 2 files: - `keyed_jagged_index_select_dim1.cu` - `inclusive_sum_scan.cuh` ROCm-specific changes are guarded with `#ifdef USE_ROCM`. Test Plan: ``` tests=( "keyed-jagged-index-select-dim1-forward-bench" "sweep-output-batch-sizes" "sweep-num-output-lengths" ) # Update as needed platform=amd/mi350 cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd for test in "${tests[@]}"; do bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}" done ``` https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0 Reviewed By: echen4096 Differential Revision: D89525621 Pulled By: q10
…5353) Summary: Pull Request resolved: pytorch#5353 X-link: https://github.com/facebookresearch/FBGEMM/pull/2317 This PR optimizes the performance of `index_select_scalar_cumsum_kernel` on ROCm using the following strategies. - Implements vectorized gather and store by processing 4, 2, or 1 indices per thread based on indices length, improving memory coalescing and bandwidth. - Uses a fast path for single-block workloads, which uses a simple block scan and early return. - Uses a shared block prefix for multi-block cumulative sum. There are code changes in 2 files: - `keyed_jagged_index_select_dim1.cu` - `inclusive_sum_scan.cuh` ROCm-specific changes are guarded with `#ifdef USE_ROCM`. Pull Request resolved: pytorch#5263 Test Plan: ``` tests=( "keyed-jagged-index-select-dim1-forward-bench" "sweep-output-batch-sizes" "sweep-num-output-lengths" ) # Update as needed platform=amd/mi350 cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/amd for test in "${tests[@]}"; do bash run_benchmark.sh "${platform}" "${test}" 2>&1 | pastry -t "${platform}/${test}" done ``` https://docs.google.com/document/d/1I1B8fEBVyn-DbbdBvjo9G7nm7TAxMR1XRYTPX8fE3b0/edit?tab=t.0 Reviewed By: echen4096 Differential Revision: D89525621 Pulled By: q10
This PR optimizes the performance of
index_select_scalar_cumsum_kernelon ROCm using the following strategies.There are code changes in 2 files:
keyed_jagged_index_select_dim1.cuinclusive_sum_scan.cuhROCm-specific changes are guarded with
#ifdef USE_ROCM.