Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions #5233

aryaman-gupta · 2025-12-16T19:04:00Z

This PR optimizes the performance on ROCm of the group_index_select_or_add_2d_kernel kernel on tables with small embedding dimensions (i.e., num_cols).

For tables with small embedding dimensions, the code is refactored to process multiple rows within the same warp. Two files are changed:

fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp - The calculation of the warp_offsets is changed in the host-side code.
fbgemm_gpu/src/sparse_ops/sparse_group_index.cu - The group_index_select_or_add_2d_kernel kernel is modified to process multiple rows within a warp for small embedding dimensions.

…r_add_2d_kernel

…or_add_2d_kernel

meta-codesync · 2025-12-16T21:34:59Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D89316371.

…p_index_select_or_add_2d_kernel

…zed small embedding dims path

…isable optimized smallEmbD path

q10 · 2026-01-14T21:48:24Z

fbgemm_gpu/src/sparse_ops/sparse_group_index.cu

  return GROUP_INDEX_SELECT_COLS_PER_WARP;
 }

+int get_group_index_select_unroll_factor() {


Can we make this constexpr?

Not trivially, unfortunately. We would need to move the function definition to sparse_ops.h where it is declared, and move the GROUP_INDEX_SELECT_UNROLL_FACTOR variable there as well.

Happy to do this refactor if you would suggest.

q10 · 2026-01-14T21:48:41Z

fbgemm_gpu/src/sparse_ops/sparse_group_index.cu

+      if (num_cols < COLS_PER_WARP && num_cols >= UNROLL_FACTOR) {
+        // Need to ensure that [member_id] and [member_warp_id] are calculated correctly
+        // for the small embedding dimension path below
+        int rows_per_warp = COLS_PER_WARP / num_cols;


Consider using const and auto

Done in bbdc17d

I left input, output, indices and idx as they were previously defined

…rate codepath for small embedding dimensions (pytorch#5233) Summary: This PR optimizes the performance on ROCm of the `group_index_select_or_add_2d_kernel` kernel on tables with small embedding dimensions (i.e., `num_cols`). For tables with small embedding dimensions, the code is refactored to process multiple rows within the same warp. Two files are changed: 1. `fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp` - The calculation of the `warp_offsets` is changed in the host-side code. 2. `fbgemm_gpu/src/sparse_ops/sparse_group_index.cu` - The `group_index_select_or_add_2d_kernel` kernel is modified to process multiple rows within a warp for small embedding dimensions. Test Plan: ``` cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/D89316371 bash run_benchmark.sh amd/mi300 2>&1 | pastry ``` https://docs.google.com/document/d/12ywiAQhA3eZqcIwUyc8_CQinwZJy_WDD8B4_43G8Gr8/edit?tab=t.0#heading=h.5row1qfol66k Reviewed By: echen4096 Differential Revision: D89316371 Pulled By: q10

…rate codepath for small embedding dimensions (pytorch#5233) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2340 This PR optimizes the performance on ROCm of the `group_index_select_or_add_2d_kernel` kernel on tables with small embedding dimensions (i.e., `num_cols`). For tables with small embedding dimensions, the code is refactored to process multiple rows within the same warp. Two files are changed: 1. `fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp` - The calculation of the `warp_offsets` is changed in the host-side code. 2. `fbgemm_gpu/src/sparse_ops/sparse_group_index.cu` - The `group_index_select_or_add_2d_kernel` kernel is modified to process multiple rows within a warp for small embedding dimensions. Pull Request resolved: pytorch#5233 Test Plan: ``` cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/D89316371 bash run_benchmark.sh amd/mi300 2>&1 | pastry ``` https://docs.google.com/document/d/12ywiAQhA3eZqcIwUyc8_CQinwZJy_WDD8B4_43G8Gr8/edit?tab=t.0#heading=h.5row1qfol66k Reviewed By: echen4096 Differential Revision: D89316371 Pulled By: q10 fbshipit-source-id: 2742965773c92ff96419fa5978c93ca6d23dbed4

aryaman-gupta added 3 commits December 12, 2025 15:09

adds optimized path for small dimension sizes to group_index_select_o…

85caa29

…r_add_2d_kernel

sparse_group_index.cu: edits some comments

ff1b9b6

adds USE_ROCM guards to subwarp optimizations for group_index_select_…

439a51a

…or_add_2d_kernel

pytorch-bot bot added the module: rocm label Dec 16, 2025

meta-cla bot added the cla signed label Dec 16, 2025

aryaman-gupta added 3 commits December 18, 2025 10:11

sparse_group_index: handle UNROLL_FACTOR for small dimensions in grou…

2a85d73

…p_index_select_or_add_2d_kernel

sparse_group_index: handle fixed-column-size case correctly in optimi…

2f54140

…zed small embedding dims path

group_index_select_or_add_2d_kernel: when num_cols < UNROLL_FACTOR, d…

e0edc40

…isable optimized smallEmbD path

avbokovoy mentioned this pull request Jan 6, 2026

group_index_select_or_add fwd/bwd optimizations ROCm/FBGEMM#139

Closed

q10 reviewed Jan 20, 2026

View reviewed changes

sparse_group_index: use const auto where possible

bbdc17d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions #5233

Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions #5233

Uh oh!

aryaman-gupta commented Dec 16, 2025

Uh oh!

meta-codesync bot commented Dec 16, 2025

Uh oh!

q10 Jan 14, 2026

Uh oh!

aryaman-gupta Jan 21, 2026

Uh oh!

q10 Jan 14, 2026

Uh oh!

aryaman-gupta Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions #5233

Are you sure you want to change the base?

Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions #5233

Uh oh!

Conversation

aryaman-gupta commented Dec 16, 2025

Uh oh!

meta-codesync bot commented Dec 16, 2025

Uh oh!

q10 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aryaman-gupta Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

q10 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aryaman-gupta Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants