[MXFP8 MoE] mx_block_rearrange_2d_M_groups_cuda fails for models with > 32 experts


## Problem

`mx_block_rearrange_2d_M_groups_cuda` hard-checks `num_groups <= 32` ([mxfp8_extension.cpp:200](https://github.com/pytorch/ao/blob/main/torchao/csrc/cuda/mx_kernels/mxfp8_extension.cpp#L200)). In MoE, `num_groups = num_experts` (offsets tensor has one entry per expert). Models with more experts crash immediately:

```
 RuntimeError: num_groups must be <= 32
```

There used to be `use_cuda_kernel_for_blocked_layout` flag, that allowed falling back to the Triton kernel, but this was removed and in the v0.16 `KernelPreference.AUTO` provides no per-op fallback — it's all-CUDA or all-emulated, so there's no way to selectively use the Triton kernel for this one op.


## Suggested fix

It would be great to support auto-fallback to `triton_mx_block_rearrange_2d_M_groups` (which handles arbitrary group counts) when `num_groups > 32`, instead of hard-failing.


## Environment
torchao 0.16.0+git3c1065c
PyTorch 2.10.0a0+a36e1d39eb.nv26.1
SM100 (GB200)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXFP8 MoE] mx_block_rearrange_2d_M_groups_cuda fails for models with > 32 experts #4163

Problem

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MXFP8 MoE] mx_block_rearrange_2d_M_groups_cuda fails for models with > 32 experts #4163

Description

Problem

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions