Triton fusion autotuning OOM


# Description
Training DeepSeek-V3 (671B) on a GB200 NVL72 cluster fails during the torch.compile precompilation phase. The torch inductor is attempting to autotune a fused kernel that exceeds the hardware resource limits (Shared Memory/SRAM) of the Blackwell architecture.

# Environment
Hardware: GB200 NVL72 (4 GPUs per node)

PyTorch Version: 2.11.0.dev20260121+cu130

Python: 3.12

Model: DeepSeek-V3 (via torchtitan)

# Reproduce
The error occurs when running the following training config:
```
python -m torchtitan.train \
    --job.config_file ./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml \
    --training.steps=50 \
    --training.dataset_path=$DATASET_PATH \
    --profiling.enable_profiling \
    --comm.init_timeout_seconds=3000 \
    --comm.train_timeout_seconds=2000 \
    --profiling.save_traces_folder $PROFILE_DIR \
    --parallelism.data_parallel_shard_degree=-1 \
    --parallelism.expert_parallel_degree=64 \
    --parallelism.pipeline_parallel_degree=1 \
    --training.local_batch_size=4 \
    --activation_checkpoint.mode=full \
    --debug.moe_force_load_balance \
    --compile.enable \
    --compile.components=loss \
    --compile.components=model \
    --parallelism.expert_parallel_comm_backend=$MOE_BACKEND
```
# Error Log
The traceback indicates a resource limit violation in torch._inductor:
```
[rank152]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codecache.py", line 4480, in result
[rank152]:      return self.result_fn()
[rank152]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/async_compile.py", line 453, in get_result
[rank152]:      kernel.precompile(
[rank152]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 496, in precompile
[rank152]:      self._make_launchers()
[rank152]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 657, in _make_launchers
[rank152]:      raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
[rank152]: torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. 
OutOfMemoryError: out of resource: triton_per_fused__fused_rms_norm__fused_rms_norm_backward__unsafe_view_add_view_21 
Required: 262200 
Hardware limit: 232448 
Reducing block sizes or `num_stages` may help.
```
# Additional Context
Setting `compile.mode="max-autotune"` and `TORCHINDUCTOR_MAX_AUTOTUNE=1` does not resolve the issue.

The failure occurs specifically on the fused kernel: `triton_per_fused__fused_rms_norm__fused_rms_norm_backward__unsafe_view_add_view_21`

### Versions

```
torch                            2.11.0.dev20260121+cu130
torch_tensorrt                   2.10.0a0
torchao                          0.16.0+git14ce8f9c2
torchdata                        0.11.0
torchprofile                     0.0.4
torchtitan                       0.2.0
torchvision                      0.25.0a0+ca221243
tornado                          6.5.4


nvidia-cublas                    13.1.0.3
nvidia-cuda-cccl-cu12            12.9.27
nvidia-cuda-cupti                13.0.85
nvidia-cuda-nvrtc                13.0.88
nvidia-cuda-runtime              13.0.96
nvidia-cuda-runtime-cu13         0.0.0a0
nvidia-cudnn-cu13                9.15.1.9
nvidia-cudnn-frontend            1.16.0
nvidia-cufft                     12.0.0.61
nvidia-cufile                    1.15.1.6
nvidia-curand                    10.4.0.35
nvidia-cusolver                  12.0.4.66
nvidia-cusparse                  12.6.3.3
nvidia-cusparselt-cu13           0.8.0
nvidia-dali-cuda130              1.52.0
nvidia-ml-py                     13.590.44
nvidia-modelopt                  0.39.0
nvidia-nccl-cu13                 2.28.9
nvidia-nvcomp-cu13               5.0.0.6
nvidia-nvimgcodec-cu13           0.6.1.37
nvidia-nvjitlink                 13.0.88
nvidia-nvjpeg                    13.0.2.28
nvidia-nvjpeg2k-cu13             0.9.1.47
nvidia-nvshmem-cu13              3.4.5
nvidia-nvtiff-cu13               0.6.0.78
nvidia-nvtx                      13.0.85
nvidia-resiliency-ext            0.5.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton fusion autotuning OOM #2272

Description

Environment

Reproduce

Error Log

Additional Context

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Triton fusion autotuning OOM #2272

Description

Description

Environment

Reproduce

Error Log

Additional Context

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions