-
Notifications
You must be signed in to change notification settings - Fork 708
Description
Description
Training DeepSeek-V3 (671B) on a GB200 NVL72 cluster fails during the torch.compile precompilation phase. The torch inductor is attempting to autotune a fused kernel that exceeds the hardware resource limits (Shared Memory/SRAM) of the Blackwell architecture.
Environment
Hardware: GB200 NVL72 (4 GPUs per node)
PyTorch Version: 2.11.0.dev20260121+cu130
Python: 3.12
Model: DeepSeek-V3 (via torchtitan)
Reproduce
The error occurs when running the following training config:
python -m torchtitan.train \
--job.config_file ./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml \
--training.steps=50 \
--training.dataset_path=$DATASET_PATH \
--profiling.enable_profiling \
--comm.init_timeout_seconds=3000 \
--comm.train_timeout_seconds=2000 \
--profiling.save_traces_folder $PROFILE_DIR \
--parallelism.data_parallel_shard_degree=-1 \
--parallelism.expert_parallel_degree=64 \
--parallelism.pipeline_parallel_degree=1 \
--training.local_batch_size=4 \
--activation_checkpoint.mode=full \
--debug.moe_force_load_balance \
--compile.enable \
--compile.components=loss \
--compile.components=model \
--parallelism.expert_parallel_comm_backend=$MOE_BACKEND
Error Log
The traceback indicates a resource limit violation in torch._inductor:
[rank152]: File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codecache.py", line 4480, in result
[rank152]: return self.result_fn()
[rank152]: File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/async_compile.py", line 453, in get_result
[rank152]: kernel.precompile(
[rank152]: File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 496, in precompile
[rank152]: self._make_launchers()
[rank152]: File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 657, in _make_launchers
[rank152]: raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
[rank152]: torch._inductor.exc.InductorError: RuntimeError: No valid triton configs.
OutOfMemoryError: out of resource: triton_per_fused__fused_rms_norm__fused_rms_norm_backward__unsafe_view_add_view_21
Required: 262200
Hardware limit: 232448
Reducing block sizes or `num_stages` may help.
Additional Context
Setting compile.mode="max-autotune" and TORCHINDUCTOR_MAX_AUTOTUNE=1 does not resolve the issue.
The failure occurs specifically on the fused kernel: triton_per_fused__fused_rms_norm__fused_rms_norm_backward__unsafe_view_add_view_21
Versions
torch 2.11.0.dev20260121+cu130
torch_tensorrt 2.10.0a0
torchao 0.16.0+git14ce8f9c2
torchdata 0.11.0
torchprofile 0.0.4
torchtitan 0.2.0
torchvision 0.25.0a0+ca221243
tornado 6.5.4
nvidia-cublas 13.1.0.3
nvidia-cuda-cccl-cu12 12.9.27
nvidia-cuda-cupti 13.0.85
nvidia-cuda-nvrtc 13.0.88
nvidia-cuda-runtime 13.0.96
nvidia-cuda-runtime-cu13 0.0.0a0
nvidia-cudnn-cu13 9.15.1.9
nvidia-cudnn-frontend 1.16.0
nvidia-cufft 12.0.0.61
nvidia-cufile 1.15.1.6
nvidia-curand 10.4.0.35
nvidia-cusolver 12.0.4.66
nvidia-cusparse 12.6.3.3
nvidia-cusparselt-cu13 0.8.0
nvidia-dali-cuda130 1.52.0
nvidia-ml-py 13.590.44
nvidia-modelopt 0.39.0
nvidia-nccl-cu13 2.28.9
nvidia-nvcomp-cu13 5.0.0.6
nvidia-nvimgcodec-cu13 0.6.1.37
nvidia-nvjitlink 13.0.88
nvidia-nvjpeg 13.0.2.28
nvidia-nvjpeg2k-cu13 0.9.1.47
nvidia-nvshmem-cu13 3.4.5
nvidia-nvtiff-cu13 0.6.0.78
nvidia-nvtx 13.0.85
nvidia-resiliency-ext 0.5.0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status