Skip to content

torch.compile InductorError on MoE model (dynamic slice with symbolic size) in PyTorch 2.10.0+cu128 #1876

@mansimov

Description

@mansimov

Summary

torch.compile fails on first forward for an MoE model with a TorchInductor internal assertion when running uv run sft @ configs/debug/moe/sft/train.toml

Error:
torch._inductor.exc.InductorError: AssertionError: failed OrderedSet([]) >= OrderedSet([u8]) (inductor >= fx)

FX node mentioned:
aten.slice.Tensor(..., 0, 0, %sym_sum)

Environment

  • PyTorch: 2.10.0+cu128
  • Python: 3.12.6
  • CUDA (driver/runtime from nvidia-smi): 12.8
  • NVIDIA driver: 570.86.15
  • GPU: NVIDIA A100-SXM4-40GB (reproduced on single GPU; machine has 8x A100)

What I was doing

I was running SFT on a debug GLM 0.5b MoE causal LM model with torch.compile enabled (Inductor backend via torch.compile).
The failure occurs on the first forward pass after model init.

Disabling torch.compile makes the same run succeed.

Behavior

  • With compile enabled: crashes in Inductor during graph lowering/compile
  • With compile disabled: training runs successfully

Help will be appreciated, thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions