[SimpleFSDP] add CI to guard compiler optimization passes by ruisizhang123 · Pull Request #2072 · pytorch/torchtitan

ruisizhang123 · 2025-11-20T23:24:33Z

As discussed, we will add an additional CI to guard compiler passes' composability.

ruisizhang123 · 2025-11-25T20:54:19Z

.github/workflows/integration_test_8gpu_simple_fsdp.yaml

+        python -m torchtitan.experiments.simple_fsdp.tests.frontend_integration_tests artifacts-to-be-uploaded --ngpu 8
+
+        # Run backend pass integration tests of SimpleFSDP
+        python -m torchtitan.experiments.simple_fsdp.tests.compiler_pass_integration_tests artifacts-to-be-uploaded --ngpu 8 --comm_mode local_tensor


I also tried FakeBackend mode, but the memory overhead is significantly higher @fegin 🤔 (~33Gib in Local_tensor -> ~90GiB in FakeBackend). My suspect is FakeBackend initialize the whole model on one rank?

Actually, it is the reverse, for FakeBackend mode, it should be lower. On the other hand, local_tensor will allocate all tensors on the same process. So it should be higher. cc., @dzmitry-huba

okay, not sure which part is wrong. here is an easy repro. The memory I reported in prev message is on compiler CI test.

NGPU=4 COMM_MODE="fake_backend" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

[titan] 2025-11-25 18:27:58,478 - root - INFO - step: 1 loss: 12.2713 grad_norm: 0.0000 memory: 47.60GiB(50.11%) tps: 1,112 tflops: 64.41 mfu: 6.51%

NGPU=4 COMM_MODE="local_tensor" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Since there is no actual training, the peak memory is ~31GiB from nvidia-smi.

uh, I see. This is because we current skip the real training for LocalTensor because LocalTensor doesn't support FSDP2. But it should work with SimpleFSDP.

hmmm, it is also skipped in simplefsdp. Not sure why, but fake_backend gives huge memory overhead in simplefsdp's compiler pass CI (~90GiB).

I think I can either verify with RealBackend (more memory overhead than LocalTensor, but smaller than fake backend); or using a LocalTensor mode (less memory overhead, but doesn't execute actual training).

Curious: which part LocalTensor mode is executing? If it executes compilation but skips the actual training, it looks sufficient for our case, since we are just verifying compiler passes' integration?

kind follow up on this @fegin!

tianyu-l · 2025-11-25T20:57:58Z

torchtitan/experiments/simple_fsdp/tests/compiler_pass_integration_tests.py

Not sure if splitting into two tests will incur overhead.
@wwwjn does this incur any overhead to CI?

tianyu-l · 2025-11-25T20:58:45Z

tests/integration_tests/run_tests.py


    for idx, override_arg in enumerate(test_flavor.override_args):
-        cmd = f"CONFIG_FILE={full_path} NGPU={test_flavor.ngpu} LOG_RANK={all_ranks} ./run_train.sh"
+        if comm_mode == "default":


comm_mode probably should be put in OverrideDefinitions?

No, fake_tensor & local_tensor mode uses python -m instead of torchrun. We need to update how cmd launches ./run_train.sh here. We can use add a COMM_MODE in front of ./run_train.sh tho.

You can still use fake_tensor with torchrun. The reason I don't use torchrun is that for dry run case, you don't actually need torchrun. Direct invoking the module saves us more time.

hmmm I guess you mean fake_backend? Sure, we can use fake_backend mode with torchrun. However, in this case, all ranks would issue a process and run things in parallel. What is the difference between using fake_backend mode and a real backend mode then?

In CI testing, I though we use a fake_backend because we wanted to use dry run to verify things with fewer GPU?

Yes, I'm just saying that if you need to use torchrun to verify (e.g., MPMD). But for SimpleFSDP, I don't think that is required.

tianyu-l · 2025-12-05T09:27:16Z

torchtitan/experiments/simple_fsdp/tests/compiler_pass_integration_tests.py

+    parser = argparse.ArgumentParser()
+    parser.add_argument("output_dir")
+    parser.add_argument(
+        "--comm_mode",


I dislike this idea of creating another layer of comm_mode config. We are doing

pass comm_mode from test config to run_train.sh's COMM_MODE

pass COMM_MODE from run_train.sh to actual training's comm.mode

The only reason we are doing this is to let COMM_MODE select torchrun / python job starter.

If we have to differentiate between torchrun / python, what we can do is to let run_train.sh select the starter by looking at the field of --comm.mode passed in by user / tests.

cc @fegin

I don't have strong opinion on this. I rewrote the code such that run_train.sh can read in args, and infer comm.mode from args.

tianyu-l · 2026-01-25T22:51:23Z

run_train.sh

+CONFIG_COMM_MODE=${CONFIG_COMM_MODE:-"default"}
 # COMM_MODE options: "fake_backend" (dry run), "local_tensor" (debug mode), or empty for normal training
-COMM_MODE=${COMM_MODE:-""}
+COMM_MODE=${COMM_MODE:-$CONFIG_COMM_MODE}


I think it might be slightly cleaner to do this logic in run_test.py https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/run_tests.py#L32

we detect if override_arg has --comm.mode and set COMM_MODE over there correspondingly.

The reason is that

run_train.sh is user-facing, and I hope we don't make it too complicated with test-only logic

here there is an undefined precedence between --comm.mode and COMM_MODE

sorry for the back-and-forth, as I don't think either way is super clean.

ruisizhang123 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 20, 2025 23:24

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025

ruisizhang123 marked this pull request as draft November 20, 2025 23:24

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from aa939c2 to a6efb68 Compare November 20, 2025 23:47

ruisizhang123 requested a review from eellison November 21, 2025 21:38

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from a6efb68 to ff34fc2 Compare November 25, 2025 20:49

ruisizhang123 marked this pull request as ready for review November 25, 2025 20:49

ruisizhang123 changed the title ~~[WIP][SimpleFSDP] add CI to guard compiler optimization passes~~ [SimpleFSDP] add CI to guard compiler optimization passes Nov 25, 2025

ruisizhang123 commented Nov 25, 2025

View reviewed changes

tianyu-l reviewed Nov 25, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from ff34fc2 to be823c6 Compare November 25, 2025 21:11

ruisizhang123 requested a review from tianyu-l December 3, 2025 22:40

tianyu-l requested changes Dec 5, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from 4d34102 to 435c53b Compare December 10, 2025 00:32

add CI to guard compiler optimization passes

bc9d660

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from 435c53b to 480ec1b Compare December 10, 2025 07:52

ruisizhang123 requested a review from tianyu-l December 10, 2025 07:52

add torchrun version

6d6f397

ruisizhang123 force-pushed the ruisi/add_passes_ci branch from 480ec1b to 6d6f397 Compare December 10, 2025 17:16

ruisizhang123 mentioned this pull request Jan 20, 2026

[simpleFSDP] An assertion error occurs in simple model of deepseek_v3 #2254

Closed

tianyu-l reviewed Jan 25, 2026

View reviewed changes

Comments

Conversation

ruisizhang123 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruisizhang123 commented Nov 20, 2025 •

edited

Loading

ruisizhang123 Nov 25, 2025 •

edited

Loading

ruisizhang123 Nov 26, 2025 •

edited

Loading