[SimpleFSDP] add CI to guard compiler optimization passes#2072
[SimpleFSDP] add CI to guard compiler optimization passes#2072ruisizhang123 wants to merge 2 commits intomainfrom
Conversation
aa939c2 to
a6efb68
Compare
a6efb68 to
ff34fc2
Compare
| python -m torchtitan.experiments.simple_fsdp.tests.frontend_integration_tests artifacts-to-be-uploaded --ngpu 8 | ||
|
|
||
| # Run backend pass integration tests of SimpleFSDP | ||
| python -m torchtitan.experiments.simple_fsdp.tests.compiler_pass_integration_tests artifacts-to-be-uploaded --ngpu 8 --comm_mode local_tensor |
There was a problem hiding this comment.
I also tried FakeBackend mode, but the memory overhead is significantly higher @fegin 🤔 (~33Gib in Local_tensor -> ~90GiB in FakeBackend). My suspect is FakeBackend initialize the whole model on one rank?
There was a problem hiding this comment.
Actually, it is the reverse, for FakeBackend mode, it should be lower. On the other hand, local_tensor will allocate all tensors on the same process. So it should be higher. cc., @dzmitry-huba
There was a problem hiding this comment.
okay, not sure which part is wrong. here is an easy repro. The memory I reported in prev message is on compiler CI test.
NGPU=4 COMM_MODE="fake_backend" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
[titan] 2025-11-25 18:27:58,478 - root - INFO - step: 1 loss: 12.2713 grad_norm: 0.0000 memory: 47.60GiB(50.11%) tps: 1,112 tflops: 64.41 mfu: 6.51%
NGPU=4 COMM_MODE="local_tensor" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
Since there is no actual training, the peak memory is ~31GiB from nvidia-smi.
There was a problem hiding this comment.
uh, I see. This is because we current skip the real training for LocalTensor because LocalTensor doesn't support FSDP2. But it should work with SimpleFSDP.
There was a problem hiding this comment.
hmmm, it is also skipped in simplefsdp. Not sure why, but fake_backend gives huge memory overhead in simplefsdp's compiler pass CI (~90GiB).
I think I can either verify with RealBackend (more memory overhead than LocalTensor, but smaller than fake backend); or using a LocalTensor mode (less memory overhead, but doesn't execute actual training).
Curious: which part LocalTensor mode is executing? If it executes compilation but skips the actual training, it looks sufficient for our case, since we are just verifying compiler passes' integration?
There was a problem hiding this comment.
Not sure if splitting into two tests will incur overhead.
@wwwjn does this incur any overhead to CI?
tests/integration_tests/run_tests.py
Outdated
|
|
||
| for idx, override_arg in enumerate(test_flavor.override_args): | ||
| cmd = f"CONFIG_FILE={full_path} NGPU={test_flavor.ngpu} LOG_RANK={all_ranks} ./run_train.sh" | ||
| if comm_mode == "default": |
There was a problem hiding this comment.
comm_mode probably should be put in OverrideDefinitions?
There was a problem hiding this comment.
No, fake_tensor & local_tensor mode uses python -m instead of torchrun. We need to update how cmd launches ./run_train.sh here. We can use add a COMM_MODE in front of ./run_train.sh tho.
There was a problem hiding this comment.
You can still use fake_tensor with torchrun. The reason I don't use torchrun is that for dry run case, you don't actually need torchrun. Direct invoking the module saves us more time.
There was a problem hiding this comment.
hmmm I guess you mean fake_backend? Sure, we can use fake_backend mode with torchrun. However, in this case, all ranks would issue a process and run things in parallel. What is the difference between using fake_backend mode and a real backend mode then?
In CI testing, I though we use a fake_backend because we wanted to use dry run to verify things with fewer GPU?
There was a problem hiding this comment.
Yes, I'm just saying that if you need to use torchrun to verify (e.g., MPMD). But for SimpleFSDP, I don't think that is required.
ff34fc2 to
be823c6
Compare
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("output_dir") | ||
| parser.add_argument( | ||
| "--comm_mode", |
There was a problem hiding this comment.
I dislike this idea of creating another layer of comm_mode config. We are doing
- pass
comm_modefrom test config to run_train.sh'sCOMM_MODE - pass
COMM_MODEfrom run_train.sh to actual training'scomm.mode
The only reason we are doing this is to let COMM_MODE select torchrun / python job starter.
If we have to differentiate between torchrun / python, what we can do is to let run_train.sh select the starter by looking at the field of --comm.mode passed in by user / tests.
cc @fegin
There was a problem hiding this comment.
I don't have strong opinion on this. I rewrote the code such that run_train.sh can read in args, and infer comm.mode from args.
4d34102 to
435c53b
Compare
435c53b to
480ec1b
Compare
480ec1b to
6d6f397
Compare
| CONFIG_COMM_MODE=${CONFIG_COMM_MODE:-"default"} | ||
| # COMM_MODE options: "fake_backend" (dry run), "local_tensor" (debug mode), or empty for normal training | ||
| COMM_MODE=${COMM_MODE:-""} | ||
| COMM_MODE=${COMM_MODE:-$CONFIG_COMM_MODE} |
There was a problem hiding this comment.
I think it might be slightly cleaner to do this logic in run_test.py https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/run_tests.py#L32
- we detect if
override_arghas --comm.mode and set COMM_MODE over there correspondingly.
The reason is that
- run_train.sh is user-facing, and I hope we don't make it too complicated with test-only logic
- here there is an undefined precedence between --comm.mode and COMM_MODE
sorry for the back-and-forth, as I don't think either way is super clean.
As discussed, we will add an additional CI to guard compiler passes' composability.