Skip to content

Add docs to explain COMM_MODE#2162

Merged
fegin merged 5 commits intomainfrom
gh/fegin/58/head
Dec 30, 2025
Merged

Add docs to explain COMM_MODE#2162
fegin merged 5 commits intomainfrom
gh/fegin/58/head

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Dec 18, 2025

Stack from ghstack (oldest at bottom):

As title

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 18, 2025
As title


ghstack-source-id: 0131fe2
Pull-Request: #2162
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 18, 2025
Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please fix lint error

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging.md is becoming huge. We could either split it into multiple ones in a docs/debugging/ folder, or create a table of contents in the single file.

- Simulates multi-GPU behavior on a single shared GPU
- Executes all collectives (all-reduce, all-gather, etc.) locally without network communication
- Maintains the same code paths as distributed training for accurate debugging
- Runs only one training step by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having this as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because under fake or local tensor modes, running more than one step doesn't give any more information. We cannot use either modes to verify memory usages or performance as it is not accurate. So one step should give users enough information.

- Runs only one training step by default

**When to use it:**
- Debugging distributed training logic (FSDP, TP, PP, CP, EP) with data dependencies without multi-GPU setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note that fsdp doesn't work today?

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 29, 2025
As title

ghstack-source-id: 1108849
Pull-Request: #2162
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 29, 2025
As title

ghstack-source-id: e728a85
Pull-Request: #2162
@fegin fegin changed the base branch from gh/fegin/58/base to main December 30, 2025 01:34
@fegin fegin merged commit 7e4ab85 into main Dec 30, 2025
7 checks passed
@tianyu-l tianyu-l deleted the gh/fegin/58/head branch December 30, 2025 01:52
xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants