Refactor spec modification/introspection to make references to Submodules typed by nschank · Pull Request #2834 · NVIDIA/Megatron-LM

nschank · 2026-01-06T17:38:35Z

What does this PR do ?

In order to safely refactor Submodules classes, I want to make sure I can easily find everywhere those classes are being referenced. This updates every instance I could find where ModuleSpec submodules are being inspected or modified, and either uses cast or uses a typed helper method to ensure that searching for references/usages of a field will consistently find them.

Relevant design doc: https://docs.google.com/document/d/1shyv0iKEzRdevLOlouF_NktbdJazvWifqxUwPXFigQE/edit?tab=t.0#heading=h.uwes2zo47yg6

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @megatron-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-01-06T17:38:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tests/unit_tests/dist_checkpointing/models/test_moe_experts.py

yashaswikarnati · 2026-01-12T17:14:31Z

@chtruong814 @ko3n1g Mostly changes in tests - could you help take a look. thank you!

nschank · 2026-01-15T16:33:06Z

Updated to fix conflicts.

Anyone mind taking a look? This should be pretty uncontroversial (except maybe the format of the gpt_layer_specs changes), but merge conflicts are going to increasingly be a pain haha

Phlip79 · 2026-01-16T03:07:27Z

@NVIDIA/mcore-oncall

nschank · 2026-01-23T17:56:14Z

Fixed merge conflicts

nschank · 2026-01-29T18:04:45Z

I added a nice little helper inspired by @Skylion007 to make the gpt_layer_spec methods keep all their many arguments during type checking. It's optional tho, I can drop it if there are concerns about it!

nschank · 2026-01-30T20:48:07Z

Just noting the other places I found where copy_signature will come in handy! nschank#2

…e identified

yashaswikarnati · 2026-01-31T01:26:38Z

megatron/core/models/multimodal/llava_model.py

        if self.sequence_parallel_lm or self.context_parallel_lm > 1:
            if not language_model_type.startswith('nemotron5-hybrid'):
-                attn_module = language_transformer_layer_spec.submodules.self_attention
+                assert isinstance(


@trintamaki @parthmannan do you have any concerns with these asserts here? can we expect the language model to be always uses mcore specs or do you also use HF models directly like for the vision encoder?

If there are concerns, I'm happy to switch back to cast instead (which only affects type checking and won't do anything at runtime), just let me know.

nschank · 2026-02-04T20:24:42Z

This has gotten no traction, so I'm going to split this up to make it easier to review. #3255 is the most interesting subset, will follow with a few one liners and finally do the tests as a last step.

nschank requested review from a team as code owners January 6, 2026 17:38

github-actions bot requested a review from Phlip79 January 6, 2026 17:38

github-actions bot added the community-request label Jan 6, 2026

Skylion007 reviewed Jan 6, 2026

View reviewed changes

tests/unit_tests/dist_checkpointing/models/test_moe_experts.py Outdated Show resolved Hide resolved

nschank requested a review from Skylion007 January 7, 2026 23:00

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 12, 2026

nschank force-pushed the submodules branch from be90842 to 8301caa Compare January 15, 2026 16:30

Phlip79 requested review from a team and removed request for Phlip79 and Skylion007 January 16, 2026 03:07

BoxiangW added the complexity: high label Jan 16, 2026

nschank force-pushed the submodules branch from 8301caa to 35e4f37 Compare January 23, 2026 17:54

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

nschank force-pushed the submodules branch from 35e4f37 to d49b1da Compare January 29, 2026 17:57

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 29, 2026

nschank added 4 commits January 30, 2026 22:36

Refactor to ensure TransformerLayerSubmodules references can easily b…

9894b88

…e identified

Look for .self_attention modules

e971c5b

Look for .mlp modules

ada5417

Look for MambaStackSubmodules members

0c6222d

nschank added 4 commits January 30, 2026 22:36

Look for MLA Submodules members

52b9a65

Clean up imports

459d783

Switch from cast to assert isinstance

6800e72

Preserve the signature of the wrappers in gpt_layer_specs

592747b

nschank force-pushed the submodules branch from d49b1da to 592747b Compare January 30, 2026 22:37

yashaswikarnati reviewed Jan 31, 2026

View reviewed changes

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 1, 2026

yashaswikarnati added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Feb 2, 2026

nschank mentioned this pull request Feb 4, 2026

Split layer_specs to return Submodules instead of ModuleSpecs #3255

Open

6 tasks

nschank closed this Feb 4, 2026

This was referenced Feb 4, 2026

Ensure type-checker understands use of Submodules in bert_model #3256

Open

Ensure type-checker understands use of Submodules in llava_model #3257

Open

chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor spec modification/introspection to make references to Submodules typed#2834

Refactor spec modification/introspection to make references to Submodules typed#2834
nschank wants to merge 8 commits intoNVIDIA:mainfrom
nschank:submodules

nschank commented Jan 6, 2026

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Jan 12, 2026

Uh oh!

nschank commented Jan 15, 2026

Uh oh!

Phlip79 commented Jan 16, 2026

Uh oh!

nschank commented Jan 23, 2026

Uh oh!

nschank commented Jan 29, 2026

Uh oh!

nschank commented Jan 30, 2026

Uh oh!

yashaswikarnati Jan 31, 2026

Uh oh!

nschank Feb 3, 2026

Uh oh!

nschank commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nschank commented Jan 6, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

Uh oh!

yashaswikarnati commented Jan 12, 2026

Uh oh!

nschank commented Jan 15, 2026

Uh oh!

Phlip79 commented Jan 16, 2026

Uh oh!

nschank commented Jan 23, 2026

Uh oh!

nschank commented Jan 29, 2026

Uh oh!

nschank commented Jan 30, 2026

Uh oh!

yashaswikarnati Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

nschank Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

nschank commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

(Step 1): Add PR label `Expert Review`