Refactor the optimizer override function so that users can swap in their own #2010

jstjohn · 2026-01-21T00:45:31Z

What does this PR do ?

Makes the private optimizer.py:_build_config_overrides into a builder object that a user can swap in their config. This allows new models (eg evo2) to have their own custom override function without adding booleans/modifying the core function in mbridge optimizer.py.

Changelog

Adds OptimizerConfigOverrideProvider to the ConfigContainer that replaces the _build_config_overrides function.
Adds missing handling of decoupled lr to the config override builder
Adds note about how we can simplify further once [main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4) NVIDIA/Megatron-LM#2753 is pulled into dev.
Builds off of enable qwen wd #1935
Depended on by Refactor the way that we do weight decay skipping for hyena to follow ToT mbridge. NVIDIA/bionemo-framework#1429

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

FDecaYed

Agree on making the config override not hard coded and each model need to change something create and pass their own. we probably also want to handle qwen the same way instead of hardcode if block in default

src/megatron/bridge/training/config.py

…eir own (NVIDIA-NeMo#2010) Signed-off-by: John St. John <jstjohn@nvidia.com> Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>

… ToT mbridge. (#1429) ### Description * No change to user level API but under the hood make use of the new mbridge API for definining custom weight decay skips. Depends on NVIDIA-NeMo/Megatron-Bridge#2010 * Update to tokenizer to support the new mbridge API for tokenizer init that no longer requires a path for path object for based inputs. Path objects no longer work with megatron using this path, so switching to strings in the recipe. * Remove unused nemo2 code/files that were left over in the refactor. ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. #### Triggering Code Rabbit AI Review To trigger a code review from code rabbit, comment on a pull request with one of these commands: - @coderabbitai review - Triggers a standard review - @coderabbitai full review - Triggers a comprehensive review See https://docs.coderabbit.ai/reference/review-commands for a full list of commands. ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added `no_weight_decay_embeddings` configuration parameter for Evo2 training recipes to control embedding weight decay behavior. * **Chores** * Updated Megatron-related dependency versions. * **Tests** * Improved test fixture scoping for better test isolation. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: John St. John <jstjohn@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci January 21, 2026 00:45 Error

copy-pr-bot bot had a problem deploying to test January 21, 2026 00:46 Error

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 00:50 Inactive

copy-pr-bot bot temporarily deployed to test January 21, 2026 00:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 01:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 01:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 02:06 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 21, 2026 02:06 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 02:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 22:15 Inactive

FDecaYed reviewed Jan 23, 2026

View reviewed changes

src/megatron/bridge/training/config.py Show resolved Hide resolved

ko3n1g approved these changes Jan 23, 2026

View reviewed changes

jstjohn merged commit a2c11f8 into main Jan 23, 2026
49 checks passed

jstjohn deleted the jstjohn/opt_config_override_provider branch January 23, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the optimizer override function so that users can swap in their own #2010

Refactor the optimizer override function so that users can swap in their own #2010

Uh oh!

jstjohn commented Jan 21, 2026 •

edited

Loading

Uh oh!

FDecaYed left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor the optimizer override function so that users can swap in their own #2010

Refactor the optimizer override function so that users can swap in their own #2010

Uh oh!

Conversation

jstjohn commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

FDecaYed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jstjohn commented Jan 21, 2026 •

edited

Loading