Skip to content

Comments

[moe] brings batch/sequence-wise load balance loss#2061

Open
rakkit wants to merge 5 commits intopytorch:mainfrom
rakkit:moe_load_balance_loss
Open

[moe] brings batch/sequence-wise load balance loss#2061
rakkit wants to merge 5 commits intopytorch:mainfrom
rakkit:moe_load_balance_loss

Conversation

@rakkit
Copy link
Contributor

@rakkit rakkit commented Nov 19, 2025

This is a draft PR for:

  1. Make the moe's load_balance_coeff configurable
  2. add the batch and seq-wise aux loss for load balance. [ref: dpskv3 eqn. 17~20]

For now, it only applies to the DeepSeek model, but I can add it for all other moe models at the end.
(also, we dont log the aux loss, but i can add it in optimizer hook to do this if you want)

The main concern is that the aux loss does not work well with PP. From what I have tested, it works well only with 1F1B. And it is broken for ZBV or interleaved 1f1b.

To test it:
[sequence_wise, by default]
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --training.extra_losses.load_balance_loss_weight=0.001
image

[batch_wise, need to pick this in ModelArgs]
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --training.extra_losses.load_balance_loss_weight=0.001
image

(turn it off)
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh
image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025
job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
)

self.loss_fn = functools.partial(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a condition here to wrap loss or not for MoE. for now all models in torchtitan only return a single output so its ok for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! @shuhuayu is working on a more formal review, and I have some house-keeping comments



@dataclass
class ExtraLosses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is specifically for MoE load balancing loss for now, do you foresee any other loss related params will be used in this section? If not, let's make the name for descriptive and specific

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup here. Should we merge these configs to the Model dataclass?

Copy link
Contributor

@shuhuayu shuhuayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the pr @rakkit ! Made some comments here.

job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
)

self.loss_fn = functools.partial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

@rakkit
Copy link
Contributor Author

rakkit commented Dec 9, 2025

Thanks a lot for the feedback, @wwwjn @shuhuayu (sorry for the late update)!

Summary of new changes:

  • Made the MoE loss a wrapper, so we can now do
    build_loss_fn = moe_loss_wrap(build_cross_entropy_loss)
    when defining the model in TrainSpec.

  • Moved ExtraLosses to the Training scope.
    The main purpose is to decouple this from the model definition.

  • Renamed load_balance_coeff to moe_aux_loss_free_bias_coeff — a bit longer, but clearer.

  • Now applied on moe models in models folder, (dpskv3, llama4, qwen)

  • Other refactors, thanks again @shuhuayu.

And be aware that the PP & aux-loss still does not work

self.load_balance_loss_weight,
)
else:
load_balance_loss = torch.tensor(0.0, device=out.device, dtype=out.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see out is not defined in this scope yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks : )

@staticmethod
def sequence_wise_aux_loss(
scores: torch.Tensor,
indices: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will use the biased topk(scores + expert_bias) instead of the unbiased topk(scores) from DSv3 eq 18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, thats top_scores

Copy link
Contributor

@lckr lckr Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, scores is the raw sigmoid output. But isn't indices (= selected_experts_indices) derived as topk(scores + expert_bias)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, good question. need to think about this.

Copy link
Contributor Author

@rakkit rakkit Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you might be right, eq 18 the topk dont have "bias"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I fixed this and rerun the two aux loss types and no aux loss in PR description.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal and sorry for the late review.

It looks like this feature requires intrusive change and is worth some discussions on how to best support it.

loss = loss_fn(pred, labels)
# Add auxiliary loss to the computation graph for gradients in the backward pass,
# but cancel out its numeric value so the forward pass only logs language model task loss.
loss = loss + (load_balance_loss - load_balance_loss.detach())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks too hacky. Curious why we don't want to log the full loss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one needs to log load_balance_loss, a helpful way is to log it via moe's optimizer pre-step hook. (where we log everything about moe, e.g. bias, experts usage, entropy, lb loss etc).

And we dont need to hack the return of "loss" (which is a single value/tensor for dense/moe/diffusion model training).

and for people who want to run ablation study, the "loss" is the "clean" CE loss for training/validation



@dataclass
class ExtraLosses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the "model definition", so shouldn't be configurable from run-to-run (given the current limitation of torchtitan config system). They should be embedded in ModelArgs or more specifically MoEArgs.

return loss_per_seq.mean()

@staticmethod
def batch_wise_aux_loss(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this batch one? For DSv3 it seems unnecessary. For Qwen3 it seems insufficient, which adopts a global batch load balancing but here it looks local / microbatch (see e.g. https://qwenlm.github.io/blog/global-load-balance/)

For simplicity let's start with seq_wise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, ideally we can make it (type of aux-loss) configurable

I am curious if people deal with a special case of document segmentation (for models trained with document mask attention, it's literally not "one" sequence on attention's side)

)

@staticmethod
def sequence_wise_aux_loss(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this returns a loss independent of DP, similar to the cross-entropy loss, in that each DP rank computes its own aux loss and do backward which eventually gets gradients reduced across DP ranks by FSDP.

In general, is there a way to verify the correctness of the implementation?

Comment on lines +444 to +446
h, accumulated_load_balance_loss = layer(
h, self.freqs_cis, accumulated_load_balance_loss, attention_masks
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, passing this per-layer loss along all the way to the final output sounds unnecessary. It sounds correct but is causing quite intrusive changes to the entire model code.

Putting PP aside, is it true that we can also achieve this via a buffer in each MoE module, similar to the expert bias? Specifically, putting the per-layer loss in a buffer, and in the loss function fetch the value and add them to the final loss.

Is this similar to the idea in #1979 when you say

Caching the per-layer aux_loss loss (which breaks compile, but not PP)

In what way it breaks compile? cc @bdhirsh @xmfan

With PP, I'm not sure what's the best way to do it. @H-Huang any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in what about caching auxilliary loss breaks compile as well.

This is probably not representative, but I have a basic working example here of using torch.compile to compile each layer of a transformer model here where I also compute and cache an extra loss onto each layer: https://gist.github.com/bdhirsh/2b59611d3070354af3f6364d9becaa08

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx both.

. In what way it breaks compile?

If we register a buffer named "aux_loss", and use it saved aux loss value, and access it and sum it to the final loss. compiler will broken.

This is probably not representative, but I have a basic working example here of using torch.compile to compile each layer of a transformer model here where I also compute and cache an extra loss onto each layer: https://gist.github.com/bdhirsh/2b59611d3070354af3f6364d9becaa08

Here self.aux_loss is not pre-defined in the buffer. This seems to be a good solution. Let me run a few more tests to check.

Copy link
Contributor Author

@rakkit rakkit Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I mean, the purpose of passing (accumulated) aux-Loss in forward is to deal with PP.

w/o pp, we have lots of clean solutions, including those you proposed.

For pp, if we put each block's aux loss along with the block (via buffer or whatever methods). At the backward the last stage will not be able to capture the aux-loss-of-block-i. For that we need either add the backward hook, that we manually hacking the back ward gradinet (which does not work well with compile as i have tested). Or we need manually add a communcation to gather all stages's aux-loss into last stage. (and we also need to think about micrio-batch things that we need some queue for aux-loss-of-block-i).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is non-trivial. Let me think about it and get back to you.

@tianyu-l tianyu-l linked an issue Dec 26, 2025 that may be closed by this pull request
@JavaZeroo
Copy link

Is there any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

question of PP x aux_loss for MoE

7 participants