Update AFMoE architecture to use v5-style MoE impl by AutumnAurelium · Pull Request #44063 · huggingface/transformers

AutumnAurelium · 2026-02-17T01:07:13Z

What does this PR do?

This brings the Arcee AFMoE architecture in line with other MoE models' implementation patterns since v5. It also adds integration testing using Trinity Nano.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez

ArthurZucker

sounds good thanks for updating!

src/transformers/models/afmoe/modeling_afmoe.py

src/transformers/models/afmoe/modular_afmoe.py

AutumnAurelium · 2026-02-26T22:36:37Z

@ArthurZucker @Cyrilvallez

Any update on getting this merged? Fixed problems mentioned above.

winglian · 2026-03-03T17:08:23Z

run-slow: afmoe

winglian

lgtm!

winglian · 2026-03-04T14:34:17Z

confirmed that model trains in axolotl, as well as loads experts as expected;

>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> model_id = "arcee-ai/Trinity-Nano-Base"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, device_map="auto")
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1003/1003 [00:01<00:00, 948.02it/s, Materializing param=model.norm.weight]
>>> messages = [{"role":"user","content":"tell me about the culinary holy trinity"}]
>>> inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
>>> outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
user
tell me about the culinary holy trinity
assistant
The culinary holy trinity refers to a combination of three key ingredients that are essential in many dishes. The trinity varies depending on the cuisine, but a common trinity includes onions, bell peppers, and celery. This trinity is used in dishes like Cajun and Creole cuisine, where it forms the base for many sauces and stews.
>>> model.model.layers[2]
AfmoeDecoderLayer(
	(self_attn): AfmoeAttention(
		(q_proj): Linear(in_features=1024, out_features=1024, bias=False)
		(k_proj): Linear(in_features=1024, out_features=256, bias=False)
		(v_proj): Linear(in_features=1024, out_features=256, bias=False)
		(o_proj): Linear(in_features=1024, out_features=1024, bias=False)
		(q_norm): AfmoeRMSNorm((128,), eps=1e-05)
		(k_norm): AfmoeRMSNorm((128,), eps=1e-05)
		(gate_proj): Linear(in_features=1024, out_features=1024, bias=False)
		(rotary_fn): Func()
	)
	(input_layernorm): AfmoeRMSNorm((1024,), eps=1e-05)
	(post_attention_layernorm): AfmoeRMSNorm((1024,), eps=1e-05)
	(pre_mlp_layernorm): AfmoeRMSNorm((1024,), eps=1e-05)
	(post_mlp_layernorm): AfmoeRMSNorm((1024,), eps=1e-05)
	(mlp): AfmoeMoE(
		(router): AfmoeTokenChoiceRouter(
			(gate): Linear(in_features=1024, out_features=128, bias=False)
		)
		(shared_experts): AfmoeMLP(
			(gate_proj): Linear(in_features=1024, out_features=256, bias=False)
			(up_proj): Linear(in_features=1024, out_features=256, bias=False)
			(down_proj): Linear(in_features=256, out_features=1024, bias=False)
			(act_fn): SiLUActivation()
		)
		(experts): AfmoeExperts(
			(act_fn): SiLUActivation()
		)
	)
)

ArthurZucker

LGTM just let's leverage modular in that case the MOE is standard can be inherited!

ArthurZucker · 2026-03-04T14:38:01Z

src/transformers/models/afmoe/modular_afmoe.py

+        return final_hidden_states


 class AfmoeMoE(nn.Module):


pretty sure you can now inherit this from another class! can you try ? 🤗

@ArthurZucker Should this also be named AfmoeSparseMoeBlock for consistency?

yep perfect

tests/models/afmoe/test_modeling_afmoe.py

github-actions · 2026-03-19T10:28:10Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe

HuggingFaceDocBuilderDev · 2026-03-19T10:40:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AutumnAurelium added 2 commits February 17, 2026 00:11

v5-style AFMoE impl

c3be2d5

Merge branch 'main' of https://github.com/AutumnAurelium/transformers

20896f8

ArthurZucker reviewed Feb 17, 2026

View reviewed changes

src/transformers/models/afmoe/modeling_afmoe.py Outdated Show resolved Hide resolved

src/transformers/models/afmoe/modular_afmoe.py Outdated Show resolved Hide resolved

AutumnAurelium added 3 commits February 18, 2026 16:19

Merge branch 'huggingface:main' into main

a9f620a

don't unnecessarily return router logits

359503b

Merge branch 'main' of https://github.com/AutumnAurelium/transformers

690e9d3

winglian approved these changes Mar 3, 2026

View reviewed changes

winglian requested a review from ArthurZucker March 3, 2026 17:13

ArthurZucker reviewed Mar 4, 2026

View reviewed changes

AutumnAurelium added 3 commits March 4, 2026 23:42

inherit MoE code and refactor for stylistic consistency

9467bd3

remove pointless type alias

6815d21

Merge branch 'main' of https://github.com/huggingface/transformers

8f59135

ArthurZucker approved these changes Mar 19, 2026

View reviewed changes

tests/models/afmoe/test_modeling_afmoe.py Outdated Show resolved Hide resolved

Merge branch 'main' into main

6372b30

remove legacy cache reference

bde5fa6

type and lint

622f2da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update AFMoE architecture to use v5-style MoE impl#44063

Update AFMoE architecture to use v5-style MoE impl#44063
AutumnAurelium wants to merge 11 commits intohuggingface:mainfrom
AutumnAurelium:main

AutumnAurelium commented Feb 17, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

AutumnAurelium commented Feb 26, 2026

Uh oh!

winglian commented Mar 3, 2026

Uh oh!

winglian left a comment

Uh oh!

winglian commented Mar 4, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Mar 4, 2026

Uh oh!

winglian Mar 4, 2026

Uh oh!

ArthurZucker Mar 19, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AutumnAurelium commented Feb 17, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AutumnAurelium commented Feb 26, 2026

Uh oh!

winglian commented Mar 3, 2026

Uh oh!

winglian left a comment

Choose a reason for hiding this comment

Uh oh!

winglian commented Mar 4, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

winglian Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants