Skip to content

【NPU】Qwen3.5 & MindSpeed 0.16.0: GKD training raises AssertionError #9410

@llan-ml

Description

@llan-ml

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

From this PR #9382, we manually update the version of mindspeed to 0.16.0 (https://gitcode.com/Ascend/MindSpeed/tree/core_r0.16.0), and do GRPO training with megatron+mindspeed.

The version of ms-swift is 4.2.1, and it gives an AssertionError:

ranko : AssertionError
[rank3]: Traceback (most recent call last):
[rank3]:  File "/home/ma-user/work/ms-swift-v4.2.1/swift/cli/ megatron/rlhf.py”, line 7, in <module>
[rank3]:	megatron rlhf main()
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/pipelines/train/rlhf.py”, line 73, in megatron_rlhf maine
[rank3]:	return MegatronRLHF(args).main()
[rank3]:	VVVVVVwwwwwwwwwwwwwwwvwvv
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/pipelines/base.py", line 52, in main
[rank3]:	result - self.run()
[rank3]:	vvvvvvwvvv
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/pipelines/train/sft.py", line 70, in runs
[rank3]:	trainer - self.prepare trainer()
[rank3]:	VVVvvvvvwvwwwwwevvvvvv
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/pipelines/train/rlhf.py", line 34, in prepare_trainercc
[rank3]:	return trainer_cls(args, self.template, **kwargs)
[rank3]:	VVVVVVVVVVVVVVVVVVVVVVVVVVVVVwVwvvvvvyvvvv
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/trainers/gkd_trainer.py", line 65, in init
[rank3]:	super(). init (args, template)
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/trainers/base.py", line 70, in  initce
[rank3]:	self.prepare model()
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/trainers/gkd trainer.py", line 86, in prepare modelee
[rank3]:	super().prepare model()
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/trainers/rlhf mixin.py", line 30, in prepare modele
[rank3]:	super().prepare model()
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/trainers/base.py”, line 186, in prepare modelee
[rank3]:	self.unwrapped models - get mcore model(args, self.template.config)
[rank3]:	^^^^^^A^^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
[rank3]:	File "/home/ma-user/work/ms-swift-v4.2.1/swift/megatron/model/utils.py”, line 82, in get mcore_modele
[rank3]:	models - _get mcore model(config)
[rank3]:	VVVVVVVVVVVwwwwwwwwwwvvv
[rank3]:	File "/home/ma-user/work/mcore-bridge/src/mcore_bridge/model/register.py”, line 178, in get mcore model
[rank3]:	model - loader.build model(pre process pre process, post process-post process)
[rank3]:	VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV-
[rank3]:	File "/home/ma-user/work/mcore-bridge/src/mcore bridge/model/gpts/qwen3 next gdn.py”, line 141, in build model
[rank3]:	assert hasattr(layer.self attention.out norm, 'zero centered gamma’)
[rank3]:	^^^^^^^^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
[rank3]: AssertionError

How to Reproduce / 如何复现

Qwen3.5 GKD/OPSD training

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions