Releases: modelscope/twinkle
Releases · modelscope/twinkle
v0.4.0
Highlights
- Initial DeepSeek V4 support, covering Flash FSDP2 + EP training and DeepSeek V4 tool-call parsing and cleanup in #190 and #218
- Expand Qwen3.5 training with padding-free / packed-sequence support and Qwen3.5 MoE GatedDeltaNet sequence-parallel support in #186 and #222
- Add Gemma 4 multimodal training support #199
- Strengthen LoRA training with rsLoRA for Multi-LoRA, FSDP2 support for Multi-LoRA SFT, and Expert Parallelism LoRA SFT examples for DeepSeek V4 and Qwen3.5 MoE in #187, #155, and #198
- Improve NPU acceleration and stability with fused operators, Qwen3.5 FLA patches, Group MatMul EP scoping, and sequence-parallel compatibility fixes in #194, #204, #205, #206, and #208
New Features
- Add padding-free and packed-sequence support for Qwen3.5 by @meichangsu1 in #186
- Add rsLoRA support to Multi-LoRA by @xichengpro in #187
- Add FSDP2 support for Multi-LoRA SFT by @kevssim in #155
- Add DeepSeek V4 Flash FSDP2 + EP training support by @meichangsu1 in #190
- Add NPU fused operators: RMSNorm, RoPE, SwiGLU, and SDPA by @ys2025-AI in #194
- Add multi-turn rollout support by @tastelikefeet in #193
- Add support for client-specified checkpoint saving paths by @vx120 in #196
- Add LoRA SFT support for Expert Parallelism, with DeepSeek V4 and Qwen3.5 MoE examples by @kevssim in #198
- Add Qwen3.5 NPU FLA and fused-operator patches by @ys2025-AI in #204
- Add LoRA capacity query support by @kevssim in #201
- Optimize Native FSDP memory_efficient_init weight loading for multi-node EP/FSDP jobs and add multi-node scripts by @meichangsu1 in #207
- Add Gemma 4 support by @EvineR666 in #199
- Add DeepSeek V4 tool-call parsing and cleanup support by @meichangsu1 in #218
- Add Gemma 4 12B cookbook by @EvineR666 in #219
- Add automatic device detection by @vx120 in #220
- Add Qwen3.5 MoE GatedDeltaNet sequence-parallel support by @meichangsu1 in #222
- Refactor server configuration and observability by @Yunnglin in #210
Bug Fixes
- Fix cache reset behavior for multimodal models by @hjh0119 in #189
- Fix Qwen3.5 GatedDeltaNet padding-free compatibility and create_causal_mask compatibility after cache_positions removal in transformers >5.3.0 by @meichangsu1 in #202
- Fix transformers 5.9 AttentionMask wrapper compatibility in sequence parallel by @ys2025-AI in #206
- Fix SP path overriding the NPU-patched chunk_gated_delta_rule by @ys2025-AI in #208
- Fix NPU Group MatMul patch scope so it only applies in EP scenarios by @0hujun in #205
- Fix adapter saving to use the MultiLora state dict by @meichangsu1 in #215
更新内容
亮点功能
- 首发支持 DeepSeek V4,覆盖 Flash FSDP2 + EP 训练,以及 DeepSeek V4 tool call 解析与清理 in #190 and #218
- 扩展 Qwen3.5 训练能力,新增 padding-free / packed-sequence 支持和 Qwen3.5 MoE GatedDeltaNet sequence parallel 支持 in #186 and #222
- 新增 Gemma 4 多模态训练支持 in #199
- 增强 LoRA 训练能力,覆盖 Multi-LoRA 的 rsLoRA、Multi-LoRA SFT 的 FSDP2 支持,以及 DeepSeek V4 / Qwen3.5 MoE 的 EP LoRA SFT 示例 in #187, #155, and #198
- 增强 NPU 加速与稳定性,覆盖融合算子、Qwen3.5 FLA patch、Group MatMul EP 以及 sequence-parallel 兼容性修复 in #194, #204, #205, #206, and #208
新特性
- 支持 Qwen3.5 padding-free / packed-sequence 训练 by @meichangsu1 in #186
- Multi-LoRA 支持 rsLoRA by @xichengpro in #187
- Multi-LoRA SFT 支持 FSDP2 by @kevssim in #155
- 支持 DeepSeek V4 Flash FSDP2 + EP 训练 by @meichangsu1 in #190
- 新增 NPU 融合算子:RMSNorm、RoPE、SwiGLU、SDPA by @ys2025-AI in #194
- 支持 multi-turn rollout by @tastelikefeet in #193
- 支持客户端指定服务端路径保存 checkpoint by @vx120 in #196
- EP 支持 LoRA SFT,并新增 DeepSeek V4 和 Qwen3.5 MoE 示例 by @kevssim in #198
- 新增 Qwen3.5 NPU FLA 与融合算子补丁 by @ys2025-AI in #204
- 支持查询 LoRA capacity 信息 by @kevssim in #201
- 优化 Native FSDP memory_efficient_init 多节点 EP/FSDP 权重加载,并新增多节点脚本 by @meichangsu1 in #207
- 新增 Gemma 4 支持 by @EvineR666 in #199
- 新增 DeepSeek V4 tool call 解析与清理支持 by @meichangsu1 in #218
- 新增 Gemma 4 12B cookbook by @EvineR666 in #219
- 新增自动显卡设备检测 by @vx120 in #220
- 支持 Qwen3.5 MoE GatedDeltaNet sequence parallel by @meichangsu1 in #222
- 服务端配置与可观测性重构 by @Yunnglin in #210
Bug 修复
- 修复多模态模型 cache reset 问题 by @hjh0119 in #189
- 修复 Qwen3.5 GatedDeltaNet padding-free 训练兼容性,并兼容 transformers >5.3.0 中 cache_positions 移除后的 create_causal_mask 逻辑 by @meichangsu1 in #202
- 修复 sequence parallel 中 transformers 5.9 AttentionMask wrapper 兼容问题 by @ys2025-AI in #206
- 修复 SP 路径覆盖 NPU patch 后的 chunk_gated_delta_rule 问题 by @ys2025-AI in #208
- 修复 NPU Group MatMul patch 作用范围,限定仅在 EP 场景启用 by @0hujun in #205
- 修复保存 adapter 时未使用 MultiLora state dict 的问题 by @meichangsu1 in #215
New Contributors
- @tpx818 made their first contribution in #65
- @wangxingjun778 made their first contribution in #68
- @hzher made their first contribution in #92
- @xichengpro made their first contribution in #123
- @vx120 made their first contribution in #118
- @0hujun made their first contribution in #183
- @a550580874 made their first contribution in #176
- @ys2025-AI made their first contribution in #194
- @EvineR666 made their first contribution in #199
Full Changelog: https://github.com/modelscope/twinkle/commits/v0.4.0
v0.3.0
中文版本
新特性
- 全面支持padding_free参数,可用于sft、dpo、grpo等各类训练中,在InputProcessor构造时传入padding_free=True即可生效
- 支持resume-from-checkpoint,参考这里
Bug修复
- 更新了lora dtype和模型dtype不同导致的训练问题
- 修复了npu gemm算子的支持
- 修复npu下fsdp生效时megatron gather报错的问题
English Version
New Features
- Full support for the
padding_freeparameter, which can be used in various training types such as SFT, DPO, GRPO, etc. It takes effect by passingpadding_free=Truewhen constructingInputProcessor. - Support for resume-from-checkpoint. Refer to here.
Bug Fixes
- Fixed a training issue caused by mismatched LoRA dtype and model dtype.
- Fixed support for the NPU GEMM operator.
- Fixed an error where Megatron gather failed when FSDP was enabled on NPU.
What's Changed
- Update docker file by @tastelikefeet in #180
- fix: model dtype is not same as lora dtype in FSDP train by @0hujun in #183
- fix: when setting fsdp size unuse megatron for gather in npu by @0hujun in #185
- npu gemm patch by @a550580874 in #176
- Support dpo/grpo/gkd/sft padding_free by @tastelikefeet in #181
- [feat] Resume from ckpt by @kevssim in #135
New Contributors
- @0hujun made their first contribution in #183
- @a550580874 made their first contribution in #176
Full Changelog: v0.2.1...v0.3.0
v0.2.1
中文版本
新功能
- 支持了Qwen/Qwen3.6-27B的魔搭官方服务,详情查看:https://www.modelscope.cn/organization/twinkle-kit
Bug修复
- 修复了expert权重同步错误的问题
- 修复了多lora场景下GRPO MoE训练崩塌的问题
- 修复了对多模态输入的序列切分问题
- 修复了pp > 1 和tp>1时服务器不正常的问题
- 修复了多处remote_function不正确的问题
- 修复了服务器训练模型上传和模型训练共用pipeline导致阻塞的问题
- 修复了采样器模块的一些bug
English Version
New Features
- Added support for the official ModelScope service on Qwen/Qwen3.6-27B. For details, see: https://www.modelscope.cn/organization/twinkle-kit
Bug Fixes
- Fixed an issue with incorrect expert weight synchronization.
- Fixed a training collapse issue with GRPO MoE in multi-LoRA scenarios.
- Fixed a sequence splitting issue for multimodal inputs.
- Fixed abnormal server behavior when pp > 1 and tp > 1.
- Fixed multiple incorrect
remote_functionimplementations. - Fixed a blocking issue caused by the model upload and model training pipelines sharing the same pipeline on the server side.
- Fixed several bugs in modules such as the Sampler.
What's Changed
- add base_layer suffix for expert weights by @hjh0119 in #159
- update cookbook and doc 0415 by @Yunnglin in #157
- Docs support Q3.6 by @tastelikefeet in #158
- Fix multi lora device by @tastelikefeet in #160
- Fix MoE multi-lora training by @tastelikefeet in #161
- Fix model id and upload to hub by @Yunnglin in #162
- Add notebooks by @tastelikefeet in #164
- Npu adapt megatron by @addsubmuldiv in #153
- Fix save by @tastelikefeet in #165
- A small refactor by @tastelikefeet in #166
- A small refactor, move 4d mask to processor by @tastelikefeet in #167
- Fix some potential bugs by @tastelikefeet in #168
- Fix some bugs by @tastelikefeet in #169
- fix mm tokentypeids splitting by @tastelikefeet in #170
- Fix model pp > 1 and tp > 1 errors by @Yunnglin in #171
- Fix moe weight sync by @tastelikefeet in #172
- update notebooks by @Yunnglin in #174
- Modify remote_function decorators in multi_lora_transformers by @xichengpro in #173
- support cp ,fix qwen3.5 gdn sp by @meichangsu1 in #138
- support qwen3.6 grpo & in-place add lora by @hjh0119 in #163
- Fix multi lora by @tastelikefeet in #177
- support q3.6-27b by @tastelikefeet in #178
- Fix sampler and grpo by @Yunnglin in #179
Full Changelog: v0.2.0...v0.2.1
v0.2.0
中文
新特性
- 重构了服务部分,目前的多租户服务支持tinker/twinkle双client语法规则。
- 支持GKD和On-policy蒸馏,请查看cookbook。
- megatron的底层替换为mcore_bridge库,并支持了对应的多模态训练。
- 支持了DPO算法,请查看cookbook。
- 支持了Qwen3.5系列的多模态任务的训练。
- 新增了服务端可用的Dockerfile。
Bug修复
- 0.2.0 bug修复较多,请查看如下的修复列表。
English
New Features
- Refactored the service layer; the multi-tenant service now supports both tinker/twinkle dual client syntax rules.
- Added support for GKD and On-policy distillation — see cookbook.
- Replaced the underlying Megatron backend with the
mcore_bridgelibrary, with support for corresponding multimodal training. - Added support for the DPO algorithm — see cookbook.
- Added support for multimodal task training on the Qwen3.5 series.
- Added a server-side Dockerfile.
Bug Fixes
- A significant number of bugs have been fixed in 0.2.0 — please refer to the fix list below.
What's Changed
- Fix tinker loss device mismatch by @addsubmuldiv in #115
- Refact server by @Yunnglin in #111
- [Fix] EP+FSDP checkpoint save for MoE expert parameters by @kevssim in #116
- Support GKD and on-policy distillation by @tastelikefeet in #112
- Support patcher on samplers by @tastelikefeet in #119
- fix vllmsampler client by @tastelikefeet in #122
- [Fix] Prevent client from importing ray via
twinkle.server.common.serializeby @xichengpro in #123 - update Qwen3.5 grpo demo by @hjh0119 in #124
- Fix mm server by @tastelikefeet in #125
- fix tp get logps by @hjh0119 in #126
- fix serve_multiplexed_model_id and mm data process by @Yunnglin in #120
- [feat] fsdp2 memory_efficient_init by @kevssim in #117
- fix megatron weights sync by @hjh0119 in #128
- Support DPO by @tastelikefeet in #130
- Fix npu qwen3moe grpo by @vx120 in #118
- fix dpo with lazy_dataset by @tastelikefeet in #136
- support transformers multi-modal grpo by @hjh0119 in #131
- fix import by @kevssim in #137
- Refactor megatron to mcore_bridge by @tastelikefeet in #134
- Fix bugs by @tastelikefeet in #139
- Change online model to qwen3.5-27b by @tastelikefeet in #140
- Merge to main by @tastelikefeet in #141
- fix megatron multi-lora converter by @hjh0119 in #144
- Remerge release/0.2 to main by @tastelikefeet in #146
- support rl vit lora with vLLM by @hjh0119 in #147
- Add server metrics monitor and DPO client by @Yunnglin in #132
- Fix multi lora saving by @tastelikefeet in #148
- fix transformers model loading by @tastelikefeet in #150
- fix short math grpo cookbook by @Yunnglin in #149
- fix docker file by @Yunnglin in #151
- fix cookbook by @tastelikefeet in #152
- fix tensor collect by @Yunnglin in #154
- fix multi lora training by @tastelikefeet in #156
New Contributors
- @xichengpro made their first contribution in #123
- @vx120 made their first contribution in #118
Full Changelog: v0.1.3...v0.2.0
v0.1.3
中文版本
新特性
- 增加了client模式的便捷安装脚本,并提升了文档描述
- 支持transformers分支的ep+fsdp分片
Bug修复
- 修复加载本地数据集失败的问题
- 修复服务化启动时http_options错误传递到模型的问题
English Version
New features
- Add a shell installation script to support the client mode, and improve the description of documentation
- Support ep+fsdp sharding of transformers
BugFix
- Fix a bug that causes an error on local dataset loading
- Fix an error that the
http_optionsargument was mis-transfered to the model in the server mode
What's Changed
- Fix loading local datasets by @tastelikefeet in #108
- [fix] http_options leaking to model init & NPU tensor serialization failure over HTTP by @kevssim in #109
- Fix docs and add new start scripts by @tastelikefeet in #113
- [feat]support ep_fsdp by @kevssim in #71
Full Changelog: v0.1.2...v0.1.3
v0.1.2
中文
新特性
- 支持Qwen3.5系列的transformers模型多模态训练,包含图片和视频
- 支持数据集预处理的batched=True操作,提升速度
Bug修复
- 修复NPU下权重同步卡死的问题
English
New Features
- Support multi-modal training of Qwen3.5 transformers framework, containing images and videos
- Support batched=True when preprocess datasets
BugFix
- Fix the hang problem of NPU weight synchronization
What's Changed
- Update cookbok to qwen35 by @tastelikefeet in #98
- Support Qwen3.5 mm by @tastelikefeet in #100
- Support batched preprocessing by @tastelikefeet in #101
- fix video mm by @tastelikefeet in #105
- Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU by @addsubmuldiv in #102
- add new cookbook with qwen3.5 by @tastelikefeet in #106
- fix cookbook by @tastelikefeet in #107
Full Changelog: v0.1.1...v0.1.2
v0.1.1
Twinkle 0.1.1 version Release
中文
- 支持Qwen3.5-2B~Qwen3.5-9B等Dense模型
English
- Support model series of Qwen3.5-2B~Qwen3.5-9B
Full Changelog: v0.1...v0.11
v0.1
中文
Twinkle框架的0.1版本发布!
新功能
- 🎉完整的数据集、DataLoader、Loss、Transformers和Megatron模型、Advantage、Sampler等组件的支持
- 🎉支持PT、SFT、RL等多种训练Stage,并支持单卡、多机多卡、Ray、Client-Server等多种训练模式
- 🎉支持了首版的多租户复用训练,并完整开源了server端实现。使用ray serve实现了多副本可扩缩容部署,并支持粘滞路由
- 🎉在魔搭官方网站上,提供了在线服务,用户可以使用该服务免费训练
Qwen/Qwen3-30B-A3B-Instruct-2507,并推送模型到ModelHub上
English
Twinkle Framework Version 0.1 Released!
New Features
- 🎉 Full support for components including Dataset, DataLoader, Loss, Transformers and Megatron models, Advantage, Sampler, and more
- 🎉Support for multiple training stages such as PT, SFT, and RL, with various training modes including single-GPU, multi-node multi-GPU, Ray, and Client-Server
- 🎉 First version of multi-tenant shared training is now supported, with the server-side implementation fully open-sourced. Multi-replica scalable deployment is implemented using Ray Serve, with support for sticky routing
- 🎉 An online service is now available on the ModelScope official website, where users can train
Qwen/Qwen3-30B-A3B-Instruct-2507for free and push models to ModelHub
What's Changed
- Squash to main by @tastelikefeet in #46
- rename cmb by @tpx818 in #65
- docs: update README and remove ulysses_size from ep_fsdp_qwen3_moe.py by @meichangsu1 in #64
- add contrbutors by @yingdachen in #66
- fix lora fetch by @tastelikefeet in #67
- Update documentation links in README.md by @wangxingjun778 in #68
- Fix router by @tastelikefeet in #69
- Fix doc links and add tests by @tastelikefeet in #70
- Refactor code by @tastelikefeet in #72
- Fix compat tinker and update doc by @Yunnglin in #73
- [compat] gpt_bridge compat transformers_5 by @Jintao-Huang in #75
- Fix server state adapter limit by @Yunnglin in #74
- Fix some bugs by @tastelikefeet in #77
- [model] support Qwen3.5 series models by @hjh0119 in #76
- fix single gpu bug by @tastelikefeet in #78
- [bugfix] fix dense model get layer spec by @hjh0119 in #80
- fix grad norm bug by @tastelikefeet in #81
- Update readme by @yingdachen in #83
- Add custom route for sticky session by @Yunnglin in #82
- [bugfix] fix 4d attention mask device by @hjh0119 in #85
- add more comment for node resouces by @tastelikefeet in #79
- Update doc and fix bugs by @tastelikefeet in #84
- Fix logps by @tastelikefeet in #86
- recover cp sequence before loss by @hjh0119 in #88
- [bugfix] fix logps with PP by @hjh0119 in #89
- Fix megatron loss by @tastelikefeet in #90
- Dev feature by @hzher in #92
- Fix proxy by @Yunnglin in #87
- fix TEGroupedLinear by @tastelikefeet in #94
- [bugfix] fix grpo loss by @hjh0119 in #93
- fix numpy version by @tastelikefeet in #95
- [bugfix] fix contiguous by @hjh0119 in #96
- Add a sample script by @tastelikefeet in #97
New Contributors
- @tastelikefeet made their first contribution in #46
- @tpx818 made their first contribution in #65
- @wangxingjun778 made their first contribution in #68
- @hzher made their first contribution in #92
Full Changelog: https://github.com/modelscope/twinkle/commits/v0.1