Qwen3.5 OPD 训练多模态时 vllm 的token和模型不一致

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

使用 4.3 dev 版本的 ms-swift，进行 OPD 训练，目标是把 Qwen3.5-35B-A3B 作为 teacher，把模型能力蒸馏到较小的 Qwen3.5 模型上。如果teacher 和 student 都不使用 vllm，模型能够正常训练，loss 初始为 0.4 左右，但是训练非常慢。如果使用 vllm，使用纯文本的样本，也能够正常训练。

Bug1： 出现在使用 vllm 同时训练多模态时，loss非常高，达到了10，甚至部分step达到了20+，teacher 返回的 token 数量和 rollout 的不一致(claude 分析的，打了很多日志信息，最后得出的结论是多模态时 vllm 的 token和模型的对不齐)。

Bug2：Qwen3.5 gkd 没法使用 deepspeed zero3。

### How to Reproduce / 如何复现

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
TRITON_CACHE_DIR=/tmp/.triton_cache/ \
MODELSCOPE_CACHE=/tmp/msscope/ \
SWIFT_CACHE_DIR=/tmp/swift_cache/ \
NPROC_PER_NODE=8 \
nohup swift rlhf \
    --rlhf_type gkd \
    --model /tmp/Qwen3.5-4B/ \
    --teacher_model_server http://0.0.0.0:8000/ \
    --gkd_logits_topk 256 \
    --optim adamw_8bit \
    --tuner_type lora \
    --lora_rank 256 \
    --lora_alpha 256 \
    --torch_dtype bfloat16 \
    \
    --dataset /tmp/combined_train_predictions/train_text_only.jsonl /tmp/combined_train_predictions/train_with_images.jsonl \
    --split_dataset_ratio 0.001 \
    --do_eval false \
    --max_length 8190 \
    --truncation_strategy delete \
    --freeze_llm false \
    --freeze_aligner true \
    --freeze_vit true \
    --torch_empty_cache_steps 1 \
    \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 3e-5 \
    --warmup_steps 200 \
    --max_grad_norm 1.0 \
    \
    --beta 1.0 \
    --temperature 1.0 \
    --seq_kd false \
    --lmbda 1.0 \
    --sleep_level 1 \
    \
    --eval_strategy steps \
    --eval_steps 200 \
    --save_strategy steps \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --log_completions true \
    --output_dir /tmp/ \
    --report_to tensorboard \
    --save_only_model true \
    \
    --deepspeed zero2 \
    --dataloader_num_workers 4

### Additional Information / 补充信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5 OPD 训练多模态时 vllm 的token和模型不一致 #9414

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3.5 OPD 训练多模态时 vllm 的token和模型不一致 #9414

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions