-
Notifications
You must be signed in to change notification settings - Fork 179
Description
Environment
- Repo: Megatron‑Bridge (main, as of Jan 2026)
- Model/recipe: Qwen‑VL 32B Instruct (GPTModelProvider via examples/recipes/qwen_vl/finetune_qwen_vl.py)
- Data: Energon WDS, no packing (MBS=1)
- Parallelism: TP=8, PP=1, CP=3, sequence_parallel=True
- model.seq_length=24576 (divisible by lcm(TP,CP)=24)
- Container: nvcr.io/nvidia/pytorch:25.11-py3
- Transformers, TE, Megatron‑Core installed via uv sync --all-extras --all-groups
Repro
- Command (per node; 3 nodes total):
/opt/venv/bin/python -m torch.distributed.run --nproc_per_node=8 --nnodes=3 --node_rank=0 --master_addr=172.31.36.153 --master_port=29500 examples/recipes/qwen_vl/finetune_qwen_vl.py --recipe qwen3_vl_8b_finetune_config --hf_path /workspace/models/Qwen3-VL-32B-Instruct --pretrained-checkpoint /workspace/models/Qwen3-VL-32B-Instruct-Nemo --dataset-type energon --data-path /workspace/nvme_data/qwen3_vl model.tensor_model_parallel_size=8 model.pipeline_model_parallel_size=1 model.context_parallel_size=3 model.hierarchical_context_parallel_sizes=[3] train.micro_batch_size=1 train.global_batch_size=18 dataset.micro_batch_size=1 dataset.global_batch_size=18 model.seq_length=98304 model.freeze_language_model=false model.freeze_vision_model=false train.train_iters=112632 train.eval_interval=5000 checkpoint.save_interval=10000 optimizer.lr=5e-6 optimizer.min_lr=1e-7 optimizer.clip_grad=0.5 scheduler.lr_warmup_iters=10000 scheduler.lr_decay_iters=112632 checkpoint.save=/workspace/nvme_data/checkpoints/qwen3_vl_32B-instruct/0122 ddp.data_parallel_sharding_strategy=optim_grads_params ddp.use_distributed_optimizer=true model.attention_backend=flash model.sequence_parallel=true model.recompute_granularity=full model.recompute_method=uniform model.recompute_num_layers=1 vision_pre_debug=true target_height=1440 target_width=800 keep_aspect_ratio_and_padding=false logger.filter_warnings=false logger.logging_level=20 logger.log_interval=1
Observed
-
With CP=1, LM loss baseline normal:
[2026-01-23 11:58:00] iteration 68/ 112632 | consumed samples: 1224 | elapsed time per iteration (ms): 6182.7 | learning rate: 3.400000E-08 | global batch size: 18 | lm loss: 1.305835E+00 | loss scale: 1.0 | grad norm: 43.360 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:07] iteration 69/ 112632 | consumed samples: 1242 | elapsed time per iteration (ms): 6978.3 | learning rate: 3.450000E-08 | global batch size: 18 | lm loss: 1.143949E+00 | loss scale: 1.0 | grad norm: 116.335 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:13] iteration 70/ 112632 | consumed samples: 1260 | elapsed time per iteration (ms): 5880.9 | learning rate: 3.500000E-08 | global batch size: 18 | lm loss: 1.225305E+00 | loss scale: 1.0 | grad norm: 433.015 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:19] iteration 71/ 112632 | consumed samples: 1278 | elapsed time per iteration (ms): 6621.9 | learning rate: 3.550000E-08 | global batch size: 18 | lm loss: 1.121844E+00 | loss scale: 1.0 | grad norm: 61.098 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:25] iteration 72/ 112632 | consumed samples: 1296 | elapsed time per iteration (ms): 5969.4 | learning rate: 3.600000E-08 | global batch size: 18 | lm loss: 1.276615E+00 | loss scale: 1.0 | grad norm: 108.640 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:33] iteration 73/ 112632 | consumed samples: 1314 | elapsed time per iteration (ms): 7592.6 | learning rate: 3.650000E-08 | global batch size: 18 | lm loss: 1.106227E+00 | loss scale: 1.0 | grad norm: 259.462 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:39] iteration 74/ 112632 | consumed samples: 1332 | elapsed time per iteration (ms): 6059.9 | learning rate: 3.700000E-08 | global batch size: 18 | lm loss: 1.142453E+00 | loss scale: 1.0 | grad norm: 111.241 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:47] iteration 75/ 112632 | consumed samples: 1350 | elapsed time per iteration (ms): 7477.5 | learning rate: 3.750000E-08 | global batch size: 18 | lm loss: 1.152739E+00 | loss scale: 1.0 | grad norm: 105.764 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:58:54] iteration 76/ 112632 | consumed samples: 1368 | elapsed time per iteration (ms): 7182.7 | learning rate: 3.800000E-08 | global batch size: 18 | lm loss: 1.058158E+00 | loss scale: 1.0 | grad norm: 46.327 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-23 11:59:00] iteration 77/ 112632 | consumed samples: 1386 | elapsed time per iteration (ms): 6576.0 | learning rate: 3.850000E-08 | global batch size: 18 | lm loss: 1.156631E+00 | loss scale: 1.0 | grad norm: 87.955 | number of skipped iterations: 0 | number of nan iterations: 0 | -
With CP=3, LM loss ≈ 3× larger:
[2026-01-24 00:12:37] iteration 2/ 112632 | consumed samples: 36 | elapsed time per iteration (ms): 26896.3 | learning rate: 1.000000E-09 | global batch size: 18 | lm loss: 4.303481E+00 | loss scale: 1.0 | grad norm: 246.339 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:13:00] iteration 3/ 112632 | consumed samples: 54 | elapsed time per iteration (ms): 23193.8 | learning rate: 1.500000E-09 | global batch size: 18 | lm loss: 4.540686E+00 | loss scale: 1.0 | grad norm: 3974.289 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:13:23] iteration 4/ 112632 | consumed samples: 72 | elapsed time per iteration (ms): 22700.9 | learning rate: 2.000000E-09 | global batch size: 18 | lm loss: 3.649251E+00 | loss scale: 1.0 | grad norm: 1125.224 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:13:46] iteration 5/ 112632 | consumed samples: 90 | elapsed time per iteration (ms): 22594.7 | learning rate: 2.500000E-09 | global batch size: 18 | lm loss: 4.623797E+00 | loss scale: 1.0 | grad norm: 148.790 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:14:10] iteration 6/ 112632 | consumed samples: 108 | elapsed time per iteration (ms): 24711.0 | learning rate: 3.000000E-09 | global batch size: 18 | lm loss: 4.260278E+00 | loss scale: 1.0 | grad norm: 1165.394 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:14:31] iteration 7/ 112632 | consumed samples: 126 | elapsed time per iteration (ms): 20853.8 | learning rate: 3.500000E-09 | global batch size: 18 | lm loss: 5.881763E+00 | loss scale: 1.0 | grad norm: 633.288 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:14:54] iteration 8/ 112632 | consumed samples: 144 | elapsed time per iteration (ms): 22760.5 | learning rate: 4.000000E-09 | global batch size: 18 | lm loss: 3.615558E+00 | loss scale: 1.0 | grad norm: 3085.513 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:15:15] iteration 9/ 112632 | consumed samples: 162 | elapsed time per iteration (ms): 20916.0 | learning rate: 4.500000E-09 | global batch size: 18 | lm loss: 3.936383E+00 | loss scale: 1.0 | grad norm: 511.562 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:15:37] iteration 10/ 112632 | consumed samples: 180 | elapsed time per iteration (ms): 21995.7 | learning rate: 5.000000E-09 | global batch size: 18 | lm loss: 5.027537E+00 | loss scale: 1.0 | grad norm: 1450.643 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:15:58] iteration 11/ 112632 | consumed samples: 198 | elapsed time per iteration (ms): 21033.6 | learning rate: 5.500000E-09 | global batch size: 18 | lm loss: 4.097773E+00 | loss scale: 1.0 | grad norm: 196.101 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:16:19] iteration 12/ 112632 | consumed samples: 216 | elapsed time per iteration (ms): 21374.8 | learning rate: 6.000000E-09 | global batch size: 18 | lm loss: 4.429692E+00 | loss scale: 1.0 | grad norm: 1511.395 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:16:41] iteration 13/ 112632 | consumed samples: 234 | elapsed time per iteration (ms): 21179.5 | learning rate: 6.500000E-09 | global batch size: 18 | lm loss: 3.719966E+00 | loss scale: 1.0 | grad norm: 782.273 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-01-24 00:17:02] iteration 14/ 112632 | consumed samples: 252 | elapsed time per iteration (ms): 21538.9 | learning rate: 7.000000E-09 | global batch size: 18 | lm loss: 4.510926E+00 | loss scale: 1.0 | grad norm: 714.650 | number of skipped iterations: 0 | number of nan iterations: 0 |