We are training GLM 4.5 Air using LORA. Our configurations are as below
model: "zai-org/GLM-4.5-Air"
use_hf: true
merge_lora: true
save_safetensors: true
train_type: "lora" # lora or full
lora_rank: 32
lora_alpha: 32
lora_dropout: 0.05
target_modules: "all-linear"
finetune: true
lr: 1e-4
min_lr: 1e-5
lr_warmup_fraction: 0.05
clip_grad: 0.5
micro_batch_size: 1
global_batch_size: 64
adam_beta2: 0.999
weight_decay: 0.0
tensor_model_parallel_size: 4
context_parallel_size: 2
pipeline_model_parallel_size: 1
expert_model_parallel_size: 4
sequence_parallel: true
moe_permute_fusion: true
moe_grouped_gemm: true
moe_shared_expert_overlap: true
moe_aux_loss_coeff: 0
recompute_granularity: "full"
recompute_method: "uniform"
recompute_num_layers: 1
load_from_cache_file: true
dataloader_num_workers: 8
dataset_num_proc: 48
max_length: 32768 #sequence length
packing: false
num_train_epochs: 1
save_steps: 1000
no_save_optim: false
no_save_rng: false
attention_backend: "flash"
cross_entropy_loss_fusion: true
2026-05-18 16:23:39
Traceback (most recent call last):
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
2026-05-18 16:23:39
megatron_sft_main()
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 92, in megatron_sft_main
2026-05-18 16:23:39
return MegatronSft(args).main()
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main
2026-05-18 16:23:39
result = self.run()
2026-05-18 16:23:39
^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 67, in run
2026-05-18 16:23:39
trainer.train(train_dataset, val_dataset)
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 643, in train
2026-05-18 16:23:39
self.save_checkpoint()
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 686, in save_checkpoint
2026-05-18 16:23:39
save_mcore_checkpoint(
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 281, in save_mcore_checkpoint
2026-05-18 16:23:39
async_save_request = dist_checkpointing.save(
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 403, in save
2026-05-18 16:23:39
sharded_state_dict, state_dict = save_preprocess(
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/state_dict_utils.py", line 51, in save_preprocess
2026-05-18 16:23:39
determine_global_metadata(sharded_part)[1],
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/validation.py", line 558, in determine_global_metadata
2026-05-18 16:23:39
torch.distributed.all_gather_object(global_metadata, local_metadata)
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
2026-05-18 16:23:39
return func(*args, **kwargs)
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3268, in all_gather_object
2026-05-18 16:23:39
object_list[i] = _tensor_to_object(tensor, tensor_size, group)
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3173, in _tensor_to_object
2026-05-18 16:23:39
return _unpickler(io.BytesIO(buf)).load()
2026-05-18 16:23:39
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
_pickle.UnpicklingError: invalid load key, '\x00'.
2026-05-18 16:23:39
[rank127]: Traceback (most recent call last):
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
2026-05-18 16:23:39
[rank127]: megatron_sft_main()
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 92, in megatron_sft_main
2026-05-18 16:23:39
[rank127]: return MegatronSft(args).main()
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main
2026-05-18 16:23:39
[rank127]: result = self.run()
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 67, in run
2026-05-18 16:23:39
[rank127]: trainer.train(train_dataset, val_dataset)
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 643, in train
2026-05-18 16:23:39
[rank127]: self.save_checkpoint()
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 686, in save_checkpoint
2026-05-18 16:23:39
[rank127]: save_mcore_checkpoint(
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 281, in save_mcore_checkpoint
2026-05-18 16:23:39
[rank127]: async_save_request = dist_checkpointing.save(
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 403, in save
2026-05-18 16:23:39
[rank127]: sharded_state_dict, state_dict = save_preprocess(
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/state_dict_utils.py", line 51, in save_preprocess
2026-05-18 16:23:39
[rank127]: determine_global_metadata(sharded_part)[1],
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/validation.py", line 558, in determine_global_metadata
2026-05-18 16:23:39
[rank127]: torch.distributed.all_gather_object(global_metadata, local_metadata)
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
2026-05-18 16:23:39
[rank127]: return func(*args, **kwargs)
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3268, in all_gather_object
2026-05-18 16:23:39
[rank127]: object_list[i] = _tensor_to_object(tensor, tensor_size, group)
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3173, in _tensor_to_object
2026-05-18 16:23:39
[rank127]: return _unpickler(io.BytesIO(buf)).load()
2026-05-18 16:23:39
[rank127]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: _pickle.UnpicklingError: invalid load key, '\x00'.
we are using 16 P5en.48xlarge instance (which have 8 H200s each). This error disappears when we change the pipeline parallel to 2. We had a similar observation with training GLM 4.7 - using PP=2 throws the same error - if we use PP=4 the error disappears
Checklist / 检查清单
Bug Description / Bug 描述
We are training GLM 4.5 Air using LORA. Our configurations are as below
The error we get this error when the first checkpoint is saved
we are using 16 P5en.48xlarge instance (which have 8 H200s each). This error disappears when we change the pipeline parallel to 2. We had a similar observation with training GLM 4.7 - using PP=2 throws the same error - if we use PP=4 the error disappears
We are using ms-swift 4.0.2
How to Reproduce / 如何复现
Added in the description
Additional Information / 补充信息
No response