Skip to content

Megatron: GLM 4.5 Air Checkpoint Save Unpicking Error #9402

@blackPython

Description

@blackPython

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

We are training GLM 4.5 Air using LORA. Our configurations are as below

model: "zai-org/GLM-4.5-Air"
use_hf: true
merge_lora: true
save_safetensors: true
train_type: "lora"         # lora or full
lora_rank: 32
lora_alpha: 32
lora_dropout: 0.05
target_modules: "all-linear"
finetune: true
lr: 1e-4
min_lr: 1e-5
lr_warmup_fraction: 0.05
clip_grad: 0.5
micro_batch_size: 1
global_batch_size: 64
adam_beta2: 0.999
weight_decay: 0.0
tensor_model_parallel_size: 4
context_parallel_size: 2
pipeline_model_parallel_size: 1
expert_model_parallel_size: 4
sequence_parallel: true
moe_permute_fusion: true
moe_grouped_gemm: true
moe_shared_expert_overlap: true
moe_aux_loss_coeff: 0
recompute_granularity: "full"
recompute_method: "uniform"
recompute_num_layers: 1
load_from_cache_file: true
dataloader_num_workers: 8
dataset_num_proc: 48
max_length: 32768 #sequence length
packing: false
num_train_epochs: 1
save_steps: 1000
no_save_optim: false
no_save_rng: false
attention_backend: "flash"
cross_entropy_loss_fusion: true

The error we get this error when the first checkpoint is saved

2026-05-18 16:23:39
Traceback (most recent call last):
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
2026-05-18 16:23:39
    megatron_sft_main()
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 92, in megatron_sft_main
2026-05-18 16:23:39
    return MegatronSft(args).main()
2026-05-18 16:23:39
           ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main
2026-05-18 16:23:39
    result = self.run()
2026-05-18 16:23:39
             ^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 67, in run
2026-05-18 16:23:39
    trainer.train(train_dataset, val_dataset)
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 643, in train
2026-05-18 16:23:39
    self.save_checkpoint()
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 686, in save_checkpoint
2026-05-18 16:23:39
    save_mcore_checkpoint(
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 281, in save_mcore_checkpoint
2026-05-18 16:23:39
    async_save_request = dist_checkpointing.save(
2026-05-18 16:23:39
                         ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 403, in save
2026-05-18 16:23:39
    sharded_state_dict, state_dict = save_preprocess(
2026-05-18 16:23:39
                                     ^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/state_dict_utils.py", line 51, in save_preprocess
2026-05-18 16:23:39
    determine_global_metadata(sharded_part)[1],
2026-05-18 16:23:39
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/validation.py", line 558, in determine_global_metadata
2026-05-18 16:23:39
    torch.distributed.all_gather_object(global_metadata, local_metadata)
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
2026-05-18 16:23:39
    return func(*args, **kwargs)
2026-05-18 16:23:39
           ^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3268, in all_gather_object
2026-05-18 16:23:39
    object_list[i] = _tensor_to_object(tensor, tensor_size, group)
2026-05-18 16:23:39
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
  File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3173, in _tensor_to_object
2026-05-18 16:23:39
    return _unpickler(io.BytesIO(buf)).load()
2026-05-18 16:23:39
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
_pickle.UnpicklingError: invalid load key, '\x00'.
2026-05-18 16:23:39
[rank127]: Traceback (most recent call last):
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>
2026-05-18 16:23:39
[rank127]:     megatron_sft_main()
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 92, in megatron_sft_main
2026-05-18 16:23:39
[rank127]:     return MegatronSft(args).main()
2026-05-18 16:23:39
[rank127]:            ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main
2026-05-18 16:23:39
[rank127]:     result = self.run()
2026-05-18 16:23:39
[rank127]:              ^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 67, in run
2026-05-18 16:23:39
[rank127]:     trainer.train(train_dataset, val_dataset)
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 643, in train
2026-05-18 16:23:39
[rank127]:     self.save_checkpoint()
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 686, in save_checkpoint
2026-05-18 16:23:39
[rank127]:     save_mcore_checkpoint(
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 281, in save_mcore_checkpoint
2026-05-18 16:23:39
[rank127]:     async_save_request = dist_checkpointing.save(
2026-05-18 16:23:39
[rank127]:                          ^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 403, in save
2026-05-18 16:23:39
[rank127]:     sharded_state_dict, state_dict = save_preprocess(
2026-05-18 16:23:39
[rank127]:                                      ^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/state_dict_utils.py", line 51, in save_preprocess
2026-05-18 16:23:39
[rank127]:     determine_global_metadata(sharded_part)[1],
2026-05-18 16:23:39
[rank127]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/validation.py", line 558, in determine_global_metadata
2026-05-18 16:23:39
[rank127]:     torch.distributed.all_gather_object(global_metadata, local_metadata)
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
2026-05-18 16:23:39
[rank127]:     return func(*args, **kwargs)
2026-05-18 16:23:39
[rank127]:            ^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3268, in all_gather_object
2026-05-18 16:23:39
[rank127]:     object_list[i] = _tensor_to_object(tensor, tensor_size, group)
2026-05-18 16:23:39
[rank127]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]:   File "/usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3173, in _tensor_to_object
2026-05-18 16:23:39
[rank127]:     return _unpickler(io.BytesIO(buf)).load()
2026-05-18 16:23:39
[rank127]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-18 16:23:39
[rank127]: _pickle.UnpicklingError: invalid load key, '\x00'.

we are using 16 P5en.48xlarge instance (which have 8 H200s each). This error disappears when we change the pipeline parallel to 2. We had a similar observation with training GLM 4.7 - using PP=2 throws the same error - if we use PP=4 the error disappears

We are using ms-swift 4.0.2

How to Reproduce / 如何复现

Added in the description

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions