training loss not decreased

This is my training loss (broken at epoch [110/200]):

<img width="435" alt="Image" src="https://github.com/user-attachments/assets/f472c12b-e1ae-4125-90bf-2ed7cdd31fa4" />

I found the "epoch" here is not generally used in deep learning. It seems the model weights are reset in each saving epoch (or not loading the latest?). Did I set the wrong parameters?

The number of whole tokens in my dataset is train-num-samples * epochs, which is 49597346 * 200 = 9,919,469,200. I want the whole tokens can be seen by model only once.

dataset: fineweb-edu-10BT
tokenization (from DCLM, not open lm):
```cargo run --release -- \
--input /path/to/fineweb-10BT \
--local-cell-dir tmp/path/to/storage/for/local/cells \
--output path/to/tokenization \
--tokenizer "EleutherAI/gpt-neox-20b" \
--seqlen 2049 \
--wds-chunk-size 8192 \
--num-local-cells 512
```


training:
```torchrun --nproc-per-node 4 -m open_lm.main  -- \
 --model open_lm_411m_v2 \
 --train-data /my/tokens/path/shard_{00000000..00000064}.tar \
 --train-num-samples 49597346 \
 --workers 8 \
 --dataset-resampled \
 --precision amp_bfloat16 \
 --grad-checkpointing \
 --log-every-n-steps 10 \
 --global-batch-size 64 \
 --epochs 200 \
 --grad-clip-norm 1 \
 --data-key json.gz \
 --lr 3e-4 \
 --fsdp --fsdp-amp \
 --warmup 2000 \
 --wd 0.1 \
 --beta2 0.95 \
 --report-to wandb \
 --name open_lm_ex_$RANDOM \
 --logs /mnt/nas/copora-evaluation/public-model/checkpoint/fineweb-10BT/checkpoint \
 --resume latest
```

device info: 4*A40



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training loss not decreased #307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

training loss not decreased #307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions