Skip to content

Accelerate and AdamW runs OOM #133

@AlexPiche

Description

@AlexPiche

Launching the script without accelerate runs fine, but runs OOM with accelerate

Setup:

  • 1 H100
  • Llama 3.1 8b
  • AdamW

Without accelerate

python examples/rl_gsm8k/run_finetune.py --config-dir /mnt/llmd/results/exps/alex/debug_accelerate/conf --config-name 0
12/05/2024 20:27:25 - INFO - tapeagents.finetune.context - epoch 0 ended
12/05/2024 20:27:27 - INFO - tapeagents.finetune.context - epoch 1 ended
12/05/2024 20:27:28 - INFO - tapeagents.finetune.context - epoch 2 ended
12/05/2024 20:27:30 - INFO - tapeagents.finetune.context - epoch 3 ended
12/05/2024 20:27:32 - INFO - tapeagents.finetune.context - epoch 4 ended
12/05/2024 20:27:33 - INFO - tapeagents.finetune.context - Completed steps 1: {'stats/lr': '0.000', 'stats/grad_norm': '0.840', 'stats/samples': '64.000', 'stats/passes': '16.000', 'stats/completed_steps': '1.000', 'stats/epoch': '5.000', 'throughput/tokens_perGPU_per_sec': '3009.996', 'throughput/tokens_per_sec': '3009.996', 'throughput/passes_per_sec': '1.568', 'throughput/steps_per_sec': '0.098', 'throughput/sec_per_step': '0.040', 'loss/train': '-0.067', 'dataset_stats/max_batch_len': '656.000', 'dataset_stats/min_batch_len': '480.000', 'rl/max_new_log_probs': '0.000', 'rl/max_ratio_new_old': '10.092', 'rl/max_loss': '0.023', 'rl/reward': '0.538', 'rl/max_reward': '1.000', 'rl/min_reward': '0.000', 'rl/mean_old_logprobs': '-0.263', 'rl/mean_new_logprobs': '-0.278', 'rl/mean_new_logprobs_positive_log_p_weights': '-0.027', 'rl/mean_new_logprobs_negative_log_p_weights': '-0.109', 'rl/mean_ref_logprobs': '-0.279', 'rl/advantage': '-0.006', 'rl/max_advantage': '0.707', 'rl/min_advantage': '-0.707', 'rl/loss': '-0.017', 'rl/kl': '0.000', 'rl/max_kl': '0.029', 'rl/min_kl': '-0.000', 'rl/surr1': '0.000', 'rl/surr2': '0.000', 'rl/ratio_new_old': '1.002', 'rl/ratio_ref_new': '1.000', 'rl/ratio_ref_old': '1.002'}

Accelerate

accelerate launch --mixed_precision=bf16 --config_file conf/accelerate/accelerate_base.yaml examples/rl_gsm8k/run_finetune.py --config-dir /mnt/llmd/results/exps/alex/debug_accelerate/conf --config-name 0
12/05/2024 20:28:19 - INFO - tapeagents.finetune.context - epoch 0 ended
12/05/2024 20:28:23 - INFO - tapeagents.finetune.context - epoch 1 ended
12/05/2024 20:28:27 - INFO - tapeagents.finetune.context - epoch 2 ended
12/05/2024 20:28:32 - INFO - tapeagents.finetune.context - epoch 3 ended
12/05/2024 20:28:36 - INFO - tapeagents.finetune.context - epoch 4 ended
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/toolkit/TapeAgents/examples/rl_gsm8k/run_finetune.py", line 8, in finetune_with_config
    run_finetuning_loop(cfg)
  File "/home/toolkit/TapeAgents/tapeagents/finetune/finetune.py", line 219, in run_finetuning_loop
    optimizer.step()
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/accelerate/optimizer.py", line 171, in step
    self.optimizer.step(closure)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
    out = func(*args, **kwargs)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 227, in step
    adamw(
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
    return func(*args, **kwargs)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 767, in adamw
    func(
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
    exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 101.88 MiB is free. Process 1230303 has 78.99 GiB memory in use. Of the allocated memory 77.24 GiB is allocated by PyTorch, and 655.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions