-
Notifications
You must be signed in to change notification settings - Fork 38
Accelerate and AdamW runs OOM #133
Copy link
Copy link
Open
Description
Launching the script without accelerate runs fine, but runs OOM with accelerate
Setup:
- 1 H100
- Llama 3.1 8b
- AdamW
Without accelerate
python examples/rl_gsm8k/run_finetune.py --config-dir /mnt/llmd/results/exps/alex/debug_accelerate/conf --config-name 012/05/2024 20:27:25 - INFO - tapeagents.finetune.context - epoch 0 ended
12/05/2024 20:27:27 - INFO - tapeagents.finetune.context - epoch 1 ended
12/05/2024 20:27:28 - INFO - tapeagents.finetune.context - epoch 2 ended
12/05/2024 20:27:30 - INFO - tapeagents.finetune.context - epoch 3 ended
12/05/2024 20:27:32 - INFO - tapeagents.finetune.context - epoch 4 ended
12/05/2024 20:27:33 - INFO - tapeagents.finetune.context - Completed steps 1: {'stats/lr': '0.000', 'stats/grad_norm': '0.840', 'stats/samples': '64.000', 'stats/passes': '16.000', 'stats/completed_steps': '1.000', 'stats/epoch': '5.000', 'throughput/tokens_perGPU_per_sec': '3009.996', 'throughput/tokens_per_sec': '3009.996', 'throughput/passes_per_sec': '1.568', 'throughput/steps_per_sec': '0.098', 'throughput/sec_per_step': '0.040', 'loss/train': '-0.067', 'dataset_stats/max_batch_len': '656.000', 'dataset_stats/min_batch_len': '480.000', 'rl/max_new_log_probs': '0.000', 'rl/max_ratio_new_old': '10.092', 'rl/max_loss': '0.023', 'rl/reward': '0.538', 'rl/max_reward': '1.000', 'rl/min_reward': '0.000', 'rl/mean_old_logprobs': '-0.263', 'rl/mean_new_logprobs': '-0.278', 'rl/mean_new_logprobs_positive_log_p_weights': '-0.027', 'rl/mean_new_logprobs_negative_log_p_weights': '-0.109', 'rl/mean_ref_logprobs': '-0.279', 'rl/advantage': '-0.006', 'rl/max_advantage': '0.707', 'rl/min_advantage': '-0.707', 'rl/loss': '-0.017', 'rl/kl': '0.000', 'rl/max_kl': '0.029', 'rl/min_kl': '-0.000', 'rl/surr1': '0.000', 'rl/surr2': '0.000', 'rl/ratio_new_old': '1.002', 'rl/ratio_ref_new': '1.000', 'rl/ratio_ref_old': '1.002'}Accelerate
accelerate launch --mixed_precision=bf16 --config_file conf/accelerate/accelerate_base.yaml examples/rl_gsm8k/run_finetune.py --config-dir /mnt/llmd/results/exps/alex/debug_accelerate/conf --config-name 012/05/2024 20:28:19 - INFO - tapeagents.finetune.context - epoch 0 ended
12/05/2024 20:28:23 - INFO - tapeagents.finetune.context - epoch 1 ended
12/05/2024 20:28:27 - INFO - tapeagents.finetune.context - epoch 2 ended
12/05/2024 20:28:32 - INFO - tapeagents.finetune.context - epoch 3 ended
12/05/2024 20:28:36 - INFO - tapeagents.finetune.context - epoch 4 ended
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/toolkit/TapeAgents/examples/rl_gsm8k/run_finetune.py", line 8, in finetune_with_config
run_finetuning_loop(cfg)
File "/home/toolkit/TapeAgents/tapeagents/finetune/finetune.py", line 219, in run_finetuning_loop
optimizer.step()
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
out = func(*args, **kwargs)
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 227, in step
adamw(
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
return func(*args, **kwargs)
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 767, in adamw
func(
File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 101.88 MiB is free. Process 1230303 has 78.99 GiB memory in use. Of the allocated memory 77.24 GiB is allocated by PyTorch, and 655.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels