Set `ddp_find_unused_parameters` to False when using distributed training by aresnow1 · Pull Request #222 · artidoro/qlora

aresnow1 · 2023-07-21T11:16:23Z

As described in Huggingface doc, ddp_find_unused_parameters should be set to False if enable gradient_checkpointing.

I've tested on my machine with 2 3090 Ti GPUs, run with following scripts:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node=2 \
--master_port=1234 \
qlora.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --use_auth \

It may resolve the #12 .

artidoro · 2023-07-21T15:53:34Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts?
--ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

aresnow1 · 2023-07-21T16:33:06Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

chenjiasheng · 2023-08-26T07:06:23Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

nickmitchko · 2023-08-29T18:11:54Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

@chenjiasheng
would you mind sharing your code / script? I have been struggling to implement parallel training on a multi-gpu node.

aresnow1 added 2 commits July 21, 2023 19:02

Set ddp_find_unused_parameters to False when using ddp

813b323

Fix

dcb4c14

shawnanastasio mentioned this pull request Jul 26, 2023

Multi-gpu training example? #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Set `ddp_find_unused_parameters` to False when using distributed training#222

Set `ddp_find_unused_parameters` to False when using distributed training#222
aresnow1 wants to merge 2 commits intoartidoro:mainfrom
aresnow1:bugfix/ddp

aresnow1 commented Jul 21, 2023 •

edited

Loading

Uh oh!

artidoro commented Jul 21, 2023

Uh oh!

aresnow1 commented Jul 21, 2023

Uh oh!

chenjiasheng commented Aug 26, 2023

Uh oh!

nickmitchko commented Aug 29, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

aresnow1 commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artidoro commented Jul 21, 2023

Uh oh!

aresnow1 commented Jul 21, 2023

Uh oh!

chenjiasheng commented Aug 26, 2023

Uh oh!

nickmitchko commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aresnow1 commented Jul 21, 2023 •

edited

Loading

nickmitchko commented Aug 29, 2023 •

edited

Loading