Set ddp_find_unused_parameters to False when using distributed training#222
Set ddp_find_unused_parameters to False when using distributed training#222aresnow1 wants to merge 2 commits intoartidoro:mainfrom
ddp_find_unused_parameters to False when using distributed training#222Conversation
|
Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU. My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts. |
When using DDP, if gradient checkpointing is enabled, |
I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable. |
@chenjiasheng |
As described in Huggingface doc,
ddp_find_unused_parametersshould be set to False if enable gradient_checkpointing.I've tested on my machine with 2 3090 Ti GPUs, run with following scripts:
It may resolve the #12 .