fix bnb multi gpu training #2714

SunMarc · 2024-04-26T10:30:41Z

What does this PR do ?

This PR fixes a bug that users who want to perform multi-gpu training with bnb gets when the first gpu is not used. In this PR, you see that we did not allow multi-gpu training before and all the code after that assume that we are in a single gpu setup. Since then, we added the possibility to perform multi-gpu training using naive PP. However, the logic after the multi-gpu check stayed the same as you can see on main . A quick fix would be to just add an elif statement.

Fixes #2713 #2429

younesbelkada

Makes sense thanks @SunMarc !

HuggingFaceDocBuilderDev · 2024-04-26T10:38:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Good catch! 🚀

## Describe your changes There is a bug in accelerate where it assumes that bnb (loaded_in_4bit) models on multiple gpus cannot be trained but this is not the case. On AML computes with multiple GPUs, the model doesn't have any weights on device 0 which is the default device for accelerator and catches the bug https://github.com/huggingface/accelerate/blob/e82de1215ae701b6bf567eb705615c656e7f55c7/src/accelerate/accelerator.py#L1374. This has been fixed on main by huggingface/accelerate#2714 but is not in a release yet. We workaround this bug for current stable releases of accelerate by setting `ACCELERATE_TORCH_DEVICE` to the same device as the first device the model is on. ` ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link

…oft#1117) ## Describe your changes There is a bug in accelerate where it assumes that bnb (loaded_in_4bit) models on multiple gpus cannot be trained but this is not the case. On AML computes with multiple GPUs, the model doesn't have any weights on device 0 which is the default device for accelerator and catches the bug https://github.com/huggingface/accelerate/blob/e82de1215ae701b6bf567eb705615c656e7f55c7/src/accelerate/accelerator.py#L1374. This has been fixed on main by huggingface/accelerate#2714 but is not in a release yet. We workaround this bug for current stable releases of accelerate by setting `ACCELERATE_TORCH_DEVICE` to the same device as the first device the model is on. ` ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link

fix bnb multi gpu training

3e298f3

SunMarc requested a review from younesbelkada April 26, 2024 10:30

SunMarc added 3 commits April 26, 2024 12:31

style

8d3cc07

elif instead

d63e560

fix

ee2d066

younesbelkada approved these changes Apr 26, 2024

View reviewed changes

SunMarc added 2 commits April 26, 2024 13:11

style

3e88fd6

fix

a053bb3

muellerzr approved these changes Apr 26, 2024

View reviewed changes

SunMarc merged commit cd7df41 into huggingface:main Apr 26, 2024
23 checks passed

jambayk mentioned this pull request Apr 26, 2024

LoRA/QLoRA: Workaround accelerate bug for multi-gpu bnb model microsoft/Olive#1117

Merged

6 tasks

This was referenced May 14, 2024

Remove check of device consistency for balanced_low_0. #2591

Closed

Accelerate refuse to work on balanced_low_0 when GPU 0 is not filled. #2429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bnb multi gpu training #2714

fix bnb multi gpu training #2714

SunMarc commented Apr 26, 2024 •

edited

Loading

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Apr 26, 2024

muellerzr left a comment

fix bnb multi gpu training #2714

fix bnb multi gpu training #2714

Conversation

SunMarc commented Apr 26, 2024 • edited Loading

What does this PR do ?

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 26, 2024

muellerzr left a comment

Choose a reason for hiding this comment

SunMarc commented Apr 26, 2024 •

edited

Loading