-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bnb multi gpu training #2714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense thanks @SunMarc !
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 🚀
## Describe your changes There is a bug in accelerate where it assumes that bnb (loaded_in_4bit) models on multiple gpus cannot be trained but this is not the case. On AML computes with multiple GPUs, the model doesn't have any weights on device 0 which is the default device for accelerator and catches the bug https://github.com/huggingface/accelerate/blob/e82de1215ae701b6bf567eb705615c656e7f55c7/src/accelerate/accelerator.py#L1374. This has been fixed on main by huggingface/accelerate#2714 but is not in a release yet. We workaround this bug for current stable releases of accelerate by setting `ACCELERATE_TORCH_DEVICE` to the same device as the first device the model is on. ` ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
…oft#1117) ## Describe your changes There is a bug in accelerate where it assumes that bnb (loaded_in_4bit) models on multiple gpus cannot be trained but this is not the case. On AML computes with multiple GPUs, the model doesn't have any weights on device 0 which is the default device for accelerator and catches the bug https://github.com/huggingface/accelerate/blob/e82de1215ae701b6bf567eb705615c656e7f55c7/src/accelerate/accelerator.py#L1374. This has been fixed on main by huggingface/accelerate#2714 but is not in a release yet. We workaround this bug for current stable releases of accelerate by setting `ACCELERATE_TORCH_DEVICE` to the same device as the first device the model is on. ` ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
What does this PR do ?
This PR fixes a bug that users who want to perform multi-gpu training with bnb gets when the first gpu is not used. In this PR, you see that we did not allow multi-gpu training before and all the code after that assume that we are in a single gpu setup. Since then, we added the possibility to perform multi-gpu training using naive PP. However, the logic after the multi-gpu check stayed the same as you can see on main . A quick fix would be to just add an elif statement.
Fixes #2713 #2429