-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix accelerate tests for roberta xl #31288
Conversation
@@ -572,7 +572,7 @@ class XLMRobertaXLPreTrainedModel(PreTrainedModel): | |||
|
|||
config_class = XLMRobertaXLConfig | |||
base_model_prefix = "roberta" | |||
_no_split_modules = ["XLMRobertaXLEmbeddings", "XLMRobertaXLSelfAttention"] | |||
_no_split_modules = ["XLMRobertaXLEmbeddings", "XLMRobertaXLLayer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can leave a comment (in the PR page is ok) why this fix the issue, that would be very nice as knowledge sharing ❤️ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because there are a lot of residual connections at the level of the layer. So we need to make sure that all the ops that are happening inside of the layer stay on the same device. Otherwise, we would have to do manually move the tensors(.to(device) and this is not good. If you check other _no_split_modules
with transformers architecture (llama, mistral ...), you will see that most of them have the layer class inside of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing it so quickly!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
What does this do ?
This PR fixes the
no_split_modules
for roberta xl model introduced in this PR #31223. Also, we need to change model_split_percents, so that we made sure to have enough space on the first gpu for the embeddings. The tests are passing:cc @ydshieh since you pinged me on this one