-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647
Conversation
cc @muellerzr @SunMarc because I'm not sure who to ping for xpu! Feel free to ping someone else if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if you get the same issue as here with XPU ? Also, does this PR fixes your issue ? cc @ydshieh as you might know something
Yes we used XPU to do fine-tuning and got the same issue. With this PR the issue fixed. |
I can only say that this seems reasonable to me but not more than that. @Liangliang-Ma It would be better to provide a tiny code snippet to demonstrate the issue. Like providing a mask with a row with all places being masked, and pass it to |
Thanks @ydshieh , I modified your test and get the NaN result like this:
I found that this issue caused from casting |
@Rocketknight1 Hi, may I know if the CI workflow failures are expected or not, for it seems to be generated modeling from original code different from what I modified. Thanks! |
Hey @Liangliang-Ma, you need to modify the modular file for mistral and bamba as the modeling file is generated automatically from this file. This should fix the CI issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the snippet, LGTM with the fix I proposed for the CI
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@SunMarc Thanks. The CI passed with the fix. |
gentle ping @ArthurZucker as this concerns attention class |
@SunMarc @ArthurZucker Soft reminder of this PR. |
gentle ping @SunMarc @ArthurZucker again. |
Merging this as this only concerns XPU workflow. |
…ling files (#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium
Sorry @Liangliang-Ma ! And thanks for the fix 🤗 |
…ling files (huggingface#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium
…ling files (huggingface#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium
Recently when we run transformers + qlora doing fine-tuning, we found NaN produced by
torch.nn.functional.scaled_dot_product_attention
.Given that xpu has similar implement of fused sdpa, we would like to follow pytorch/transformers tmp solution to modify the mask here, which could solve the issue on XPU device too.