-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong ouput of Gemma-2 models using flash_attention_2
#32309
Comments
flash_attention_2
flash_attention_2
Yes, there's a PR to fix it (#32188) |
Thank you, @zucchini-nlp ! After the fix, will we be able to use Since FlashAttention currently doesn't support a static cache, do you think this issue will also impact other libraries (e.g., vLLM and other frameworks) when using |
@tanliboy FA2 should now work for transformer, in forward and backward. For other libraries, I am not super familiar with all of them but for vllm Gemma2 should work same way as other models because they do not use the same ``StaticCache` we do. Also note that currently vllm doesn't do sliding window in every second attn block, as per the comment I see here |
Thank you for the details, @zucchini-nlp ! |
@tanliboy Glad to see it's fixed! Let me know if I can close the issue 😇 |
sure, the PR is merged already, closing the issue :) |
@zucchini-nlp, very thank you to fix the issue about it. |
@HuangBugWei correct! We might have a release soon, but until then it should be installed from source |
@zucchini-nlp thanks for the fix! I installed the latest release but ran into the below error while using
Here is the testing code to repro:
Did I miss something? |
@tanliboy yeah, seems like there were some other changes in how attn mask is prepared, which broke FA2 again... Will open a new PR |
Thank you, @zucchini-nlp ! |
I tested the fix, and it worked well. Thank you! I also had a side-by-side comparison during fine-tuning with and without The "GPU Time Spent Accessing Memory" was around 40%, which is lower than the ~47% observed with |
@zucchini-nlp , is this warning still true? Or should we remove it given the fix?
|
Yes, I believe it still holds true as it wasn't related to FA2 not being supported, but rather due to small numerical precision differences between eager and non-eager attn |
No it’s not longer true as flash attention soft capping is supported. Will remove |
I guess SDPA is not yet supported? |
Yes, we need to integrate |
The warning is still there in 4.48.1. Can you confirm that we can safely ignore this warning? |
If you don't have the correct version of Flash then it's expected, but otherwise yes, can be ignored! |
I remember that the soft-capping issue was resolved for forward pass in flash_attn. However, I am still seeing poor model outputs when I enable use_flash_attention_2 in Transformers, even for inference:
Did I miss something? Or is it a recent regression?
Who can help?
@ArthurZucker
Reproduction
use_flash_attention_2
to load Gemma-2 7B IT modelpython
use_flash_attention_2=False
. This can be consistently reproduced.Expected behavior
See the difference below:


The text was updated successfully, but these errors were encountered: