-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fix Gemma 2 numerical issues for FA #9166
CUDA: fix Gemma 2 numerical issues for FA #9166
Conversation
Just pulled the PR and checked, looks to be solved to me. Nice work!
Also checked another Gemma 2 model (gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf) and Gemma 2 9B, both fine. |
Benchmarks for what it's worth: gemma-2-27b-it-imat-Q6_K_(bartowski).gguf
gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf
|
You should be seeing more of a difference if you set higher values for |
Thanks for the tip. Mostly just testing and wanted to show setting FP32 doesn't seem to cause any issues or slowdown. Here's some higher values.
Looks like a nice improvement to me. |
Fixup to #8542 .
There seem to be numerical issues which (depending on the inputs) can cause incorrect results with Gemma 2 when using FlashAttention. This PR sets the FA precision to FP32 which fixes the issue described in #8542 (comment) .