Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fix Gemma 2 numerical issues for FA #9166

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixup to #8542 .

There seem to be numerical issues which (depending on the inputs) can cause incorrect results with Gemma 2 when using FlashAttention. This PR sets the FA precision to FP32 which fixes the issue described in #8542 (comment) .

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Aug 25, 2024

Just pulled the PR and checked, looks to be solved to me. Nice work!

./llama-cli --model "gemma-2-27b-it-imat-Q6_K (bartowski).gguf" --prompt "<start_of_turn>user\nhow many squares are on a chessboard?<end_of_turn>\n<start_of_turn>model\n" --verbose --special --flash-attn -ngl 99
This is a classic riddle! There are more squares on a chessboard than you might initially think. Here's how to figure it out: 
[...]

Also checked another Gemma 2 model (gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf) and Gemma 2 9B, both fine.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Aug 25, 2024

Benchmarks for what it's worth:

gemma-2-27b-it-imat-Q6_K_(bartowski).gguf

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  0 |         pp512 |  1149.13 ± 15.39 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  0 |         tg128 |     29.53 ± 0.28 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  1 |         pp512 |  1230.19 ± 10.37 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  1 |         tg128 |     30.58 ± 0.12 |

gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |         pp512 |   1299.05 ± 8.52 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |         tg128 |     34.25 ± 0.23 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |         pp512 |  1350.98 ± 22.11 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |         tg128 |     35.62 ± 0.23 |

@JohannesGaessler
Copy link
Collaborator Author

You should be seeing more of a difference if you set higher values for -n and -p since with a mostly empty context the attention is comparatively fast.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Aug 25, 2024

Thanks for the tip. Mostly just testing and wanted to show setting FP32 doesn't seem to cause any issues or slowdown.

Here's some higher values.

| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |        pp4096 |   1052.16 ± 8.75 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |        tg1024 |     33.08 ± 0.09 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |        pp4096 |   1256.27 ± 5.57 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |        tg1024 |     35.13 ± 0.17 |

Looks like a nice improvement to me.

@JohannesGaessler JohannesGaessler merged commit f91fc56 into ggml-org:master Aug 25, 2024
49 of 52 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants