CUDA: fix Gemma 2 numerical issues for FA #9166

JohannesGaessler · 2024-08-25T07:26:45Z

Fixup to #8542 .

There seem to be numerical issues which (depending on the inputs) can cause incorrect results with Gemma 2 when using FlashAttention. This PR sets the FA precision to FP32 which fixes the issue described in #8542 (comment) .

strawberrymelonpanda · 2024-08-25T09:24:55Z

Just pulled the PR and checked, looks to be solved to me. Nice work!

./llama-cli --model "gemma-2-27b-it-imat-Q6_K (bartowski).gguf" --prompt "<start_of_turn>user\nhow many squares are on a chessboard?<end_of_turn>\n<start_of_turn>model\n" --verbose --special --flash-attn -ngl 99

This is a classic riddle! There are more squares on a chessboard than you might initially think. Here's how to figure it out: 
[...]

Also checked another Gemma 2 model (gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf) and Gemma 2 9B, both fine.

strawberrymelonpanda · 2024-08-25T09:34:53Z

Benchmarks for what it's worth:

gemma-2-27b-it-imat-Q6_K_(bartowski).gguf

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  0 |         pp512 |  1149.13 ± 15.39 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  0 |         tg128 |     29.53 ± 0.28 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  1 |         pp512 |  1230.19 ± 10.37 |
| gemma2 27B Q6_K                |  21.70 GiB |    28.41 B | CUDA       |  99 |  1 |         tg128 |     30.58 ± 0.12 |

gemma-2-27b-it-SimPO-37K-Q5_K_M.gguf

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |         pp512 |   1299.05 ± 8.52 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |         tg128 |     34.25 ± 0.23 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |         pp512 |  1350.98 ± 22.11 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |         tg128 |     35.62 ± 0.23 |

JohannesGaessler · 2024-08-25T09:37:35Z

You should be seeing more of a difference if you set higher values for -n and -p since with a mostly empty context the attention is comparatively fast.

strawberrymelonpanda · 2024-08-25T09:46:18Z

Thanks for the tip. Mostly just testing and wanted to show setting FP32 doesn't seem to cause any issues or slowdown.

Here's some higher values.

| model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |        pp4096 |   1052.16 ± 8.75 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  0 |        tg1024 |     33.08 ± 0.09 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |        pp4096 |   1256.27 ± 5.57 |
| gemma2 27B Q5_K - Medium       |  18.97 GiB |    28.41 B | CUDA       |  99 |  1 |        tg1024 |     35.13 ± 0.17 |

Looks like a nice improvement to me.

CUDA: fix Gemma 2 numerical issues for FA

931ed36

JohannesGaessler mentioned this pull request Aug 25, 2024

CPU/CUDA: Gemma 2 FlashAttention support #8542

Merged

slaren approved these changes Aug 25, 2024

View reviewed changes

JohannesGaessler merged commit f91fc56 into ggml-org:master Aug 25, 2024
49 of 52 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

CUDA: fix Gemma 2 numerical issues for FA (ggml-org#9166)

773e9b9

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

CUDA: fix Gemma 2 numerical issues for FA (ggml-org#9166)

22d75c6

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025

CUDA: fix Gemma 2 numerical issues for FA (ggml-org#9166)

ac2c769

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix Gemma 2 numerical issues for FA #9166

CUDA: fix Gemma 2 numerical issues for FA #9166

JohannesGaessler commented Aug 25, 2024

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading

JohannesGaessler commented Aug 25, 2024

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading

CUDA: fix Gemma 2 numerical issues for FA #9166

CUDA: fix Gemma 2 numerical issues for FA #9166

Conversation

JohannesGaessler commented Aug 25, 2024

strawberrymelonpanda commented Aug 25, 2024 • edited Loading

strawberrymelonpanda commented Aug 25, 2024 • edited Loading

JohannesGaessler commented Aug 25, 2024

strawberrymelonpanda commented Aug 25, 2024 • edited Loading

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading

strawberrymelonpanda commented Aug 25, 2024 •

edited

Loading