Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: further optimize q5_k mul_mat_vec #10479

Merged
merged 1 commit into from
Nov 27, 2024
Merged

Conversation

jeffbolznv
Copy link
Collaborator

Do some of the logic ops in packed u32.

Perf results on RTX 4070. Note that this "phi3 3B Q4_K" model uses Q5_K maybe a third of the time.

before
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    98.65 us/run - 117.44 MFLOP/run -   1.19 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        108.54  1.25 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        112.41  2.25 |

after
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  60492 runs -    82.96 us/run - 117.44 MFLOP/run -   1.42 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.39  0.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        117.24  1.19 |

@jeffbolznv jeffbolznv requested a review from 0cc4m November 25, 2024 04:04
@daniandtheweb
Copy link
Contributor

These changes make quite a big difference on my Radeon 5700XT.

model size params backend ngl threads test branch t/s
qwen2 7B Q5_K - Small 4.94 GiB 7.62 B Vulkan 99 4 tg128 master 41.07 ± 0.06
qwen2 7B Q5_K - Small 4.94 GiB 7.62 B Vulkan 99 4 tg128 PR 49.23 ± 0.42

@netrunnereve
Copy link
Collaborator

I haven't tried it with an actual model but our tests show that it's now 6% faster on a RX 570.

Master:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1195.05 us/run - 117.44 MFLOP/run -  98.27 GFLOPS
PR:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1124.40 us/run - 117.44 MFLOP/run - 104.45 GFLOPS

@0cc4m 0cc4m merged commit 249a790 into ggerganov:master Nov 27, 2024
7 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants