vulkan: further optimize q5_k mul_mat_vec #10479

jeffbolznv · 2024-11-25T04:04:09Z

Do some of the logic ops in packed u32.

Perf results on RTX 4070. Note that this "phi3 3B Q4_K" model uses Q5_K maybe a third of the time.

before
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    98.65 us/run - 117.44 MFLOP/run -   1.19 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        108.54  1.25 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        112.41  2.25 |

after
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  60492 runs -    82.96 us/run - 117.44 MFLOP/run -   1.42 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.39  0.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        117.24  1.19 |

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

daniandtheweb · 2024-11-26T01:56:26Z

These changes make quite a big difference on my Radeon 5700XT.

model	size	params	backend	ngl	threads	test	branch	t/s
qwen2 7B Q5_K - Small	4.94 GiB	7.62 B	Vulkan	99	4	tg128	master	41.07 ± 0.06
qwen2 7B Q5_K - Small	4.94 GiB	7.62 B	Vulkan	99	4	tg128	PR	49.23 ± 0.42

netrunnereve · 2024-11-26T02:20:43Z

I haven't tried it with an actual model but our tests show that it's now 6% faster on a RX 570.

Master:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1195.05 us/run - 117.44 MFLOP/run -  98.27 GFLOPS
PR:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1124.40 us/run - 117.44 MFLOP/run - 104.45 GFLOPS

vulkan: further optimize q5_k mul_mat_vec

457a483

jeffbolznv requested a review from 0cc4m November 25, 2024 04:04

0cc4m approved these changes Nov 27, 2024

View reviewed changes

0cc4m merged commit 249a790 into ggerganov:master Nov 27, 2024
7 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

vulkan: further optimize q5_k mul_mat_vec (ggerganov#10479)

433e5ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: further optimize q5_k mul_mat_vec #10479

vulkan: further optimize q5_k mul_mat_vec #10479

jeffbolznv commented Nov 25, 2024

daniandtheweb commented Nov 26, 2024

netrunnereve commented Nov 26, 2024

vulkan: further optimize q5_k mul_mat_vec #10479

vulkan: further optimize q5_k mul_mat_vec #10479

Conversation

jeffbolznv commented Nov 25, 2024

daniandtheweb commented Nov 26, 2024

netrunnereve commented Nov 26, 2024