vulkan: small mul_mat_vec optimizations #10665

netrunnereve · 2024-12-04T22:52:56Z

Here's a couple of small optimizations for mul_mat_vec that came from studying the generated GCN assembly. The biggest improvement came from moving out the delta if possible (8 additional multiplications per inner loop iteration add up fast).

Using the dot() function still generates FMA instructions but due to changes in the instruction ordering it's now running a little bit faster. I think on some newer cards there are dedicated dot product units which may have a better chance of being utilized by the compiler if we use this function.

I tested this on my RX 470, I just got it recently and I don't have access to the 570 anymore. Interestingly enough my 470 (and my W8100 as well) downclock pretty heavily when running mul_mat_vec as the hardware is hitting its power limit, it literally drops down from 1200 mhz to 750 during inference. I discussed this with an ex-AMD engineer and after confirming that the board's power circuits could handle it we made some VBIOS adjustments allowing the card to draw more power and run inference at around 1100 mhz. I'm saying this just in case someone with a 470 is wondering why they can't get 27 t/s, and can explain more if people are interested. Just don't blame me if you blow your computer up 😁...

Master:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	pp512	131.27 ± 0.17
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	25.19 ± 0.01
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	Vulkan	100	8	pp512	127.40 ± 0.09
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	Vulkan	100	8	tg128	23.60 ± 0.06
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	pp512	128.87 ± 0.13
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	tg128	16.53 ± 0.01

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   273.43us/run - 117.44 MFLOP/run - 429.51 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   280.16us/run - 117.44 MFLOP/run - 419.20 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   378.93us/run - 117.44 MFLOP/run - 309.92 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   361.89us/run - 117.44 MFLOP/run - 324.52 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   441.92us/run - 117.44 MFLOP/run - 265.75 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 4260 runs -   291.88us/run - 117.44 MFLOP/run - 402.36 GFLOPS

PR:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	pp512	131.09 ± 0.35
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	26.94 ± 0.07
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	Vulkan	100	8	pp512	127.18 ± 0.18
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	Vulkan	100	8	tg128	25.24 ± 0.07
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	pp512	129.33 ± 0.03
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	tg128	17.08 ± 0.00

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   254.45 us/run - 117.44 MFLOP/run - 461.54 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   279.52 us/run - 117.44 MFLOP/run - 420.15 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   350.07 us/run - 117.44 MFLOP/run - 335.48 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   352.30 us/run - 117.44 MFLOP/run - 333.35 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   429.12 us/run - 117.44 MFLOP/run - 273.68 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 4260 runs -   266.08 us/run - 117.44 MFLOP/run - 441.37 GFLOPS

By the way @0cc4m have you tried to implement the dequantization and matrix multiplication logic in F16 for the F16 shaders? The activations and deltas are already in F16 and it might not be necessary to do all the math in F32.

* server : force F16 KV cache for the draft model ggml-ci * server : fix draft params ggml-ci * server : various params fixes ggml-ci

this doesn't work as expected

* metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci

* readme : document --no-display-prompt * readme : update default prompt context size * readme : remove unnecessary indentation Indenting a line with four spaces makes Markdown treat that section as plain text. * readme : indent commands under bullets * readme : indent commands in lettered list

* wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit

* implemented argmax kernel * tpig -> tgpig * change to strides * contiguous assertions * kernel working and tested * argmax simd parallel implementation * added 2 new tests for argmax in test-backend-ops * cosmit * added 3 tests cases for perf eval * add test_argmax in make_test_cases_perf * Update test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>

ggml-ci

* hide buttons in dropdown menu * use npm as deps manager and vite as bundler * fix build * fix build (2) * fix responsive on mobile * fix more problems on mobile * sync build * (test) add CI step for verifying build * fix ci * force rebuild .hpp files * cmake: clean up generated files pre build

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

Co-authored-by: piDack <pcdack@hotmail.co>

* Add notes for a static build * Update docs/build.md --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

…IDIA backend (#10584) * [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend Move to compile time selection to backend to avoid latency at run time. Add it to all mkl gemm calls and only for NVIDIA backend. Signed-off-by: nscipione <nicolo.scipione@codeplay.com> * Formatting * Address PR comments to increase readibility --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

polynomial iq4_nl test (slower but keep as reference)

netrunnereve · 2024-12-08T22:40:46Z

If you can suggest a cheaper instance on Azure cloud that is compatible with Vulkan - would be great. Currently using Standard_NC4as_T4_v3

I took a quick look at the CI and since it runs a 7B model in FP16 you pretty much need the T4 as that's the cheapest Linux 16GB option on Azure. On the other hand you could probably save a bit on your cloud bill by switching the CUDA V100 for a T4 and the Metal M4 for a M1 or M2.

jeffbolznv · 2024-12-09T04:05:35Z

I think most of the current warnings in ggml-vulkan.cpp are my fault, so I'll fix them.

ggerganov · 2024-12-09T07:26:33Z

Metal M4 for a M1 or M2.

The M4 is a self-hosted Mac Mini that I bought to run the CI.

netrunnereve · 2024-12-10T03:38:46Z

Hmm the CI is failing since it's using llvmpipe instead of the GPU.

ggerganov · 2024-12-10T07:51:23Z

Hmm the CI is failing since it's using llvmpipe instead of the GPU.

It should be using the correct device now. But some tests are still failing:

https://github.com/ggml-org/ci/tree/results/llama.cpp/26/a8406ba9198eb6fdd8329fa717555b4f77f05f/ggml-6-x86-vulkan-t4

jeffbolznv · 2024-12-10T14:47:41Z

#10763 should fix the mat_mul failures. The im2col and rope failures are news to me.

jeffbolznv · 2024-12-10T14:57:15Z

I can reproduce the other failures on Turing, I'll debug them.

jeffbolznv · 2024-12-10T15:19:19Z

I filed #10764 to track the Turing failures.

netrunnereve · 2024-12-11T02:05:25Z

Setting this as draft until #10721 is merged.

0cc4m · 2024-12-11T09:08:32Z

Setting this as draft until #10721 is merged.

I'm sorry that it's taking a while. I'm still trying to figure out Intel and AMD. I think I'll merge a basic extension support version and then figure out how to deal with their specific issues.

0cc4m · 2024-12-12T08:00:55Z

You probably have to rebase this PR so that it only contains your commits, once subgroup size control is merged. I think the conflict is from my branch, which I already rebased and fixed.

jeffbolznv

LGTM. I had previously tested on my system and perf was fine, I don't think I need to retest.

netrunnereve · 2024-12-12T22:42:07Z

Setting this back to draft as this updated logic is setting device->subgroup_size_control to 0 on my computer after the #10721 rebase. I'll need to dig more into this.

        device->subgroup_size_control = device->subgroup_size_control &&
                (subgroup_size_control_props.requiredSubgroupSizeStages & vk::ShaderStageFlagBits::eCompute) &&
                subgroup_size_control_features.subgroupSizeControl;

netrunnereve · 2024-12-13T01:30:24Z

Okay this actually works properly now and I'm getting the same performance as before. There's no harm in setting subgroup_min_size and subgroup_max_size even if subgroup_size_control is set to false.

0cc4m · 2024-12-13T05:57:18Z

Setting this back to draft as this updated logic is setting device->subgroup_size_control to 0 on my computer after the #10721 rebase. I'll need to dig more into this.
        device->subgroup_size_control = device->subgroup_size_control &&
                (subgroup_size_control_props.requiredSubgroupSizeStages & vk::ShaderStageFlagBits::eCompute) &&
                subgroup_size_control_features.subgroupSizeControl;

I'm surprised this doesn't go through on your system. Can you upload your vulkaninfo output? I thought this would be true on any modern driver.

Edit: I see that Mesa reports 0 (no shader stages supported) for requiredSubgroupSizeStages on GCN. I didn't notice that earlier.

0cc4m · 2024-12-13T07:35:35Z

Radeon Pro VII:

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         61.32 ± 0.31 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         64.32 ± 0.06 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         66.75 ± 0.03 |

Intel A770

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         35.57 ± 0.03 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         28.63 ± 0.03 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         38.63 ± 0.04 |

RTX 3090

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        106.72 ± 0.23 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        110.11 ± 0.21 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        108.73 ± 0.16 |

netrunnereve · 2024-12-14T01:56:55Z

Unsurprisingly Intel's being its usual weird self again with NUM_ROWS=2. My 470 could probably beat it if I used a 7B model.

* double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires ggerganov#10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs

netrunnereve and others added 20 commits December 3, 2024 11:57

dot and delta optimization

b7ad234

server : fix default draft model parameters (#10586)

be2d004

* server : force F16 KV cache for the draft model ggml-ci * server : fix draft params ggml-ci * server : various params fixes ggml-ci

github : minify link [no ci]

ed8649f

github : minify link [no ci] (revert)

ca7c213

this doesn't work as expected

metal : small-batch mat-mul kernels (#10581)

d6753d7

* metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci

llama : add missing LLAMA_API for llama_chat_builtin_templates (#10636)

f697baf

metal : add GGML_OP_CONV_TRANSPOSE_1D kernels (ggml/1026)

e92a46b

* wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit

CUDA: remove unnecessary warp reduce in FA (ggml/1032)

0df0452

* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>

sync : ggml

69c7f20

scripts : remove amx sync

f8fe71a

ggml-ci

vulkan: optimize and reenable split_k (#10637)

fa9abd6

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

clip : add sycl support (#10574)

0fa9dc4

Co-authored-by: piDack <pcdack@hotmail.co>

Add docs for creating a static build (#10268) (#10630)

0a81a82

* Add notes for a static build * Update docs/build.md --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

Avoid using __fp16 on ARM with old nvcc (#10616)

9075271

fix typo of README.md (#10605)

4153d57

remove a multiply

062f256

polynomial iq4_nl test (slower but keep as reference)

netrunnereve requested a review from 0cc4m December 4, 2024 22:52

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 4, 2024

netrunnereve added 2 commits December 4, 2024 20:02

merge

fe81134

Merge https://github.com/ggerganov/llama.cpp into vulkan

c403d89

netrunnereve force-pushed the vulkan branch from 37b3ac9 to c403d89 Compare December 5, 2024 01:05

netrunnereve added 3 commits December 4, 2024 21:07

remove a multiply

5fbaf12

polynomial iq4_nl test (slower but keep as reference)

additional small optimizations

2f56bac

Merge https://github.com/ggerganov/llama.cpp into vulkan

591894a

netrunnereve force-pushed the vulkan branch from c403d89 to 591894a Compare December 5, 2024 02:15

netrunnereve marked this pull request as draft December 11, 2024 02:04

0cc4m mentioned this pull request Dec 12, 2024

Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats #10721

Merged

ggerganov mentioned this pull request Dec 12, 2024

metal : use F16 math in mul_mat kernels #10220

Draft

netrunnereve added 2 commits December 12, 2024 16:42

manual merge ggml-vulkan.cpp

d9c6bf1

fix conflict

8b13f2d

netrunnereve marked this pull request as ready for review December 12, 2024 22:06

jeffbolznv approved these changes Dec 12, 2024

View reviewed changes

netrunnereve marked this pull request as draft December 12, 2024 22:30

set min and max subgroup size in any case

1aa26d7

netrunnereve marked this pull request as ready for review December 13, 2024 01:27

Also double the number of rows for Intel GPUs

20b47d4

0cc4m merged commit 64ae065 into ggerganov:master Dec 13, 2024
47 checks passed

netrunnereve deleted the vulkan branch December 14, 2024 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: small mul_mat_vec optimizations #10665

vulkan: small mul_mat_vec optimizations #10665

netrunnereve commented Dec 4, 2024 •

edited

Loading

netrunnereve commented Dec 8, 2024

jeffbolznv commented Dec 9, 2024

ggerganov commented Dec 9, 2024

netrunnereve commented Dec 10, 2024

ggerganov commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

netrunnereve commented Dec 11, 2024

0cc4m commented Dec 11, 2024

0cc4m commented Dec 12, 2024

jeffbolznv left a comment

netrunnereve commented Dec 12, 2024

netrunnereve commented Dec 13, 2024

0cc4m commented Dec 13, 2024 •

edited

Loading

0cc4m commented Dec 13, 2024

netrunnereve commented Dec 14, 2024

vulkan: small mul_mat_vec optimizations #10665

vulkan: small mul_mat_vec optimizations #10665

Conversation

netrunnereve commented Dec 4, 2024 • edited Loading

netrunnereve commented Dec 8, 2024

jeffbolznv commented Dec 9, 2024

ggerganov commented Dec 9, 2024

netrunnereve commented Dec 10, 2024

ggerganov commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

jeffbolznv commented Dec 10, 2024

netrunnereve commented Dec 11, 2024

0cc4m commented Dec 11, 2024

0cc4m commented Dec 12, 2024

jeffbolznv left a comment

Choose a reason for hiding this comment

netrunnereve commented Dec 12, 2024

netrunnereve commented Dec 13, 2024

0cc4m commented Dec 13, 2024 • edited Loading

0cc4m commented Dec 13, 2024

Radeon Pro VII:

Intel A770

RTX 3090

netrunnereve commented Dec 14, 2024

netrunnereve commented Dec 4, 2024 •

edited

Loading

0cc4m commented Dec 13, 2024 •

edited

Loading