Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: small mul_mat_vec optimizations #10665

Merged
merged 44 commits into from
Dec 13, 2024
Merged

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Dec 4, 2024

Here's a couple of small optimizations for mul_mat_vec that came from studying the generated GCN assembly. The biggest improvement came from moving out the delta if possible (8 additional multiplications per inner loop iteration add up fast).

Using the dot() function still generates FMA instructions but due to changes in the instruction ordering it's now running a little bit faster. I think on some newer cards there are dedicated dot product units which may have a better chance of being utilized by the compiler if we use this function.

I tested this on my RX 470, I just got it recently and I don't have access to the 570 anymore. Interestingly enough my 470 (and my W8100 as well) downclock pretty heavily when running mul_mat_vec as the hardware is hitting its power limit, it literally drops down from 1200 mhz to 750 during inference. I discussed this with an ex-AMD engineer and after confirming that the board's power circuits could handle it we made some VBIOS adjustments allowing the card to draw more power and run inference at around 1100 mhz. I'm saying this just in case someone with a 470 is wondering why they can't get 27 t/s, and can explain more if people are interested. Just don't blame me if you blow your computer up 😁...

Master:

model size params backend ngl threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 pp512 131.27 ± 0.17
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 tg128 25.19 ± 0.01
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B Vulkan 100 8 pp512 127.40 ± 0.09
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B Vulkan 100 8 tg128 23.60 ± 0.06
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 100 8 pp512 128.87 ± 0.13
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 100 8 tg128 16.53 ± 0.01
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   273.43us/run - 117.44 MFLOP/run - 429.51 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   280.16us/run - 117.44 MFLOP/run - 419.20 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   378.93us/run - 117.44 MFLOP/run - 309.92 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   361.89us/run - 117.44 MFLOP/run - 324.52 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   441.92us/run - 117.44 MFLOP/run - 265.75 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 4260 runs -   291.88us/run - 117.44 MFLOP/run - 402.36 GFLOPS

PR:

model size params backend ngl threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 pp512 131.09 ± 0.35
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 tg128 26.94 ± 0.07
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B Vulkan 100 8 pp512 127.18 ± 0.18
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B Vulkan 100 8 tg128 25.24 ± 0.07
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 100 8 pp512 129.33 ± 0.03
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 100 8 tg128 17.08 ± 0.00
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   254.45 us/run - 117.44 MFLOP/run - 461.54 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   279.52 us/run - 117.44 MFLOP/run - 420.15 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   350.07 us/run - 117.44 MFLOP/run - 335.48 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   352.30 us/run - 117.44 MFLOP/run - 333.35 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   429.12 us/run - 117.44 MFLOP/run - 273.68 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 4260 runs -   266.08 us/run - 117.44 MFLOP/run - 441.37 GFLOPS

By the way @0cc4m have you tried to implement the dequantization and matrix multiplication logic in F16 for the F16 shaders? The activations and deltas are already in F16 and it might not be necessary to do all the math in F32.

netrunnereve and others added 20 commits December 3, 2024 11:57
* server : force F16 KV cache for the draft model

ggml-ci

* server : fix draft params

ggml-ci

* server : various params fixes

ggml-ci
this doesn't work as expected
* metal : small-batch mat-mul kernels

ggml-ci

* metal : add rest of types

ggml-ci

* metal : final adjustments

ggml-ci

* metal : add comments

ggml-ci
* readme : document --no-display-prompt

* readme : update default prompt context size

* readme : remove unnecessary indentation

Indenting a line with four spaces makes Markdown treat that section as
plain text.

* readme : indent commands under bullets

* readme : indent commands in lettered list
* wip

* wip implementation f32

* kernel conv transpose 1d f32 working

* initial commit
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit

* same problem in vec32

---------

Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
* hide buttons in dropdown menu

* use npm as deps manager and vite as bundler

* fix build

* fix build (2)

* fix responsive on mobile

* fix more problems on mobile

* sync build

* (test) add CI step for verifying build

* fix ci

* force rebuild .hpp files

* cmake: clean up generated files pre build
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
Co-authored-by: piDack <pcdack@hotmail.co>
* Add notes for a static build

* Update docs/build.md

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
…IDIA backend (#10584)

* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
polynomial iq4_nl test (slower but keep as reference)
@netrunnereve netrunnereve requested a review from 0cc4m December 4, 2024 22:52
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 4, 2024
@netrunnereve
Copy link
Collaborator Author

If you can suggest a cheaper instance on Azure cloud that is compatible with Vulkan - would be great. Currently using Standard_NC4as_T4_v3

I took a quick look at the CI and since it runs a 7B model in FP16 you pretty much need the T4 as that's the cheapest Linux 16GB option on Azure. On the other hand you could probably save a bit on your cloud bill by switching the CUDA V100 for a T4 and the Metal M4 for a M1 or M2.

@jeffbolznv
Copy link
Collaborator

I think most of the current warnings in ggml-vulkan.cpp are my fault, so I'll fix them.

@ggerganov
Copy link
Owner

Metal M4 for a M1 or M2.

The M4 is a self-hosted Mac Mini that I bought to run the CI.

@netrunnereve
Copy link
Collaborator Author

Hmm the CI is failing since it's using llvmpipe instead of the GPU.

@ggerganov
Copy link
Owner

Hmm the CI is failing since it's using llvmpipe instead of the GPU.

It should be using the correct device now. But some tests are still failing:

https://github.com/ggml-org/ci/tree/results/llama.cpp/26/a8406ba9198eb6fdd8329fa717555b4f77f05f/ggml-6-x86-vulkan-t4

@jeffbolznv
Copy link
Collaborator

#10763 should fix the mat_mul failures. The im2col and rope failures are news to me.

@jeffbolznv
Copy link
Collaborator

I can reproduce the other failures on Turing, I'll debug them.

@jeffbolznv
Copy link
Collaborator

I filed #10764 to track the Turing failures.

@netrunnereve netrunnereve marked this pull request as draft December 11, 2024 02:04
@netrunnereve
Copy link
Collaborator Author

Setting this as draft until #10721 is merged.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 11, 2024

Setting this as draft until #10721 is merged.

I'm sorry that it's taking a while. I'm still trying to figure out Intel and AMD. I think I'll merge a basic extension support version and then figure out how to deal with their specific issues.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 12, 2024

You probably have to rebase this PR so that it only contains your commits, once subgroup size control is merged. I think the conflict is from my branch, which I already rebased and fixed.

@netrunnereve netrunnereve marked this pull request as ready for review December 12, 2024 22:06
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had previously tested on my system and perf was fine, I don't think I need to retest.

@netrunnereve netrunnereve marked this pull request as draft December 12, 2024 22:30
@netrunnereve
Copy link
Collaborator Author

Setting this back to draft as this updated logic is setting device->subgroup_size_control to 0 on my computer after the #10721 rebase. I'll need to dig more into this.

        device->subgroup_size_control = device->subgroup_size_control &&
                (subgroup_size_control_props.requiredSubgroupSizeStages & vk::ShaderStageFlagBits::eCompute) &&
                subgroup_size_control_features.subgroupSizeControl;

@netrunnereve netrunnereve marked this pull request as ready for review December 13, 2024 01:27
@netrunnereve
Copy link
Collaborator Author

Okay this actually works properly now and I'm getting the same performance as before. There's no harm in setting subgroup_min_size and subgroup_max_size even if subgroup_size_control is set to false.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 13, 2024

Setting this back to draft as this updated logic is setting device->subgroup_size_control to 0 on my computer after the #10721 rebase. I'll need to dig more into this.

        device->subgroup_size_control = device->subgroup_size_control &&
                (subgroup_size_control_props.requiredSubgroupSizeStages & vk::ShaderStageFlagBits::eCompute) &&
                subgroup_size_control_features.subgroupSizeControl;

I'm surprised this doesn't go through on your system. Can you upload your vulkaninfo output? I thought this would be true on any modern driver.

Edit: I see that Mesa reports 0 (no shader stages supported) for requiredSubgroupSizeStages on GCN. I didn't notice that earlier.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 13, 2024

Radeon Pro VII:

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         61.32 ± 0.31 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         64.32 ± 0.06 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         66.75 ± 0.03 |

Intel A770

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         35.57 ± 0.03 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         28.63 ± 0.03 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         38.63 ± 0.04 |

RTX 3090

MASTER
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        106.72 ± 0.23 |
NUM_ROWS=2
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        110.11 ± 0.21 |
NUM_ROWS=4
| llama 7B Q4_0                  |   3.83 GiB |     7.24 B | Vulkan     |  99 |         tg128 |        108.73 ± 0.16 |

@0cc4m 0cc4m merged commit 64ae065 into ggerganov:master Dec 13, 2024
47 checks passed
@netrunnereve netrunnereve deleted the vulkan branch December 14, 2024 00:34
@netrunnereve
Copy link
Collaborator Author

Unsurprisingly Intel's being its usual weird self again with NUM_ROWS=2. My 470 could probably beat it if I used a 7B model.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.