Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: implement initial support for IQ2 and IQ3 quantizations #11360

Merged
merged 12 commits into from
Jan 29, 2025

Conversation

remyoudompheng
Copy link
Contributor

@remyoudompheng remyoudompheng commented Jan 22, 2025

This pull request implements basic support for IQ2 and IQ3 quantizations in the Vulkan backend, with tentative acceptable performance (there are probably possible improvements). Unfortunately I do not have access to coopmat2 hardware and there might be typos in the proposed implementation.

A commit modifies the Q3_K implementation to optimize performance, but it may be unwelcome in this PR.

The existing init_iq4nl_shmem function has been renamed to a more generic name in order to simplify ifdef logic.

Tests were performed on a Radeon 780M iGPU with Mesa 24.3.3 using the default compiler (ACO, not LLVM). It supports KHR_coopmat.

Performance results:

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        479.81 ± 0.67 |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         26.03 ± 0.08 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        516.05 ± 0.81 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         39.28 ± 0.12 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         pp512 |       509.56 ± 14.44 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         35.83 ± 0.55 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        502.33 ± 0.74 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         43.28 ± 0.54 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        504.33 ± 0.63 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         40.19 ± 0.33 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        509.78 ± 1.15 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         34.43 ± 0.19 |

Performance numbers from test-backend-ops

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   268.20 us/run - 117.44 MFLOP/run - 437.89 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   431.30 us/run - 117.44 MFLOP/run - 272.29 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   344.17 us/run - 117.44 MFLOP/run - 341.23 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   365.16 us/run - 117.44 MFLOP/run - 321.62 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   490.09 us/run - 117.44 MFLOP/run - 239.63 GFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   330.88 us/run - 117.44 MFLOP/run - 354.93 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   392.55 us/run - 117.44 MFLOP/run - 299.18 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   427.37 us/run - 117.44 MFLOP/run - 274.80 GFLOPS

Before Q3_K change:

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   34 runs - 29597.18 us/run -  60.13 GFLOP/run -   2.03 TFLOPS

After Q3_K change

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   337.65 us/run - 117.44 MFLOP/run - 347.82 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   52 runs - 19365.63 us/run -  60.13 GFLOP/run -   3.10 TFLOPS

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 22, 2025
@0cc4m 0cc4m self-requested a review January 23, 2025 07:39
@0cc4m
Copy link
Collaborator

0cc4m commented Jan 23, 2025

Thank you, very cool! This will take a bit of time to review, I'll take a look this weekend. Can you fix the conflict? The mmq_wg_denoms fix has to be applied here too: #11343

@jeffbolznv
Copy link
Collaborator

Exciting to see this. I've done a quick check with coopmat2 and there are a few failures:

  MUL_MAT_ID(type_a=iq2_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.083714476 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 1.604342081 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.533599553 > 0.000500000 FAIL

Not seeing failures with coopmat1 or no coopmat. I'll try to debug these later. I haven't looked at the code yet.

@jeffbolznv jeffbolznv self-requested a review January 23, 2025 16:22
@jeffbolznv
Copy link
Collaborator

I was surprised it was only the MUL_MAT_ID tests failing, but it was due to a gap in test coverage, which #11375 will fix. MUL_MAT also fails for coopmat2 with the same types.

@jeffbolznv
Copy link
Collaborator

I went ahead and did the straightforward unoptimized "port" of the failing dequant callbacks from mul_mm.comp - just divide the index by 2 (because mul_mm does pairs at a time) and replace data_a with the block reference. Code is at 078ebe5. Feel free to pull this in however you want. IMO it would be OK to have these be unoptimized at first and you or I can optimize the later. I haven't done any perf testing yet.

@remyoudompheng
Copy link
Contributor Author

Thanks for the comments, I rebased the branch to include #11343 and cherry-picked 078ebe5

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review that actual dequantization logic, but I reviewed the rest. And I still need to do some perf testing.

{
// copy the table into shared memory and sync
if (gl_LocalInvocationIndex.x < 32) {
for (uint i = gl_LocalInvocationIndex.x; i < 512; i += 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop bound mismatches the array size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the increment could use gl_WorkGroupSize.x instead of hardcoding 32, but it probably won't affect performance much in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop looks better using workgroup size and less error prone, I made the fix

uvec2(0x082b082b, 0x2b2b2b2b), uvec2(0x082b2b08, 0x2b2b2b2b), uvec2(0x2b082b08, 0x2b2b2b2b), uvec2(0x2b2b2b2b, 0x2b2b2b2b)
};

shared uvec2 iq2s_grid[1024];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably account for these array sizes in ggml_vk_matmul_shmem_support. It's a bug that we didn't for iq4_nl, but these are significantly larger.

0x3e1c1c1c, 0x3e1c3404, 0x3e24140c, 0x3e24240c, 0x3e2c0404, 0x3e2c0414, 0x3e2c1424, 0x3e341c04,
};

shared uint32_t iq3xxs_grid[512];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array sizes don't match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jeffbolznv
Copy link
Collaborator

I've been working on optimizing the cm2 dequant callbacks and am getting good speedups. I'm out of time to finish it tonight, but I'll share it in the morning.

@jeffbolznv
Copy link
Collaborator

Here are the cm2 optimizations: jeffbolznv@9079f06

RTX 4070 (including Q4_K for reference):

before:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  616 runs -  1625.19 us/run -  60.13 GFLOP/run -  37.00 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               294 runs -  3402.81 us/run -  60.13 GFLOP/run -  17.67 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                308 runs -  3250.11 us/run -  60.13 GFLOP/run -  18.50 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 324 runs -  3099.17 us/run -  60.13 GFLOP/run -  19.40 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               378 runs -  2656.58 us/run -  60.13 GFLOP/run -  22.63 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 438 runs -  2291.55 us/run -  60.13 GFLOP/run -  26.24 TFLOPS  
  
after:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  632 runs -  1583.83 us/run -  60.13 GFLOP/run -  37.96 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               596 runs -  1682.00 us/run -  60.13 GFLOP/run -  35.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                608 runs -  1647.64 us/run -  60.13 GFLOP/run -  36.49 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 490 runs -  2044.33 us/run -  60.13 GFLOP/run -  29.41 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               682 runs -  1467.80 us/run -  60.13 GFLOP/run -  40.97 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 672 runs -  1490.99 us/run -  60.13 GFLOP/run -  40.33 TFLOPS

@remyoudompheng
Copy link
Contributor Author

Branch updated:

@remyoudompheng
Copy link
Contributor Author

For some reason commit 6ed3047 causes a huge performance regression on my device (Radeon 780M using Mesa ACO compiler or LLVM compiler) as if all benefits from the shared array were lost.

I restored the hardcoded 32 value in 3f7aa9d

@jeffbolznv
Copy link
Collaborator

fix array lengths issues

Looks good.

no cherry-pick of jeffbolznv@9079f06

Do you want me to do this separately after you merge? Either way is OK with me.

no change to shmem checks (they seem quite complex)

The tricky part is that we don't currently track which sizes are supported per-type, and now the shared memory usage depends on the type (well, it did previously for iq4_nl, but we got away with it). Maybe the easiest thing to do is just make bool mul_mat_l and friends into arrays and do the computation for all types. A simpler and less precise way might just be to say that these new types require 48KB of shared memory or more.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues on AMD, Intel or Nvidia in my tests. Performance isn't that good yet, but that can be improved separately.

@netrunnereve
Copy link
Collaborator

Well you beat me to this 😉

As @0cc4m mentioned I think that speed is not really a priority here and the important part is to have a functional implementation we can improve on in the future. This will probably need dedicated mat vec shaders like the K quants to get full performance.

Please see my review but aside from that your code looks fine (I didn't verify the actual dequantization algorithm though) and runs fine on my AMD GCN cards. If you want to be sure that your implementation is correct it's worth running a perplexity check against a different backend to see if the numbers match up.

@remyoudompheng
Copy link
Contributor Author

remyoudompheng commented Jan 26, 2025

The bug regarding gl_WorkgroupSize happens at the glslc step, because even the SPIR-V looks wrong (even with [[unroll]]) the loop is compiled with a stride of 1, making each shader invocation very costly.

$ glslc  -DDATA_A_IQ2_XS=1 -DB_TYPE=float -DB_TYPE_VEC2=vec2 -DB_TYPE_VEC4=vec4 -DD_TYPE=float mul_mat_vec.comp -DFLOAT_TYPE=float -o mul_mat_vec_iq2xs.spv
$ spirv-dis mul_mat_vec_iq2xs.spv
...
               ; Function init_iq_shmem_
%init_iq_shmem_ = OpFunction %void None %7

         %11 = OpLabel
        %i_0 =   OpVariable %_ptr_Function_uint Function
  %indexable =   OpVariable %_ptr_Function__arr_v2uint_uint_512 Function
                 OpLine %4 553 0
                 OpStore %i_0 %uint_0
                 OpBranch %65

         %65 = OpLabel
                 OpLine %4 553 0
                 OpLoopMerge %67 %68 None
                 OpBranch %69

         %69 =     OpLabel
                     OpLine %4 553 0
         %70 =       OpLoad %uint %i_0
         %72 =       OpULessThan %bool %70 %uint_512
                     OpBranchConditional %72 %66 %67

         %66 =         OpLabel
                         OpLine %4 554 0
         %77 =           OpLoad %uint %i_0
         %80 =           OpLoad %uint %gl_LocalInvocationIndex
         %81 =           OpIAdd %uint %77 %80
        %676 =           OpLoad %uint %i_0
        %677 =           OpLoad %uint %gl_LocalInvocationIndex
        %678 =           OpIAdd %uint %676 %677
                         OpStore %indexable %675
        %682 =           OpAccessChain %_ptr_Function_v2uint %indexable %678
        %683 =           OpLoad %v2uint %682
        %685 =           OpAccessChain %_ptr_Workgroup_v2uint %iq2xs_grid %81
                         OpStore %685 %683
                         OpBranch %68

         %68 =   OpLabel
                   OpLine %4 553 0
        %687 =     OpLoad %uint %i_0
        %688 =     OpIAdd %uint %687 %uint_1
                   OpStore %i_0 %688
                   OpBranch %65

(notice the %uint_1 on instruction 688, it is %uint_32 when 32 is hardcoded)

It seems to be this issue: KhronosGroup/glslang#2479 (more precisely KhronosGroup/glslang#2627)

The workaround is to include types.comp after the workgroup size is declared in the shader. If this is fine to you I can update the PR to use this method (meaning minor changes to all shaders to move the include, or at least the ones calling init_iq_shmem)

@remyoudompheng
Copy link
Contributor Author

PR updated:

llvmpipe is now happy

@jeffbolznv
Copy link
Collaborator

The workaround is to include types.comp after the workgroup size is declared in the shader.

I'm worried that somebody will accidentally break this in the future and it'll be very confusing. Another option might be to pass the workgroup size as a function parameter.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 26, 2025

I think a function parameter is a good idea, yeah.

@sorasoras
Copy link

looking forward to IQ4XS support as well.

@remyoudompheng
Copy link
Contributor Author

remyoudompheng commented Jan 26, 2025

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

@sorasoras
Copy link

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

 .\llama-bench.exe -m W:\model\sakura-14b-qwen2beta-v0.9-IQ4_XS.gguf -ngl 99 -sm none
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         pp512 |       1418.05 ± 2.73 |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         tg128 |         50.25 ± 0.29 |

build: e955cbed (4570)

it looks ok

@remyoudompheng
Copy link
Contributor Author

The test failure seems unrelated to patch correctness (llvmpipe is too slow for test-backend-ops to complete following the increase in test cases)

@jeffbolznv
Copy link
Collaborator

Is it possible to increase the timeout? Or what do we need to do to unblock this change.

@netrunnereve
Copy link
Collaborator

Indeed, it looks better with a function parameter the branch is updated.

We should probably make the change for iq4_nl as well so things are consistent.

Is it possible to increase the timeout?

Yeah you can change it here, and if it's still faster than the CUDA and HIP builds I don't think anyone will mind.

ctest -L main --verbose --timeout 900

@github-actions github-actions bot added the devops improvements to build systems and github actions label Jan 28, 2025
@remyoudompheng
Copy link
Contributor Author

Branch updated

  • refactor init_iq_shmem for iq4nl (0886297 replaced by aa17d32)
  • increase test timeout

@netrunnereve
Copy link
Collaborator

So I tried out the new Github ARM CI machine for fun and that one actually finishes the llvmpipe test stage for master in only 7 minutes. The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

@jeffbolznv
Copy link
Collaborator

The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

Can you clarify what you mean by 128b wide SIMD, and whether these new test failures are specific to this PR?

IMO this change is complete other than accounting for the additional shared memory usage. I think it would be OK to merge even with that bug and then I or @remyoudompheng could fix it after.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 29, 2025

The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

Can you clarify what you mean by 128b wide SIMD, and whether these new test failures are specific to this PR?

I think that's about llvmpipe using SIMD width as subgroup size. For AVX2 it was 8, on ARM it is apparently 4. But that is unrelated to this PR.

IMO this change is complete other than accounting for the additional shared memory usage. I think it would be OK to merge even with that bug and then I or @remyoudompheng could fix it after.

Yeah, I think it's okay to merge. We can clean up the shared memory checks in another PR. I think the checks got messed up in some refactor anyways and do some redundant work, your original version of the check and my later implementation currently coexist.

@netrunnereve
Copy link
Collaborator

I think that's about llvmpipe using SIMD width as subgroup size. For AVX2 it was 8, on ARM it is apparently 4. But that is unrelated to this PR.

Yep that's what I meant. I think we're fine to merge.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 29, 2025

Thank you for your contribution @remyoudompheng !

@0cc4m 0cc4m merged commit 66ee4f2 into ggerganov:master Jan 29, 2025
45 checks passed
@jeffbolznv
Copy link
Collaborator

FYI I've started implementing the shared memory accounting logic.

@remyoudompheng
Copy link
Contributor Author

I think some typos have slipped in the dequant_iq* shaders but it is very unclear how to spot them with test-backend-ops (see this diff for a few examples : 1b0739c)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants