vulkan: implement initial support for IQ2 and IQ3 quantizations #11360

remyoudompheng · 2025-01-22T20:37:27Z

This pull request implements basic support for IQ2 and IQ3 quantizations in the Vulkan backend, with tentative acceptable performance (there are probably possible improvements). Unfortunately I do not have access to coopmat2 hardware and there might be typos in the proposed implementation.

A commit modifies the Q3_K implementation to optimize performance, but it may be unwelcome in this PR.

The existing init_iq4nl_shmem function has been renamed to a more generic name in order to simplify ifdef logic.

Tests were performed on a Radeon 780M iGPU with Mesa 24.3.3 using the default compiler (ACO, not LLVM). It supports KHR_coopmat.

Performance results:

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        479.81 ± 0.67 |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         26.03 ± 0.08 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        516.05 ± 0.81 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         39.28 ± 0.12 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         pp512 |       509.56 ± 14.44 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         35.83 ± 0.55 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        502.33 ± 0.74 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         43.28 ± 0.54 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        504.33 ± 0.63 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         40.19 ± 0.33 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        509.78 ± 1.15 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         34.43 ± 0.19 |

Performance numbers from test-backend-ops

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   268.20 us/run - 117.44 MFLOP/run - 437.89 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   431.30 us/run - 117.44 MFLOP/run - 272.29 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   344.17 us/run - 117.44 MFLOP/run - 341.23 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   365.16 us/run - 117.44 MFLOP/run - 321.62 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   490.09 us/run - 117.44 MFLOP/run - 239.63 GFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   330.88 us/run - 117.44 MFLOP/run - 354.93 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   392.55 us/run - 117.44 MFLOP/run - 299.18 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   427.37 us/run - 117.44 MFLOP/run - 274.80 GFLOPS

Before Q3_K change:

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   34 runs - 29597.18 us/run -  60.13 GFLOP/run -   2.03 TFLOPS

After Q3_K change

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   337.65 us/run - 117.44 MFLOP/run - 347.82 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   52 runs - 19365.63 us/run -  60.13 GFLOP/run -   3.10 TFLOPS

0cc4m · 2025-01-23T07:42:19Z

Thank you, very cool! This will take a bit of time to review, I'll take a look this weekend. Can you fix the conflict? The mmq_wg_denoms fix has to be applied here too: #11343

jeffbolznv · 2025-01-23T16:22:03Z

Exciting to see this. I've done a quick check with coopmat2 and there are a few failures:

  MUL_MAT_ID(type_a=iq2_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.083714476 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 1.604342081 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.533599553 > 0.000500000 FAIL

Not seeing failures with coopmat1 or no coopmat. I'll try to debug these later. I haven't looked at the code yet.

jeffbolznv · 2025-01-23T19:20:45Z

I was surprised it was only the MUL_MAT_ID tests failing, but it was due to a gap in test coverage, which #11375 will fix. MUL_MAT also fails for coopmat2 with the same types.

jeffbolznv · 2025-01-23T20:03:10Z

I went ahead and did the straightforward unoptimized "port" of the failing dequant callbacks from mul_mm.comp - just divide the index by 2 (because mul_mm does pairs at a time) and replace data_a with the block reference. Code is at 078ebe5. Feel free to pull this in however you want. IMO it would be OK to have these be unoptimized at first and you or I can optimize the later. I haven't done any perf testing yet.

remyoudompheng · 2025-01-23T21:28:46Z

Thanks for the comments, I rebased the branch to include #11343 and cherry-picked 078ebe5

jeffbolznv

I didn't review that actual dequantization logic, but I reviewed the rest. And I still need to do some perf testing.

jeffbolznv · 2025-01-24T00:51:23Z

ggml/src/ggml-vulkan/vulkan-shaders/types.comp

+{
+    // copy the table into shared memory and sync
+    if (gl_LocalInvocationIndex.x < 32) {
+        for (uint i = gl_LocalInvocationIndex.x; i < 512; i += 32) {


loop bound mismatches the array size

Also, the increment could use gl_WorkGroupSize.x instead of hardcoding 32, but it probably won't affect performance much in practice.

The loop looks better using workgroup size and less error prone, I made the fix

jeffbolznv · 2025-01-24T00:54:39Z

ggml/src/ggml-vulkan/vulkan-shaders/types.comp

+    uvec2(0x082b082b, 0x2b2b2b2b), uvec2(0x082b2b08, 0x2b2b2b2b), uvec2(0x2b082b08, 0x2b2b2b2b), uvec2(0x2b2b2b2b, 0x2b2b2b2b)
+};
+
+shared uvec2 iq2s_grid[1024];


We should probably account for these array sizes in ggml_vk_matmul_shmem_support. It's a bug that we didn't for iq4_nl, but these are significantly larger.

jeffbolznv · 2025-01-24T00:54:59Z

ggml/src/ggml-vulkan/vulkan-shaders/types.comp

+    0x3e1c1c1c, 0x3e1c3404, 0x3e24140c, 0x3e24240c, 0x3e2c0404, 0x3e2c0414, 0x3e2c1424, 0x3e341c04,
+};
+
+shared uint32_t iq3xxs_grid[512];


array sizes don't match

jeffbolznv · 2025-01-24T05:29:49Z

I've been working on optimizing the cm2 dequant callbacks and am getting good speedups. I'm out of time to finish it tonight, but I'll share it in the morning.

jeffbolznv · 2025-01-24T14:58:52Z

Here are the cm2 optimizations: jeffbolznv@9079f06

RTX 4070 (including Q4_K for reference):

before:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  616 runs -  1625.19 us/run -  60.13 GFLOP/run -  37.00 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               294 runs -  3402.81 us/run -  60.13 GFLOP/run -  17.67 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                308 runs -  3250.11 us/run -  60.13 GFLOP/run -  18.50 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 324 runs -  3099.17 us/run -  60.13 GFLOP/run -  19.40 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               378 runs -  2656.58 us/run -  60.13 GFLOP/run -  22.63 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 438 runs -  2291.55 us/run -  60.13 GFLOP/run -  26.24 TFLOPS  
  
after:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  632 runs -  1583.83 us/run -  60.13 GFLOP/run -  37.96 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               596 runs -  1682.00 us/run -  60.13 GFLOP/run -  35.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                608 runs -  1647.64 us/run -  60.13 GFLOP/run -  36.49 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 490 runs -  2044.33 us/run -  60.13 GFLOP/run -  29.41 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               682 runs -  1467.80 us/run -  60.13 GFLOP/run -  40.97 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 672 runs -  1490.99 us/run -  60.13 GFLOP/run -  40.33 TFLOPS

remyoudompheng · 2025-01-25T11:25:33Z

Branch updated:

fix array lengths issues
rebase to resolve conflict with tests: fix some mul_mat test gaps #11375
no cherry-pick of jeffbolznv@9079f06
no change to shmem checks (they seem quite complex)

remyoudompheng · 2025-01-25T13:44:02Z

For some reason commit 6ed3047 causes a huge performance regression on my device (Radeon 780M using Mesa ACO compiler or LLVM compiler) as if all benefits from the shared array were lost.

I restored the hardcoded 32 value in 3f7aa9d

jeffbolznv · 2025-01-25T17:38:03Z

fix array lengths issues

Looks good.

no cherry-pick of jeffbolznv@9079f06

Do you want me to do this separately after you merge? Either way is OK with me.

no change to shmem checks (they seem quite complex)

The tricky part is that we don't currently track which sizes are supported per-type, and now the shared memory usage depends on the type (well, it did previously for iq4_nl, but we got away with it). Maybe the easiest thing to do is just make bool mul_mat_l and friends into arrays and do the computation for all types. A simpler and less precise way might just be to say that these new types require 48KB of shared memory or more.

0cc4m

I don't see any issues on AMD, Intel or Nvidia in my tests. Performance isn't that good yet, but that can be improved separately.

ggml/src/ggml-vulkan/vulkan-shaders/types.comp

netrunnereve · 2025-01-26T02:13:20Z

Well you beat me to this 😉

As @0cc4m mentioned I think that speed is not really a priority here and the important part is to have a functional implementation we can improve on in the future. This will probably need dedicated mat vec shaders like the K quants to get full performance.

Please see my review but aside from that your code looks fine (I didn't verify the actual dequantization algorithm though) and runs fine on my AMD GCN cards. If you want to be sure that your implementation is correct it's worth running a perplexity check against a different backend to see if the numbers match up.

remyoudompheng · 2025-01-26T13:44:32Z

The bug regarding gl_WorkgroupSize happens at the glslc step, because even the SPIR-V looks wrong (even with [[unroll]]) the loop is compiled with a stride of 1, making each shader invocation very costly.

$ glslc  -DDATA_A_IQ2_XS=1 -DB_TYPE=float -DB_TYPE_VEC2=vec2 -DB_TYPE_VEC4=vec4 -DD_TYPE=float mul_mat_vec.comp -DFLOAT_TYPE=float -o mul_mat_vec_iq2xs.spv
$ spirv-dis mul_mat_vec_iq2xs.spv
...
               ; Function init_iq_shmem_
%init_iq_shmem_ = OpFunction %void None %7

         %11 = OpLabel
        %i_0 =   OpVariable %_ptr_Function_uint Function
  %indexable =   OpVariable %_ptr_Function__arr_v2uint_uint_512 Function
                 OpLine %4 553 0
                 OpStore %i_0 %uint_0
                 OpBranch %65

         %65 = OpLabel
                 OpLine %4 553 0
                 OpLoopMerge %67 %68 None
                 OpBranch %69

         %69 =     OpLabel
                     OpLine %4 553 0
         %70 =       OpLoad %uint %i_0
         %72 =       OpULessThan %bool %70 %uint_512
                     OpBranchConditional %72 %66 %67

         %66 =         OpLabel
                         OpLine %4 554 0
         %77 =           OpLoad %uint %i_0
         %80 =           OpLoad %uint %gl_LocalInvocationIndex
         %81 =           OpIAdd %uint %77 %80
        %676 =           OpLoad %uint %i_0
        %677 =           OpLoad %uint %gl_LocalInvocationIndex
        %678 =           OpIAdd %uint %676 %677
                         OpStore %indexable %675
        %682 =           OpAccessChain %_ptr_Function_v2uint %indexable %678
        %683 =           OpLoad %v2uint %682
        %685 =           OpAccessChain %_ptr_Workgroup_v2uint %iq2xs_grid %81
                         OpStore %685 %683
                         OpBranch %68

         %68 =   OpLabel
                   OpLine %4 553 0
        %687 =     OpLoad %uint %i_0
        %688 =     OpIAdd %uint %687 %uint_1
                   OpStore %i_0 %688
                   OpBranch %65

(notice the %uint_1 on instruction 688, it is %uint_32 when 32 is hardcoded)

It seems to be this issue: KhronosGroup/glslang#2479 (more precisely KhronosGroup/glslang#2627)

The workaround is to include types.comp after the workgroup size is declared in the shader. If this is fine to you I can update the PR to use this method (meaning minor changes to all shaders to move the include, or at least the ones calling init_iq_shmem)

remyoudompheng · 2025-01-26T15:04:44Z

PR updated:

removed 3f7aa9d
added edd2546

llvmpipe is now happy

jeffbolznv · 2025-01-26T16:02:55Z

The workaround is to include types.comp after the workgroup size is declared in the shader.

I'm worried that somebody will accidentally break this in the future and it'll be very confusing. Another option might be to pass the workgroup size as a function parameter.

0cc4m · 2025-01-26T16:28:15Z

I think a function parameter is a good idea, yeah.

sorasoras · 2025-01-26T17:41:49Z

looking forward to IQ4XS support as well.

remyoudompheng · 2025-01-26T18:16:50Z

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

sorasoras · 2025-01-26T20:41:37Z

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

 .\llama-bench.exe -m W:\model\sakura-14b-qwen2beta-v0.9-IQ4_XS.gguf -ngl 99 -sm none
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         pp512 |       1418.05 ± 2.73 |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         tg128 |         50.25 ± 0.29 |

build: e955cbed (4570)

it looks ok

remyoudompheng · 2025-01-27T20:52:22Z

The test failure seems unrelated to patch correctness (llvmpipe is too slow for test-backend-ops to complete following the increase in test cases)

jeffbolznv · 2025-01-27T22:39:58Z

Is it possible to increase the timeout? Or what do we need to do to unblock this change.

netrunnereve · 2025-01-28T02:15:30Z

Indeed, it looks better with a function parameter the branch is updated.

We should probably make the change for iq4_nl as well so things are consistent.

Is it possible to increase the timeout?

Yeah you can change it here, and if it's still faster than the CUDA and HIP builds I don't think anyone will mind.

llama.cpp/.github/workflows/build.yml

Line 349 in a4417dd

ctest -L main --verbose --timeout 900

remyoudompheng · 2025-01-28T05:16:51Z

Branch updated

refactor init_iq_shmem for iq4nl (0886297 replaced by aa17d32)
increase test timeout

netrunnereve · 2025-01-29T02:38:44Z

So I tried out the new Github ARM CI machine for fun and that one actually finishes the llvmpipe test stage for master in only 7 minutes. The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

jeffbolznv · 2025-01-29T04:05:52Z

The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

Can you clarify what you mean by 128b wide SIMD, and whether these new test failures are specific to this PR?

IMO this change is complete other than accounting for the additional shared memory usage. I think it would be OK to merge even with that bug and then I or @remyoudompheng could fix it after.

0cc4m · 2025-01-29T05:57:50Z

The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures.

Can you clarify what you mean by 128b wide SIMD, and whether these new test failures are specific to this PR?

I think that's about llvmpipe using SIMD width as subgroup size. For AVX2 it was 8, on ARM it is apparently 4. But that is unrelated to this PR.

IMO this change is complete other than accounting for the additional shared memory usage. I think it would be OK to merge even with that bug and then I or @remyoudompheng could fix it after.

Yeah, I think it's okay to merge. We can clean up the shared memory checks in another PR. I think the checks got messed up in some refactor anyways and do some redundant work, your original version of the check and my later implementation currently coexist.

netrunnereve · 2025-01-29T16:33:06Z

I think that's about llvmpipe using SIMD width as subgroup size. For AVX2 it was 8, on ARM it is apparently 4. But that is unrelated to this PR.

Yep that's what I meant. I think we're fine to merge.

0cc4m · 2025-01-29T17:29:18Z

Thank you for your contribution @remyoudompheng !

jeffbolznv · 2025-01-29T22:42:24Z

FYI I've started implementing the shared memory accounting logic.

remyoudompheng · 2025-02-02T15:07:55Z

I think some typos have slipped in the dequant_iq* shaders but it is very unclear how to spot them with test-backend-ops (see this diff for a few examples : 1b0739c)

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 22, 2025

0cc4m self-requested a review January 23, 2025 07:39

jeffbolznv self-requested a review January 23, 2025 16:22

remyoudompheng force-pushed the vulkan-iquants branch from a56c535 to ecfabd7 Compare January 23, 2025 21:24

jeffbolznv reviewed Jan 24, 2025

View reviewed changes

remyoudompheng force-pushed the vulkan-iquants branch from 890a125 to 6ed3047 Compare January 25, 2025 11:23

0cc4m approved these changes Jan 25, 2025

View reviewed changes

netrunnereve reviewed Jan 26, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/types.comp Outdated Show resolved Hide resolved

remyoudompheng added 8 commits January 26, 2025 15:16

vulkan: initial support for IQ3_S

1843136

vulkan: initial support for IQ3_XXS

a0dae0b

vulkan: initial support for IQ2_XXS

953b47e

vulkan: initial support for IQ2_XS

055e428

vulkan: optimize Q3_K by removing branches

5fa89da

vulkan: implement dequantize variants for coopmat2

bf3ca6b

vulkan: initial support for IQ2_S

520e6b1

vulkan: vertically realign code

097befd

jeffbolznv and others added 2 commits January 26, 2025 15:16

port failing dequant callbacks from mul_mm

10cc151

Fix array length mismatches

118b4f0

remyoudompheng force-pushed the vulkan-iquants branch from 3f7aa9d to edd2546 Compare January 26, 2025 14:59

remyoudompheng force-pushed the vulkan-iquants branch from edd2546 to 0886297 Compare January 26, 2025 18:10

remyoudompheng added 2 commits January 28, 2025 06:13

vulkan: avoid using workgroup size before it is referenced

aa17d32

tests: increase timeout for Vulkan llvmpipe backend

5be217a

remyoudompheng force-pushed the vulkan-iquants branch from 0886297 to 5be217a Compare January 28, 2025 05:15

github-actions bot added the devops improvements to build systems and github actions label Jan 28, 2025

0cc4m merged commit 66ee4f2 into ggerganov:master Jan 29, 2025
45 checks passed

remyoudompheng mentioned this pull request Jan 29, 2025

vulkan: initial support for IQ4_XS quantization #11501

Merged

vulkan: implement initial support for IQ2 and IQ3 quantizations #11360

vulkan: implement initial support for IQ2 and IQ3 quantizations #11360

Conversation

remyoudompheng commented Jan 22, 2025 • edited Loading

0cc4m commented Jan 23, 2025

jeffbolznv commented Jan 23, 2025

jeffbolznv commented Jan 23, 2025

jeffbolznv commented Jan 23, 2025

remyoudompheng commented Jan 23, 2025

jeffbolznv left a comment

Choose a reason for hiding this comment

jeffbolznv Jan 24, 2025

Choose a reason for hiding this comment

jeffbolznv Jan 24, 2025

Choose a reason for hiding this comment

remyoudompheng Jan 25, 2025

Choose a reason for hiding this comment

jeffbolznv Jan 24, 2025

Choose a reason for hiding this comment

jeffbolznv Jan 24, 2025

Choose a reason for hiding this comment

remyoudompheng Jan 25, 2025

Choose a reason for hiding this comment

jeffbolznv commented Jan 24, 2025

jeffbolznv commented Jan 24, 2025

remyoudompheng commented Jan 25, 2025

remyoudompheng commented Jan 25, 2025

jeffbolznv commented Jan 25, 2025

0cc4m left a comment

Choose a reason for hiding this comment

netrunnereve commented Jan 26, 2025

remyoudompheng commented Jan 26, 2025 • edited Loading

remyoudompheng commented Jan 26, 2025

jeffbolznv commented Jan 26, 2025

0cc4m commented Jan 26, 2025

sorasoras commented Jan 26, 2025

remyoudompheng commented Jan 26, 2025 • edited Loading

sorasoras commented Jan 26, 2025

remyoudompheng commented Jan 27, 2025

jeffbolznv commented Jan 27, 2025

netrunnereve commented Jan 28, 2025

remyoudompheng commented Jan 28, 2025

netrunnereve commented Jan 29, 2025

jeffbolznv commented Jan 29, 2025

0cc4m commented Jan 29, 2025

netrunnereve commented Jan 29, 2025

0cc4m commented Jan 29, 2025

jeffbolznv commented Jan 29, 2025

remyoudompheng commented Feb 2, 2025

remyoudompheng commented Jan 22, 2025 •

edited

Loading

remyoudompheng commented Jan 26, 2025 •

edited

Loading

remyoudompheng commented Jan 26, 2025 •

edited

Loading