vulkan: optimize and reenable split_k #10637

jeffbolznv · 2024-12-03T14:52:40Z

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

Split out from #10206.

I did a quick touch test to verify split_k helps the non-coopmat shaders as well:

before:
  MUL_MAT(type_a=f32,type_b=f32,m=128,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    426 runs -  2600.37 us/run - 469.76 MFLOP/run - 180.65 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=256,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    428 runs -  2569.83 us/run - 939.52 MFLOP/run - 365.60 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=384,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    426 runs -  2579.22 us/run -   1.41 GFLOP/run - 546.40 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=512,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    432 runs -  2582.09 us/run -   1.88 GFLOP/run - 727.72 GFLOPS

after:
  MUL_MAT(type_a=f32,type_b=f32,m=128,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   664.08 us/run - 469.76 MFLOP/run - 707.39 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=256,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1605 runs -   656.67 us/run - 939.52 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=384,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1562 runs -   659.93 us/run -   1.41 GFLOP/run -   2.14 TFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=512,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1512 runs -   678.08 us/run -   1.88 GFLOP/run -   2.77 TFLOPS

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

0cc4m

Thank you! Pretty good improvement even without coopmat. I should have retested it myself. But I wouldn't have thought of the vector + scalar load in one shader thing, at best I'd have created a separate vector version.

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

vulkan: optimize and reenable split_k

7a5a42b

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

jeffbolznv requested a review from 0cc4m December 3, 2024 14:52

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2024

0cc4m approved these changes Dec 3, 2024

View reviewed changes

0cc4m merged commit cc98896 into ggerganov:master Dec 3, 2024
43 of 44 checks passed

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024

vulkan: optimize and reenable split_k (ggerganov#10637)

8092c2d

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

vulkan: optimize and reenable split_k (ggerganov#10637)

746a4e0

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: optimize and reenable split_k #10637

vulkan: optimize and reenable split_k #10637

jeffbolznv commented Dec 3, 2024

0cc4m left a comment

vulkan: optimize and reenable split_k #10637

vulkan: optimize and reenable split_k #10637

Conversation

jeffbolznv commented Dec 3, 2024

0cc4m left a comment

Choose a reason for hiding this comment