Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: optimize and reenable split_k #10637

Merged
merged 1 commit into from
Dec 3, 2024
Merged

Conversation

jeffbolznv
Copy link
Collaborator

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

Split out from #10206.

I did a quick touch test to verify split_k helps the non-coopmat shaders as well:

before:
  MUL_MAT(type_a=f32,type_b=f32,m=128,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    426 runs -  2600.37 us/run - 469.76 MFLOP/run - 180.65 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=256,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    428 runs -  2569.83 us/run - 939.52 MFLOP/run - 365.60 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=384,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    426 runs -  2579.22 us/run -   1.41 GFLOP/run - 546.40 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=512,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    432 runs -  2582.09 us/run -   1.88 GFLOP/run - 727.72 GFLOPS

after:
  MUL_MAT(type_a=f32,type_b=f32,m=128,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   664.08 us/run - 469.76 MFLOP/run - 707.39 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=256,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1605 runs -   656.67 us/run - 939.52 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=384,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1562 runs -   659.93 us/run -   1.41 GFLOP/run -   2.14 TFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=512,n=128,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1512 runs -   678.08 us/run -   1.88 GFLOP/run -   2.77 TFLOPS

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
@jeffbolznv jeffbolznv requested a review from 0cc4m December 3, 2024 14:52
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2024
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Pretty good improvement even without coopmat. I should have retested it myself. But I wouldn't have thought of the vector + scalar load in one shader thing, at best I'd have created a separate vector version.

@0cc4m 0cc4m merged commit cc98896 into ggerganov:master Dec 3, 2024
43 of 44 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants