Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: optimize add/mul/div #10270

Merged
merged 1 commit into from
Nov 14, 2024
Merged

vulkan: optimize add/mul/div #10270

merged 1 commit into from
Nov 14, 2024

Conversation

jeffbolznv
Copy link
Collaborator

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

Perf data from RTX 4070:

Before:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 417690 runs -     2.41 us/run -       48 kB/run -   18.96 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                23222 runs -    44.08 us/run -    24576 kB/run -  531.86 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                30957 runs -    33.92 us/run -    18432 kB/run -  518.39 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               101948 runs -    10.14 us/run -     4608 kB/run -  433.44 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               101948 runs -    10.06 us/run -     4608 kB/run -  436.66 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):                43696 runs -    23.37 us/run -    12288 kB/run -  501.66 GB/s

After:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 450450 runs -     2.24 us/run -       48 kB/run -   20.44 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                54640 runs -    18.75 us/run -    24576 kB/run - 1250.15 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                80124 runs -    12.50 us/run -    18432 kB/run - 1407.27 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               218460 runs -     4.59 us/run -     4608 kB/run -  956.98 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               174768 runs -     5.80 us/run -     4608 kB/run -  757.36 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):               111971 runs -     9.01 us/run -    12288 kB/run - 1300.86 GB/s

CUDA, for comparison:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 548730 runs -     1.84 us/run -       48 kB/run -   24.93 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                56006 runs -    17.96 us/run -    24576 kB/run - 1305.45 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                69198 runs -    14.56 us/run -    18432 kB/run - 1207.90 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               174768 runs -     5.85 us/run -     4608 kB/run -  751.17 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               218460 runs -     4.65 us/run -     4608 kB/run -  946.08 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):                92854 runs -    10.86 us/run -    12288 kB/run - 1079.51 GB/s

@jeffbolznv jeffbolznv requested a review from 0cc4m November 12, 2024 18:08
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 12, 2024
@ExtReMLapin
Copy link
Contributor

Thank you Jeff for your multiple contributions on the Vulkan backend.

Curiosity question, if we stick to NVIDIA, is there any point of using Vulkan over CUDA ? Some of your PRs seems to benefits more to nVidia devices, but if we stick to nVidia, why not only use CUDA ?

Is it for embeded system where cuda may not be available ? Example nintendo switch maybe ?

@jeffbolznv
Copy link
Collaborator Author

Hi,

CUDA and Vulkan can both be good for machine learning on NVIDIA GPUs. There may be other factors that influence which API to use, for example if you wanted to do ML from an application that's already using Vulkan or that needs to support different hardware, it may be more natural to use the Vulkan backend. However, using CUDA still gives you the best performance for a lot of cases.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Nov 13, 2024

but if we stick to nVidia, why not only use CUDA ?

Unless it was fixed by NVIDIA, CUDA Toolkit on Windows assumes you use MSVS, and compiling with GCC (MinGW, w64devkit) is impossible. Since Clblast was removed, Vulkan is the only option for gpu offloading for such environment.

As for PR itself, I see improvements in prompt processing and inference on GTX 1060 3GB, 22b model, 3 layers offloaded: from ~13.55t/s to 15.09t/s in prompt processing; from 2.10t/s to 2.15t/s tokens generation.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I tested it and saw no issues on my devices. Can you resolve the merge conflict?

Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
@jeffbolznv
Copy link
Collaborator Author

Thanks, rebased/resolved.

@0cc4m 0cc4m merged commit af148c9 into ggerganov:master Nov 14, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 17, 2024
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants