vulkan: optimize add/mul/div #10270

jeffbolznv · 2024-11-12T18:08:08Z

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

Perf data from RTX 4070:

Before:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 417690 runs -     2.41 us/run -       48 kB/run -   18.96 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                23222 runs -    44.08 us/run -    24576 kB/run -  531.86 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                30957 runs -    33.92 us/run -    18432 kB/run -  518.39 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               101948 runs -    10.14 us/run -     4608 kB/run -  433.44 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               101948 runs -    10.06 us/run -     4608 kB/run -  436.66 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):                43696 runs -    23.37 us/run -    12288 kB/run -  501.66 GB/s

After:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 450450 runs -     2.24 us/run -       48 kB/run -   20.44 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                54640 runs -    18.75 us/run -    24576 kB/run - 1250.15 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                80124 runs -    12.50 us/run -    18432 kB/run - 1407.27 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               218460 runs -     4.59 us/run -     4608 kB/run -  956.98 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               174768 runs -     5.80 us/run -     4608 kB/run -  757.36 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):               111971 runs -     9.01 us/run -    12288 kB/run - 1300.86 GB/s

CUDA, for comparison:
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                 548730 runs -     1.84 us/run -       48 kB/run -   24.93 GB/s
  ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                56006 runs -    17.96 us/run -    24576 kB/run - 1305.45 GB/s
  ADD(type=f32,ne=[512,3072,1,1],nr=[1,1,1,1]):                69198 runs -    14.56 us/run -    18432 kB/run - 1207.90 GB/s
  ADD(type=f32,ne=[3072,128,1,1],nr=[1,1,1,1]):               174768 runs -     5.85 us/run -     4608 kB/run -  751.17 GB/s
  ADD(type=f32,ne=[3072,1,1,1],nr=[1,128,1,1]):               218460 runs -     4.65 us/run -     4608 kB/run -  946.08 GB/s
  ADD(type=f32,ne=[8192,128,1,1],nr=[1,1,1,1]):                92854 runs -    10.86 us/run -    12288 kB/run - 1079.51 GB/s

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ExtReMLapin · 2024-11-12T19:08:15Z

Thank you Jeff for your multiple contributions on the Vulkan backend.

Curiosity question, if we stick to NVIDIA, is there any point of using Vulkan over CUDA ? Some of your PRs seems to benefits more to nVidia devices, but if we stick to nVidia, why not only use CUDA ?

Is it for embeded system where cuda may not be available ? Example nintendo switch maybe ?

jeffbolznv · 2024-11-12T20:24:48Z

Hi,

CUDA and Vulkan can both be good for machine learning on NVIDIA GPUs. There may be other factors that influence which API to use, for example if you wanted to do ML from an application that's already using Vulkan or that needs to support different hardware, it may be more natural to use the Vulkan backend. However, using CUDA still gives you the best performance for a lot of cases.

MaggotHATE · 2024-11-13T04:33:37Z

but if we stick to nVidia, why not only use CUDA ?

Unless it was fixed by NVIDIA, CUDA Toolkit on Windows assumes you use MSVS, and compiling with GCC (MinGW, w64devkit) is impossible. Since Clblast was removed, Vulkan is the only option for gpu offloading for such environment.

As for PR itself, I see improvements in prompt processing and inference on GTX 1060 3GB, 22b model, 3 layers offloaded: from ~13.55t/s to 15.09t/s in prompt processing; from 2.10t/s to 2.15t/s tokens generation.

0cc4m

Looks good, I tested it and saw no issues on my devices. Can you resolve the merge conflict?

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

jeffbolznv · 2024-11-13T22:14:44Z

Thanks, rebased/resolved.

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

jeffbolznv requested a review from 0cc4m November 12, 2024 18:08

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 12, 2024

0cc4m approved these changes Nov 13, 2024

View reviewed changes

jeffbolznv force-pushed the bin_opt branch from bb06973 to 0b71c6c Compare November 13, 2024 22:14

0cc4m merged commit af148c9 into ggerganov:master Nov 14, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: optimize add/mul/div #10270

vulkan: optimize add/mul/div #10270

jeffbolznv commented Nov 12, 2024

ExtReMLapin commented Nov 12, 2024

jeffbolznv commented Nov 12, 2024

MaggotHATE commented Nov 13, 2024 •

edited

Loading

0cc4m left a comment

jeffbolznv commented Nov 13, 2024

vulkan: optimize add/mul/div #10270

vulkan: optimize add/mul/div #10270

Conversation

jeffbolznv commented Nov 12, 2024

ExtReMLapin commented Nov 12, 2024

jeffbolznv commented Nov 12, 2024

MaggotHATE commented Nov 13, 2024 • edited Loading

0cc4m left a comment

Choose a reason for hiding this comment

jeffbolznv commented Nov 13, 2024

MaggotHATE commented Nov 13, 2024 •

edited

Loading