vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

jeffbolznv · 2024-12-03T22:21:50Z

Integer division is relatively expensive on GPUs. Division by a constant can be done with a mul+shift of some precomputed values (See https://gmplib.org/~tege/divcnst-pldi94.pdf figure 4.1).

This change uses the fast divide for the coordinate calculations in generic_unary_head.comp, primarily intended to speed up noncontiguous copy shaders. Copies are still relatively expensive in some models, and I can see a 1-2% speedup in some cases (with coopmat) with this change. I also added a couple test-backend-ops perf tests that benefit (measured on RTX 4070):

before:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   6669 runs -   160.82 us/run -    65536 kB/run -  389.02 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  23222 runs -    43.37 us/run -    24576 kB/run -  540.66 GB/s
after:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   7182 runs -   148.61 us/run -    65536 kB/run -  420.98 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  42346 runs -    24.18 us/run -    24576 kB/run -  969.74 GB/s
cuda:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   5130 runs -   195.87 us/run -    65536 kB/run -  319.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  15026 runs -    70.60 us/run -    24576 kB/run -  332.12 GB/s

0cc4m

I don't see a benefit as big as you did, but it's a positive change on all my devices. Just difficult to read.

…gerganov#10642)

jeffbolznv requested a review from 0cc4m December 3, 2024 22:22

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2024

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy

fc3eb4c

jeffbolznv force-pushed the fastdiv branch from 654b69a to fc3eb4c Compare December 3, 2024 22:24

0cc4m approved these changes Dec 4, 2024

View reviewed changes

0cc4m merged commit 2759916 into ggerganov:master Dec 4, 2024
44 checks passed

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (g…

a037a76

…gerganov#10642)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (g…

3c1539a

…gerganov#10642)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

jeffbolznv commented Dec 3, 2024

0cc4m left a comment

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

Conversation

jeffbolznv commented Dec 3, 2024

0cc4m left a comment

Choose a reason for hiding this comment