Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy #10642

Merged
merged 1 commit into from
Dec 4, 2024

Conversation

jeffbolznv
Copy link
Collaborator

Integer division is relatively expensive on GPUs. Division by a constant can be done with a mul+shift of some precomputed values (See https://gmplib.org/~tege/divcnst-pldi94.pdf figure 4.1).

This change uses the fast divide for the coordinate calculations in generic_unary_head.comp, primarily intended to speed up noncontiguous copy shaders. Copies are still relatively expensive in some models, and I can see a 1-2% speedup in some cases (with coopmat) with this change. I also added a couple test-backend-ops perf tests that benefit (measured on RTX 4070):

before:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   6669 runs -   160.82 us/run -    65536 kB/run -  389.02 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  23222 runs -    43.37 us/run -    24576 kB/run -  540.66 GB/s
after:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   7182 runs -   148.61 us/run -    65536 kB/run -  420.98 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  42346 runs -    24.18 us/run -    24576 kB/run -  969.74 GB/s
cuda:
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                   5130 runs -   195.87 us/run -    65536 kB/run -  319.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                  15026 runs -    70.60 us/run -    24576 kB/run -  332.12 GB/s

@jeffbolznv jeffbolznv requested a review from 0cc4m December 3, 2024 22:22
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2024
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a benefit as big as you did, but it's a positive change on all my devices. Just difficult to read.

@0cc4m 0cc4m merged commit 2759916 into ggerganov:master Dec 4, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants