-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: optimize add/mul/div #10270
vulkan: optimize add/mul/div #10270
Conversation
Thank you Jeff for your multiple contributions on the Vulkan backend. Curiosity question, if we stick to NVIDIA, is there any point of using Vulkan over CUDA ? Some of your PRs seems to benefits more to nVidia devices, but if we stick to nVidia, why not only use CUDA ? Is it for embeded system where cuda may not be available ? Example nintendo switch maybe ? |
Hi, CUDA and Vulkan can both be good for machine learning on NVIDIA GPUs. There may be other factors that influence which API to use, for example if you wanted to do ML from an application that's already using Vulkan or that needs to support different hardware, it may be more natural to use the Vulkan backend. However, using CUDA still gives you the best performance for a lot of cases. |
Unless it was fixed by NVIDIA, CUDA Toolkit on Windows assumes you use MSVS, and compiling with GCC (MinGW, w64devkit) is impossible. Since Clblast was removed, Vulkan is the only option for gpu offloading for such environment. As for PR itself, I see improvements in prompt processing and inference on GTX 1060 3GB, 22b model, 3 layers offloaded: from ~13.55t/s to 15.09t/s in prompt processing; from 2.10t/s to 2.15t/s tokens generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I tested it and saw no issues on my devices. Can you resolve the merge conflict?
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Thanks, rebased/resolved. |
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Perf data from RTX 4070: