-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: small mul_mat_vec optimizations #10665
Conversation
* server : force F16 KV cache for the draft model ggml-ci * server : fix draft params ggml-ci * server : various params fixes ggml-ci
this doesn't work as expected
* metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci
* readme : document --no-display-prompt * readme : update default prompt context size * readme : remove unnecessary indentation Indenting a line with four spaces makes Markdown treat that section as plain text. * readme : indent commands under bullets * readme : indent commands in lettered list
* wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit
* implemented argmax kernel * tpig -> tgpig * change to strides * contiguous assertions * kernel working and tested * argmax simd parallel implementation * added 2 new tests for argmax in test-backend-ops * cosmit * added 3 tests cases for perf eval * add test_argmax in make_test_cases_perf * Update test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
ggml-ci
* hide buttons in dropdown menu * use npm as deps manager and vite as bundler * fix build * fix build (2) * fix responsive on mobile * fix more problems on mobile * sync build * (test) add CI step for verifying build * fix ci * force rebuild .hpp files * cmake: clean up generated files pre build
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.
Co-authored-by: piDack <pcdack@hotmail.co>
…IDIA backend (#10584) * [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend Move to compile time selection to backend to avoid latency at run time. Add it to all mkl gemm calls and only for NVIDIA backend. Signed-off-by: nscipione <nicolo.scipione@codeplay.com> * Formatting * Address PR comments to increase readibility --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
polynomial iq4_nl test (slower but keep as reference)
polynomial iq4_nl test (slower but keep as reference)
I took a quick look at the CI and since it runs a 7B model in FP16 you pretty much need the T4 as that's the cheapest Linux 16GB option on Azure. On the other hand you could probably save a bit on your cloud bill by switching the CUDA V100 for a T4 and the Metal M4 for a M1 or M2. |
I think most of the current warnings in ggml-vulkan.cpp are my fault, so I'll fix them. |
The M4 is a self-hosted Mac Mini that I bought to run the CI. |
Hmm the CI is failing since it's using llvmpipe instead of the GPU. |
It should be using the correct device now. But some tests are still failing: |
#10763 should fix the mat_mul failures. The im2col and rope failures are news to me. |
I can reproduce the other failures on Turing, I'll debug them. |
I filed #10764 to track the Turing failures. |
Setting this as draft until #10721 is merged. |
I'm sorry that it's taking a while. I'm still trying to figure out Intel and AMD. I think I'll merge a basic extension support version and then figure out how to deal with their specific issues. |
You probably have to rebase this PR so that it only contains your commits, once subgroup size control is merged. I think the conflict is from my branch, which I already rebased and fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I had previously tested on my system and perf was fine, I don't think I need to retest.
Setting this back to draft as this updated logic is setting
|
Okay this actually works properly now and I'm getting the same performance as before. There's no harm in setting |
I'm surprised this doesn't go through on your system. Can you upload your vulkaninfo output? I thought this would be true on any modern driver. Edit: I see that Mesa reports 0 (no shader stages supported) for |
Radeon Pro VII:
Intel A770
RTX 3090
|
Unsurprisingly Intel's being its usual weird self again with NUM_ROWS=2. My 470 could probably beat it if I used a 7B model. |
* double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires ggerganov#10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs
Here's a couple of small optimizations for
mul_mat_vec
that came from studying the generated GCN assembly. The biggest improvement came from moving out the delta if possible (8 additional multiplications per inner loop iteration add up fast).Using the
dot()
function still generates FMA instructions but due to changes in the instruction ordering it's now running a little bit faster. I think on some newer cards there are dedicated dot product units which may have a better chance of being utilized by the compiler if we use this function.I tested this on my RX 470, I just got it recently and I don't have access to the 570 anymore. Interestingly enough my 470 (and my W8100 as well) downclock pretty heavily when running
mul_mat_vec
as the hardware is hitting its power limit, it literally drops down from 1200 mhz to 750 during inference. I discussed this with an ex-AMD engineer and after confirming that the board's power circuits could handle it we made some VBIOS adjustments allowing the card to draw more power and run inference at around 1100 mhz. I'm saying this just in case someone with a 470 is wondering why they can't get 27 t/s, and can explain more if people are interested. Just don't blame me if you blow your computer up 😁...Master:
PR:
By the way @0cc4m have you tried to implement the dequantization and matrix multiplication logic in F16 for the F16 shaders? The activations and deltas are already in F16 and it might not be necessary to do all the math in F32.