-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: Optimize some mat-vec mul quant shaders #10296
Conversation
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.
Thank you, this is quite impressive! I tested these models:
Nvidia RTX 3090:Before:
After:
Nvidia Tesla P40:Before:
After:
AMD Radeon Pro VII:Before:
After:
AMD Radeon RX 6800 XT:Before:
After:
|
This is 50% faster on Q4_0 with a RX 570, very nice! Master
PR
|
We also didn't break Intel yet: Intel ARC A770:
After:
It got a little faster in most cases, but Intel is still quirky. Besides low prompt processing cause ANV doesn't like my mul_mm shader, it seems to have an issue with q5_0 Matrix Vector Multiplication too. I added q4_0 to make sure the issue is not the model. Maybe you have an idea what's going on. Alignment issue with the quant struct? I've had those in the past. Here's most of the quants on a 1.1B model on Intel A770. Something's definitely wrong with Q5_0 on ARC.
Edit: But this is unrelated to this PR. |
Just to make sure I understand, the issue with Q5_0 on Intel is not a functional issue, and is a preexisting performance issue? I don't have any experience with those GPUs. I can't think of a reason Q5_0 would be so much worse than Q5_1. |
Yeah, this is a preexisting performance issue. I've found many quirks with Intel GPUs over the time of trying to make them work, and I only got them to a usable state, not to the performance they should be capable of. I don't really know how I can figure out what the cause of this is, I suspect the driver is not mature/optimized enough. |
So I went ahead and tried out subgroup adds with these changes and this time it has a negligible effect compared to #10206 (at least that's the case on the RX 570, I don't have the W8100 with me currently). I'm only seeing a 1% improvement with the code below. --- a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp
@@ -6,6 +6,7 @@
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_null_initializer : enable
+#extension GL_KHR_shader_subgroup_arithmetic : enable
#include "mul_mat_vec_base.comp"
@@ -16,8 +17,6 @@ layout (constant_id = 1) const uint NUM_ROWS = 1;
uint a_offset, b_offset, d_offset, y_offset;
-shared FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
-
void iter(inout FLOAT_TYPE temp[NUM_ROWS], const uint first_row, const uint num_rows, const uint tid, const uint i, bool lastiter)
{
const uint col = i*BLOCK_SIZE + 2*tid;
@@ -79,21 +78,11 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
// sum up partial sums and write back result
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
- tmpsh[n][tid] = temp[n];
- }
- barrier();
- [[unroll]] for (uint s = BLOCK_SIZE/2; s > 0; s >>= 1) {
- if (tid < s) {
- [[unroll]] for (uint n = 0; n < num_rows; ++n) {
- tmpsh[n][tid] += tmpsh[n][tid + s];
- }
- }
- barrier();
+ temp[n] = subgroupAdd(temp[n]);
}
- if (tid == 0) {
- [[unroll]] for (uint n = 0; n < num_rows; ++n) {
- data_d[d_offset + first_row + n] = D_TYPE(tmpsh[n][0]);
- }
+
+ if (tid < num_rows) {
+ data_d[d_offset + first_row + tid] = D_TYPE(temp[tid]);
}
} |
Out of curiosity I tried @netrunnereve's subgroupAdd change and it gives correct results on Nvidia and AMD, but not on Intel. So the subgroup operations issue still exists on ANV. |
On the other hand I'm able to get noticeably faster results on the 570 by adjusting the number of rows per workgroup. I wonder if it's worth making these things tweakable at some point like how we have K_QUANTS_PER_ITERATION.
The numbers start going downhill once I go past 16 rows on Q4_0. Q8_0 also has some nice improvements as seen below.
|
Since it's a spec constant, we could select the value in ggml_vk_load_shaders based on the GPU. I'm working on another change that does 8 consecutive K values per iteration and is able to use larger loads as a result. It would be good to test the number of rows again with that change. Maybe it's getting some of the same gains, or maybe the gains will stack. |
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.
Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. I'd also like to be able to disable robustness for some of these pipelines in the future, to get a bit more perf.
Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions. It should be possible to do something similar to other Qi_K shaders. I can maybe do this, but happy for somebody else to do it.
Perf results below. Still slower than CUDA (which is using dp4a), but a nice boost. Definitely worth testing on some other hardware, too.
Split out from #10206, but the code is pretty different.