-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: implement initial support for IQ2 and IQ3 quantizations #11360
Conversation
Thank you, very cool! This will take a bit of time to review, I'll take a look this weekend. Can you fix the conflict? The mmq_wg_denoms fix has to be applied here too: #11343 |
Exciting to see this. I've done a quick check with coopmat2 and there are a few failures:
Not seeing failures with coopmat1 or no coopmat. I'll try to debug these later. I haven't looked at the code yet. |
I was surprised it was only the MUL_MAT_ID tests failing, but it was due to a gap in test coverage, which #11375 will fix. MUL_MAT also fails for coopmat2 with the same types. |
I went ahead and did the straightforward unoptimized "port" of the failing dequant callbacks from mul_mm.comp - just divide the index by 2 (because mul_mm does pairs at a time) and replace data_a with the block reference. Code is at 078ebe5. Feel free to pull this in however you want. IMO it would be OK to have these be unoptimized at first and you or I can optimize the later. I haven't done any perf testing yet. |
a56c535
to
ecfabd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review that actual dequantization logic, but I reviewed the rest. And I still need to do some perf testing.
{ | ||
// copy the table into shared memory and sync | ||
if (gl_LocalInvocationIndex.x < 32) { | ||
for (uint i = gl_LocalInvocationIndex.x; i < 512; i += 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop bound mismatches the array size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the increment could use gl_WorkGroupSize.x instead of hardcoding 32, but it probably won't affect performance much in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop looks better using workgroup size and less error prone, I made the fix
uvec2(0x082b082b, 0x2b2b2b2b), uvec2(0x082b2b08, 0x2b2b2b2b), uvec2(0x2b082b08, 0x2b2b2b2b), uvec2(0x2b2b2b2b, 0x2b2b2b2b) | ||
}; | ||
|
||
shared uvec2 iq2s_grid[1024]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably account for these array sizes in ggml_vk_matmul_shmem_support. It's a bug that we didn't for iq4_nl, but these are significantly larger.
0x3e1c1c1c, 0x3e1c3404, 0x3e24140c, 0x3e24240c, 0x3e2c0404, 0x3e2c0414, 0x3e2c1424, 0x3e341c04, | ||
}; | ||
|
||
shared uint32_t iq3xxs_grid[512]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array sizes don't match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I've been working on optimizing the cm2 dequant callbacks and am getting good speedups. I'm out of time to finish it tonight, but I'll share it in the morning. |
Here are the cm2 optimizations: jeffbolznv@9079f06 RTX 4070 (including Q4_K for reference):
|
890a125
to
6ed3047
Compare
Branch updated:
|
Looks good.
Do you want me to do this separately after you merge? Either way is OK with me.
The tricky part is that we don't currently track which sizes are supported per-type, and now the shared memory usage depends on the type (well, it did previously for iq4_nl, but we got away with it). Maybe the easiest thing to do is just make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any issues on AMD, Intel or Nvidia in my tests. Performance isn't that good yet, but that can be improved separately.
Well you beat me to this 😉 As @0cc4m mentioned I think that speed is not really a priority here and the important part is to have a functional implementation we can improve on in the future. This will probably need dedicated mat vec shaders like the K quants to get full performance. Please see my review but aside from that your code looks fine (I didn't verify the actual dequantization algorithm though) and runs fine on my AMD GCN cards. If you want to be sure that your implementation is correct it's worth running a perplexity check against a different backend to see if the numbers match up. |
The bug regarding
(notice the It seems to be this issue: KhronosGroup/glslang#2479 (more precisely KhronosGroup/glslang#2627) The workaround is to include |
3f7aa9d
to
edd2546
Compare
I'm worried that somebody will accidentally break this in the future and it'll be very confusing. Another option might be to pass the workgroup size as a function parameter. |
I think a function parameter is a good idea, yeah. |
looking forward to IQ4XS support as well. |
edd2546
to
0886297
Compare
Indeed, it looks better with a function parameter the branch is updated. @sorasoras feel free to test branch remyoudompheng@e955cbed |
it looks ok |
The test failure seems unrelated to patch correctness (llvmpipe is too slow for test-backend-ops to complete following the increase in test cases) |
Is it possible to increase the timeout? Or what do we need to do to unblock this change. |
We should probably make the change for iq4_nl as well so things are consistent.
Yeah you can change it here, and if it's still faster than the CUDA and HIP builds I don't think anyone will mind. llama.cpp/.github/workflows/build.yml Line 349 in a4417dd
|
0886297
to
5be217a
Compare
So I tried out the new Github ARM CI machine for fun and that one actually finishes the llvmpipe test stage for master in only 7 minutes. The problem is that it has a subgroup size of 4 with a 128 bit wide SIMD and that's causing new test failures. |
Can you clarify what you mean by 128b wide SIMD, and whether these new test failures are specific to this PR? IMO this change is complete other than accounting for the additional shared memory usage. I think it would be OK to merge even with that bug and then I or @remyoudompheng could fix it after. |
I think that's about llvmpipe using SIMD width as subgroup size. For AVX2 it was 8, on ARM it is apparently 4. But that is unrelated to this PR.
Yeah, I think it's okay to merge. We can clean up the shared memory checks in another PR. I think the checks got messed up in some refactor anyways and do some redundant work, your original version of the check and my later implementation currently coexist. |
Yep that's what I meant. I think we're fine to merge. |
Thank you for your contribution @remyoudompheng ! |
FYI I've started implementing the shared memory accounting logic. |
I think some typos have slipped in the |
This pull request implements basic support for IQ2 and IQ3 quantizations in the Vulkan backend, with tentative acceptable performance (there are probably possible improvements). Unfortunately I do not have access to coopmat2 hardware and there might be typos in the proposed implementation.
A commit modifies the
Q3_K
implementation to optimize performance, but it may be unwelcome in this PR.The existing
init_iq4nl_shmem
function has been renamed to a more generic name in order to simplifyifdef
logic.Tests were performed on a Radeon 780M iGPU with Mesa 24.3.3 using the default compiler (ACO, not LLVM). It supports
KHR_coopmat
.Performance results:
Performance numbers from
test-backend-ops
Before
Q3_K
change:After
Q3_K
change