-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: initial support for IQ1_S and IQ1_M quantizations #11528
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with coopmat2 enabled! Perf is a bit low, but I'll fix it after it's merged.
@@ -217,7 +217,7 @@ void quantize(uint dst_idx, uint src_idx) | |||
#endif | |||
|
|||
void main() { | |||
#if defined(DATA_A_IQ2_XXS) || defined(DATA_A_IQ2_XS) || defined(DATA_A_IQ2_S) || defined(DATA_A_IQ3_XXS) || defined(DATA_A_IQ3_S) || defined(DATA_A_IQ4_NL) | |||
#if defined(DATA_A_IQ1_S) || defined(DATA_A_IQ1_M) || defined(DATA_A_IQ2_XXS) || defined(DATA_A_IQ2_XS) || defined(DATA_A_IQ2_S) || defined(DATA_A_IQ3_XXS) || defined(DATA_A_IQ3_S) || defined(DATA_A_IQ4_XS) || defined(DATA_A_IQ4_NL) | |||
init_iq_shmem(gl_WorkGroupSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to defined a "NEED_INIT_IQ_SHMEM" macro each place init_iq_shmem is defined, and then all the #ifs can be simple.
I added MMV kernels for the new quants Some performance figures on Radeon 780M (~70GB/s memory bandwidth). The LLVM/AMDGPU compiler does not like the generic code at all and behaves better with the specialized shader. Before MMV kernels:
After:
|
See branch https://github.com/remyoudompheng/llama.cpp/tree/vulkan-iq-mmv for MMV kernels for IQ2 and IQ3 quants |
This pull request implements basic support for the remaining I-quants (IQ1_S and IQ1_M).
Performance is not great but similar to IQ2 quantizations.
To avoid spamming shared memory, the IQ1S grid has been compressed to 2 bits per value (4kB shmem size).
Pull request is draft waiting for #11501 and #11502 to be merged