-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use different bit arrangement for quants (nibbles) #1241
Comments
This could be done without breaking Q4_0/Q4_1 file compatibility, right? Just ensure you're doing Q8 in the right order. Edit: Actually I'm not sure what you refer to. The AVX implementation of |
The I realize it will be a complete mess to make all models incompatible, but at the same time I think getting the best performance is always highest priority. |
👍 agreed that vzip should be unnecessary. |
In the existing
llama.cpp
implementation, quantization bits of consecutive model weights are packed together one after the other. E.g., for 4-bit quantization, the 8 bits of two consecutive weights are stored into auint8_t
. The disadvantage of this approach is that when the data is to be used in dot products or is being de-quantized for matrix multiplications done via BLAS, and the operations are performed using SIMD instructions, one needs to shuffle the de-quantized bytes to get them into the correct order. These shuffle operations can be avoided by arranging the bits differently. For instance, for 4-bit quantization in blocks of 32 weights (Q4_0
), one can store the quants of the first 16 weights into the low 4 bits of the 16uint8_t
's, and the quants of the second 16 weights in the block of 32 into the high 4-bits. The same or similar strategy can also be applied for other block sizes or when using 2 bits per weight.The performance gain is not earth-shattering: in a synthetic benchmark performing
Q4_0_Q8_0
dot products I measured about a 10% speedup from avoiding the shuffle. Still, it is a trivial change, so why leave this low-hanging fruit hanging?The text was updated successfully, but these errors were encountered: