-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : PoC for normalizing weights for better quantization packing #2434
base: master
Are you sure you want to change the base?
Conversation
Do you use
on this branch with |
In #2160 I am changing how 8 bit integers are loaded as 32 bit integers. The gist of it is that if the int8 array is aligned to at least 2 bytes then it's faster to load it via 2 16 bit integers than it is to use memcpy (1-2% total token generation speedup, the fastest way is still to directly cast the int8 array to an int32 pointer). This would not work with block sizes that are not aligned but I think the VRAM savings would be worth it. At some point it may be worthwhile to look into rearranging the blocks into larger blocks that are aligned to at least 4 bytes in order to improve effective memory bandwidth. |
@ggerganov what is the intended time scale for this change? |
I noticed that the I did brief speed test on the CPU, both ARM and x64 and there was no speed difference with I don't think the I don't plan to prioritize this. Want to first finish the GGUF change which should happen in the second part of Aug. |
@JohannesGaessler Yes, I understand why |
@ggerganov I just pushed a fix for the misaligned memory access, so you can now run TG with |
Perfect! Thanks |
Probably both but I suspect the main factor is memory alignment and maybe slightly less optimal data types. |
I suggest keeping old quantization formats for compatibility and user choice. Users have their own hardware constraints and preferences about model quality vs speed vs memory usage. In PR #1508 the Q4_0 format could been kept and renamed to something like Q4_0_fp32d,.and the new Q4_0_fp16d. The format in this PR could be named Q4_0_rownorm. And if new formats gets new identifiers instead of replacing the old, couldn't he quantization version metadata be dropped then? Since the NN weights could be formed totally different in different models (not only LLMs), you have to test different quantization formats to see which one works best for that model, the more formats you can test the better. Maybe no more breaking changes to model data when GGUF is released? |
While that's understandable from a user perspective you should keep in mind that backwards compatibility is not free. Each additional quantization format increases the amount of work required to roll out new features. And I think the time of developers is much more valuable than the time of users, especially for a project that is still relatively new. I also think that an optimized version of this approach will be universally better than the current implementation. |
Even if this project is relatively new, the number of users have exploded at the same rate as the general use of LLMs. A quick search for "ggml" at huuggingface returns some 1000 model folders and many of them like @TheBloke contain the models in several quantization formats. Like @philpax commented here #1408 (comment) it is a bad idea to keep the old quantization names, since it is confusing and frustrating for end-users to find compatible model files. A typical model filename could be named |
The number of users is not an argument. The exact same number of users will need to deal with a breaking change as would benefit from developers being more productive. In fact, if the number of users continues to grow this only weighs more heavily on the side of making a breaking change that causes trouble once but permanently reduces the amount of work that needs to spent on maintenance. |
I don't mind if you break the format/lose compatibility, but please use a different name. As @klosax says, it is extremely confusing and frustrating for users to download a model for hours or even days only to discover that it's not compatible with the version of GGML they're using, especially for older models. We've seen this play out several times already and it leads to much user frustration playing out in old and new GitHub issues. Most people are not paying attention to the vagaries of llama.cpp versioning. TheBloke is doing a fantastic job of including the file format version in the models, but not everyone is, and if it's not in the name of the quantisation format people are unlikely to know there's a difference. Call it |
If the project needs increasingly more maintenance for each quantization format added, maybe the layout is not optimal. A start could be to have the quantization formats separated from ggml.c, like what @ikawrakow did when adding k-quants. |
I agree with @philpax that disambiguation is probably the most important, since correctly supporting two different formats without being able to tell them apart makes things that much more difficult. Dropping support for ancient formats due to maintenance burden as @JohannesGaessler mentioned is perfectly understandable, just that these incompatibilities should be easy to detect and handle for future clients (so definitely do not reuse the deprecated IDs or identifiers e.g. Q4_3). I think when evaluating our options, it is helpful to examine other software that face similar problems, such as Video Codecs, which have gone through decades of evolution. You have VP2 all the way to VP9, each version extending and iterating on the previous, dropping and adding capabilities as things change, and a good video player is able to differentiate and handle all of them because they're properly versioned. |
@@ -17200,7 +17230,13 @@ struct ggml_cplan ggml_graph_plan(struct ggml_cgraph * cgraph, int n_threads) { | |||
} | |||
} break; | |||
case GGML_OP_SILU_BACK: | |||
{ | |||
n_tasks = n_threads; | |||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move GGML_OP_SILU_BACK to after GGML_OP_MUL to avoid the duplicate code.
I don't suppose there would be a way to convert from the old format (even with a slight quality loss)? |
This is a proof-of-concept for alternative packing of the quantization scaling factors. The idea is to pre-normalize the model tensors with row-wise scaling factors that transform the weights in each row in the range
[-1 .. 1]
. Having the information that the weights are bounded in this range thedelta
andmin
factors of the quantizations can be represented with less bits.This PR demonstrates the technique for
Q4_0
,Q4_1
andQ5_1
. CPU + CUDA + Metal implementations are provided.Here are some sample results:
Q4_0
master
Q4_0
Q4_1
master
Q4_1
Q4_1
master
Q4_1
Q5_1
master
Q5_1
build: 71d6975 (1129)
build: 71d6975 (1129)
build: 8c2b881 (1131)
M2 Ultra (Metal):
master
PPL Q4_1 with Metal: 6.0702 +/- 0.03387
In these examples all tensors are quantized with the specified quantization, except
output.weight
andtok_embeddings.weight
. They are quantized withQ6_K
.The technique is compatible with all quantization methods. It's quite cheap in terms of computation - the extra
ggml_mul()
operation is applied on the matrix multiplication results with broadcast across rows. The implementation requires very little changes to the existing code - we only computed
andm
in different way, all else is the same. For each tensor we add an extra tensor with the row scaling factors stored in F32. The amount of bits to use ford
andm
can be easily adjusted as long as the sum of the bits is8
or16
.It's also possible to extend the normalization step to transform the weights to the
[0 .. 1]
range using an extra normalization tensor containing row-wise offset factors. Tried it, it works though I don't think if it is worth the extra effort.