-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use full range for q4_0 quantization #729
Conversation
I used this for 2-bit quantization, where it did make a big difference (after all, it lets you use 4 instead of 3 values). For 4-bit the effect is less pronounced, but may still be worthwhile. Your code looks fine, but this should definitely be discussed more in depth before we decide to change this. If we do it, I believe we should update the SIMD implementations at the same time. The increase in maximum error is probably due to the case where there are two values of similar magnitude but opposite signs. The value closer to zero gets rounded to +8 and clipped to +7. I don't know if this is bad, but if yes, we might use the old method for this case specifically. |
Ideally all the SIMD implementations should be updated yes, but who has a PowerPC CPU lying around? :/ I think there is value in just updating the reference implementation, since that is the one we use for model quantization, maybe adding a comment to each SIMD implementation that it needs updating. |
Thanks to #728, I was able to test my q2 and q3 implementations, and as expected the changes are bigger with fewer bits:
Weirdly, the maximum error goes down for q2 and q3.
That would be acceptable if changing the SIMD implementations incurs a significant performance degradation. But then we shouldn't really call the function "reference" anymore. |
Not going to complete full perplexity run on my laptop, but from a partial run in 7B the difference looks close to 0.1 perplexity [-7,7] 7B perplexity
[-8,7] 7B perplexity
|
Pushed suggestions for vectorized implementations, but besides AVX and AVX2 they have not been tested. |
may need to pump version to 2? force everyone regen the file again? |
Ok, this looks promising. @howard0su |
read the change again. The change only applies to quantize to pick a better scale factor. we don't really need to pump version but ask the user to regain is enough. |
However if we bump the version we can warn the users using outdated models that they could gain some quality by re-quantizing their models. No need to break backwards compatibility, just bump the version and show a warning if the model is old, but still accept it. I am currently running the perplexity test on AVX2, so far the results seem similar to what @unbounded reported earlier, about 0.1 lower than before. I will try to complete the test but it is going to take a few hours. |
Trying to address the issue I raised (how important is max error though?), I came up with this: #define THRESHOLD 9
float max = -INFINITY;
float min = +INFINITY;
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
max = MAX(max, v);
min = MIN(min, v);
}
float d;
if (fabsf(max + min) < (max - min) / THRESHOLD) {
d = MAX(max, -min) / 7; // max and min are close in magnitude, use old method
}
else if (max > -min) {
d = max / -8;
}
else {
d = min / -8;
} Maybe someone can think of a better way of writing this. Anyway, the results look good:
I know that I was originally in favor of updating the SIMD to match this, but it's getting more complicated. Maybe we have to abandon the idea of it being the "reference" and just the "slow but low error" method. |
Very interesting experiment @sw - hard to know how much the metrics correspond to perplexity, but we could test the one with lowest RMS and see if it improves perplexity. I don't think we need to bump the version for pure quantization improvements, it's a shame to break compatibility when there's no actual change to the format. We might see multiple different ways to quantize later, e.g. error-minimizing or GPTQ-like methods and I don't think we want a breaking change for each one. It would be be nice to know the version that generated the model, e.g. for bug reports you could know if they used a known buggy version, but there's no unused space in the header to put it. Maybe next time the format changes we could reserve some extra space to add metadata like "quantization method" which does not change the format. |
Instead of playing with a threshold heuristic, would it be very costly to calculate the rms of the two quantization methods and choose the lower? Only in the reference implementation for converting the model of course. |
Perplexity with 7B/AVX2: 6.4481 Full output
|
The idea would be to keep compatibility, but show a warning to let the user know that it is possible to requantize the model for some gain. |
I will have a look at dynamically selecting 7 or 8 (or even a value inbetween) according to RMSE, but should we just ignore the maximum error? Essentially this is what #397 was about. There are some prototypes there, just need to select something that's not too bad in terms of performance. This would then not be a reference or fallback implementation, but we can keep the SIMD implementations as they are and provide a matching non-SIMD one (apart from the known rounding/associativity issues with floats). I'm not too keen on touching the version, as it would prevent older revisions from using newly generated files. Would it be acceptable to scan through part of some tensors to see if the value -8 is ever used, and print a warning if not? (during loading in llama.cpp of course, not in ggml.c) It's a bit of a hack and may affect mmap/cache behavior. Or just use a padding space as a minor version. |
I believe we were discussing that we pump the version to 2 but we load version 2 as version 1 except to tell the user if you have original file, please re-run quantize gain a better result. |
Let's apply this approach to the I'll try to figure out the change and do it for ARM NEON Using
|
I have rebased this branch on latest
Also, is there an extension of this approach to Q4_0 with "RMSE optimized"
Q4_0 with "Full Range"
Q4_2 with "RMSE optimized"
Q4_2 with "Full Range"
Used branches: |
These already use the full range, |
Based on the results in my previous comment, it seems that RMSE optimization is not optimal for perplexity - at least for I think one other "feature" of this approach is that it uses the Edit: Started the 4 13B runs - will report the results tomorrow Results for 13B
Q4_0 with "RMSE optimized"
Q4_0 with "Full Range"
Q4_2 with "RMSE optimized"
Q4_2 with "Full Range"
|
@ggerganov Rebased #1106 on latest master, re-quantized, started a |
@ggerganov , you are right. The approach in this PR (full range) really is better for perplexity than RMSE minimization as per #1106 for However, for |
@ikawrakow that is surprising to me, so far the RMSE has mapped fairly well to the performance of different quantization methods. Very interesting result! I think there is a good argument to use this method as a default then. Besides measuring actual activations like GPTQ, one other error measure I've seen mentioned, is cosine similarity
It might make sense that if it's most important to preserve the angle, the largest value affects that the most and should be preserved (and q4_1 preserves the largest positive and negative value). The "tie break" @sw tested when there are negative and positive values of the same magnitude might still be beneficial then. |
@unbounded Thanks for the suggestion, but the cosine similarity cannot be used to define the scale of a bucket (it cancels out). Else I would have picked that (or the closely related correlation coefficient). We are quantizing "buckets", not whole layers, and my guess would be that activations in the next layer do depend on the relative magnitude of the quantized weights in the buckets (although independent on the overall scale of the layer as per the paper you linked). What has been done so far here in But yes, I agree, RMSE-minimization being worse than "full range" was a surprise for me too. My simple-minded explanation is that a) the asymmetry around 0 does matter (as we see by the much better results for |
Lets finalize this PR, including Also, for now disable the
|
By keeping the sign of the highest magnitude, we can make sure the highest value maps to -8, which is currently unused. This is a bit of a freebie since it is fully backwards compatible with the current format.
Untested
944c703
to
5808fcf
Compare
Updated for q4_2 |
Thanks Only need to decide if we bump the version and print a warning message for P.S. I think the latter is better, but want to hear if there are other opinions |
I feel like a version bump is unnecessary at this point, if people have a model they're happy with we don't need to prod them to upgrade. |
So, one can indeed improve To follow up on the second hypothesis I made above (one needs more bits to match My third hypothesis (one can accurately represent model weights via SVD decomposition using fewer bits) seems to be incorrect in general. It does work pretty well for the attention related tensors of the first layer, but that's about it. |
Thank you for the additional analysis. It looks like 5-bit quantization provides almost perfect quality. |
By preserving the sign of the highest magnitude value, we can make sure the highest value maps to -8 in our [-8, 7] range, which is currently unused. This is a bit of a freebie since the change is fully backwards compatible with the current format.
This was also noted in #397 (comment) but has not been fixed in the code yet.
This PR only updates the reference implementation, not the SIMD accelerated versions.
quantize-stats output: (see #728)
before(7B):
q4_0 : mse 0.00000492, maxerr 0.14257812, 95pct<0.0040, median<0.0018
q4_1 : mse 0.00000318, maxerr 0.12756348, 95pct<0.0034, median<0.0014
after(7B):
q4_0 : mse 0.00000386, maxerr 0.18200684, 95pct<0.0036, median<0.0016
q4_1 : mse 0.00000318, maxerr 0.12756348, 95pct<0.0034, median<0.0014
Most layers seem to have reduced maxerr, but the total max error is actually slightly higher
TODO: run perplexity
quantize-stats before (7B)
quantize-stats after (7B)