-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality of 4-bit quantization #62
Comments
Do you have examples showing the poor quality here and higher quality with other quantization models? Are you sure the hyper parameters are all the same? |
I agree that this seems like the single biggest "bang for the buck" of quality improvement versus effort. Also, I wonder if using this technique and the 65B model, you could get down to 3 bits or even 2bit (say, only for the last 25% of the layers?). On top of that, using some kind of high speed streaming compression (zstd for example), perhaps the quantized weights could be reduced even more, which might help with model load speed (assuming you are IO bound rather than compute bound during loading). I wish I knew more C++ to help with this. |
No, unfortunately I didn't try to set the parameters to be exactly the same, but the output of 65B 4-bit quantized llama.cpp was obviously rubbish and I wasn't able to get good results from it:
|
Better quantization will be added in the future (#9) |
The quality of the 4-bit quantization is really abysmal compared to both non-quantized models and GPTQ quantization
(https://github.com/qwopqwop200/GPTQ-for-LLaMa). Wouldn't it make sense for llama.cpp to load already-prequantized LLaMa models?
The text was updated successfully, but these errors were encountered: