Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality of 4-bit quantization #62

Closed
thement opened this issue Mar 12, 2023 · 4 comments
Closed

Quality of 4-bit quantization #62

thement opened this issue Mar 12, 2023 · 4 comments
Labels
duplicate This issue or pull request already exists

Comments

@thement
Copy link
Contributor

thement commented Mar 12, 2023

The quality of the 4-bit quantization is really abysmal compared to both non-quantized models and GPTQ quantization
(https://github.com/qwopqwop200/GPTQ-for-LLaMa). Wouldn't it make sense for llama.cpp to load already-prequantized LLaMa models?

@jminardi
Copy link

Do you have examples showing the poor quality here and higher quality with other quantization models? Are you sure the hyper parameters are all the same?

@Dicklesworthstone
Copy link

Dicklesworthstone commented Mar 13, 2023

I agree that this seems like the single biggest "bang for the buck" of quality improvement versus effort. Also, I wonder if using this technique and the 65B model, you could get down to 3 bits or even 2bit (say, only for the last 25% of the layers?).

On top of that, using some kind of high speed streaming compression (zstd for example), perhaps the quantized weights could be reduced even more, which might help with model load speed (assuming you are IO bound rather than compute bound during loading).

I wish I knew more C++ to help with this.

@thement
Copy link
Contributor Author

thement commented Mar 13, 2023

Do you have examples showing the poor quality here and higher quality with other quantization models? Are you sure the hyper parameters are all the same?

No, unfortunately I didn't try to set the parameters to be exactly the same, but the output of 65B 4-bit quantized llama.cpp was obviously rubbish and I wasn't able to get good results from it:

...
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

Building a website can be done in 10 simple steps:
Step One – Pick Your Domain Name (Your Website Address)
Domain names are the addresses of websites, and every single site has one. A domain name is an address assigned to you so that others may find your business online. It’s like having a unique physical street or mailing address for each company; without it people would have no idea where to look when they want something from a specific website!
The most popular types of websites are those used by companies as their main marketing tool, and typically these will end with .com (for “company”), so that visitors know exactly what kind of site this is. If you’re building your business online then we recommend choosing one of the many available TLDs ending in ‘dot com.’
Step Two – Pick Your Host Name And URL Extension (.COM, ORG., ETC.) Is it www or Non-www? Does It Really Matter and Should You Care About This Stuff as a Newbie Webmaster Building A Website For The First Time Ever ??. You betcha !!! To make things easy for you , we recommend using WordPress on Cloud.
Step Three – Choose Your Platform (The Technology And Framework That Will Power Up Your Site) Hosting is like the land your house sits upon; technically, it’s not part of building a website as such since websites can be built anywhere—but without hosting there would just be nowhere for them to go.

@ggerganov ggerganov added the duplicate This issue or pull request already exists label Mar 13, 2023
@ggerganov
Copy link
Owner

ggerganov commented Mar 13, 2023

Better quantization will be added in the future (#9)

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants