-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Q3_K_XS #5060
Add Q3_K_XS #5060
Conversation
Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size.
Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807 That Q2_K was the one I spoke about to rename in Q3_K_XS, because it already exists and is proofed for a long time, its perplexity bump (<1%) was more than twice inferior to its size shrinking (>2%), and there's a gain of 1k context at stake in KV f16 just with that change. But it'd ofc be great to have an intermediate quant below, the Q3_K_XS that you PRed, and which looks like a Q3_K_XXS to me! |
I was taking the values from my notes, and I guess I forgot to update the notes when I made PR #2807. So, what we see in the above tables/graph is what we had before PR #2807. Here is an updated graph with the values post #2807 (i.e., current master) |
Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops. I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2. |
Thank you for noticing. It should be fixed via PR #5113 |
I've tested the patch, it works. Thanks! |
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
TL;DR See #5055
Before the recent two-bit quantization and importance matrix related changes, there were two low-bit quantization types available in
llama.cpp
:Q2_K
andQ3_K_S
.Q2_K
was basically a 3-bit quantization with just theattn_k
andattn_q
tensors quantized with 2 bit. The table shows their model sizes and perplexities (wiki.test.raw, n_ctx = 512
) for LLaMA-v2-70B:After the recent changes,
Q2_K
has become an actual 2-bit quantization (less than 3 bits-per-weight), has a LLaMA-v-70B model size of 23.71 GiB, and a perplexity of4.0039
(using an importance matrix derived fromwiki.train.raw
).Q3_K_S
has increased very slightly to 27.86 GiB, but has a better perplexity of3.6603
. Based on #5005 there is a need to have an intermediate step in terms of model size between the newQ2_K
andQ3_K_S
. This PR adds such a quantization type asQ3_K_XS
. The following table summarizes the new situation for LLaMA-v2-70BThe table on a graph: