Replies: 1 comment
-
Good point. See #5060 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
On a Llama 70b model, the old Q2_K allowed to gain 600MB compared to the Q3_K_S, for a minor bump in perplexity (< 1%), and that is precious for the 36GB VRAM users like me. I imagine that for lesser size of models, it can matter for users with less RAM as well.
Considering that it was working well, and that we now have XS & XXS quants, could we have it back in the form of a Q3_K_XS, @ikawrakow , and even have an intermediate Q3_K_XXS to fill the gap with the Q2_K & lower incrementation ?
Some 70b models with 32k context capabilities start to appear, and it exists in smaller size also with various context lengths, and such granularity would be a great way to exploit them.
Beta Was this translation helpful? Give feedback.
All reactions