Put back the old Q2_K quant as Q3_K_XS #5055

Nexesenex · 2024-01-21T01:44:21Z

Nexesenex
Jan 21, 2024

On a Llama 70b model, the old Q2_K allowed to gain 600MB compared to the Q3_K_S, for a minor bump in perplexity (< 1%), and that is precious for the 36GB VRAM users like me. I imagine that for lesser size of models, it can matter for users with less RAM as well.

Considering that it was working well, and that we now have XS & XXS quants, could we have it back in the form of a Q3_K_XS, @ikawrakow , and even have an intermediate Q3_K_XXS to fill the gap with the Q2_K & lower incrementation ?

Some 70b models with 32k context capabilities start to appear, and it exists in smaller size also with various context lengths, and such granularity would be a great way to exploit them.

ikawrakow · 2024-01-21T07:59:07Z

ikawrakow
Jan 21, 2024

Good point. See #5060

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put back the old Q2_K quant as Q3_K_XS #5055

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Put back the old Q2_K quant as Q3_K_XS #5055

Nexesenex Jan 21, 2024

Replies: 1 comment

ikawrakow Jan 21, 2024

Nexesenex
Jan 21, 2024

ikawrakow
Jan 21, 2024