Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : remove bit shuffling #1405

Merged
merged 32 commits into from
May 11, 2023
Merged

ggml : remove bit shuffling #1405

merged 32 commits into from
May 11, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 11, 2023

Close #1241

  • Drop Q4_2 support
  • Changed bit-order for Q4 and Q5 (breaking change)
  • Preplexity is perplexing as usual

New timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 128 50 54 75 83 75
7B ms/tok @ 8th 123 44 52 53 58 72
13B ms/tok @ 4th 332* 93 101 150 164 141
13B ms/tok @ 8th 308* 81 96 96 104 136
  • these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 128 56 61 91 95 75
7B ms/tok @ 8th 128 47 55 53 59 75
13B ms/tok @ 4th 239 104 113 176 185 141
13B ms/tok @ 8th 240 85 99 108 117 147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

@ggerganov ggerganov marked this pull request as ready for review May 11, 2023 18:36
@ggerganov ggerganov requested a review from sw May 11, 2023 18:36
@ggerganov ggerganov mentioned this pull request May 11, 2023
18 tasks
**Hot topics:**

- Qauntization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "qauntization" a typo? 🤔

@redthing1
Copy link

Is there a script to upgrade the old models to new? I don't have the source models because they're huge.

@sw
Copy link
Contributor

sw commented May 12, 2023

Well this was a bit rushed, I think for such a significant change it would have been nice to allow a discussion.

Ah, that's unfortunate. I don't think there's a way to fix this efficiently for the ARM_NEON branch

Fix what? Were the NEON implementations not working in #1305 / #1384?

@ggerganov
Copy link
Owner Author

#1384 does not work for NEON because when we remove the vzip calls, the high bits are no longer in the right place.

This is the relevant section before this PR:

llama.cpp/ggml.c

Lines 3335 to 3360 in b608b55

// extract the 5th bit
uint32_t qh;
memcpy(&qh, x0->qh, sizeof(qh));
tmp[0] = table_b2b_u[(qh >> 0) & 0xFF];
tmp[1] = table_b2b_u[(qh >> 8) & 0xFF];
tmp[2] = table_b2b_u[(qh >> 16) & 0xFF];
tmp[3] = table_b2b_u[(qh >> 24) ];
const int8x16_t qhl = vld1q_s8((const int8_t *)(tmp + 0));
const int8x16_t qhh = vld1q_s8((const int8_t *)(tmp + 2));
const uint8x16_t v0 = vld1q_u8(x0->qs);
// 4-bit -> 8-bit
const int8x16_t v0l = vreinterpretq_s8_u8(vandq_u8 (v0, m4b));
const int8x16_t v0h = vreinterpretq_s8_u8(vshrq_n_u8(v0, 4));
// interleave
const int8x16_t v0lz = vzip1q_s8(v0l, v0h);
const int8x16_t v0hz = vzip2q_s8(v0l, v0h);
// add high bit and sub 16
const int8x16_t v0lf = vsubq_s8(vorrq_s8(v0lz, qhl), s16b);
const int8x16_t v0hf = vsubq_s8(vorrq_s8(v0hz, qhh), s16b);

We were ORing the 5th bit after the vzip.
I didn't see a way to fix this without shuffling the tables or the bits in some way.

@M00N-MAN
Copy link

M00N-MAN commented May 12, 2023

Is there a script to upgrade the old models to new? I don't have the source models because they're huge.

Hello All

Could someone please share the way of requantizing?
image
At least on this example https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/README.md
What should be done with WizardLM-7B-uncensored.q5_1.bin or which tool need to by used for https://huggingface.co/ehartford/WizardLM-13B-Uncensored which is the source model.

upd:
./quantize --help
usage: ./quantize model-f32.bin [model-quant.bin] type [nthreads]
type = "q4_0" or 2
type = "q4_1" or 3
type = "q5_0" or 8
type = "q5_1" or 9
type = "q8_0" or 7

but seems man README.md has lack of !q4 variants and sheet matching -n value with quantization of selected model

@LostRuins
Copy link
Collaborator

Cool to see this is merged, i'm slightly confused though, seems like the converter is still writing the old file version (1) ?
image

@philpax
Copy link

philpax commented May 12, 2023

Cool to see this is merged, i'm slightly confused though, seems like the converter is still writing the old file version (1) ?
image

I think that's intentional - the converter converts the f16 model to a GGJT v1 f16 model, which is then quantised to a GGJT v2 qX_Y model. (The saver always writes version 2, from what I can see.)

@LostRuins
Copy link
Collaborator

LostRuins commented May 12, 2023

Doesn't seem like it though, I don't see references where File version 2 is set during quantization either.

Edit: i'm wrong. It's set in write_magic()

@TheBloke
Copy link
Contributor

TheBloke commented May 12, 2023

Could someone please share the way of requantizing? <img alt="image" width="594" src="https://user-example https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/README.md What should be done with WizardLM-7B-uncensored.q5_1.bin or which tool need to by used for https://huggingface.co/ehartford/WizardLM-13B-Uncensored which is the source model.

Check my repos again. I've re-quantised all my GGMLs using the latest code, in q4_0, q5_0, q5_1 and q8_0 variants. So no need to do it yourself unless you want to.

@sw sw mentioned this pull request May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use different bit arrangement for quants (nibbles)
9 participants