Improving quality with 8bit? #53

neuhaus · 2023-03-12T17:06:45Z

I can achieve around 1 token per second on a Ryzen 7 3700X on Linux with the 65B model and 4bit quantization.

If we use 8bit instead, would it run faster? I have 128GB RAM. Is 8bit already supported?

$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 14010.35 ms
main:   sample time =   335.09 ms
main:  predict time = 140527.48 ms / 1089.36 ms per token
main:    total time = 157951.48 ms

The text was updated successfully, but these errors were encountered:

gjmulder · 2023-03-12T17:32:56Z

I tried the intermediate fp16 and could get the model to run in 122GB of resident memory. With a Ryzen 1950X 16 Core CPU and slower memory than you:

4bit quantized:

main: mem per token = 71159620 bytes
main:     load time = 18022.09 ms
main:   sample time =   279.06 ms
main:  predict time = 139437.72 ms / 787.78 ms per token

fp16:

main: mem per token = 71159620 bytes
main:     load time = 136686.84 ms
main:   sample time =   372.38 ms
main:  predict time = 303936.28 ms / 2356.10 ms per token
main:    total time = 482714.19 ms

apollotsantos · 2023-03-12T18:08:00Z

I tried the intermediate fp16 and could get the model to run in 122GB of resident memory. With a Ryzen 1950X 16 Core CPU and slower memory than you:

4bit quantized:
main: mem per token = 71159620 bytes
main:     load time = 18022.09 ms
main:   sample time =   279.06 ms
main:  predict time = 139437.72 ms / 787.78 ms per token
fp16:
main: mem per token = 71159620 bytes
main:     load time = 136686.84 ms
main:   sample time =   372.38 ms
main:  predict time = 303936.28 ms / 2356.10 ms per token
main:    total time = 482714.19 ms
```

How did you run with the fp16 version?

gjmulder · 2023-03-12T18:08:37Z

./main -m ./models/65B/ggml-model-f16.bin -t 16 -n 128

apollotsantos · 2023-03-12T18:25:22Z

Thanks. Hlw much memory it uses? Em dom, 12 de mar de 2023 15:08, Gary Mulder ***@***.***> escreveu:

…

./main -m ./models/65B/ggml-model-f16.bin -t 16 -n 128 — Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEJ5P3CCH2TZQTJ2KN7IMDW3YGK7ANCNFSM6AAAAAAVYHEBBY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

neuhaus · 2023-03-12T18:52:54Z

OK i tried it with the fp16 model too, it only swapped a little bit (i have an 8-core Ryzen 7 3700X and 128GB RAM):

$ ./main -m models/65B/ggml-model-f16.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 71429.04 ms
main:   sample time =   324.53 ms
main:  predict time = 402116.09 ms / 3117.18 ms per token
main:    total time = 483291.78 ms

I also tried using -t 16 (to take advantage of multithreading) but it ended up being slightly slower.

I'm still hoping that 8bit could be faster than 4bit - it it likely?

apollotsantos · 2023-03-12T19:06:31Z

Is there an 8 bit version of the conversion script? Em dom, 12 de mar de 2023 15:53, S. Neuhaus ***@***.***> escreveu:

…

OK i tried it with the fp16 model too, it only swapped a little bit (i have an 8-core Ryzen 7 3700X and 128GB RAM): $ ./main -m models/65B/ggml-model-f16.bin -t 8 -n 128 main: mem per token = 70897348 bytes main: load time = 71429.04 ms main: sample time = 324.53 ms main: predict time = 402116.09 ms / 3117.18 ms per token main: total time = 483291.78 ms I also tried using -t 16 (for take advantage of multithreading) but it ended up being slightly slower. I'm still hoping that 8bit might be faster than 4bit...? — Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEJ5P3766UC4HEPYABQ4ZLW3YLRDANCNFSM6AAAAAAVYHEBBY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

neuhaus · 2023-03-12T19:14:18Z

As of now "quantize" only knows how to do 4bit.

gjmulder · 2023-03-12T19:14:33Z

122GB.

What would be interesting is to benchmark quality versus memory size, i.e. does say a fp16 13B model generate better output than a int4 60GB model?

@apollotsantos are you in Lisboa? I'm in Carcavelos.

ggerganov · 2023-03-12T19:15:14Z

No 8-bit support atm, but can be added similar to 4-bit
I expect it will be slower, because it will increase memory traffic. But it also depends on how efficient the SIMD is implemented

neuhaus · 2023-03-12T19:15:30Z

I believe to have noticed a significant quality increase going from 7B to 13B and from 13B to 30B (on GPU) and i've just started with 65B and it is a bit slow on my CPU.

apollotsantos · 2023-03-12T19:15:45Z

@gjmulder actually not. I'm in Brazil

gjmulder · 2023-03-13T13:14:00Z

This issue is perhaps misnamed, now, as 8bit will likely improve quality over 4bit but not performance.

In summary:

Inference performance: 4bit > 8bit > fp16 (as the code looks to be primarily memory-bound, with only a 50% performance increase from going from 8 cores to 16 cores on my 16 core / 32 hyperthread Ryzen 1650X)
Precision quality: fp16 > 8bit > 4bit (as more precision improves inference quality)
Scaling quality: 65B > 30B > 13B > 7B (scaling of models improves inference quality significantly)

Which led me to wonder where the sweet spots are among two parameters for a given memory footprint?

Once the model is able to be loaded once and called repeatedly (issue #23) and the python bindings are merged (issue #82 and https://github.com/thomasantony/llama.cpp/tree/feature/pybind), I can test all the permutations against say the SQuAD benchmark and we can understand the impact of quantization versus model size.

AGSaidi · 2023-03-13T19:31:36Z

Arm has SMMLA instructions which for newer arm targets should give another 4x over fp16.

MarkSchmidty · 2023-03-14T07:50:10Z

122GB.

What would be interesting is to benchmark quality versus memory size, i.e. does say a fp16 13B model generate better output than a int4 60GB model?

The answer is no, At around 20B parameters you only need 3 bits to get about the same performance quality as the same 20B parameter model in uncompressed fp16. As a rule of thumb, for each 4x more parameters you can drop a bit off while still getting close to 16bit quality.

So an 80B parameter model would have around the same quality in 2bit as in 16bit and a 320B parameter model would have around the same quality in 1bit as in 16bit. Beyond 1bit quantization can be achieved through various methods, such as re-using bins of bits from non-connected layers, which are only applicable to massive models and which will only maintain output quality for ~1T+ parameter models.

I'm not going to list every source for this. But these papers are a good start:
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | Oct 2022
The case for 4-bit precision: k-bit Inference Scaling Laws | Dec 2022 - Updated Feb 2023
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | Jan 2023

Also we're running empirical tests to validate this with LLaMA specifically over in #9 and so far they are turning out as expected. (No surprises there since the same tests have already been done on a half a dozen different models in a dozen sizes from 200M to over 500B parameters in the papers linked above.)

P.s. The only LLaMA which will have a quality benefit from 8-bit is 7B. But the benefit will be so small as to be insignificant. Even a minor amount of finetuning worth $10 of compute is enough to overcome the difference between 8-bit and 4-bit at 7B parameters.

MarkSchmidty · 2023-03-14T08:09:04Z

Which led me to wonder where the sweet spots are among two parameters for a given memory footprint?

13B appears to have negligible quality difference at 3-bit.

So you'll want to 13B-65B in 3-bit to save memory and run faster for effectively the same quality output, once it is implemented.

For 7B 4bit is practically always best. If you really want to run it in 4GB of memory then 3bit will make it fit at a reduced quality, but not so much as to make it unusable, especially with finetuning.

Some interesting use cases for 4GB inference including running at near native speeds fully in a web browser on any device with WebAssembly and running on the very popular Raspberry Pi 4GB. :)

j-f1 · 2023-03-14T12:25:44Z

Some interesting use cases for 4GB inference including running at near native speeds fully in a web browser on any device with WebAssembly and running on the very popular Raspberry Pi 4GB. :)

Also newish iPhones which allow up to 4080MB of memory use with the “Increased Memory Limit” entitlement!

riverzhou · 2023-04-09T10:26:17Z

Waiting for int8 quantization....

* Split of model loader and context mostly similar to how llama.cpp did that already: falcon_context now contains the falcon_model and a pointer to vocab the model contains the vocab now this allows to now conveniently load more than one context as well as loading multiple models at once A ton of changes - if anything acts funky please report * System prompt improvements Structural improvements on libfalcon * updates * Further adapted system prints for all fine tunes --------- Co-authored-by: John <nolife+git@gmail.com>

ggerganov added the enhancement New feature or request label Mar 12, 2023

gjmulder mentioned this issue Mar 13, 2023

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) #102

Closed

gjmulder mentioned this issue Mar 14, 2023

[Feature request?]: Running larger models without quantization. #118

Closed

prusnak changed the title ~~Improving performance with 8bit?~~ Improving quality with 8bit? Mar 18, 2023

neuhaus closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving quality with 8bit? #53

Improving quality with 8bit? #53

neuhaus commented Mar 12, 2023 •

edited

Loading

gjmulder commented Mar 12, 2023 •

edited

Loading

apollotsantos commented Mar 12, 2023

gjmulder commented Mar 12, 2023

apollotsantos commented Mar 12, 2023 via email

neuhaus commented Mar 12, 2023 •

edited

Loading

apollotsantos commented Mar 12, 2023 via email

neuhaus commented Mar 12, 2023 •

edited

Loading

gjmulder commented Mar 12, 2023 •

edited

Loading

ggerganov commented Mar 12, 2023

neuhaus commented Mar 12, 2023 •

edited

Loading

apollotsantos commented Mar 12, 2023

gjmulder commented Mar 13, 2023 •

edited

Loading

AGSaidi commented Mar 13, 2023

MarkSchmidty commented Mar 14, 2023 •

edited

Loading

MarkSchmidty commented Mar 14, 2023 •

edited

Loading

j-f1 commented Mar 14, 2023

riverzhou commented Apr 9, 2023

Improving quality with 8bit? #53

Improving quality with 8bit? #53

Comments

neuhaus commented Mar 12, 2023 • edited Loading

gjmulder commented Mar 12, 2023 • edited Loading

apollotsantos commented Mar 12, 2023

gjmulder commented Mar 12, 2023

apollotsantos commented Mar 12, 2023 via email

neuhaus commented Mar 12, 2023 • edited Loading

apollotsantos commented Mar 12, 2023 via email

neuhaus commented Mar 12, 2023 • edited Loading

gjmulder commented Mar 12, 2023 • edited Loading

ggerganov commented Mar 12, 2023

neuhaus commented Mar 12, 2023 • edited Loading

apollotsantos commented Mar 12, 2023

gjmulder commented Mar 13, 2023 • edited Loading

AGSaidi commented Mar 13, 2023

MarkSchmidty commented Mar 14, 2023 • edited Loading

MarkSchmidty commented Mar 14, 2023 • edited Loading

j-f1 commented Mar 14, 2023

riverzhou commented Apr 9, 2023

neuhaus commented Mar 12, 2023 •

edited

Loading

gjmulder commented Mar 12, 2023 •

edited

Loading

neuhaus commented Mar 12, 2023 •

edited

Loading

neuhaus commented Mar 12, 2023 •

edited

Loading

gjmulder commented Mar 12, 2023 •

edited

Loading

neuhaus commented Mar 12, 2023 •

edited

Loading

gjmulder commented Mar 13, 2023 •

edited

Loading

MarkSchmidty commented Mar 14, 2023 •

edited

Loading

MarkSchmidty commented Mar 14, 2023 •

edited

Loading