int8 quantization #312

karpathy · 2023-08-17T06:00:02Z

[WIP] just thought I'd put up my draft PR as I go through this, allowing for people to pitch in / comment as we go along.

Step 1: re-write the model export to do int8 quantization in groups, with an ideal group size and a fallback as needed, and also change the header to be much better.

Note: I am currently planning this PR to be a totally breaking change to current behavior. The reason for this is that quantization is, imo, totally necessary to be practical, and also a "free lunch", especially if we stick around Q8_0. So afaik there is no great need to keep around the old float32 code.

Step 2 (coming up): the run.c changes.
Step 3 (coming up): also the Meta Llama 2 export

…up size fallback, and also change the header to be much better

…during model export

karpathy · 2023-08-17T16:22:10Z

Update: this PR is now very out of date, I totally gutted the export and run.c and tied myself into knots about how to implement this without making things ugly. This won't be easy.

karpathy · 2023-08-18T03:31:17Z

Update: I got it to work!! Ended up being not too too terrible ugliness-wise. The actual PR will be fairly small. I'll clean it up a bit more and submit the commits.

…fp32 still

… to think on more

karpathy · 2023-08-18T15:51:21Z

Ok I pushed my first stab at a grouped int8 implementation, quantizing the activations dynamically, and the weights at model export once. The matmul uses actual int8 arithmetic for the dot products, in configurable groups. (I'm not sure how llama.cpp does it)

For my big models I'm getting around 50% speedup, which is a lot less than I was hoping for. I have to run off to work now but I'll be looking into optimizing this over the weekend probably.

I want to try the 7B model too

karpathy · 2023-08-18T15:57:35Z

Looking at the code and what's needed, I think the final version will certainly be uglier than our original nice float version. So I'm not sure what to do now. One option is to preserve the float only legacy version in root of the repo, something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

runf.c (original float)
run.c (int8 quantized)
run.cu (cuda version)

But if that is the case we'll almost certainly want to take a lot of the copy paste code between them and shuttle it into common files that can be imported. Maybe this is okay...

ggerganov · 2023-08-18T16:14:23Z

The matmul uses actual int8 arithmetic for the dot products, in configurable groups. (I'm not sure how llama.cpp does it)

In llama.cpp, the integer arithmetic is pretty much the same as in this PR, but we manually write the SIMD intrinsics to make it more efficient (but way more ugly :) instead of relying on the compiler to do it.

For Q8 quantizations we ended up using group size of 32. It seemed like the best fit for most types of SIMD instructions, tensor shapes and accuracy. The value is a compile-time constant, so we don't change it at run-time. Not sure if it is the best group size - probably for 8-bit one can easily increase it up to 128 or 256 without losing too much accuracy, but it gets tricky when tensor dimensions are not a multiple of GS.

karpathy · 2023-08-18T16:27:41Z

Got it (and ty for stopping by @ggerganov !). I'll play with the group size but I expect I will settle on 32 in that case too.

I actually don't mind some ugliness if it is contained only to the matmul function, because it's easy to give a simple algorithmic reference, and it's also a well-understood operation, so the optimizations don't "get in the way" of understanding too much, imo.

So I'll most likely end up looking around for some SIMD hacking too, (will look at llama.cpp for references). I'm slightly nervous about if one has to suddenly introduce 10 IFDEFs for all the different platform versions and supports, but ah well.

ggerganov · 2023-08-18T16:34:37Z

Yup, I totally get it - been through some very similar dilemmas with llama.cpp :)
I learned most of the SIMD stuff that I know thanks to that project so it is a great learning experience.
Easy enough to dig in the low-level CPU optimizations.

I keep an eye on the progress in this repo and enjoy reading the issues / PRs - great work!

byte-6174 · 2023-08-18T19:09:19Z

something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

I vote for keeping the original float code intact. for someone coming in new to the repo, it is clean and easy to follow. opts can go in another file.

ozabluda · 2023-08-18T20:48:30Z

run.c

+        int j;
+        for (j = 0; j <= n - GS; j += GS) {
+            for (int k = 0; k < GS; k++) {
+                ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);


Maybe this is would be faster, maybe due to better SIMD

Suggested change

ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);

ival += ((int16_t) xq[j + k]) * ((int16_t) wq[in + j + k]);

on my linuxbox this does nothing

Whether speed improves or not depends, as always. It's logically more correct IMHO. It seems that gcc figures out that mine is actually equivalent to yours and already optimizes "my" way with vpmullw:
https://godbolt.org/z/TbdoxT7TT

Maybe VPDPBUSD would be faster
https://en.wikichip.org/wiki/x86/avx512_4vnniw).

Hmm, it may be possible to extract VPDPBUSD from a compiler following either way
https://godbolt.org/z/ccahMeTn5

using the "official" idiom
https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-llvm-gcc-cpus-gpus.html#gs.411p4x

ozabluda · 2023-08-19T16:37:54Z

Using BS=32 would prevent AVX-512 VPDPBUSD and I think would prevent optimal GPU code, which are all 512 bits [1]. BS=64 are better cache-aligned on CPUs and 128 would be better cache-aligned on GPUs.

BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR.

[1] I think. They used to be, not 100% sure about now. Not an issue for llama.cpp, right now because on Mac SIMD are 128 bits right now.

karpathy · 2023-08-19T17:11:05Z

"BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR."

@ozabluda I'd like this. I tried to make the header be exactly a nice number, as an example. And I believe with the current version things should be fairly aligned?

efocht · 2023-08-21T09:44:41Z

Looking at the code and what's needed, I think the final version will certainly be uglier than our original nice float version. So I'm not sure what to do now. One option is to preserve the float only legacy version in root of the repo, something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

runf.c (original float)

run.c (int8 quantized)

run.cu (cuda version)

But if that is the case we'll almost certainly want to take a lot of the copy paste code between them and shuttle it into common files that can be imported. Maybe this is okay...

Yes, please please keep the original floating point version! I've seen the patch of this PR and was worried you'd throw it away. On the SX-Aurora vector engine porting the fp32 version is trivial, and a fp16/bf16 version also makes sense for other CPUs like sapphire rapids.

karpathy · 2023-08-27T03:10:44Z

closing this PR as we now have attempt # 2 at int8 quantization at #364

karpathy added 2 commits August 17, 2023 05:56

re-write the model export to do int8 quantization in groups, with gro…

5e2e5b2

…up size fallback, and also change the header to be much better

small improvements to comments and warnings and increase header size …

e9cbe3e

…during model export

karpathy added 2 commits August 18, 2023 03:40

ok this works but is super slow because we are doing all the work in …

591f135

…fp32 still

ok this first version works but i don't think is ready to merge, have…

039a971

… to think on more

ozabluda reviewed Aug 18, 2023

View reviewed changes

karpathy mentioned this pull request Aug 19, 2023

New model export [call for feedback and help] #322

Merged

kroggen mentioned this pull request Aug 22, 2023

float16 and 8-bit CUDA implementations #310

Open

karpathy closed this Aug 27, 2023

clebert mentioned this pull request Oct 19, 2023

Add support for 8-bit Quantization clebert/llama2.zig#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 quantization #312

int8 quantization #312

karpathy commented Aug 17, 2023

karpathy commented Aug 17, 2023

karpathy commented Aug 18, 2023

karpathy commented Aug 18, 2023 •

edited

Loading

karpathy commented Aug 18, 2023

ggerganov commented Aug 18, 2023

karpathy commented Aug 18, 2023

ggerganov commented Aug 18, 2023

byte-6174 commented Aug 18, 2023 •

edited

Loading

ozabluda Aug 18, 2023

karpathy Aug 18, 2023

ozabluda Aug 19, 2023

ozabluda Aug 19, 2023

ozabluda commented Aug 19, 2023

karpathy commented Aug 19, 2023

efocht commented Aug 21, 2023

karpathy commented Aug 27, 2023

	ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);
	ival += ((int16_t) xq[j + k]) * ((int16_t) wq[in + j + k]);

int8 quantization #312

int8 quantization #312

Conversation

karpathy commented Aug 17, 2023

karpathy commented Aug 17, 2023

karpathy commented Aug 18, 2023

karpathy commented Aug 18, 2023 • edited Loading

karpathy commented Aug 18, 2023

ggerganov commented Aug 18, 2023

karpathy commented Aug 18, 2023

ggerganov commented Aug 18, 2023

byte-6174 commented Aug 18, 2023 • edited Loading

ozabluda Aug 18, 2023

Choose a reason for hiding this comment

karpathy Aug 18, 2023

Choose a reason for hiding this comment

ozabluda Aug 19, 2023

Choose a reason for hiding this comment

ozabluda Aug 19, 2023

Choose a reason for hiding this comment

ozabluda commented Aug 19, 2023

karpathy commented Aug 19, 2023

efocht commented Aug 21, 2023

karpathy commented Aug 27, 2023

karpathy commented Aug 18, 2023 •

edited

Loading

byte-6174 commented Aug 18, 2023 •

edited

Loading