Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 quantization #312

Closed
wants to merge 4 commits into from
Closed

int8 quantization #312

wants to merge 4 commits into from

Conversation

karpathy
Copy link
Owner

[WIP] just thought I'd put up my draft PR as I go through this, allowing for people to pitch in / comment as we go along.

Step 1: re-write the model export to do int8 quantization in groups, with an ideal group size and a fallback as needed, and also change the header to be much better.

Note: I am currently planning this PR to be a totally breaking change to current behavior. The reason for this is that quantization is, imo, totally necessary to be practical, and also a "free lunch", especially if we stick around Q8_0. So afaik there is no great need to keep around the old float32 code.

Step 2 (coming up): the run.c changes.
Step 3 (coming up): also the Meta Llama 2 export

@karpathy
Copy link
Owner Author

Update: this PR is now very out of date, I totally gutted the export and run.c and tied myself into knots about how to implement this without making things ugly. This won't be easy.

@karpathy
Copy link
Owner Author

Update: I got it to work!! Ended up being not too too terrible ugliness-wise. The actual PR will be fairly small. I'll clean it up a bit more and submit the commits.

@karpathy
Copy link
Owner Author

karpathy commented Aug 18, 2023

Ok I pushed my first stab at a grouped int8 implementation, quantizing the activations dynamically, and the weights at model export once. The matmul uses actual int8 arithmetic for the dot products, in configurable groups. (I'm not sure how llama.cpp does it)

For my big models I'm getting around 50% speedup, which is a lot less than I was hoping for. I have to run off to work now but I'll be looking into optimizing this over the weekend probably.

I want to try the 7B model too

@karpathy
Copy link
Owner Author

Looking at the code and what's needed, I think the final version will certainly be uglier than our original nice float version. So I'm not sure what to do now. One option is to preserve the float only legacy version in root of the repo, something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

  • runf.c (original float)
  • run.c (int8 quantized)
  • run.cu (cuda version)

But if that is the case we'll almost certainly want to take a lot of the copy paste code between them and shuttle it into common files that can be imported. Maybe this is okay...

@ggerganov
Copy link

The matmul uses actual int8 arithmetic for the dot products, in configurable groups. (I'm not sure how llama.cpp does it)

In llama.cpp, the integer arithmetic is pretty much the same as in this PR, but we manually write the SIMD intrinsics to make it more efficient (but way more ugly :) instead of relying on the compiler to do it.

For Q8 quantizations we ended up using group size of 32. It seemed like the best fit for most types of SIMD instructions, tensor shapes and accuracy. The value is a compile-time constant, so we don't change it at run-time. Not sure if it is the best group size - probably for 8-bit one can easily increase it up to 128 or 256 without losing too much accuracy, but it gets tricky when tensor dimensions are not a multiple of GS.

@karpathy
Copy link
Owner Author

Got it (and ty for stopping by @ggerganov !). I'll play with the group size but I expect I will settle on 32 in that case too.

I actually don't mind some ugliness if it is contained only to the matmul function, because it's easy to give a simple algorithmic reference, and it's also a well-understood operation, so the optimizations don't "get in the way" of understanding too much, imo.

So I'll most likely end up looking around for some SIMD hacking too, (will look at llama.cpp for references). I'm slightly nervous about if one has to suddenly introduce 10 IFDEFs for all the different platform versions and supports, but ah well.

@ggerganov
Copy link

Yup, I totally get it - been through some very similar dilemmas with llama.cpp :)
I learned most of the SIMD stuff that I know thanks to that project so it is a great learning experience.
Easy enough to dig in the low-level CPU optimizations.

I keep an eye on the progress in this repo and enjoy reading the issues / PRs - great work!

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 18, 2023

something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

I vote for keeping the original float code intact. for someone coming in new to the repo, it is clean and easy to follow. opts can go in another file.

int j;
for (j = 0; j <= n - GS; j += GS) {
for (int k = 0; k < GS; k++) {
ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is would be faster, maybe due to better SIMD

Suggested change
ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);
ival += ((int16_t) xq[j + k]) * ((int16_t) wq[in + j + k]);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on my linuxbox this does nothing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether speed improves or not depends, as always. It's logically more correct IMHO. It seems that gcc figures out that mine is actually equivalent to yours and already optimizes "my" way with vpmullw:
https://godbolt.org/z/TbdoxT7TT

Maybe VPDPBUSD would be faster
https://en.wikichip.org/wiki/x86/avx512_4vnniw).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it may be possible to extract VPDPBUSD from a compiler following either way
https://godbolt.org/z/ccahMeTn5

using the "official" idiom
https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-llvm-gcc-cpus-gpus.html#gs.411p4x

@ozabluda
Copy link
Contributor

Using BS=32 would prevent AVX-512 VPDPBUSD and I think would prevent optimal GPU code, which are all 512 bits [1]. BS=64 are better cache-aligned on CPUs and 128 would be better cache-aligned on GPUs.

BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR.

[1] I think. They used to be, not 100% sure about now. Not an issue for llama.cpp, right now because on Mac SIMD are 128 bits right now.

@karpathy
Copy link
Owner Author

"BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR."

@ozabluda I'd like this. I tried to make the header be exactly a nice number, as an example. And I believe with the current version things should be fairly aligned?

@efocht
Copy link

efocht commented Aug 21, 2023

Looking at the code and what's needed, I think the final version will certainly be uglier than our original nice float version. So I'm not sure what to do now. One option is to preserve the float only legacy version in root of the repo, something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:

  • runf.c (original float)
  • run.c (int8 quantized)
  • run.cu (cuda version)

But if that is the case we'll almost certainly want to take a lot of the copy paste code between them and shuttle it into common files that can be imported. Maybe this is okay...

Yes, please please keep the original floating point version! I've seen the patch of this PR and was worried you'd throw it away. On the SX-Aurora vector engine porting the fp32 version is trivial, and a fp16/bf16 version also makes sense for other CPUs like sapphire rapids.

@karpathy
Copy link
Owner Author

closing this PR as we now have attempt # 2 at int8 quantization at #364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants