-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
int8 quantization #312
int8 quantization #312
Conversation
…up size fallback, and also change the header to be much better
…during model export
Update: this PR is now very out of date, I totally gutted the export and run.c and tied myself into knots about how to implement this without making things ugly. This won't be easy. |
Update: I got it to work!! Ended up being not too too terrible ugliness-wise. The actual PR will be fairly small. I'll clean it up a bit more and submit the commits. |
Ok I pushed my first stab at a grouped int8 implementation, quantizing the activations dynamically, and the weights at model export once. The matmul uses actual int8 arithmetic for the dot products, in configurable groups. (I'm not sure how llama.cpp does it) For my big models I'm getting around 50% speedup, which is a lot less than I was hoping for. I have to run off to work now but I'll be looking into optimizing this over the weekend probably. I want to try the 7B model too |
Looking at the code and what's needed, I think the final version will certainly be uglier than our original nice float version. So I'm not sure what to do now. One option is to preserve the float only legacy version in root of the repo, something like runf.c, and the export code as it was, and layer this on top. So we'd have a few configurations:
But if that is the case we'll almost certainly want to take a lot of the copy paste code between them and shuttle it into common files that can be imported. Maybe this is okay... |
In llama.cpp, the integer arithmetic is pretty much the same as in this PR, but we manually write the SIMD intrinsics to make it more efficient (but way more ugly :) instead of relying on the compiler to do it. For |
Got it (and ty for stopping by @ggerganov !). I'll play with the group size but I expect I will settle on 32 in that case too. I actually don't mind some ugliness if it is contained only to the matmul function, because it's easy to give a simple algorithmic reference, and it's also a well-understood operation, so the optimizations don't "get in the way" of understanding too much, imo. So I'll most likely end up looking around for some SIMD hacking too, (will look at llama.cpp for references). I'm slightly nervous about if one has to suddenly introduce 10 IFDEFs for all the different platform versions and supports, but ah well. |
Yup, I totally get it - been through some very similar dilemmas with llama.cpp :) I keep an eye on the progress in this repo and enjoy reading the issues / PRs - great work! |
I vote for keeping the original float code intact. for someone coming in new to the repo, it is clean and easy to follow. opts can go in another file. |
int j; | ||
for (j = 0; j <= n - GS; j += GS) { | ||
for (int k = 0; k < GS; k++) { | ||
ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is would be faster, maybe due to better SIMD
ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]); | |
ival += ((int16_t) xq[j + k]) * ((int16_t) wq[in + j + k]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on my linuxbox this does nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether speed improves or not depends, as always. It's logically more correct IMHO. It seems that gcc figures out that mine is actually equivalent to yours and already optimizes "my" way with vpmullw
:
https://godbolt.org/z/TbdoxT7TT
Maybe VPDPBUSD
would be faster
https://en.wikichip.org/wiki/x86/avx512_4vnniw).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it may be possible to extract VPDPBUSD
from a compiler following either way
https://godbolt.org/z/ccahMeTn5
using the "official" idiom
https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-llvm-gcc-cpus-gpus.html#gs.411p4x
Using BS=32 would prevent AVX-512 BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR. [1] I think. They used to be, not 100% sure about now. Not an issue for llama.cpp, right now because on Mac SIMD are 128 bits right now. |
"BTW, since file format is being changed, it's a good time to make sure all the weights are at least 64 byte cache-line aligned for mmap (or maybe 128 byte for some kind of GPUDirect Storage"?) I'll add a sample suggestion to the code in this PR." @ozabluda I'd like this. I tried to make the header be exactly a nice number, as an example. And I believe with the current version things should be fairly aligned? |
Yes, please please keep the original floating point version! I've seen the patch of this PR and was worried you'd throw it away. On the SX-Aurora vector engine porting the fp32 version is trivial, and a fp16/bf16 version also makes sense for other CPUs like sapphire rapids. |
closing this PR as we now have attempt # 2 at int8 quantization at #364 |
[WIP] just thought I'd put up my draft PR as I go through this, allowing for people to pitch in / comment as we go along.
Step 1: re-write the model export to do int8 quantization in groups, with an ideal group size and a fallback as needed, and also change the header to be much better.
Note: I am currently planning this PR to be a totally breaking change to current behavior. The reason for this is that quantization is, imo, totally necessary to be practical, and also a "free lunch", especially if we stick around Q8_0. So afaik there is no great need to keep around the old float32 code.
Step 2 (coming up): the
run.c
changes.Step 3 (coming up): also the Meta Llama 2 export