-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ Quantization (3-bit and 4-bit) #9
Comments
The GitHub Issue for text-generation-webui's implementation of GPTQ-for-LLaMA may also be a helpful reference. |
There’s no real benefit to using this method for 4-bit. Would be neat to see 3-bit or 2-bit attempts though. |
Well three is likely a minor, currently unknown (pending benchmarks), benefit to GPTQ for 4bit-- yes? Additionally, once 4bit GPTQ is implemented 3bit and 2bit are not much additional work and could have much larger benefits to VRAM/URAM consumption on top of the current 4bit implementation with potentially very little (if acceptable) output quality loss. |
WebAssembly implementation is blocked pending 3-bit inference due to WASM's 4GB memory constraint. |
I had a quick glance at the GPTQ paper yesterday, but haven't dug into details yet. Do you think it is possible to demonstrate a simple routine for performing quantization using this method? // src - input 32-bit floats
// dst - output quantized data
// n - number of input floats
void quantize_gptq(float * src, void * dst, int n); If I can get a prototype of this and it does not look too complex, I can try to plug it in |
@zoidbb or @qwopqwop200 might have an answer for the question above. |
The actual quantization algorithm (spread across that file and another) seems to be a little hairy, and uses some nontrivial linear algebra (Cholesky decomposition) that we'd either have to reimplement or pull in another dependency for (LAPACK?). However, if I read the CUDA kernels that they have for evaluation correctly, the format in which the quantized weights wind up is more or less equivalent to the Q4_1 mode (4-bit quantization with blockwise f16 zero offset) that we already have support for, though there currently is no AVX2 implementation for that mode. Since someone has uploaded precomputed GPTQ weights to Huggingface, it might be worthwhile to start out by implementing the right form of accelerated Q4_1 evaluation plus some to directly convert the Huggingface pickles into an appropriate format for us. |
https://twitter.com/NolanoOrg/status/1635409631530057728 Q4_0 mode will be worse with GPTQ for 7B than current Round-to-nearest quantization approach. In Q4_1 and 13B it can not only reduce RAM (by changing bin size Also 3-bit 13B GPTQ will perform better than 7B at FP16. Disclaimer - these were observed on a small subset of WikiText and Penn TreeBank (following GPTQ). |
Also, the above has results on both Q4_0 and Q4_1. |
I have no understanding of GPTQ quantization algorithms. |
According to this paper, 3 or 2 bit quantization is not a very good idea. |
That does not consider groupping/binning of less than 64 and data-dependent quantization with weight reconstruction - which is already being used now (with QK=32 --- bin of size 32) |
RTN is the one currently being used and it is the go-to baseline (not the best and decent for int4). But empirically, when zeroOffset is fixed, then GPTQ does worse than RTN when done on 7B LLaMa. |
I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including LLaMA in WebAssembly (which has 32bit addressing with 4GB max address space) and LLaMA on Raspberry Pi. Aside from that, the benefits of GPTQ seem to go up with model size; like potentially enabling both 16GB and 32GB devices to move up one entire model size as well as increasingly better performance. |
Depends on what you mean by benchmark:
|
I should be out with atleast one of the two (int-3 quant or quant 4 kernels and their running time) by tomorrow - will share the code once I am done. |
Those 3bit graphs look better than I expected actually. This is quite promising. Thanks for your contributions @Ayushk4! |
I'm curious if anyone has tried LLaMA on an 8GB RPi. If not, I might be the first. |
Please post your results in the issue for that: #58 |
It sure looks interesting. |
I just committed in some code to get AVX2-accelerated Q4_1 inference into a new branch. Since I've never written any sort of SIMD code before, this was a bit of a learning experience, and I can't guarantee its optimality. As it stands, I think it's somewhere around 50% slower than Q4_0 on the 7B model. (I get 300~400ms/tokm whereas I had 200~300 on Q4_0) However, it's still significantly faster than the unvectorised implementation that was there before, which was more in the region of a second or two per token, and in fact seems to make the difference between Q4_1 being academic and tolerable. @ggerganov @Const-me Care to take a look? |
@blackhole89 You can probably combine xsum/ysum into a single vector, like that:
Similarly, combine the multipliers:
These instructions are cheap, but they are saving Another thing, you should definitely remove the Also, you’re loading from these float pointers multiple times to get the same values. |
Yeah, haven't looked how 3-bit works, but I think it will be a bit more difficult to achieve. Maybe at a later stage when we have the 4bit stuff properly integrated |
Just FYI I would not trust these models. They were converted with the very first initial commit version of GPTQ-for-llama and will likely cause numerous problems if they even behave slightly. |
Well, let's keep an eye out for a better conversion of a GPTQ HF model; once we have one is the procedure to convert it the same using the updated script? |
Existing converter script for gptq->ggml doesn't work for this model: https://huggingface.co/elinas/alpaca-30b-lora-int4/tree/main
(I know it's early days and this is just some random model from HF, but just posting in case it's helpful data or shows a new failure case 🤷) |
the 4-bit gptq models seem to work fine in llama.cpp and anecdotally produce marginally better results, however i havent done any proper perplexity testing or such yet. also i cannot run 65b properly because i run out of ram. if someone with better pc want to try 4b 65b gptq #382 (comment) i would be interested how that works out |
the integration in llama.cpp of gptq is not ready yet, you cannot load it directly |
See #9 (comment). Seems to work just fine. |
Seems, there is superior GPTQ approach already exist (from authors of original GPTQ paper). @ggerganov, @blackhole89 care to take a look? |
…_linux Adds instructions and works on linux as well
From @Tom-Neverwinter in #1411:
|
@ggerganov Thanks for this code. I enjoy running it on my computer. I have an observation on compatibility with GPTQ, or any asymmetric uniform quantizers. Don't we want the number zero preserved for neural nets? I expect dequantize(quantize(0)) to return 0, but I don't believe q4_1 does. The q4_1 quantization code and the q4_1 dequantization code don't seem to preserve zero. The floating-point bias ( You must have a good reason to use a floating-point bias, and I may have misread the code. I'm also aware of the better perplexity due to q4_1 vs. q4_0 (symmetric). I wonder what your thoughts are regarding q4_1 vs. other asymmetric quantizers, which evaluate
|
Not sure we want such requirement. The quantized values are used only during matrix multiplication. They are not directly used to compute the activations I could be wrong, but my intuition so far from the experiments is that quantization quality mainly depends on the amount of data that you effectively use. The specific approach for representing that data has second-order effects. In any case, it is very easy to add a zero-preserving quantization and see the effect on the perplexity. |
Yes, I understand that only the inputs to the matrix multiplication are converted from the quantized format to what the hardware supports (for example, fp16 or even fp32). In fact, I believe only the weights are converted. My question is less about optimality and more about compatibility with other asymmetric quantizers vs. q4_1. Perhaps the following pseudo code helps show what I'm thinking about.
The fp16 perplexity baselines are different between GPTQ and llama.cpp. Has someone already compared GPTQ and llama.cpp using the same baseline? |
Fix stopping strings.
Update Makefile - minor spelling error
fix lora issues
4-bit quantization tends to come at a cost of output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit/2-bit) quantization methods and even when compared with uncompressed fp16 inference.
It would be good to see benchmarks on the existing implementation. It's possible there is substantial quality loss from the 4-bit quantization. It's also possible that it isn't very substantial. We'd have to see benchmarks to know.
The related project GPTQ-for-LLaMA has some benchmarks available for their implementation.
Refernces:
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
The case for 4-bit precision: k-bit Inference Scaling Laws
Related work:
https://github.com/qwopqwop200/GPTQ-for-LLaMA/
The text was updated successfully, but these errors were encountered: