-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float16 and 8-bit CUDA implementations #310
base: master
Are you sure you want to change the base?
Conversation
Add simple cuda implementation for llama2 inference ~60 Tokens/second on RTX 4090 (sequence length of 269)
- for easier diff with top of the tree - rename laama2.cu.cu -> llama2.cu (what I originally wanted).
fixes issue with latest tokenizer.bin
- very tiny improvement in performance at the cost of being less general.
- split into 3 stages - 3x faster than before for large seq lengths, and we are now able to get full memory bandwidth utilization.
- get rid of the redundant memcpys
This fixes these errors: ``` ~/llama2.cu$ nvcc llama2.cu -o llama2 llama2.cu(157): error: ambiguous "?" operation: second operand of type "const half" can be converted to third operand type "int", and vice versa loaded_fragment[0][threadIdx.y][threadIdx.x] = ((n < N) && (k < K)) ? weight[offset] : 0; ^ llama2.cu(177): error: ambiguous "?" operation: second operand of type "const half" can be converted to third operand type "int", and vice versa loaded_fragment[buf_i][threadIdx.y][threadIdx.x] = ((n < N) && (k < K)) ? weight[offset] : 0; ^ 2 errors detected in the compilation of "llama2.cu". ```
speed-up softmax (or at least make it more readable)
fix build with CUDA 12.2
Fwiw, I believe it is key to have in the repo a Cuda implementation (and later an OpenCL one and so on). I hope this will be accepted soon ... |
@mgrabban You are right. Good catch! I will fix it. Thank you! |
I know I'm annoying but this is exactly why I believe it's beneficial to have this version in the repo. |
This is really cool! Wow. Do you have any stats on the 110M, or even better 7B model? What are your thoughts on how we maintain all the copy paste code between all these different versions?
etc. etc. :\ |
To keep them aligned, I would push the differences to specific functions like "load_weights" etc. Btw, on my machine the previous version of llama2.cu run at more than 12x wrt the CPU version. I'll test the new version (and the int8 one) when home. |
Performance with an RTX 3090: stories110M
Llama2 7B
The above tests were not on the same machine. The performance is consistent (has low variability) on the same machine. I was also wondering in using |
In the beginning we could do that copy and paste by ourselves But this is the kind of job that an AI coder model like WizardCoder should do. A mostly repetitive task. It could be setup as a CI job, with some test cases to check if the model did something wrong. We would need some good prompts. That could be fun! I just don't know if these models are good at editing, I suspect they are better at writing new code. But I agree that it is a good idea to have them in separate files. It is good for understanding and also for performance Regarding the names to use, here is a separate issue to discuss and select proper names: #323 |
Great job! I think you can get some more performance optimizing the mat_vec_q8_kernel() a bit - by loading multiple int8 elements at a time (I think loading just 4 int8 elements - i.e load size of uint32_t should be enough to get pretty close to max performance). |
Hey @ankan-ban, good to see you back here! If you wanna do it, you can modify this branch and then send a PR to it (on my fork). When the PR is merged, the commit will appear here
One thing lacking is to apply the |
(Btw I really want to get around to the CUDA versions but still a lot of the "basics" of the repo are not where I want them to be. I submitted a bunch of refactors yesterday that I thought clean up the repo quite a lot. I'm still continuing that work a bit, and I'm also still thinking through Chat and Quantization and have to get those into a happy state before I can move on to CUDA) |
@karpathy , I see your point, that's why I submitted those minimal PR in the hope they can help you moving faster to your desired state. Of course it's your call. Just, if you see anything that could help speeding up the addition of CUDA in the official repo let us know what is it. With the two PR I submitted today (the one on avoiding qsort in encode e the other on having a "generate()" function, I believe that adding a simple "chat mode" would be easier. If you thing they are ok, my next PR would probably be for a "chat mode". |
I tried quantize.c at my end (on a windows system) and it crashes for the llama7b model (when quantizing the q-matrix for 9th layer). I still need to figure out what's wrong (maybe some limitation of memory mapped file size on windows?). For quantization I see this implementation is using a pretty simple scheme with a single scale and zero point value per tensor. This is often known as per-tensor quantization. For better accuracy, there are more advanced techniques like:
With INT8 weights, the above techniques are probably not required, but when using INT4 they do help a lot. As batch 1 LLM inference is purely memory bandwidth bound INT4 quantization makes lot of sense (it also reduces the memory foot-print and would allow running even the 70b parameter model on systems with ~40GB memory). I had been playing around with AWQ quantization - I just hacked weights generated by AWQ repo using their python scripts (https://github.com/mit-han-lab/llm-awq), converted them to binary files and imported to llama2.c codebase, and then integrated just the cuda kernel from AWQ repo for matmul with quantized weights. (I wasted weeks in debugging an issue that turned out to be different layout used by the rotary embedding operation - but finally I have something working).
(My very hacky/test/debug WIP code is here: https://github.com/ankan-ban/llama2.cu/tree/int4-expts) |
@ankan-ban Cool! Is the output content good enough with the int4? Karpathy implemented a grouped version, it is on #312. @atamurad also implemented an int4 quantization using AWQ. You can check it here: https://huggingface.co/atamurad/llama2-7b-4bit-awq |
I run into |
I got the same issue but I only have 16GB of Ram at the moment. I told myself I would have tried with a bigger machine but never did. |
I have first version of my awq quantized int4 GPU version here: I get ~160 Tokens per second on RTX 4090 with llama2-7b model:
It's still pretty small at < 1000 lines of code (but I got rid of some sampling logic that I will probably add later moving stuff to GPU). I hope to optimize it further. Would be nice if we can reach 200 tokens per second with the 7b model. Will try bigger models too. |
Thanks for the links. I finished first version of my implementation too (above). The output looks reasonable. Just looking at the output I can't make out much difference vs the fp16 version. Will try to implement a way to compute perplexity to get a better sense of the quality of the output. |
This is based on the work of @ankan-ban on #159
It has 2 implementations in separate files:
run.cu
uses float16run-q8.cu
uses 8-bit quantizationExample Usage
For float16:
For the 8-bit quantization: