float16 and 8-bit CUDA implementations #310

kroggen · 2023-08-16T23:50:39Z

This is based on the work of @ankan-ban on #159

It has 2 implementations in separate files:

run.cu uses float16
run-q8.cu uses 8-bit quantization

Example Usage

For float16:

git clone -b cuda https://github.com/kroggen/llama2.c
cd llama2.c
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
make cuda
./run-cuda stories110M.bin

For the 8-bit quantization:

git clone https://github.com/kroggen/llama2.c
cd llama2.c
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

git checkout quantization-q8
gcc -O3 quantize.c -o quantize -lm
./quantize stories110M.bin

git checkout cuda
make cuda-q8
./run-cuda-q8 data.bin

Add simple cuda implementation for llama2 inference ~60 Tokens/second on RTX 4090 (sequence length of 269)

- for easier diff with top of the tree - rename laama2.cu.cu -> llama2.cu (what I originally wanted).

fixes issue with latest tokenizer.bin

- very tiny improvement in performance at the cost of being less general.

- split into 3 stages - 3x faster than before for large seq lengths, and we are now able to get full memory bandwidth utilization.

- get rid of the redundant memcpys

This fixes these errors: ``` ~/llama2.cu$ nvcc llama2.cu -o llama2 llama2.cu(157): error: ambiguous "?" operation: second operand of type "const half" can be converted to third operand type "int", and vice versa loaded_fragment[0][threadIdx.y][threadIdx.x] = ((n < N) && (k < K)) ? weight[offset] : 0; ^ llama2.cu(177): error: ambiguous "?" operation: second operand of type "const half" can be converted to third operand type "int", and vice versa loaded_fragment[buf_i][threadIdx.y][threadIdx.x] = ((n < N) && (k < K)) ? weight[offset] : 0; ^ 2 errors detected in the compilation of "llama2.cu". ```

speed-up softmax (or at least make it more readable)

fix build with CUDA 12.2

rdentato · 2023-08-18T06:38:31Z

Fwiw, I believe it is key to have in the repo a Cuda implementation (and later an OpenCL one and so on).
This will allow to focus the efforts on supporting GPU while keeping consistency with the official run.c.
It also will help in refactoring the code so that the logic will be simple and clean while the implementation details will be abstracted away.
To avoid confusion, we could add a 'cuda' directory and add a readme that explains that it is not the official release.

I hope this will be accepted soon ...

mgrabban · 2023-08-18T16:44:34Z

@kroggen can you check rmsnorm_kernel? I think there are bugs on line 108 (only adds x[index], not x[index] * x[index]) and line 113 (sums ss * ss, not ss).

kroggen · 2023-08-18T20:22:31Z

@mgrabban You are right. Good catch! I will fix it. Thank you!

rdentato · 2023-08-19T06:53:48Z

I know I'm annoying but this is exactly why I believe it's beneficial to have this version in the repo.

karpathy · 2023-08-19T17:09:02Z

This is really cool! Wow. Do you have any stats on the 110M, or even better 7B model?

What are your thoughts on how we maintain all the copy paste code between all these different versions?

run.c
runq.c
run.cu
runq.cu
run.cl

etc. etc. :\

rdentato · 2023-08-19T19:37:35Z

To keep them aligned, I would push the differences to specific functions like "load_weights" etc.
If you are not opposed to the idea of creating a llm "object" (like I proposed some PR ago), I'll submit a new one.
I'm traveling these days and I'll be back home tomorrow night. I'll try to cook something up on Monday.

Btw, on my machine the previous version of llama2.cu run at more than 12x wrt the CPU version. I'll test the new version (and the int8 one) when home.

kroggen · 2023-08-19T23:04:58Z

Performance with an RTX 3090:

stories110M

fp16: achieved tok/s: 1300.813008
int8: achieved tok/s: 1262.295082

Llama2 7B

fp16: achieved tok/s: 57.862491
int8: achieved tok/s: 77.062557

The above tests were not on the same machine. The performance is consistent (has low variability) on the same machine.

I was also wondering in using bfloat16 instead of half, as it is more lossless, but not all cards support it. So maybe as a side work.

kroggen · 2023-08-19T23:59:12Z

What are your thoughts on how we maintain all the copy paste code between all these different versions?

run.c

runq.c

run.cu

runq.cu

run.cl

In the beginning we could do that copy and paste by ourselves

But this is the kind of job that an AI coder model like WizardCoder should do. A mostly repetitive task. It could be setup as a CI job, with some test cases to check if the model did something wrong. We would need some good prompts. That could be fun! I just don't know if these models are good at editing, I suspect they are better at writing new code.

But I agree that it is a good idea to have them in separate files. It is good for understanding and also for performance

Regarding the names to use, here is a separate issue to discuss and select proper names: #323

ankan-ban · 2023-08-20T04:07:17Z

Great job! I think you can get some more performance optimizing the mat_vec_q8_kernel() a bit - by loading multiple int8 elements at a time (I think loading just 4 int8 elements - i.e load size of uint32_t should be enough to get pretty close to max performance).

kroggen · 2023-08-20T06:21:24Z

Hey @ankan-ban, good to see you back here!

If you wanna do it, you can modify this branch and then send a PR to it (on my fork). When the PR is merged, the commit will appear here

git remote add bernardo https://github.com/kroggen/llama2.c
git fetch bernardo
git checkout -b cuda2 bernardo/cuda

One thing lacking is to apply the n_kv_heads to the MHA. I only did for the other parts

karpathy · 2023-08-21T14:48:25Z

(Btw I really want to get around to the CUDA versions but still a lot of the "basics" of the repo are not where I want them to be. I submitted a bunch of refactors yesterday that I thought clean up the repo quite a lot. I'm still continuing that work a bit, and I'm also still thinking through Chat and Quantization and have to get those into a happy state before I can move on to CUDA)

rdentato · 2023-08-21T14:55:31Z

@karpathy , I see your point, that's why I submitted those minimal PR in the hope they can help you moving faster to your desired state.
However not having this cuda version, which @kroggen is actively working on, makes more complicated to help him converge toward a stable state. Like in the case of the bug that @mgrabban noted earlier, if the code wasnt' posted here, maybe he would not have found them.
Working on two separate repository when you try to make code converge is quite complicated.

Of course it's your call. Just, if you see anything that could help speeding up the addition of CUDA in the official repo let us know what is it.

With the two PR I submitted today (the one on avoiding qsort in encode e the other on having a "generate()" function, I believe that adding a simple "chat mode" would be easier. If you thing they are ok, my next PR would probably be for a "chat mode".

ankan-ban · 2023-08-21T15:23:18Z

I tried quantize.c at my end (on a windows system) and it crashes for the llama7b model (when quantizing the q-matrix for 9th layer). I still need to figure out what's wrong (maybe some limitation of memory mapped file size on windows?).

For quantization I see this implementation is using a pretty simple scheme with a single scale and zero point value per tensor. This is often known as per-tensor quantization.

For better accuracy, there are more advanced techniques like:

Per-channel (or per-column) quantization: where we quantize each column of the matrix separately and have multiple scale/zero-point values - one per column
Grouped quantization which is even finer grained where we typically quantize small groups of elements together (like 128 elements), so we end up having even more scale/zero-point values (one for each group).
Activation aware quantization (AWQ) - which in addition to grouped quantization scheme scales the input to the layer based on relative magnitudes of activations and weights. (My understanding is that this is kind of the state of art now...)

With INT8 weights, the above techniques are probably not required, but when using INT4 they do help a lot. As batch 1 LLM inference is purely memory bandwidth bound INT4 quantization makes lot of sense (it also reduces the memory foot-print and would allow running even the 70b parameter model on systems with ~40GB memory).

I had been playing around with AWQ quantization - I just hacked weights generated by AWQ repo using their python scripts (https://github.com/mit-han-lab/llm-awq), converted them to binary files and imported to llama2.c codebase, and then integrated just the cuda kernel from AWQ repo for matmul with quantized weights. (I wasted weeks in debugging an issue that turned out to be different layout used by the rotary embedding operation - but finally I have something working).
With that I am getting slightly more than 2x the performance of the FP16 CUDA implementation at my end (RTX 4090) with llama2-7b model:

FP16: 62 tokens per second
INT4: 128 tokens per second
I am hoping to write my own kernel for the mat-mul and get better than this.

(My very hacky/test/debug WIP code is here: https://github.com/ankan-ban/llama2.cu/tree/int4-expts)

kroggen · 2023-08-22T01:01:24Z

@ankan-ban Cool! Is the output content good enough with the int4? Karpathy implemented a grouped version, it is on #312. @atamurad also implemented an int4 quantization using AWQ. You can check it here: https://huggingface.co/atamurad/llama2-7b-4bit-awq

calvintwr · 2023-08-22T04:40:02Z

I run into malloc failed with this. I was using llama-13b. Any idea?

rdentato · 2023-08-22T12:59:27Z

I got the same issue but I only have 16GB of Ram at the moment. I told myself I would have tried with a bigger machine but never did.
How much RAM do you have?

ankan-ban · 2023-08-24T17:32:07Z

I have first version of my awq quantized int4 GPU version here:
https://github.com/ankan-ban/llama_cu_awq/tree/main
(decided to put it in a separate repository as I am not using llama2.c weights anymore, and had to change some logic related to RoPE rotation to make it work with AWQ weights).

I get ~160 Tokens per second on RTX 4090 with llama2-7b model:

>llama2.cu.exe C:\LLM\llama2-awq-q4.bin 256 "write an essay about GPUs"

Model params:-
dim: 4096
hidden_dim: 11008
n_heads: 32
n_kv_heads: 32
n_layers: 32
seq_len: 2048
vocab_size: 32000


loaded weights
<s>
write an essay about GPUs

Introduction:

GPU (Graphics Processing Unit) is a specialized electronic circuit designed to accelerate the manipulation of graphical data. It is a key component of a computer's hardware that is used to improve the performance of graphics-intensive applications such as video games, computer-aided design (CAD) software, and scientific simulations. In this essay, we will explore the history of GPUs, their architecture, and their impact on the computer industry.
History of GPUs:
The concept of a GPU can be traced back to the 1960s when computer graphics were still in their infancy. At that time, computer graphics were primarily used for scientific visualization and were not yet a major component of mainstream computing. However, as computer graphics became more popular in the 1980s and 1990s, the need for specialized hardware to handle the increasingly complex graphics tasks became apparent. In the early 1990s, the first GPUs were developed, which were designed to offload the computationally intensive graphics tasks from the CPU (Central Processing Unit) to the GPU.
Architecture
achieved tok/s: 161.494617. Tokens: 255, seconds: 1.579

It's still pretty small at < 1000 lines of code (but I got rid of some sampling logic that I will probably add later moving stuff to GPU). I hope to optimize it further. Would be nice if we can reach 200 tokens per second with the 7b model. Will try bigger models too.

ankan-ban · 2023-08-24T17:44:04Z

@ankan-ban Cool! Is the output content good enough with the int4? Karpathy implemented a grouped version, it is on #312. @atamurad also implemented an int4 quantization using AWQ. You can check it here: https://huggingface.co/atamurad/llama2-7b-4bit-awq

Thanks for the links. I finished first version of my implementation too (above). The output looks reasonable. Just looking at the output I can't make out much difference vs the fp16 version. Will try to implement a way to compute perplexity to get a better sense of the quality of the output.

ankan-ban and others added 30 commits July 28, 2023 19:19

Initial commit

26a5ea5

Add simple cuda implementation for llama2 inference ~60 Tokens/second on RTX 4090 (sequence length of 269)

move code around a bit

05e1abd

- for easier diff with top of the tree - rename laama2.cu.cu -> llama2.cu (what I originally wanted).

sync llama2.cu to latest run.c

cb0c6fa

fixes issue with latest tokenizer.bin

vector reads for matmul kernel

ba108c0

- very tiny improvement in performance at the cost of being less general.

faster MHA implementation

8648afa

- split into 3 stages - 3x faster than before for large seq lengths, and we are now able to get full memory bandwidth utilization.

store directly to kv cache

05f9348

- get rid of the redundant memcpys

rename variables

fc9ad3a

pre-compute weight[index * weight_row_stride]

fd53659

speed-up softmax

283e2dd

softmax: use multiplication instead of division

e72fc38

parallelize part of logits processing

edde656

update pointer computation

899d97e

argmax on the GPU

2e6e2b8

Merge pull request #4 from kroggen/patch-5

a1af33c

speed-up softmax (or at least make it more readable)

Merge pull request #2 from kroggen/patch-3

753a551

fix build with CUDA 12.2

Merge branch 'opt' into patch-4

d91996d

Merge branch 'patch-6' into cuda

8f69238

enhance RoPE rotation kernel

31376f3

Merge branch 'master' into cuda

78158b9

update CUDA impl. - topp, tokenizer, argparse

7c1b3b6

apply n_kv_heads to RoPE rotation

408ed8d

Merge branch 'master' into cuda

fb95fd1

update CUDA impl. - UTF-8

8757a28

fix build

63c8967

update cuda impl

dfd07a6

rename Cpu -> CPU

57a82f7

move structs to top

6a8c95f

Merge branch 'master' into cuda

06012f3

update with master

741f49e

build with 'make cuda-q8'

313b5c9

kroggen mentioned this pull request Aug 17, 2023

8-bit Quantization #298

Open

fix rmsnorm kernel

f991743

calvintwr mentioned this pull request Aug 22, 2023

Error in converting huggingface models #314

Open

kroggen added 7 commits August 22, 2023 13:18

rename cuda files

57cb012

run-q8.cu: compute freq_cis on-the-fly

61c75b2

fix build

7b42ca3

run.cu: compute freq_cis on-the-fly

1487e7f

Merge branch 'master' into cuda

8b27e10

update CUDA implementations with last changes from run.c

c3454cc

release transformer weights

d8ccc35

GilesBathgate mentioned this pull request Nov 24, 2023

Simple float CUDA port #376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float16 and 8-bit CUDA implementations #310

float16 and 8-bit CUDA implementations #310

kroggen commented Aug 16, 2023 •

edited

Loading

rdentato commented Aug 18, 2023 •

edited

Loading

mgrabban commented Aug 18, 2023

kroggen commented Aug 18, 2023 •

edited

Loading

rdentato commented Aug 19, 2023

karpathy commented Aug 19, 2023

rdentato commented Aug 19, 2023 •

edited

Loading

kroggen commented Aug 19, 2023 •

edited

Loading

kroggen commented Aug 19, 2023 •

edited

Loading

ankan-ban commented Aug 20, 2023

kroggen commented Aug 20, 2023

karpathy commented Aug 21, 2023

rdentato commented Aug 21, 2023 •

edited

Loading

ankan-ban commented Aug 21, 2023

kroggen commented Aug 22, 2023

calvintwr commented Aug 22, 2023

rdentato commented Aug 22, 2023

ankan-ban commented Aug 24, 2023

ankan-ban commented Aug 24, 2023

float16 and 8-bit CUDA implementations #310

Are you sure you want to change the base?

float16 and 8-bit CUDA implementations #310

Conversation

kroggen commented Aug 16, 2023 • edited Loading

Example Usage

rdentato commented Aug 18, 2023 • edited Loading

mgrabban commented Aug 18, 2023

kroggen commented Aug 18, 2023 • edited Loading

rdentato commented Aug 19, 2023

karpathy commented Aug 19, 2023

rdentato commented Aug 19, 2023 • edited Loading

kroggen commented Aug 19, 2023 • edited Loading

kroggen commented Aug 19, 2023 • edited Loading

ankan-ban commented Aug 20, 2023

kroggen commented Aug 20, 2023

karpathy commented Aug 21, 2023

rdentato commented Aug 21, 2023 • edited Loading

ankan-ban commented Aug 21, 2023

kroggen commented Aug 22, 2023

calvintwr commented Aug 22, 2023

rdentato commented Aug 22, 2023

ankan-ban commented Aug 24, 2023

ankan-ban commented Aug 24, 2023

kroggen commented Aug 16, 2023 •

edited

Loading

rdentato commented Aug 18, 2023 •

edited

Loading

kroggen commented Aug 18, 2023 •

edited

Loading

rdentato commented Aug 19, 2023 •

edited

Loading

kroggen commented Aug 19, 2023 •

edited

Loading

kroggen commented Aug 19, 2023 •

edited

Loading

rdentato commented Aug 21, 2023 •

edited

Loading