-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2.cu - a simple cuda implementation #159
base: master
Are you sure you want to change the base?
Conversation
Add simple cuda implementation for llama2 inference ~60 Tokens/second on RTX 4090 (sequence length of 269)
>Cherry-picked a pending pull request to add support for chat (much easier to use and test). |
Good job! I suspect this would be better as a separate repo, as it may have different instructions to run it, and other people may create different implementations Suggestions:
Unless Karpathy wants to have one CUDA implementation right here |
llama2.cu.cu
Outdated
float val = 0.0f; | ||
for (int t = 0; t < seq_len; t++) | ||
val += att[t] * (float)value_cache[loff + t * dim + h * head_size + i]; | ||
output[h * head_size + i] = (half) val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic for this part was updated for better performance, and even readability. please check the run.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this?
// weighted sum of the values, store back into xb
half* xb = s->xb + h * head_size;
cudaMemset(xb, 0, head_size * sizeof(half));
for (int t = threadIdx.x; t < seq_len; t+= blockDim.x) {
// get the value vector for this head and at this timestep
half* v = s->value_cache + loff + t * dim + h * head_size;
// get the attention weight for this timestep
float a = att[t];
// accumulate the weighted value into xb
for (int i = 0; i < head_size; i++) {
xb[i] += a * (float)v[i];
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the above is wrong because it needs to convert half to float, compute, and then store back as half.
It must also use output
instead of s->xb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this:
// weighted sum of the values, store back into xb
half* xb = output + h * head_size;
cudaMemset(xb, 0, head_size * sizeof(half));
for (int t = threadIdx.x; t < seq_len; t+= blockDim.x) {
// get the value vector for this head and at this timestep
half* v = s->value_cache + loff + t * dim + h * head_size;
// get the attention weight for this timestep
float a = att[t];
// accumulate the weighted value into the output
for (int i = 0; i < head_size; i++) {
xb[i] += (half)(a * (float)v[i]);
}
}
!!! On quick skim - amazing, I love it. I'll take a close look and think through how this should interact with the CPU version. |
@ankan-ban Would not be better to make the computations in FP16 as well? Currently it has lots of conversions BTW, I am learning a lot with your code. Thank you! |
I tested the compiled binary and I could not see any oiutouts from the transformer, even though the GPU showed that the file was loaded. I ran
while the output of the GPU during that time is |
|
I am purposefully doing all the computations in FP32 as computations are not the bottleneck for batch-size 1 inference of these models so it makes zero difference in speed. Any free improvement in accuracy is good however I don't expect any difference in accuracy with fp16 calculations either. |
Same with me. It is failing somewhere:
It is failing on this line: if (fread(vocab[i], len, 1, file) != 1) { return 1; } with these values:
|
It works when we use a But the output is gibberish:
I am using |
Sorry about the issues. I was testing with code that was a bit old (old tokenizer and potentially incorrect code for handling prompts). I am going to sync to latest and update it today after testing with more models. |
- for easier diff with top of the tree - rename laama2.cu.cu -> llama2.cu (what I originally wanted).
fixes issue with latest tokenizer.bin
I just sync'ed llama2.cu with latest run.c. The issues you were facing should be now fixed. Tested with 4 models:
|
The instructions I followed:
|
Oh forget about the errors. It was the server instance I was using. It worked perfectly when I changed to another instance of RTX 3090. Good Job!! \o/ |
After adding #179 I was able to load llama2 7B model with this patch on Windows and am getting great results on my 3090. In fact it beats what I am getting with llama.cpp on the same machine!
vs llama.cpp
|
@richinseattle if you are measuring performance, you may want to try this branch: |
I sent some PRs to the https://github.com/ankan-ban/llama2.cu/pulls They increase performance even more |
With @kroggen's patches I am seeing double the speed on llama2 7B. 60tok/s |
Worked perfectly for me, 3090 Cuda 11.4. Although the output isn't great
|
@ss32 Llama32 7B is not a chat LM. It only does completion of the prompt Try "Here is a story about chess engines:" |
Is there still a chance this can be merged? |
This branch is no longer actively maintained. If you are interested, you can use this repo which uses INT4 weight quantization for ~3.3X more speed and 3x reduction in memory footprint: https://github.com/ankan-ban/llama_cu_awq |
Add simple cuda implementation for llama2 inference
Other unrelated changes:
Cherry-picked a pending pull request to add support for chat (much easier to use and test).I am actually very impressed by the original llama2.c. With OpenMP enabled on my system (AMD r9 5900X, 3200Mhz DDR4 dual channel memory, Windows 10), I get ~1.6 Tokens per second on the 7b model which is ~85% of peak memory bandwidth. So not only the implementation is simple - it's almost as fast as it can possibly get. For small sequence lengths 60 Tokens/s on RTX 4090 is again close to 85% of peak memory bandwidth utilization, so I believe the only way to make this significantly faster is to use weight-quantization techniques.