Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA mul mat vec q kernels for k-quants #2203

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

As a followup to #2067 this PR adds CUDA matrix vector multiplication kernels based on integer intrinsics for k-quants. The implementations seem to work but I'll still need to do clean up and re-write the code to make it more readable. Only a block size of 256, not 64, is supported. Implementing kernels for k-quants is already tedious enough as it is and I don't want to spend time on a band-aid fix when the real solution should be to just allow blocks to span multiple rows if they don't exactly divide row size. Right now I'm too tired to do performance testing but the new kernels should be faster. However, the performance of the older data formats still seems to be better, presumably because their simpler layout allows for more coalescing memory accesses.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review July 13, 2023 18:34
@JohannesGaessler
Copy link
Collaborator Author

Alright, I now consider this ready to be merged. These are the performance numbers on my systems:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q2_K tg128 82.65 115.50 1.40
RTX 3090 7b q3_K_M tg128 71.55 92.50 1.29
RTX 3090 7b q4_K_M tg128 72.76 94.20 1.29
RTX 3090 7b q5_K_M tg128 70.78 87.42 1.24
RTX 3090 7b q6_K tg128 71.02 86.35 1.22
P40 7b q2_K tg128 28.91 40.30 1.39
P40 7b q3_K_M tg128 25.38 31.09 1.22
P40 7b q4_K_M tg128 24.61 32.16 1.31
P40 7b q5_K_M tg128 22.79 25.84 1.13
P40 7b q6_K_M tg128 22.21 27.60 1.24

@abc-nix
Copy link

abc-nix commented Jul 14, 2023

Hi.

Is it correct to assume I need to use LLAMA_CUBLAS make option to test this PR? If so, these are my results for q4_K_M quant models on a Nvidia RTX 3060 12GB. I am not experiencing a token generation boost, but instead a speed decrease.

Max context: 2048 tokens, Prompt : 544 tokens, Generation: ~300 tokens.

Model layers main TG (t/s) PR TG (t/s)
7B 35/35 32.87 30.88
13B 43/43 18.39 17.19
30B 29/63 2.49 2.36

Please, let me know what I am doing wrong and if I need to use different make options to build and correctly test this PR.

Thanks for your work on improving GPU performance on llama.cpp.

@JohannesGaessler
Copy link
Collaborator Author

Can you compare the speed of this PR with and without the compile option LLAMA_CUDA_FORCE_DMMV? It's possible that while I made this PR something changed on master that affects performance.

@abc-nix
Copy link

abc-nix commented Jul 14, 2023

Can you compare the speed of this PR with and without the compile option LLAMA_CUDA_FORCE_DMMV? It's possible that while I made this PR something changed on master that affects performance.

Sure. This is the updated table:

Model layers LLAMA_CUDA_FORCE_DMMV main TG (t/s) PR TG (t/s)
7B 35/35 no 32.87 30.88
7B 35/35 yes 33.32 33.45
13B 43/43 no 18.39 17.19
13B 43/43 yes 18.08 17.85
30B 29/63 no 2.49 2.36
30B 29/63 yes 2.49 2.37

@JohannesGaessler JohannesGaessler merged commit 4304bd3 into ggerganov:master Jul 14, 2023
@ikawrakow
Copy link
Contributor

Here are my results on RTX-4080 for 7B & 13B:

Quantization TG Before this PR (t/s) TG After this PR (t/s) Difference
Q2_K - 7B 125.0 132.4 +5.9%
Q3_K_S - 7B 111.3 Broken -
Q4_K_S - 7B 119.3 109.5 -8.2%
Q5_K_S - 7B 104.5 95.3 -8.9%
Q6_K - 7B 92.7 87.9 -5.2%
Q2_K - 13B 72.6 79.3 +9.2%
Q3_K_S - 13B 65.1 78.4 +20.0%
Q4_K_S - 13B 68.5 63.2 -7.7%
Q5_K_S - 13B 60.1 55.3 -8.0%
Q6_K - 13B 52.7 50.9 -3.5%

2- and 3-bit are better (apart from 3-bit being producing gibberish for 7B), but all others are slower.

@JohannesGaessler
Copy link
Collaborator Author

Thank you for sharing those results, I'll look into 7b Q3_K_S correctness. Just to make sure: are you setting the compile option LLAMA_CUDA_MMV_Y? The default value is 1 but on my RTX 3090 I get better performance by setting it to 2. In any case, the previous implementation should still be available via LLAMA_CUDA_FORCE_DMMV.

@ikawrakow
Copy link
Contributor

I see no difference between LLAMA_CUDA_MMV_Y=1 and LLAMA_CUDA_MMV_Y=2.

@JohannesGaessler
Copy link
Collaborator Author

I see, that's very unfortunate. I want to at some point implement a script that automatically runs benchmarks and optimizes settings; ideally that will then also give us more information regarding good defaults.

@JohannesGaessler
Copy link
Collaborator Author

Looking at the specs it seems that compared to older generations RTX 4000 has comparatively more compute relative to the amount of memory bandwidth. So the reduction in floating point arithmetic from using integer intrinsics may not matter as much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants