CUDA mul mat vec q kernels for k-quants #2203

JohannesGaessler · 2023-07-12T22:08:46Z

As a followup to #2067 this PR adds CUDA matrix vector multiplication kernels based on integer intrinsics for k-quants. The implementations seem to work but I'll still need to do clean up and re-write the code to make it more readable. Only a block size of 256, not 64, is supported. Implementing kernels for k-quants is already tedious enough as it is and I don't want to spend time on a band-aid fix when the real solution should be to just allow blocks to span multiple rows if they don't exactly divide row size. Right now I'm too tired to do performance testing but the new kernels should be faster. However, the performance of the older data formats still seems to be better, presumably because their simpler layout allows for more coalescing memory accesses.

JohannesGaessler · 2023-07-13T19:07:35Z

Alright, I now consider this ready to be merged. These are the performance numbers on my systems:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q2_K	tg128	82.65	115.50	1.40
RTX 3090	7b q3_K_M	tg128	71.55	92.50	1.29
RTX 3090	7b q4_K_M	tg128	72.76	94.20	1.29
RTX 3090	7b q5_K_M	tg128	70.78	87.42	1.24
RTX 3090	7b q6_K	tg128	71.02	86.35	1.22
P40	7b q2_K	tg128	28.91	40.30	1.39
P40	7b q3_K_M	tg128	25.38	31.09	1.22
P40	7b q4_K_M	tg128	24.61	32.16	1.31
P40	7b q5_K_M	tg128	22.79	25.84	1.13
P40	7b q6_K_M	tg128	22.21	27.60	1.24

abc-nix · 2023-07-14T16:11:04Z

Hi.

Is it correct to assume I need to use LLAMA_CUBLAS make option to test this PR? If so, these are my results for q4_K_M quant models on a Nvidia RTX 3060 12GB. I am not experiencing a token generation boost, but instead a speed decrease.

Max context: 2048 tokens, Prompt : 544 tokens, Generation: ~300 tokens.

Model	layers	main TG (t/s)	PR TG (t/s)
7B	35/35	32.87	30.88
13B	43/43	18.39	17.19
30B	29/63	2.49	2.36

Please, let me know what I am doing wrong and if I need to use different make options to build and correctly test this PR.

Thanks for your work on improving GPU performance on llama.cpp.

JohannesGaessler · 2023-07-14T16:20:37Z

Can you compare the speed of this PR with and without the compile option LLAMA_CUDA_FORCE_DMMV? It's possible that while I made this PR something changed on master that affects performance.

abc-nix · 2023-07-14T16:46:15Z

Can you compare the speed of this PR with and without the compile option LLAMA_CUDA_FORCE_DMMV? It's possible that while I made this PR something changed on master that affects performance.

Sure. This is the updated table:

Model	layers	LLAMA_CUDA_FORCE_DMMV	main TG (t/s)	PR TG (t/s)
7B	35/35	no	32.87	30.88
7B	35/35	yes	33.32	33.45
13B	43/43	no	18.39	17.19
13B	43/43	yes	18.08	17.85
30B	29/63	no	2.49	2.36
30B	29/63	yes	2.49	2.37

ikawrakow · 2023-07-20T11:01:56Z

Here are my results on RTX-4080 for 7B & 13B:

Quantization	TG Before this PR (t/s)	TG After this PR (t/s)	Difference
Q2_K - 7B	125.0	132.4	+5.9%
Q3_K_S - 7B	111.3	Broken	-
Q4_K_S - 7B	119.3	109.5	-8.2%
Q5_K_S - 7B	104.5	95.3	-8.9%
Q6_K - 7B	92.7	87.9	-5.2%
Q2_K - 13B	72.6	79.3	+9.2%
Q3_K_S - 13B	65.1	78.4	+20.0%
Q4_K_S - 13B	68.5	63.2	-7.7%
Q5_K_S - 13B	60.1	55.3	-8.0%
Q6_K - 13B	52.7	50.9	-3.5%

2- and 3-bit are better (apart from 3-bit being producing gibberish for 7B), but all others are slower.

JohannesGaessler · 2023-07-20T11:19:39Z

Thank you for sharing those results, I'll look into 7b Q3_K_S correctness. Just to make sure: are you setting the compile option LLAMA_CUDA_MMV_Y? The default value is 1 but on my RTX 3090 I get better performance by setting it to 2. In any case, the previous implementation should still be available via LLAMA_CUDA_FORCE_DMMV.

ikawrakow · 2023-07-20T11:55:24Z

I see no difference between LLAMA_CUDA_MMV_Y=1 and LLAMA_CUDA_MMV_Y=2.

JohannesGaessler · 2023-07-20T12:16:57Z

I see, that's very unfortunate. I want to at some point implement a script that automatically runs benchmarks and optimizes settings; ideally that will then also give us more information regarding good defaults.

JohannesGaessler · 2023-07-20T12:33:40Z

Looking at the specs it seems that compared to older generations RTX 4000 has comparatively more compute relative to the amount of memory bandwidth. So the reduction in floating point arithmetic from using integer intrinsics may not matter as much.

CUDA: mul_mat_vec_q kernels for k-quants

358dcf0

JohannesGaessler force-pushed the cuda-mmvq-k-quant branch from 75de6d1 to 358dcf0 Compare July 13, 2023 18:34

JohannesGaessler marked this pull request as ready for review July 13, 2023 18:34

slaren approved these changes Jul 14, 2023

View reviewed changes

JohannesGaessler merged commit 4304bd3 into ggerganov:master Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA mul mat vec q kernels for k-quants #2203

CUDA mul mat vec q kernels for k-quants #2203

JohannesGaessler commented Jul 12, 2023

JohannesGaessler commented Jul 13, 2023

abc-nix commented Jul 14, 2023

JohannesGaessler commented Jul 14, 2023

abc-nix commented Jul 14, 2023

ikawrakow commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023

ikawrakow commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023

CUDA mul mat vec q kernels for k-quants #2203

CUDA mul mat vec q kernels for k-quants #2203

Conversation

JohannesGaessler commented Jul 12, 2023

JohannesGaessler commented Jul 13, 2023

abc-nix commented Jul 14, 2023

JohannesGaessler commented Jul 14, 2023

abc-nix commented Jul 14, 2023

ikawrakow commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023

ikawrakow commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023

JohannesGaessler commented Jul 20, 2023