-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA mul mat vec q kernels for k-quants #2203
CUDA mul mat vec q kernels for k-quants #2203
Conversation
75de6d1
to
358dcf0
Compare
Alright, I now consider this ready to be merged. These are the performance numbers on my systems:
|
Hi. Is it correct to assume I need to use LLAMA_CUBLAS make option to test this PR? If so, these are my results for q4_K_M quant models on a Nvidia RTX 3060 12GB. I am not experiencing a token generation boost, but instead a speed decrease. Max context: 2048 tokens, Prompt : 544 tokens, Generation: ~300 tokens.
Please, let me know what I am doing wrong and if I need to use different make options to build and correctly test this PR. Thanks for your work on improving GPU performance on llama.cpp. |
Can you compare the speed of this PR with and without the compile option |
Sure. This is the updated table:
|
Here are my results on RTX-4080 for 7B & 13B:
2- and 3-bit are better (apart from 3-bit being producing gibberish for 7B), but all others are slower. |
Thank you for sharing those results, I'll look into 7b Q3_K_S correctness. Just to make sure: are you setting the compile option |
I see no difference between |
I see, that's very unfortunate. I want to at some point implement a script that automatically runs benchmarks and optimizes settings; ideally that will then also give us more information regarding good defaults. |
Looking at the specs it seems that compared to older generations RTX 4000 has comparatively more compute relative to the amount of memory bandwidth. So the reduction in floating point arithmetic from using integer intrinsics may not matter as much. |
As a followup to #2067 this PR adds CUDA matrix vector multiplication kernels based on integer intrinsics for k-quants. The implementations seem to work but I'll still need to do clean up and re-write the code to make it more readable. Only a block size of 256, not 64, is supported. Implementing kernels for k-quants is already tedious enough as it is and I don't want to spend time on a band-aid fix when the real solution should be to just allow blocks to span multiple rows if they don't exactly divide row size. Right now I'm too tired to do performance testing but the new kernels should be faster. However, the performance of the older data formats still seems to be better, presumably because their simpler layout allows for more coalescing memory accesses.