-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: K cache without FA #103
Comments
Thanks for the report. Happens for me too. I'll investigate. |
This is also broken on mainline |
CUDA on mainline |
Indeed, it's on mainline also. |
It was puzzling to me why |
Thinking more about this, it is kind of strange. It does work on the CPU, where |
@Nexesenex Does this PR fix it for you? It is approved and all, but I still get NaN's with a quantized model. It does appear to work with the |
I also get NaN with a q8_0 model when using |
It is not just |
@ikawrakow I just confirmed that all K quantum cache no-FA modes present on mainline are now working : ggerganov/llama.cpp#10011 (comment) I also used ggerganov/llama.cpp#10015 while I was at it. |
What happened?
With the non-FA Quantum K cache, q6_0 works.
But q4_0, q4_1, q5_0, q5_1, q8_0 do not work anymore as K quant without FA, both on IK_L and mainline, and go NaN instead. As does iq4_nl K/no FA.
(I personally don't mind, K q6_0 is my new bff K cache quant).
Tested on Llama 3.1 8b Q5_K.
Name and Version
b3962 on Mainline.
Pre granite merge on IK.
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: