-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229
Comments
The choice to use mmq or not is made in |
Somehow we got there even though I can reproduce this on the latest master with the following commands:
I can reproduce this all the way back to d0cee0d, it's probably an issue with the original implementation of #2506. One thing to note is that this is specific to the model size - llama-2-13b.Q4_K_S.gguf will not trigger this, even though each of the Q4_K_S and Q6_K will fully fit into the P40's VRAM. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I'm still hitting this on latest master (799fc22):
|
I don't have a GTX 9XX GPU but I edited the code in such a way that one of my GPUs should be treated as such. Still, I am not able to reproduce this bug. I don't know what would be wrong with the multi GPU logic either. |
Try a few different Q6_K 13B models, with various (short) prompts. For some reason I haven't been able to trigger it with all models and all prompts, but with the right prompt and model it seems 100% reproducible. |
I tried a bunch of quantization formats and models and I still can't reproduce it. Is the model you were using one of those where the output tensor has 32001 instead of 32000 rows? |
I can reproduce it on a 13B with n_vocab=32032 (chronos-hermes-13b-v2.Q6_K.gguf) and a 20B with n_vocab=32001, but not on a 13B or 20B with n_vocab=32000. So you're right, that's the difference. |
Current Behavior
I got this crash on https://github.com/cebtenzzre/llama.cpp/tree/18fe116e9a5aa45a83bd1d6f043f98dc395f218e:
Failure Information (for bugs)
Backtrace:
Relevant code: https://github.com/cebtenzzre/llama.cpp/blob/18fe116e9a5aa45a83bd1d6f043f98dc395f218e/ggml-cuda.cu#L5054-L5077
It asserts that
g_compute_capabilities[id] >= MIN_CC_DP4A
(610) where id is the current device. But it is 520, which matches my GTX 970:Steps to Reproduce
I'm not exactly sure how I ran into this issue, because I've been using the same build for weeks without seeing it. It could be an issue with my fork - I should investigate whether the latest llama.cpp is still significantly slower on my GPUs. I still have the coredump handy if any further information would help.
cc @slaren
The text was updated successfully, but these errors were encountered: