-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Speculative Decoding "Segmentation fault (core dumped)" #10176
Comments
Looks like a crash in the DRY sampler when it is cloned due to
|
While running this with address sanitizer, it also detects a buffer overflow after generating tokens for a while (unrelated to the DRY issue):
cc @ggerganov |
Interestingly, I seem to have run into a different issue with the --sampling-seq modifier when using speculative decoding with Qwen 2.5, Llama3.1 seems to be working just fine: |
Looks like an issue with |
It seems it only occurs when using Qwen2.5-0.5B as the draft model, 1.5B and onwards operate as expected |
I could reproduce it now. I think this is because this model is so small that the tensor does not have enough rows, and some devices end with 0 rows, which causes the event to not be created. It can be reproduced with |
I see, that does make sense |
Thank you for the heads up, I will try to get this fixed ASAP. |
@slaren Do you have a repro? I'm running a few tests here with |
I can reproduce it reliably with this command line:
|
I am unable to reproduce on my end and the address sanitizer does not produce any errors up until the context is filled. If you could generate a detailed log with |
What happened?
Hey all, I wanted to report a segmentation fault issue with llama-speculative. I have never once gotten this executable to work; I don't believe it is my command, as I have tried copy-pasting the speculative example commands as well.
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
version: 4031 (d5a409e)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
The text was updated successfully, but these errors were encountered: