-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support speculative decoding in server
example
#5877
Comments
Any updates for this? |
@vietanh125 Not yet, but contributions are welcome 😃 |
There is ongoing related work in #6828. Though I haven't had time to look in details yet |
Sorry, does that means the server doesn't support speculative decoding? However, I can run it with commands like below in Kubernetes Just a sample:
|
Not yet supported |
Ok so the |
Also interested in this PR. Thank you to everyone contributing to a solution here. |
The #6828 PR is a distinct technique that uses a lookup file to speculate tokens instead of using a draft model, there seems to have less speedup than draft-based speculative decoding. |
Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb |
Yeah definitely. With draft: LLama-3.2-3B Q8 and model LLama-3.1-70B-Instruct (Q5_K, to fit on 2 32GB Tesla V100) we go from 10 t/s to 30 t/s. Very impressive I'd say. CUDA_VISIBLE_DEVICES=0,1 ./llama-speculative \
-m Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf \
-md Llama-3.2-3B-Instruct-Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage" \
-t 4 -n 512 -c 8192 -s 8 --top_k 1 \
--draft 16 -ngl 88 -ngld 30 --temp 0
EDIT: with LLama-3.2-1B Q8 that can go to 40 t/s |
Wait, what happened? I used to run llama-server with speculative decoding with -md. I just "upgraded" and -md went away. now there's a separate program called llama-speculative, but doesn't appear to be a server. Sigh :( Guess I have to downgrade and find the version where it went away.... |
@enn-nafnlaus Did you find the version where it went away? Would appreciate any leads. |
The last commit with |
Came to ask the same as other folks have stated here - looks like |
Is anyone working on this issue? Or is this possibly blocked by something? I am already preparing for this feature to be implemented in Ollama, but depend on this feature being implemented in llama-server here. I don't mind giving this issue here a shot, it is labeled as good first issue and if that's true would make it suitable for my first commit. I had a quick look and from what I see there is already an example of implementation in speculative. I assume I can use that as a hint for implementing it at the server level. Are there any additional pointers or specific considerations for the implementation I should be aware of? |
At the very least the |
FWIW, I went to test this a.m. before I went hunting and stumbled into this thread:
using a q4_k_l Qwen2.5-Coder-7B-Instruct draft with a q4_k_l Qwen2.5-coder-32B-Instruct-GGUF main model (bartowski quants from hf)
was perf without the draft model m3max mbp 128GB. ~53% performance increase when using the draft model, based on Went immediately to see if I could add on server since I remembered abetlan merging draft model way back when although that required the python bindings, found this thread. in case I was doing something errant, my CLI:
and
|
Glad to hear this, this is pretty similar to ExllamaV2. The Qwen 2.5 model family is a good example for this as well, you can basically use the small 1.5b or even 0.5b model for the draft with the big 72b model and get an excellent boost. |
I also ran some smaller scale tests, which I wanted to share to bring some additional perspective (this is on an RTX 3060 with 12GB of VRAM, so can't find as large models):
Which, if I read that correctly, bumps the speed from Edit: this was on |
Any progress on allowing speculative decoding in the server? |
It's already supported - this issue hasn't been closed. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
provide speculative decoding through
server
example.Motivation
Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.
Possible Implementation
perhaps move speculative sampling implementation to
common
orsampling
?The text was updated successfully, but these errors were encountered: