-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Speculative decoding – load 2 models at the same time! #1207
Comments
Support for this was added to llama.cpp today: It shows 40-60% speedups in practice. |
Support is now added to KoboldCpp. Do give it a try! |
How do I confirm whether it is working or not? It prints While I know that Mistal tokenizer is partially different between Large and 7B, there were no errors following. During generation, I see 0% load on GPU, suddenly spiking to 40-60% periodically. Is that the draft model running? But where is statistics of speculative sampling? For example, "drafted 40 tokens, accepted 29 tokens, in average 3.4 tokens for step" Still, can you add a flag to switch to the draft model at runtime? (Al least that: not actually use two models as independent, but simply force generation on a small model instead of running both speculatively) |
You can run it in --debugmode. The draft model output will be displayed along with the validated result and whether it matches. To verify drafting you can compare with two kinds of instructions: "Please give me the first 100 positive integers" - draft model will do very well on this at low temperature. You will see large chunks of correctly guessed tokens being output "Write a funny story about zebras" - draft model will do poorly on this as it's hard to speculate creative output. |
Oh, Debug Mode, should have thought of that! It's working, but it renders UTF-8 characters in system locale I think? For me it's win-1251. I even tried to execute Or are you afraid of decoding Unicode at that point? UPD: when the full generated text is printed on console at the end as a string, it renders correctly. |
Unicode characters are often made of multiple tokens. For example, You should cross reference the token IDs with the byte representations in the vocab. It's more complicated than you think. For example, |
Oh, I got it. You print individual tokens up to the first failed guess but not more. To be able to visually compare texts, it is enough to just utf8-decode whatever we have from the draft model so far, not truncating it on the very first wrong token. (Yeah, if the failed one was the very last – we get the same exact thing: nothing known after it) If you would additionally print the full drafted string, it will be trivial to compare what was generated against "what the smaller model was trying to say"! |
sure. you can print the full drafted tokens by printing out the token IDs here: I have a helper function |
@LostRuins, I have mistaken.
This is what I tried to add into the code, as you suggested: Before the line
// if we have somehow skipped ahead (e.g drafting), ensure that all tokens after npast are purged
(after the big loop that drafts tokens)
if(debugmode==1 && draft_used){
printf("\nSpeculation: [%s] (correct=%d/%d)\n", get_tok_vec_str_concat(draft_results.draftids).c_str(), logits_sampled, logits_to_sample);
}
Where get_tok_vec_str_concat is like yours but concatenating:
static std::string get_tok_vec_str_concat(std::vector<int> &embd)
{
std::string tmp = "";
for (auto id : embd)
{
tmp += FileFormatTokenizeID(id, file_format, true);
}
::utreplace(tmp, "\n", "\\n");
return tmp;
} Here are full logs of a run (Mistral-Large-Instruct-2407 + Mistral-7B-Instruct-v0.3): CMD shell
The drafted part of So, two final questions:
|
|
For 1: most of the time your .exe is run via Explorer and not from a terminal, and thus it is your process who creates and ultimately owns the console. But checking for this on Windows is even more inconvenient, so I agree that it is easy enough to just create a .bat file that would run chcp and then koboldcpp. |
Weird, now But then I clicked "Extra → Unpack KoboldCpp To Folder" and then tried |
Like I said before, it's likely a setting from the terminal console itself and not within koboldcpp. Different terminals can have different settings for text encoding. |
ggerganov#2926
ggerganov#3624
ggerganov#5625
Feature request
Background
I have been using MIQU model (https://huggingface.co/miqudev/miqu-1-70b) for quite a long time many months ago. It is 70b and of course it won't fit in my 3060 GPU with 12 Gb of VRAM.
That model was better than anything else I've ever tried! I wouldn't want to run it heavily quantized to 2 bits because I didn't want to sacrifice its quality, especially because I have 128 Gb of DDR4 RAM.
I could get, like, 1 token/second (or slightly more while the context is short) by activating CuBLAS with 0 offloaded layers.
But later, llama.cpp was updated with new quantization algorithm that hurts performance for older models (that have to be requantized, which is not the case for this stolen/unofficial miqu model). Anyway, it was not that bad even when I continued to play with miqu.
But recently a Mistral Large 2 model came out (https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF) that has 123b parameters!
For me it was superior to miqu for every possible task, and it is even less censored.
Unfortunately, such huge model is running 0.3 tokens/second from the empty context and it gets even slower over time…
I tried different attempts to speed it up, but CuBLAS with 0 layers is still the best (and I cannot roll back to the older koboldcpp version because of GGUF format changes, to see if the previous CUDA kernel versions might be faster or not).
Q4 quant instead of Q5 gives a slight improvement: 0.4 tokens/seconds (+0.1 comparing to Q5).
After searching information about which model can be used as a draft model for speculative sampling for Mistral Large 2 I decided to try Mistral 7B Instruct v0.3 (https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF)
Strangely enough, llama.cpp has some redundant vocabulary checks (https://github.com/ggerganov/llama.cpp/blob/f018acba22095b8995bf6c5ef815b16a3ce4cf1b/examples/speculative/speculative.cpp#L119-L136). I had to recompile from source with those asserts commented out to make it accept Mistral 7B Instruct v0.3 (as Q5_K_M) as draft model for Mistral Large Instruct 2407 (as Q4_K_S).
Also I had to build with full CUDA support for a fair comparation.
The final speedup was huge! At 3-5 drafted tokes I got doubled speed of 0.85-1.0 token/second!
I think it worth to have it in koboldcpp as well.
What exactly I propose
Sampling
Only greedy sampling (temp=0 or top_k=1) is straightforward to implement for speculative decoding. Though, some algorithms exist to allow stochastic sampling from several token probabilities (I'm not quite sure how it is implemented in llama.cpp: are they generate a most probable depth tree recursively? Are they just estimating output probabilities, sacrificing the authenticity of main model actual logits?)
Here I suggest to live with whatever is implemented in llama.cpp. Even if only greedy sampling would work correctly – this would be still a huge improvement, because:
Use cases!
I see another improvement that technically will be possible if everything is implemented: the ability to use two cache contexts while still running one model:
There are two things need to be done: 2 (or more) models and contexts at the same time, and speculative decoding using 2 models.
If you would implement several models – then you can rather easily add the speculative decoding too.
Otherwise, if you would want speculative decoding – you would have to implement loading of several models for this anyway.
Then, when having two models in memory – you can imagine something like "model offloading", or "switching on demand", where a model my be unloaded and replaced with another model at runtime.
But those are future possible improvements, while the speculative decoding is a useful thing by itself!
The text was updated successfully, but these errors were encountered: