-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speculative : add grammar support #2991
Conversation
f4682ee
to
11b2050
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Let me know what you think on the suggested approach, but nothing is really a blocker.
struct llama_grammar * llama_grammar_copy(const struct llama_grammar * grammar) { | ||
llama_grammar * result = new llama_grammar{ grammar->rules, grammar->stacks, grammar->partial_utf8 }; | ||
|
||
// redirect elements in stacks to point to new rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a reason to allow grammar states to share a common rules array, as previously discussed for beam search - I think that would avoid the need for relocating pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I guess we can make the rules std::shared_ptr
so they persist as long as at least one grammar references them.
The rules
never mutate after we create a grammar with llama_grammar_init()
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the rules are read only. Having them alongside the actual parse state (stacks + partial utf8) is really just a convenience. And we could maybe accomplish sharing just by splitting out the rules and separately managing them from the actual state(s). But the added complexity of that might not be worth it compared to the shared_ptr
approach.
examples/speculative/speculative.cpp
Outdated
// sample n_draft tokens from the draft model picking the best token | ||
int n_past_cur = n_past_dft; | ||
for (int i = 0; i < n_draft; ++i) { | ||
// remember the grammar state | ||
if (grammar_dft != NULL) { | ||
grammar_mem[i] = llama_grammar_copy(grammar_dft); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible to scope grammar_dft
to the drafting phase, copying it from grammar_tgt
when you start drafting, and avoid the need for saving per-token state. The grammar state is entirely determined by the sequence of characters accepted. If I understand the flow here correctly, and drafting starts at the end of the token sequence accepted by the target model thus far, then it should be valid to just clone the target grammar for drafting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Removed grammar_mem
via ba199d8
FWIW, for the schema stuff, |
ba199d8
to
c79d130
Compare
I forgot about this script - thanks! Thank you very much for this review - very useful insights and the example is now much simpler. |
Looks great! And you're welcome, happy to contribute! |
I create a VS code ext to support that file https://github.com/iddar/gbnf-highlighter |
Does this feature (Speculative Decoding) support multi batch? @ggerganov |
After I merge #3624 it will be possible to implement batched speculative decoding. But currently the |
ref #2030
This improves upon #2926 by adding constraints on the generated text using a grammar. This helps the speculative approach, because the draft model has an easier time to suggest "correct" tokens.
This approach should be useful for things like generating JSON or other highly-structured text.
Here is an example of using this strategy to summarize a short text.
We use LLaMA v1 30B F16 target model in combination with a LLaMA v1 7B Q4_1 draft model to achieve ~20 t/s on M2 Ultra:
speculative-grammar-0.mp4
Usage
assistant.txt