-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Support for masking output using a Context-Free-Grammar #26520
base: main
Are you sure you want to change the base?
[WIP] Add Support for masking output using a Context-Free-Grammar #26520
Conversation
Thanks for your PR! I'm prompting @gante for review once he's back from leave :) I appreciate your effort and patience @jvhoffbauer! |
Hey @LysandreJik, did you already have time to review? |
My 2 cents: it would be desirable to have compatibility with the BNF grammars in the llama.cpp repository: https://github.com/ggerganov/llama.cpp/tree/master/grammars Existing work in this direction (using custom https://github.com/Shopify/torch-grammar https://github.com/im-not-tom/text-generation-webui-output-template-extension/ The original llama.cpp PR: |
Hey @jvhoffbauer, I will let @gante that just came back from holiday review as soon as he has a spare cycle; he's the owner of Thanks for your contribution |
(it's not forgotten, it's in my queue -- I should be able to review it over the new few days) |
Awesome! It’s just a draft. Please let me know your thoughts and I will focus on wrapping it up to a complete PR. Potentially already over the coming week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening the PR @jvhoffbauer! And apologies for my delayed review 🤗
This is a very challenging PR to review. On one hand, see the value and the possibilities it unlocks. On the other hand, it is hard to incorporate inside generate
-- first allow me to explain the problem, to then propose an alternative avenue!
Problem: this operation depends on the tokenizer
. We really want to avoid having tokenizer-related operations inside generate
, as a) it will introduce a new degree of freedom and complexity to an already complex function; b) tokenization happens in the CPU, which means this will be an (unexpected) cause of slowdowns to most GPU users AND will difficult/restrict our hardware optimization goals.
Proposed alternative: In technical terms, we can work around the problem I described above if we iteratively call generate
one token at a time, passing around past_key_values
for inference speed purposes, and performing this additional logic outside the generate
call. My suggestion would be to add an advanced tutorial to our transformers
documentation, using lark
and an example as in your notebook (FYI we intend to do the same with RAG). In other words, this wouldn't be a logits processor, but the code to add this feature would be just as easy to find as it and our team wouldn't suffer from the complexity expansion 🤗
WDYT?
@gante no worries, I am still here! I understand you do not want to make the generate function dependent on a CPU-bound tokenizer which I get. The way you explain the approach makes total sense and I trust it is the best way of integrating this functionality into Does it make sense if I start drafting out the code that could be used for such an article? |
@jvhoffbauer absolutely! You can start by drafting a stand-alone After this is done, I would love to invite you to write a community blog post explaining the virtues of context-free-grammar and to share a space with this technique! 💛 |
Sounds great! I will prepare a draft and we can iterate. |
Is it a bit overkill to introduce the dependency of Lark to implement the grammar-constrained decoding feature ? |
@Saibo-creator It's okay, since |
For this feature, I think there are several aspects to take into account while implementing:
|
See #27557 for an implementation compatible with llama-cpp |
Is anyone still working on this? I need this functionality to test an idea. I can either do a quick fork for my own use or spend a little longer and make something worth sharing. |
Hey @RadixSeven, due to the complexity of this feature, the HF team has decided it would be best to transition it to a separate project instead of integrating it. Feel free to check out https://github.com/epfl-dlab/transformers-CFG, which was created from this PR. |
What does this PR do?
Fixes #25778
Review
@gante @LysandreJik as discussed. Also, @oobabooga @ArthurZucker @jorgemcgomes feel free to look at this.
This is in large parts inspired by rellm and parserllm by @r2d4 so I am tagging him here too for visibility.
Discussion
This is totally WIP still. I added a notebook showcasing how it might work when using the generation API from a low-level perspective. Please educate me if that is ok that way. Long-term I want to add tests.
Generally, we use Lark to parse the grammar. The grammar itself would most likely come from a text file or a string but Lark also has a grammar object that one can use. I built a class CfgParsingStepper that we can use to get the state of the parser for any input string. It will give us the terminal symbol we are processing and the regex for this symbol that we can use to find valid tokens. The CfgLogitsProcessor gets the state every turn for every beam. We can consider persisting the state during generation which might be faster, instead of recalculating it.
The whole codebase is still very much WIP. Most importantly, we still need to
model.generate(..., grammar=Grammar(...))
Anyhow, I wanted to put this out there for discussion. Let me know if it makes sense this way and whether you would like me to continue working in this way or if I should change anything!