Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Support for masking output using a Context-Free-Grammar #26520

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jvhoffbauer
Copy link

@jvhoffbauer jvhoffbauer commented Oct 1, 2023

What does this PR do?

Fixes #25778

Review

@gante @LysandreJik as discussed. Also, @oobabooga @ArthurZucker @jorgemcgomes feel free to look at this.
This is in large parts inspired by rellm and parserllm by @r2d4 so I am tagging him here too for visibility.

Discussion

This is totally WIP still. I added a notebook showcasing how it might work when using the generation API from a low-level perspective. Please educate me if that is ok that way. Long-term I want to add tests.
Generally, we use Lark to parse the grammar. The grammar itself would most likely come from a text file or a string but Lark also has a grammar object that one can use. I built a class CfgParsingStepper that we can use to get the state of the parser for any input string. It will give us the terminal symbol we are processing and the regex for this symbol that we can use to find valid tokens. The CfgLogitsProcessor gets the state every turn for every beam. We can consider persisting the state during generation which might be faster, instead of recalculating it.

The whole codebase is still very much WIP. Most importantly, we still need to

  • Add tests (instead of a Jupyter Notebook)
  • Add error handling to make sure the parser throws an error if it is started with an invalid input
  • Connect the logits processor to the actual generation APIs. I am thinking of something like model.generate(..., grammar=Grammar(...))
  • Think about cases where you want the grammar constraint to start only for the new generations and not for the input prompt

Anyhow, I wanted to put this out there for discussion. Let me know if it makes sense this way and whether you would like me to continue working in this way or if I should change anything!

@LysandreJik
Copy link
Member

Thanks for your PR! I'm prompting @gante for review once he's back from leave :)

I appreciate your effort and patience @jvhoffbauer!

@jvhoffbauer
Copy link
Author

Thanks for your PR! I'm prompting @gante for review once he's back from leave :)

I appreciate your effort and patience @jvhoffbauer!

Hey @LysandreJik, did you already have time to review?

@oobabooga
Copy link
Contributor

My 2 cents: it would be desirable to have compatibility with the BNF grammars in the llama.cpp repository:

https://github.com/ggerganov/llama.cpp/tree/master/grammars

Existing work in this direction (using custom LogitsProcessor in the transformers library):

https://github.com/Shopify/torch-grammar

https://github.com/im-not-tom/text-generation-webui-output-template-extension/

The original llama.cpp PR:

ggerganov/llama.cpp#1773

@LysandreJik
Copy link
Member

Hey @jvhoffbauer, I will let @gante that just came back from holiday review as soon as he has a spare cycle; he's the owner of generate and should be able to provide a much better review than I can :)

Thanks for your contribution

@gante
Copy link
Member

gante commented Oct 16, 2023

(it's not forgotten, it's in my queue -- I should be able to review it over the new few days)

@jvhoffbauer
Copy link
Author

Awesome! It’s just a draft. Please let me know your thoughts and I will focus on wrapping it up to a complete PR. Potentially already over the coming week.

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening the PR @jvhoffbauer! And apologies for my delayed review 🤗

This is a very challenging PR to review. On one hand, see the value and the possibilities it unlocks. On the other hand, it is hard to incorporate inside generate -- first allow me to explain the problem, to then propose an alternative avenue!

Problem: this operation depends on the tokenizer. We really want to avoid having tokenizer-related operations inside generate, as a) it will introduce a new degree of freedom and complexity to an already complex function; b) tokenization happens in the CPU, which means this will be an (unexpected) cause of slowdowns to most GPU users AND will difficult/restrict our hardware optimization goals.

Proposed alternative: In technical terms, we can work around the problem I described above if we iteratively call generate one token at a time, passing around past_key_values for inference speed purposes, and performing this additional logic outside the generate call. My suggestion would be to add an advanced tutorial to our transformers documentation, using lark and an example as in your notebook (FYI we intend to do the same with RAG). In other words, this wouldn't be a logits processor, but the code to add this feature would be just as easy to find as it and our team wouldn't suffer from the complexity expansion 🤗

WDYT?

@jvhoffbauer
Copy link
Author

@gante no worries, I am still here!

I understand you do not want to make the generate function dependent on a CPU-bound tokenizer which I get. The way you explain the approach makes total sense and I trust it is the best way of integrating this functionality into transformers.

Does it make sense if I start drafting out the code that could be used for such an article?

@gante
Copy link
Member

gante commented Nov 8, 2023

@jvhoffbauer absolutely! You can start by drafting a stand-alone .py file (easier to see the diff and iterate than a notebook :) ), and when we're happy with it, I will share further pointers on integrating it into the documentation.

After this is done, I would love to invite you to write a community blog post explaining the virtues of context-free-grammar and to share a space with this technique! 💛

@jvhoffbauer
Copy link
Author

Sounds great! I will prepare a draft and we can iterate.

@Saibo-creator
Copy link
Contributor

Is it a bit overkill to introduce the dependency of Lark to implement the grammar-constrained decoding feature ?
It seems the llama-cpp repo simply made a standalone implementation ? @gante https://github.com/ggerganov/llama.cpp/tree/master/grammars

@gante
Copy link
Member

gante commented Nov 15, 2023

@Saibo-creator It's okay, since lark is stable and has no dependencies of its own :)

@Saibo-creator
Copy link
Contributor

Saibo-creator commented Nov 15, 2023

For this feature, I think there are several aspects to take into account while implementing:

  1. how does this work with different tokenizers(bpe, unigram, wordpiece etc). For example, LLAMA-CPP's implementation is specific to llama tokenizer, I don't think it will work for wordpiece.
  2. unicode support. This is a bit tricky but important for multilingual usage, llama-cpp 's first implementation didn't take care of that and later they spotted and fixed it . Emojis in grammar cause assertion failure ggerganov/llama.cpp#2501
  3. possible comptatbility with other sampling methods such as top-k sampling
  4. [future] comptaiblity with utf-16 ?

@Saibo-creator
Copy link
Contributor

My 2 cents: it would be desirable to have compatibility with the BNF grammars in the llama.cpp repository:

https://github.com/ggerganov/llama.cpp/tree/master/grammars

Existing work in this direction (using custom LogitsProcessor in the transformers library):

https://github.com/Shopify/torch-grammar

https://github.com/im-not-tom/text-generation-webui-output-template-extension/

The original llama.cpp PR:

ggerganov/llama.cpp#1773

See #27557 for an implementation compatible with llama-cpp

@github-actions github-actions bot closed this Dec 21, 2023
@LysandreJik LysandreJik reopened this Dec 31, 2023
@huggingface huggingface deleted a comment from github-actions bot Dec 31, 2023
@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Dec 31, 2023
@RadixSeven
Copy link

Is anyone still working on this? I need this functionality to test an idea. I can either do a quick fork for my own use or spend a little longer and make something worth sharing.

@Saibo-creator
Copy link
Contributor

Is anyone still working on this? I need this functionality to test an idea. I can either do a quick fork for my own use or spend a little longer and make something worth sharing.

Hey @RadixSeven, due to the complexity of this feature, the HF team has decided it would be best to transition it to a separate project instead of integrating it.

Feel free to check out https://github.com/epfl-dlab/transformers-CFG, which was created from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for context-free-grammars (CFG) to constrain model output
6 participants