Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add assistant prefill for chat templates and TextGenerationPipeline #33198

Merged
merged 16 commits into from
Sep 2, 2024

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Aug 29, 2024

Something that's been requested several times both internally and on Github is assistant prefill: The ability to begin the model's response for it and let it continue.

We use a slightly hacky solution suggested by @Narsil, but which I think will work in all cases that I know of: When we do assistant prefill, we simply truncate the formatted chat to the end of the final message text, removing any tokens that indicate the end of an assistant message, which will cause the model to simply continue from wherever the final message ends.

This PR also updates TextGenerationPipeline so that if you pass a chat where the final message is an assistant message, it will assume that you want to treat the final message as an assistant prefill. Before this PR, it would start a new message if you did this, which would generally result in malformed output because most models were not trained with multiple consecutive assistant messages.

Fixes #33096

Fixes #32213

@LysandreJik
Copy link
Member

Let me know when you'd like a review @Rocketknight1!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Aug 29, 2024

@LysandreJik should be ready for review now! I had to tweak a couple of other tests that ended in assistant messages - the output was garbage in those cases anyway, and the tests were just checking that it wasn't changing unexpectedly.

For visibility:
cc @philschmid who requested this. Cc @Narsil for TGI and @xenova for transformers.js, though note that this doesn't change Jinja behaviour at all, it just adds a flag that trims the output string in Python! Also cc @gante for generation.

See also #33096 and #32213 where this was requested

@DIYer22
Copy link

DIYer22 commented Aug 29, 2024

Instead of assistant_prefill, may just add a new parameter called prefill_last_message to apply_chat_template. I feel like there's no need to limit prefill to the assistant role. This way, it's compatible with the current needs and can also adapt to potential future needs. Plus, the parameter name and the corresponding changes are simple and clear enough that there won't be any misunderstandings.

@Rocketknight1
Copy link
Member Author

@DIYer22 I'm not sure about that - the "assistant" role is universal across all chat models that I know of. Although some users might want the model to continue a user or system message, I think that's more of a fun/research thing than a common use-case, and I think it's okay to expect people to construct their own prompts/templates for those cases.

@Tostino
Copy link

Tostino commented Aug 29, 2024

There are other models in private that don't use standard assistant / user roles, and will likely be forced to by this type of change, or will just use something other than chat_templates for formatting their prompts (hurting standardization even more).

I don't think using a model through the standard chat/completion style endpoint for auto-complete for a users' message is a far off use case. It is just something that OpenAI doesn't support with their endpoints, so tooling hasn't been built to support it yet. That was one of the main points I was trying to enable (along with assistant prefill).

Not to mention "chat_templates" are not just for chat models, there are so many ways that the API can be used if things aren't artificially limited.

Edit: I am good with the approach in general though (not needing to modify the templates themselves is great).

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Aug 29, 2024

Hmn, okay, there's more demand for it than I thought - you were right @DIYer22!

@Tostino @DIYer22 I've replaced assistant_prefill with continue_final_message. It has the same behaviour as assistant_prefill, but will no longer raise an error if the final message is not an assistant message.

I've kept the TextGenerationPipeline behaviour the same, however. I'll think about how I could possibly add support there without disrupting the UX for simple use-cases.

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Aug 29, 2024

Update: Pipeline now correctly handles the continue_final_message kwarg, which overrides the default behaviour of continuing assistant messages only.

@Rocketknight1
Copy link
Member Author

@LysandreJik sorry for the delay, should be ready for actual review now!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense! Thanks @Rocketknight1

Comment on lines 416 to 417
if continue_final_message is None:
continue_final_message = prompt_text.messages[-1]["role"] == "assistant"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining what is being done under the hood here (so assuming that the request is to continue the final message if the last message was already sent by the assistant?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# Here we check that passing a chat that ends in an assistant message is handled correctly
# by continuing the final message rather than starting a new one
text_generator = pipeline(
task="text-generation", model="rocketknight1/tiny-gpt2-with-chatml-template", framework="pt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move that model to hf-internal-testing? We're trying to move away from user/contributor/maintainer-hosted checkpoints in our CI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# Here we check that passing a chat that ends in an assistant message is handled correctly
# by continuing the final message rather than starting a new one
text_generator = pipeline(
task="text-generation", model="rocketknight1/tiny-gpt2-with-chatml-template", framework="pt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Aug 30, 2024

@LysandreJik all comments addressed/merged, and moved all the repos to hf-internal-testing. Also updated a couple of other cases where I was using personal repos for tests that were already merged to main. I also added a section to the chat templating docs about this.

Let me know if you're happy to merge, or if you want to re-review any of that!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for iterating on this @Rocketknight1

@Rocketknight1 Rocketknight1 merged commit 52a0213 into main Sep 2, 2024
25 checks passed
@Rocketknight1 Rocketknight1 deleted the add_assistant_prefill branch September 2, 2024 12:23
@Tostino
Copy link

Tostino commented Sep 2, 2024

Thank you very much for the work on this @Rocketknight1. I'm really happy you came up with a solution that helps solve all the needs brought up.

@Rocketknight1
Copy link
Member Author

Thanks @Tostino! Please let me know if you encounter any issues while using it - you can try it out right now by installing from main.

@Tostino
Copy link

Tostino commented Sep 6, 2024

@Rocketknight1 There seems to be an issue...

I was testing it the other day and just couldn't get it working...and then I needed to continue on my roadtrip. Just got back to testing and realized there is a discrepancy between the code and the documentation.

Documentation says continue_last_message and code is actually continue_final_message.

@Rocketknight1
Copy link
Member Author

@Tostino I have no idea how I overlooked that! I think I just changed the name of the argument quite late in the PR, opening a fix now.

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…uggingface#33198)

* Add assistant prefill to chat templates

* Add assistant prefill to pipeline

* Add assistant prefill to pipeline

* Tweak another test that ended in assistant message

* Update tests that ended in assistant messages

* Update tests that ended in assistant messages

* Replace assistant_prefill with continue_final_message

* Allow passing continue_final_message to pipeline

* Small fixup

* Add continue_final_message as a pipeline kwarg

* Update docstrings

* Move repos to hf-internal-testing!

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Add explanatory comment

* make fixup

* Update chat templating docs to explain continue_last_message

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mode-aware chat templates for distinct training and inference behaviors Chat Assistant Prefill
5 participants