-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
token healing impl #29081
token healing impl #29081
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI is failing due to an automatic update in the pytest package, we are tracking it. Will let you know when it is sorted -- it will need a rebase |
Thanks for the follow-up! |
@Ayenem |
(@Ayenem we're trying to fix the merge conflicts for you, and we're experimenting with a few GH permissions on our side. You may see a few test commits 🤗 ) |
Now rebased after #29320 was merged, which was causing the last set of errors seen here. If everything went well, we should see a green CI here 🤞 |
@Ayenem FYI, I've reverted the tokenizer input to your original suggestion (tokenizer passed to |
It does feel better to offload the tokenizer choice and loading to the caller. Thanks again for following up on this 🙏 |
CI is green! It was possible :') |
ping @ArthurZucker :) |
Sorry for the late review on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few nits, mostly safely import and protect the function as the new dependency is optional / should be optional. Potentially use our own trie?
src/transformers/generation/utils.py
Outdated
@@ -22,6 +22,7 @@ | |||
|
|||
import torch | |||
import torch.distributed as dist | |||
from pygtrie import CharTrie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is an optinal dependency we need to protect the import
""" | ||
if tokenizer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
if tokenizer is None: | |
""" | |
requires_backends(self, ["pygtrie"]) |
we also need to make sure this function errors out correctly if used
"argument of `generate`." | ||
) | ||
bos_id, pad_id = tokenizer.bos_token_id, tokenizer.pad_token_id | ||
vocab_trie = CharTrie(tokenizer.get_vocab()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW we have https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L52
which could be used for this? Would remove the dependency? (It's might be additional work as well)
input_ids = torch.where(input_ids == bos_id, pad_id, input_ids) | ||
|
||
tail_ids = input_ids[:, -1].tolist() | ||
space_tok = tokenizer.tokenize(" ")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure this will always do what you want, specifically for tokenizer that add a prefix token you could get [▁▁
]
Hi @Ayenem , thanks for this feature. I was curious to look into this as an early feature to see how this works on my domain data but had some issues with some generation using the example data provided (stacktrace attached below). Could you share some example script how to test it?
Environment used:
|
What does this PR do?
Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.
Token boundary bias is a silent performance killer that doesn't seem very well known. It has clear impact on completion quality.
A more thorough explanation of the problem: The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.
Motivation
Given a completion prompt with a partial url ending with
:
, the model might have seen the expected completion://
as a single token in training. However, the prompt's tail token:
tells it that the next token is not//
, and so it generates a wrong completion. Such errors compound in auto-regressive language models.Fixes #28346
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?