-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162
Conversation
…okenizer Co-authored-by: Hanyu Wang <1065708749@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is a breaking change, which will silently no longer break.
I am down to add it tho!
- We probably need some tests
- We need to do a deprecation cycle:
If we are OOV, we raise the error ourselved, saying that this behaviour will change / can be set through tokenizer.oov_error = "replace" or "strict" to keep the old behaviour. we'll then make "replace" the default !
Got it, thanks for the suggestion, I'll try to use the deprecation strategy and add some test cases.😀 |
…cycle, and optimize the structure of the `convert_ids_to_tokens()` function. Co-authored-by: Hanyu Wang <1065708749@qq.com>
Co-authored-by: Hanyu Wang <1065708749@qq.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Hanyu Wang <1065708749@qq.com>
Just in case you may not have seen this message. What else do I need to do to perfect this PR? @ArthurZucker |
Was this pull request abandoned? |
Hey not at all 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits and will be alright!
This is a pretty big change that is why we need to be careful!
I see! 😀 |
Co-authored-by: Hanyu Wang <1065708749@qq.com>
Feel free to ping me again once the CIs are green! |
Ok, I will try my best, but I think some of the problems come from some of the tokenizers themselves. I can provide a report on the problems that are hard to fix. 🤝 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I want a second look from @Lysandre!
In the process of trying to make all CI tests green, I found another inconsistency, as shown in commit 80e3436. When the decode parameter was changed from a single id to a list of the single id, more models passed the tests. This new inconsistent behavior has been demonstrated in issue #29489. In order to keep the low coupling of the pull request, a list of the single id is used here to ensure the program can run, and the issue can be addressed with other pull requests. I know now the problem is getting annoying. 😂 @ArthurZucker |
Could you fix the red cis before we do another review? 🤗 |
if isinstance(ids, int): | ||
if ids in self._added_tokens_decoder: | ||
return self._added_tokens_decoder[ids].content | ||
else: | ||
return self._convert_id_to_token(ids) | ||
is_single_element = isinstance(ids, int) | ||
if is_single_element: | ||
ids = [ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should adresse this here
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR addresses and resolves issue #29159 by fixing the inconsistent behaviors observed between slow and fast tokenizers regarding the tokenization of out-of-vocabulary (OOV) IDs.
Specifically, it manually sets the tokens for OOV IDs to empty strings, aligning the behavior with the implementation in the Rust version of the tokenizer. For reference and further understanding, the approach to handling OOV tokens in the Rust version can be examined in the models folder. This folder contains various tokenizer algorithms, each with a
model.rs
file implementing atokenize
method. These methods are designed to avoid UNK (unknown) tokens or None tokens. For instance, the BPE algorithm'stokenize
method is a relevant example.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker and @younesbelkada