Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

Closed
wants to merge 13 commits into from

Conversation

Ki-Seki
Copy link
Contributor

@Ki-Seki Ki-Seki commented Feb 21, 2024

What does this PR do?

This PR addresses and resolves issue #29159 by fixing the inconsistent behaviors observed between slow and fast tokenizers regarding the tokenization of out-of-vocabulary (OOV) IDs.

Specifically, it manually sets the tokens for OOV IDs to empty strings, aligning the behavior with the implementation in the Rust version of the tokenizer. For reference and further understanding, the approach to handling OOV tokens in the Rust version can be examined in the models folder. This folder contains various tokenizer algorithms, each with a model.rs file implementing a tokenize method. These methods are designed to avoid UNK (unknown) tokens or None tokens. For instance, the BPE algorithm's tokenize method is a relevant example.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker and @younesbelkada

…okenizer

Co-authored-by: Hanyu Wang <1065708749@qq.com>
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This is a breaking change, which will silently no longer break.
I am down to add it tho!

  1. We probably need some tests
  2. We need to do a deprecation cycle:
    If we are OOV, we raise the error ourselved, saying that this behaviour will change / can be set through tokenizer.oov_error = "replace" or "strict" to keep the old behaviour. we'll then make "replace" the default !

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Feb 23, 2024

Got it, thanks for the suggestion, I'll try to use the deprecation strategy and add some test cases.😀

Ki-Seki and others added 3 commits February 27, 2024 03:13
…cycle, and optimize the structure of the `convert_ids_to_tokens()` function.

Co-authored-by: Hanyu Wang <1065708749@qq.com>
Co-authored-by: Hanyu Wang <1065708749@qq.com>
@Ki-Seki Ki-Seki requested a review from ArthurZucker February 26, 2024 19:50
Ki-Seki and others added 3 commits February 27, 2024 10:39
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@Ki-Seki Ki-Seki requested a review from ArthurZucker February 27, 2024 03:09
Co-authored-by: Hanyu Wang <1065708749@qq.com>
@Ki-Seki Ki-Seki requested a review from ArthurZucker February 27, 2024 06:31
@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Feb 29, 2024

Just in case you may not have seen this message. What else do I need to do to perfect this PR? @ArthurZucker

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 2, 2024

Was this pull request abandoned? ☹️ @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey not at all 🤗

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits and will be alright!
This is a pretty big change that is why we need to be careful!

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 2, 2024

A few nits and will be alright! This is a pretty big change that is why we need to be careful!

I see! 😀

@Ki-Seki Ki-Seki requested a review from ArthurZucker March 2, 2024 05:27
Co-authored-by: Hanyu Wang <1065708749@qq.com>
@ArthurZucker
Copy link
Collaborator

Feel free to ping me again once the CIs are green!

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 4, 2024

Feel free to ping me again once the CIs are green!

Ok, I will try my best, but I think some of the problems come from some of the tokenizers themselves. I can provide a report on the problems that are hard to fix. 🤝

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I want a second look from @Lysandre!

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Mar 6, 2024

In the process of trying to make all CI tests green, I found another inconsistency, as shown in commit 80e3436. When the decode parameter was changed from a single id to a list of the single id, more models passed the tests. This new inconsistent behavior has been demonstrated in issue #29489. In order to keep the low coupling of the pull request, a list of the single id is used here to ensure the program can run, and the issue can be addressed with other pull requests. I know now the problem is getting annoying. 😂 @ArthurZucker

@Ki-Seki Ki-Seki requested a review from ArthurZucker March 6, 2024 15:46
@ArthurZucker ArthurZucker requested review from LysandreJik and removed request for LysandreJik March 7, 2024 04:21
@ArthurZucker
Copy link
Collaborator

Could you fix the red cis before we do another review? 🤗

Comment on lines -969 to +982
if isinstance(ids, int):
if ids in self._added_tokens_decoder:
return self._added_tokens_decoder[ids].content
else:
return self._convert_id_to_token(ids)
is_single_element = isinstance(ids, int)
if is_single_element:
ids = [ids]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should adresse this here

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants