[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

Ki-Seki · 2024-02-21T08:19:20Z

What does this PR do?

This PR addresses and resolves issue #29159 by fixing the inconsistent behaviors observed between slow and fast tokenizers regarding the tokenization of out-of-vocabulary (OOV) IDs.

Specifically, it manually sets the tokens for OOV IDs to empty strings, aligning the behavior with the implementation in the Rust version of the tokenizer. For reference and further understanding, the approach to handling OOV tokens in the Rust version can be examined in the models folder. This folder contains various tokenizer algorithms, each with a model.rs file implementing a tokenize method. These methods are designed to avoid UNK (unknown) tokens or None tokens. For instance, the BPE algorithm's tokenize method is a relevant example.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker and @younesbelkada

…okenizer Co-authored-by: Hanyu Wang <1065708749@qq.com>

ArthurZucker

Thanks for the PR. This is a breaking change, which will silently no longer break.
I am down to add it tho!

We probably need some tests
We need to do a deprecation cycle:
If we are OOV, we raise the error ourselved, saying that this behaviour will change / can be set through tokenizer.oov_error = "replace" or "strict" to keep the old behaviour. we'll then make "replace" the default !

Ki-Seki · 2024-02-23T09:15:25Z

Got it, thanks for the suggestion, I'll try to use the deprecation strategy and add some test cases.😀

…cycle, and optimize the structure of the `convert_ids_to_tokens()` function. Co-authored-by: Hanyu Wang <1065708749@qq.com>

Co-authored-by: Hanyu Wang <1065708749@qq.com>

src/transformers/tokenization_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

src/transformers/tokenization_utils.py

Co-authored-by: Hanyu Wang <1065708749@qq.com>

Ki-Seki · 2024-02-29T10:58:03Z

Just in case you may not have seen this message. What else do I need to do to perfect this PR? @ArthurZucker

Ki-Seki · 2024-03-02T01:59:53Z

Was this pull request abandoned? ☹️ @ArthurZucker

ArthurZucker · 2024-03-02T04:05:07Z

Hey not at all 🤗

ArthurZucker

A few nits and will be alright!
This is a pretty big change that is why we need to be careful!

src/transformers/tokenization_utils.py

src/transformers/tokenization_utils_base.py

Ki-Seki · 2024-03-02T04:17:09Z

A few nits and will be alright! This is a pretty big change that is why we need to be careful!

I see! 😀

Co-authored-by: Hanyu Wang <1065708749@qq.com>

ArthurZucker · 2024-03-04T09:42:49Z

Feel free to ping me again once the CIs are green!

Ki-Seki · 2024-03-04T09:49:13Z

Feel free to ping me again once the CIs are green!

Ok, I will try my best, but I think some of the problems come from some of the tokenizers themselves. I can provide a report on the problems that are hard to fix. 🤝

ArthurZucker

LGTM, I want a second look from @Lysandre!

tests/test_tokenization_common.py

Ki-Seki · 2024-03-06T15:45:28Z

In the process of trying to make all CI tests green, I found another inconsistency, as shown in commit 80e3436. When the decode parameter was changed from a single id to a list of the single id, more models passed the tests. This new inconsistent behavior has been demonstrated in issue #29489. In order to keep the low coupling of the pull request, a list of the single id is used here to ensure the program can run, and the issue can be addressed with other pull requests. I know now the problem is getting annoying. 😂 @ArthurZucker

ArthurZucker · 2024-03-07T05:15:36Z

Could you fix the red cis before we do another review? 🤗

ArthurZucker · 2024-03-07T05:16:04Z

src/transformers/tokenization_utils.py

-        if isinstance(ids, int):
-            if ids in self._added_tokens_decoder:
-                return self._added_tokens_decoder[ids].content
-            else:
-                return self._convert_id_to_token(ids)
+        is_single_element = isinstance(ids, int)
+        if is_single_element:
+            ids = [ids]


I don't think we should adresse this here

github-actions · 2024-03-31T08:03:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fix (tokenizer): Fix handling of out-of-vocabulary IDs in PreTrainedT…

a3e3e3e

…okenizer Co-authored-by: Hanyu Wang <1065708749@qq.com>

Ki-Seki force-pushed the fix-oov branch from 0084e0b to a3e3e3e Compare February 21, 2024 08:32

ArthurZucker reviewed Feb 23, 2024

View reviewed changes

Ki-Seki and others added 3 commits February 27, 2024 03:13

fix: add oov_error argument to the constructor, deploy a deprecation …

e60741e

…cycle, and optimize the structure of the `convert_ids_to_tokens()` function. Co-authored-by: Hanyu Wang <1065708749@qq.com>

tests: add test for out-of-vocabulary token ids

e45d661

Co-authored-by: Hanyu Wang <1065708749@qq.com>

Merge branch 'main' into fix-oov

79c074d

Ki-Seki requested a review from ArthurZucker February 26, 2024 19:50

ArthurZucker reviewed Feb 27, 2024

View reviewed changes

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

Ki-Seki and others added 3 commits February 27, 2024 10:39

refactor: warning update

95c1918

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'huggingface:main' into fix-oov

4e2da9d

docs: add oov_error arg

ddcbcdb

Ki-Seki requested a review from ArthurZucker February 27, 2024 03:09

ArthurZucker reviewed Feb 27, 2024

View reviewed changes

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

docs: move docs into INIT_TOKENIZER_DOCSTRING

4c742cc

Co-authored-by: Hanyu Wang <1065708749@qq.com>

Ki-Seki requested a review from ArthurZucker February 27, 2024 06:31

style: update

f8f9a81

ArthurZucker reviewed Mar 2, 2024

View reviewed changes

misc: fix the nits

c41dc78

Ki-Seki requested a review from ArthurZucker March 2, 2024 05:27

fix: indentation issue

43fc4a1

Co-authored-by: Hanyu Wang <1065708749@qq.com>

ArthurZucker reviewed Mar 6, 2024

View reviewed changes

tests/test_tokenization_common.py Outdated Show resolved Hide resolved

Ki-Seki added 2 commits March 6, 2024 21:03

refactor: explicit conditions

34d7c14

feat!: use list of the single id instead of bare the single id

80e3436

Ki-Seki requested a review from ArthurZucker March 6, 2024 15:46

ArthurZucker requested review from LysandreJik and removed request for LysandreJik March 7, 2024 04:21

ArthurZucker reviewed Mar 7, 2024

View reviewed changes

github-actions bot closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

Ki-Seki commented Feb 21, 2024 •

edited

Loading

ArthurZucker left a comment

Ki-Seki commented Feb 23, 2024

Ki-Seki commented Feb 29, 2024 •

edited

Loading

Ki-Seki commented Mar 2, 2024

ArthurZucker commented Mar 2, 2024

ArthurZucker left a comment

Ki-Seki commented Mar 2, 2024

ArthurZucker commented Mar 4, 2024

Ki-Seki commented Mar 4, 2024

ArthurZucker left a comment

Ki-Seki commented Mar 6, 2024

ArthurZucker commented Mar 7, 2024

ArthurZucker Mar 7, 2024

github-actions bot commented Mar 31, 2024

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

[Tokenizer] Fix handling of out-of-vocabulary IDs in PreTrainedTokenizer #29162

Conversation

Ki-Seki commented Feb 21, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

Ki-Seki commented Feb 23, 2024

Ki-Seki commented Feb 29, 2024 • edited Loading

Ki-Seki commented Mar 2, 2024

ArthurZucker commented Mar 2, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Ki-Seki commented Mar 2, 2024

ArthurZucker commented Mar 4, 2024

Ki-Seki commented Mar 4, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Ki-Seki commented Mar 6, 2024

ArthurZucker commented Mar 7, 2024

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 31, 2024

Ki-Seki commented Feb 21, 2024 •

edited

Loading

Ki-Seki commented Feb 29, 2024 •

edited

Loading