-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489
Comments
That's very interesting, and can confirm we have this issue. |
Yes, you're right. I added this test case in the def test_single_id(self):
tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer()
vocab_size = len(tokenizer)
int_single_id = vocab_size - 1
list_single_id = [vocab_size - 1]
self.assertEqual(tokenizer.decode(int_single_id), tokenizer.decode(list_single_id))
self.assertEqual(rust_tokenizer.decode(int_single_id), rust_tokenizer.decode(list_single_id)) The test results are as below (scroll to the bottom to view the failed 33 models): Details
|
feel free to open a PR for a fix. IMO we should not have spaces added in this case |
No problem, I will try to do this, but there are some other research work that needs to be pushed forward recently, and I may do it later. |
Hi :)
I ran a couple of the test cases that were reported to be failing above with a slightly modified version of the test function proposed by @Ki-Seki and they pass now
Unfortunately, I can't run all of the test cases (I keep running into weird python segmentation errors that occur even without having changed the library at all). Does know a trick how I can run the test cases anyway or is it ok if I create a pull request and wait for the CI tests? |
You can create a PR and rely on the CIs for sure! 🤗 |
Hello @ArthurZucker and all, Looks to me problem is that (i) a signature mismatch between transformers/src/transformers/tokenization_utils_base.py Lines 3913 to 3915 in 74b92c6
transformers/src/transformers/tokenization_utils.py Lines 1062 to 1064 in 74b92c6
transformers/src/transformers/tokenization_utils_fast.py Lines 640 to 642 in 74b92c6
Consequently slow tokenizer _decode handles only list of ids, not a single id. If the filtered_tokens is a single string, not a list of strings in the loop its characters are iterated and processed so @MariaHei is totally right:
Also there are not many decoding tests, though lots of encoding tests 😊 I added quick signature fix and return statements, also added some decode tests in my PR. |
System Info
transformers
version: 4.39.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Code
Output
Expected behavior
Consistent behaviors. For example, when decoding the single ID, the output could also be
##~
.Suspected rationale: In the
src/transformers/tokenization_utils.py
, the_decode
function incorrectly usesspaces_between_special_tokens
, and then adds spaces between the sub-tokens.The text was updated successfully, but these errors were encountered: