-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean_up_tokenization_spaces=False if unset #31938
clean_up_tokenization_spaces=False if unset #31938
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
e689ec3
to
ec6f78a
Compare
tests/test_tokenization_common.py
Outdated
@@ -4247,52 +4247,6 @@ def test_save_slow_from_fast_and_reload_fast(self): | |||
# Should not raise an error | |||
self.rust_tokenizer_class.from_pretrained(tmp_dir_2) | |||
|
|||
# TODO This is ran for all models but only tests bert... | |||
def test_clean_up_tokenization_spaces(self): | |||
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!
@@ -79,7 +79,7 @@ def get_tokenizer(self, **kwargs): | |||
# Copied from transformers.tests.models.gpt2.test_tokenization_gpt2.GPT2TokenizationTest.get_input_output_texts | |||
def get_input_output_texts(self, tokenizer): | |||
input_text = "lower newer" | |||
output_text = "lower newer" | |||
output_text = "lower[SPACE]newer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer
for an example where [SPACE]
was expected
self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"]) | ||
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tokenizer.word_delimiter_token
is replaced with
. See :
def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""
Converts a connectionist-temporal-classification (CTC) output tokens into a single string.
"""
...
# replace delimiter token
string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip()
if self.do_lower_case:
string = string.lower()
return string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too but the sample_ids
have tokenizer.word_delimiter_token_id
which is a space " "
so I think it would be expected in the output? wdyt @ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, let's make sure we keep the default to True for now!
45b6a0e
to
b7b2b09
Compare
i believe the issue is if "clean_up_tokenization_spaces" not in kwargs:
warnings.warn(
"`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
"behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
"For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
FutureWarning,
)
# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True) by default it setting but i don't understand where to update it to remove the warning completely |
@rishi23root I'll update the message! |
self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"]) | ||
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"]) | ||
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?
warnings.warn( | ||
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.", | ||
FutureWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not warn, we won't remove it!
warnings.warn( | |
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.", | |
FutureWarning, | |
) |
* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45
* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45
FUTURE DEPRECATION
fixes #31884
start of deprecating
clean_up_tokenization_spaces
. Right now it defaults to True, update to default to False.clean_up_tokenization_spaces=True
, so set this in class initclean_up_tokenization_spaces=True
.@ArthurZucker