Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_up_tokenization_spaces=False if unset #31938

Merged
merged 10 commits into from
Sep 26, 2024

Conversation

itazap
Copy link
Collaborator

@itazap itazap commented Jul 12, 2024

FUTURE DEPRECATION

fixes #31884

start of deprecating clean_up_tokenization_spaces. Right now it defaults to True, update to default to False.

  • BERT based models need clean_up_tokenization_spaces=True, so set this in class init
  • some models like wav2vec needed test updated since they don't really expect clean_up_tokenization_spaces=True.

@ArthurZucker

@itazap itazap requested a review from ArthurZucker July 12, 2024 16:53
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the clean_up_tokenization_spaces_false_default branch 4 times, most recently from e689ec3 to ec6f78a Compare July 25, 2024 08:23
@@ -4247,52 +4247,6 @@ def test_save_slow_from_fast_and_reload_fast(self):
# Should not raise an error
self.rust_tokenizer_class.from_pretrained(tmp_dir_2)

# TODO This is ran for all models but only tests bert...
def test_clean_up_tokenization_spaces(self):
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
Copy link
Collaborator Author

@itazap itazap Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!

@itazap itazap marked this pull request as ready for review July 26, 2024 10:29
@itazap itazap requested a review from ArthurZucker July 26, 2024 10:30
@@ -79,7 +79,7 @@ def get_tokenizer(self, **kwargs):
# Copied from transformers.tests.models.gpt2.test_tokenization_gpt2.GPT2TokenizationTest.get_input_output_texts
def get_input_output_texts(self, tokenizer):
input_text = "lower newer"
output_text = "lower newer"
output_text = "lower[SPACE]newer"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer for an example where [SPACE] was expected

self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer.word_delimiter_token is replaced with . See :

    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        """
        Converts a connectionist-temporal-classification (CTC) output tokens into a single string.
        """
      ...
        # replace delimiter token
        string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip()

        if self.do_lower_case:
            string = string.lower()

        return string
        

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

Copy link
Collaborator Author

@itazap itazap Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too but the sample_ids have tokenizer.word_delimiter_token_id which is a space " " so I think it would be expected in the output? wdyt @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's make sure we keep the default to True for now!

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Show resolved Hide resolved
@itazap itazap force-pushed the clean_up_tokenization_spaces_false_default branch from 45b6a0e to b7b2b09 Compare September 6, 2024 11:29
@rishi23root
Copy link

i believe the issue is

if "clean_up_tokenization_spaces" not in kwargs:
            warnings.warn(
                "`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
                "behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
                "For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
                FutureWarning,
            )

# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

by default it setting but i don't understand where to update it to remove the warning completely

@itazap
Copy link
Collaborator Author

itazap commented Sep 14, 2024

@rishi23root I'll update the message!

@itazap itazap requested a review from ArthurZucker September 20, 2024 09:21
self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

Comment on lines 1611 to 1614
warnings.warn(
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
FutureWarning,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not warn, we won't remove it!

Suggested change
warnings.warn(
"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
FutureWarning,
)

@ArthurZucker ArthurZucker merged commit 6730485 into main Sep 26, 2024
21 of 24 checks passed
@ArthurZucker ArthurZucker deleted the clean_up_tokenization_spaces_false_default branch September 26, 2024 17:38
ArthurZucker pushed a commit that referenced this pull request Sep 26, 2024
* clean_up_tokenization_spaces=False if unset

* deprecate warning

* updating param for old models

* update models

* make fix-copies

* fix-copies and update bert models

* warning msg

* update prophet and clvp

* updating test since space before is arbitrarily removed

* remove warning for 4.45
@itazap itazap mentioned this pull request Sep 27, 2024
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* clean_up_tokenization_spaces=False if unset

* deprecate warning

* updating param for old models

* update models

* make fix-copies

* fix-copies and update bert models

* warning msg

* update prophet and clvp

* updating test since space before is arbitrarily removed

* remove warning for 4.45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPT-2 tokenizer is NOT invertible
4 participants