Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GemmaConverter] use user_defined_symbols #29473

Merged
merged 17 commits into from
Mar 19, 2024
Merged

[GemmaConverter] use user_defined_symbols #29473

merged 17 commits into from
Mar 19, 2024

Conversation

ArthurZucker
Copy link
Collaborator

What does this PR do?

Fixes #29440 by always adding the user_defined_symbols to the tokenizer, otherwise because no merges exist for them, they will be split.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from pcuenca March 6, 2024 02:42
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! There are a few additional tokens inuser_defined_symbols, I'll push PRs to the repos.

@ArthurZucker ArthurZucker marked this pull request as ready for review March 6, 2024 13:47
@ArthurZucker
Copy link
Collaborator Author

I will add a test and we can merge

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! cc @ydshieh that this will introduce some failing tests in the slow CI, we're aware (maybe we can skip those until we get to it? after the release)

@@ -52,7 +52,7 @@
@require_sentencepiece
@require_tokenizers
class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
from_pretrained_id = "hf-internal-testing/llama-tokenizer"
from_pretrained_id = ["hf-internal-testing/llama-tokenizer", "meta-llama/Llama-2-7b-hf"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ArthurZucker ArthurZucker changed the title [GemmaConverte] use user_defined_symbols [GemmaConverter] use user_defined_symbols Mar 19, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented Mar 19, 2024

Thank you for warning me. But could you share with me the reason to merge before we resolve the failing tests?

@ArthurZucker
Copy link
Collaborator Author

We need to release !

@ydshieh
Copy link
Collaborator

ydshieh commented Mar 19, 2024

It would be very appreciated that the expected failing tests to be skipped temporarily (before they would be fixed/updated after the release).

@ArthurZucker
Copy link
Collaborator Author

We can open a PR to skip / fix them tomorrow!

@ArthurZucker
Copy link
Collaborator Author

Failing tests are unrelated, merging!

@ArthurZucker ArthurZucker merged commit 2f9a3ed into main Mar 19, 2024
19 of 21 checks passed
@ArthurZucker ArthurZucker deleted the fix-spm-converter branch March 19, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoTokenizer tokenization issue
5 participants