-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing tokenizer when transformers
is installed without tokenizers
#26236
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for the catch! 🤗 it seems indeed that the representation does not show this. __repr__
should also be updated to show the content, single_word
etc same as if tokenizers
was present.
Thanks @ArthurZucker , I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let's try to match the fast implementation !
def __repr__(self): | ||
return self.content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, let's either take the default repr from the dataclass overload, or when we print the added Token we need to have: AddedToken("▁<PRE>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=True)
This reverts commit 1883950.
Hi @ArthurZucker , I'm an not sure that the
Most of the special tokens are strings rather than So, I am keeping only the Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll open a follow up PR to update the __repr___
then
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Thanks for contributing this fix back to upstream, @urialon! |
What does this PR do?
This PR fixes the tokenization of the
<s>
and</s>
tokens, whentransformers
is installed buttokenizers
is not installed, fixing the string representation of theAddedToken
class.Using
transformers
withouttokenizers
installed results in the following problem:Prints:
In other words, the tokenizer knows that the correct IDs for
<s>
and</s>
are0
and2
, but when encoding an arbitrary string, it adds the new IDs50265
and50266
(which are not known to the model!).Using this solution, the tokenizer does not add additional token IDs 50265 and up, because it recognizes them as existing already in IDs 0-3.
Then, encoding a string using a tokenizer results in adding
0
and2
as the<s>
and</s>
tokens.The two lines of:
result in the same output of
[0, 102, 2]
, as expected.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @ydshieh @hvaara @amyeroberts