-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detokenization discrepancy with Llama3.1 #35175
Comments
As you mentioned, |
Is it a problem to be fixed in the code in master or should we set it manually everytime? |
@ArthurZucker yes, same question as @denadai2. It seems counterintuitive for the tokenizer's default to be |
It IS counter intuitive, but we can't easily break stuff in |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.47.0Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Spaces are being stripped from space-prefixed token
Ġ'
when followed by a common abbreviation (e.g.,n't
,'m
,'s
,'ve
), even when not appropriate to do so. This is being caused becauseclean_up_tokenization_spaces
is True by default for the Llama 3.1 tokenizer.Produces
Expected behavior
I would expect the
original
string to match thedecoded
string in all cases unless it actually contains "traditional" tokenization spacing (e.g.,it 's
vsit's
). Perhaps a good approach could be to modify the clean_up_tokenization function to only apply this rule when the common abbreviation is followed immediately by another space.The text was updated successfully, but these errors were encountered: