-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875
Comments
Hey @cstorm125, I think, those |
Hi all, I ran into this too. But I did find a bug as a result of this mismatch. I try to resize the embedding to be smaller and got a Cuda assert error. See bug report. |
I found this mismatch recently and I think this may result in many bugs. Wish someone can fix it. |
This is wrong. It shouldn't be this way. In case model predicts wong index and when you calculate loss, it will cause serious issues. Its hard to believe no one cares this. |
Hey @s4sarath, During training all input_ids and labels are defined by the tokenizer. If the tokenizer has a vocab_size of 32000 there is no way that it will tokenize to an id >= 32000 neither for |
Hi Patrick,
Thanks for the reply.
If the embedding matrix is 32128 x d , for an example if the predicted id
is say 32099, if we are using Sentencepiece tokenizer ( not huggingface ),
it will fail to decode that.
And special tokens ( 100 tokens ) are added extra, right. Which are
actually not a part of official sentecepice model. That's why I told, it
shouldn't be that way.
Thanks anyway, I really appreciate your reply.:-)
…On Fri, 10 Dec, 2021, 7:11 pm Patrick von Platen, ***@***.***> wrote:
Hey @s4sarath <https://github.com/s4sarath>,
During training all input_ids and labels are defined by the tokenizer. If
the tokenizer has a vocab_size of 32000 there is no way that it will
tokenize to an id >= 32000 neither for input_ids nor for labels. Because
no label ever has an id >= 32000 the model learns to never predict those
ids. I don't really see a problem with this to be honest
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4875 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6KAZNTAF4DX5ZXOHUH3UQH7P5ANCNFSM4NZOPKUQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Upvoting this. Another subtle bug this causes is when doing prompt tuning. The common way to do it is to call |
Temporary Solution : |
I just found that it sometimes generates > 32100 input ids in generate function. Especially that happens if I evaluate a fine-tuned model in the very early step while training. Thanks, @Darshan2104 ! model.resize_token_embeddings(len(tokenizer)) temporally resolves my issue. |
I am also facing the |
I tried this but not helping. |
@kanak8278 , could you double-check that you are using the right tokenizer for the model? For the model, could you show me what happens when you run this code? {n:p.shape for n, p in model.named_parameters() if "embedding" in n} For the tokenizer, could you do And then could you do this on your input ids? |
This is a bit troubling, especially because I'm only interested in using a model for inference. I'm generating some sequences using multinomial sampling from |
The model was never incentivized to predict those tokens so the weights for the tokens with ids > len(tokenizer) will have extraordinarily low scores. I did a quick test and the scores for those extra tokens summed together to be on the order of 1e-30 for each token. That is basically 0. Could you share your sampling approach? |
Never mind, it was an error on my end! I apologize for the confusion! I thought I had tried everything and was desperate. |
❓ Questions & Help
Pretrained
T5Tokenizer
has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer ofT5ForConditionalGeneration
has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.Where did the extra 28 embeddings come from and how can we map it to the tokenizer?
To reproduce
Output:
The text was updated successfully, but these errors were encountered: