Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

Closed
cstorm125 opened this issue Jun 9, 2020 · 15 comments
Assignees

Comments

@cstorm125
Copy link

❓ Questions & Help

Pretrained T5Tokenizer has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.

Where did the extra 28 embeddings come from and how can we map it to the tokenizer?

To reproduce

from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration,
)

tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape

Output:

(32100, torch.Size([32128, 768]))
@patrickvonplaten patrickvonplaten self-assigned this Jun 9, 2020
@patrickvonplaten
Copy link
Contributor

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

@huu4ontocord
Copy link

Hi all, I ran into this too. But I did find a bug as a result of this mismatch. I try to resize the embedding to be smaller and got a Cuda assert error. See bug report.

#8643

@libing125
Copy link

I found this mismatch recently and I think this may result in many bugs. Wish someone can fix it.

@s4sarath
Copy link

s4sarath commented Dec 9, 2021

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

This is wrong. It shouldn't be this way. In case model predicts wong index and when you calculate loss, it will cause serious issues. Its hard to believe no one cares this.

@patrickvonplaten
Copy link
Contributor

Hey @s4sarath,

During training all input_ids and labels are defined by the tokenizer. If the tokenizer has a vocab_size of 32000 there is no way that it will tokenize to an id >= 32000 neither for input_ids nor for labels. Because no label ever has an id >= 32000 the model learns to never predict those ids. I don't really see a problem with this to be honest

@s4sarath
Copy link

s4sarath commented Dec 10, 2021 via email

@ZhaofengWu
Copy link
Contributor

Upvoting this. Another subtle bug this causes is when doing prompt tuning. The common way to do it is to call add_tokens to add some special prompt tokens, and also create a special embedding class that consists of two embedding matrics, the original one + one for the prompt tokens, and the forward call simply indexes into the two matrices concatenated together. Then all parameters but the prompt token embedding matrix are frozen. The expected behavior is that the IDs of the added tokens correspond to the prompt token embeddings when concatenated with the original. However, this mismatch causes the tokenizer to assign IDs starting from 32100, which are still a part of the original embedding matrix, which doesn't get gradients.

@Darshan2104
Copy link

Temporary Solution : model.resize_token_embeddings(len(tokenizer))

@theejung
Copy link

theejung commented Feb 9, 2022

I just found that it sometimes generates > 32100 input ids in generate function. Especially that happens if I evaluate a fine-tuned model in the very early step while training. Thanks, @Darshan2104 ! model.resize_token_embeddings(len(tokenizer)) temporally resolves my issue.

@kanak8278
Copy link

I am also facing the IndexError: index out of range in self issue due to this difference between the vocab size in the t5 tokenizer and the model for ConditionalGeneration. Should I resize model token_embeddings?

@kanak8278
Copy link

model.resize_token_embeddings(len(tokenizer)

I tried this but not helping.

@nbroad1881
Copy link
Contributor

model.resize_token_embeddings(len(tokenizer)

I tried this but not helping.

@kanak8278 , could you double-check that you are using the right tokenizer for the model?

For the model, could you show me what happens when you run this code?

{n:p.shape for n, p in model.named_parameters() if "embedding" in n}

For the tokenizer, could you do len(tokenizer) and report what it says?

And then could you do this on your input ids? torch.tensor(input_ids).max()

@PastelBelem8
Copy link

This is a bit troubling, especially because I'm only interested in using a model for inference. I'm generating some sequences using multinomial sampling from pythia-70M model. When I attempt to obtain the corresponding scores to this sequence, I obtain a CUDA assertion (which, when running in the CPU, reveals itself as an indexing error). Upon checking the size of the model and the tokenizer, I find these are different, and although I understand @patrickvonplaten's justification, I am not sure how to proceed in terms of replacing these tokens, the fact is that they are being selected during the random sampling (even though they shouldn't since they were learned...). The other troubling problem of having a model head greater than the vocab size is that, by definition, these tokens will still contain some probability mass.

@nbroad1881
Copy link
Contributor

@PastelBelem8

The model was never incentivized to predict those tokens so the weights for the tokens with ids > len(tokenizer) will have extraordinarily low scores. I did a quick test and the scores for those extra tokens summed together to be on the order of 1e-30 for each token. That is basically 0.

Could you share your sampling approach?

@PastelBelem8
Copy link

Never mind, it was an error on my end! I apologize for the confusion! I thought I had tried everything and was desperate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests