Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

cstorm125 · 2020-06-09T14:18:42Z

❓ Questions & Help

Pretrained T5Tokenizer has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.

Where did the extra 28 embeddings come from and how can we map it to the tokenizer?

To reproduce

from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration,
)

tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape

Output:

(32100, torch.Size([32128, 768]))

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-06-22T16:35:34Z

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

huu4ontocord · 2020-11-19T02:02:06Z

Hi all, I ran into this too. But I did find a bug as a result of this mismatch. I try to resize the embedding to be smaller and got a Cuda assert error. See bug report.

#8643

libing125 · 2021-05-28T15:27:24Z

I found this mismatch recently and I think this may result in many bugs. Wish someone can fix it.

s4sarath · 2021-12-09T06:27:34Z

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

This is wrong. It shouldn't be this way. In case model predicts wong index and when you calculate loss, it will cause serious issues. Its hard to believe no one cares this.

patrickvonplaten · 2021-12-10T13:41:06Z

Hey @s4sarath,

During training all input_ids and labels are defined by the tokenizer. If the tokenizer has a vocab_size of 32000 there is no way that it will tokenize to an id >= 32000 neither for input_ids nor for labels. Because no label ever has an id >= 32000 the model learns to never predict those ids. I don't really see a problem with this to be honest

s4sarath · 2021-12-10T15:34:07Z

Hi Patrick, Thanks for the reply. If the embedding matrix is 32128 x d , for an example if the predicted id is say 32099, if we are using Sentencepiece tokenizer ( not huggingface ), it will fail to decode that. And special tokens ( 100 tokens ) are added extra, right. Which are actually not a part of official sentecepice model. That's why I told, it shouldn't be that way. Thanks anyway, I really appreciate your reply.:-)

…

On Fri, 10 Dec, 2021, 7:11 pm Patrick von Platen, ***@***.***> wrote: Hey @s4sarath <https://github.com/s4sarath>, During training all input_ids and labels are defined by the tokenizer. If the tokenizer has a vocab_size of 32000 there is no way that it will tokenize to an id >= 32000 neither for input_ids nor for labels. Because no label ever has an id >= 32000 the model learns to never predict those ids. I don't really see a problem with this to be honest — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4875 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KAZNTAF4DX5ZXOHUH3UQH7P5ANCNFSM4NZOPKUQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ZhaofengWu · 2021-12-18T23:34:20Z

Upvoting this. Another subtle bug this causes is when doing prompt tuning. The common way to do it is to call add_tokens to add some special prompt tokens, and also create a special embedding class that consists of two embedding matrics, the original one + one for the prompt tokens, and the forward call simply indexes into the two matrices concatenated together. Then all parameters but the prompt token embedding matrix are frozen. The expected behavior is that the IDs of the added tokens correspond to the prompt token embeddings when concatenated with the original. However, this mismatch causes the tokenizer to assign IDs starting from 32100, which are still a part of the original embedding matrix, which doesn't get gradients.

Darshan2104 · 2022-01-31T12:00:43Z

Temporary Solution : model.resize_token_embeddings(len(tokenizer))

theejung · 2022-02-09T03:43:26Z

I just found that it sometimes generates > 32100 input ids in generate function. Especially that happens if I evaluate a fine-tuned model in the very early step while training. Thanks, @Darshan2104 ! model.resize_token_embeddings(len(tokenizer)) temporally resolves my issue.

kanak8278 · 2023-01-31T15:14:49Z

I am also facing the IndexError: index out of range in self issue due to this difference between the vocab size in the t5 tokenizer and the model for ConditionalGeneration. Should I resize model token_embeddings?

kanak8278 · 2023-01-31T15:21:17Z

model.resize_token_embeddings(len(tokenizer)

I tried this but not helping.

nbroad1881 · 2023-02-06T20:44:52Z

model.resize_token_embeddings(len(tokenizer)

I tried this but not helping.

@kanak8278 , could you double-check that you are using the right tokenizer for the model?

For the model, could you show me what happens when you run this code?

{n:p.shape for n, p in model.named_parameters() if "embedding" in n}

For the tokenizer, could you do len(tokenizer) and report what it says?

And then could you do this on your input ids? torch.tensor(input_ids).max()

PastelBelem8 · 2023-03-29T01:51:36Z

This is a bit troubling, especially because I'm only interested in using a model for inference. I'm generating some sequences using multinomial sampling from pythia-70M model. When I attempt to obtain the corresponding scores to this sequence, I obtain a CUDA assertion (which, when running in the CPU, reveals itself as an indexing error). Upon checking the size of the model and the tokenizer, I find these are different, and although I understand @patrickvonplaten's justification, I am not sure how to proceed in terms of replacing these tokens, the fact is that they are being selected during the random sampling (even though they shouldn't since they were learned...). The other troubling problem of having a model head greater than the vocab size is that, by definition, these tokens will still contain some probability mass.

nbroad1881 · 2023-03-30T17:52:51Z

@PastelBelem8

The model was never incentivized to predict those tokens so the weights for the tokens with ids > len(tokenizer) will have extraordinarily low scores. I did a quick test and the scores for those extra tokens summed together to be on the order of 1e-30 for each token. That is basically 0.

Could you share your sampling approach?

PastelBelem8 · 2023-03-30T18:41:02Z

Never mind, it was an error on my end! I apologize for the confusion! I thought I had tried everything and was desperate.

patrickvonplaten self-assigned this Jun 9, 2020

patrickvonplaten closed this as completed Jun 22, 2020

patrickvonplaten mentioned this issue Dec 22, 2020

T5 tokenizer.vocab_size and config.vocab_size mismatch? #9247

Closed

patrickvonplaten mentioned this issue Feb 15, 2021

T5 Base length of Tokenizer not equal config vocab_size #10144

Closed

LysandreJik mentioned this issue Mar 5, 2021

Different vocab_size between model and tokenizer of mT5 #10528

Closed

This was referenced Oct 11, 2021

Incosistent vocab sizes in t5-base model & tokenizer #13946

Closed

TF mT5 model is not adding new tokens into it's vocabulary. #13839

Closed

rahuln mentioned this issue Nov 4, 2021

Mismatch between sentinel token IDs from T5 data collator and T5 tokenizer #14282

Closed

NielsRogge mentioned this issue Dec 12, 2021

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

Closed

ghost mentioned this issue May 19, 2023

Mismatch between config.vocab_size and len(tokenizer) in Flan-T5 #23199

Closed

4 tasks

so298 mentioned this issue Apr 14, 2024

Add convert_special_token_settings.py llm-jp/modelwg#15

Merged

Saibo-creator mentioned this issue Apr 16, 2024

Please support phi-1_5 and phi-2 epfl-dlab/transformers-CFG#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

cstorm125 commented Jun 9, 2020

patrickvonplaten commented Jun 22, 2020

huu4ontocord commented Nov 19, 2020

libing125 commented May 28, 2021

s4sarath commented Dec 9, 2021

patrickvonplaten commented Dec 10, 2021

s4sarath commented Dec 10, 2021 via email

ZhaofengWu commented Dec 18, 2021

Darshan2104 commented Jan 31, 2022

theejung commented Feb 9, 2022

kanak8278 commented Jan 31, 2023

kanak8278 commented Jan 31, 2023

nbroad1881 commented Feb 6, 2023

PastelBelem8 commented Mar 29, 2023

nbroad1881 commented Mar 30, 2023

PastelBelem8 commented Mar 30, 2023

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration #4875

Comments

cstorm125 commented Jun 9, 2020

❓ Questions & Help

To reproduce

patrickvonplaten commented Jun 22, 2020

huu4ontocord commented Nov 19, 2020

libing125 commented May 28, 2021

s4sarath commented Dec 9, 2021

patrickvonplaten commented Dec 10, 2021

s4sarath commented Dec 10, 2021 via email

ZhaofengWu commented Dec 18, 2021

Darshan2104 commented Jan 31, 2022

theejung commented Feb 9, 2022

kanak8278 commented Jan 31, 2023

kanak8278 commented Jan 31, 2023

nbroad1881 commented Feb 6, 2023

PastelBelem8 commented Mar 29, 2023

nbroad1881 commented Mar 30, 2023

PastelBelem8 commented Mar 30, 2023