You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For T5:
Tokenizer vocab_size: 32100
Model vocab size: 32128
For BART:
Tokenizer vocab_size: 50265
Model vocab size: 50265
Expected behavior
Both - the model and the corresponding tokenizer should ideally have the same vocab_size. Incase the additional vocab_size in the model is to incorporate prefix-tokens, why should they not be included in the corresponding Tokenizer.
The text was updated successfully, but these errors were encountered:
My main goal is to add_tokens to the T5Tokenizer, but not sure how the vocab_size of the model would get affected due to this, after resize_embeddings. It would help to get an idea of how to incorporate the above.
I need to add 3 special tokens to Google/long-t5-tglobal-large.
The tokenizer vocabulary is 32100 which will be increased by 3.
The Model vocab size/embedding tokens size is: 32128 (with 28 buffer tokens as per the T5 models)
My question is when I add the 3 special tokens I need to resize the embedding vocab size/embedding matrix from 32128 to 32131 or the additional 3 special tokens will be part of the buffer tokens so the embedding vocab size/embedding matrix will remain at 32128 (25 buffer tokens now).
Environment info
transformers
version: 4.13.0Who can help
@patrickvonplaten, @patil-suraj, @LysandreJik
Information
Model I am using (Bert, XLNet ...): BART and T5
The problem arises when using:
To reproduce
Steps to reproduce the behavior:
Current Output
Expected behavior
Both - the model and the corresponding tokenizer should ideally have the same
vocab_size
. Incase the additionalvocab_size
in the model is to incorporate prefix-tokens, why should they not be included in the corresponding Tokenizer.The text was updated successfully, but these errors were encountered: