Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

Closed
ashutoshml opened this issue Dec 11, 2021 · 3 comments

Comments

@ashutoshml
Copy link

ashutoshml commented Dec 11, 2021

Environment info

  • transformers version: 4.13.0
  • Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.12
  • PyTorch version (GPU?): 1.10.0+cu111 (False)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @patil-suraj, @LysandreJik

Information

Model I am using (Bert, XLNet ...): BART and T5

The problem arises when using:

  • [Yes] the official example scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import BartTokenizer, BartForConditionalGeneration

t5model = T5ForConditionalGeneration.from_pretrained('t5-small')
t5tokenizer = T5Tokenizer.from_pretrained('t5-small')

print("For T5:")
print("Tokenizer vocab_size: {}".format(t5tokenizer.vocab_size))
print("Model vocab size: {}\n".format(t5model.config.vocab_size))

bartmodel = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
barttokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

print("For BART:")
print("Tokenizer vocab_size: {}".format(barttokenizer.vocab_size))
print("Model vocab size: {}".format(bartmodel.config.vocab_size))

Current Output

For T5:
Tokenizer vocab_size: 32100
Model vocab size: 32128

For BART:
Tokenizer vocab_size: 50265
Model vocab size: 50265

Expected behavior

Both - the model and the corresponding tokenizer should ideally have the same vocab_size. Incase the additional vocab_size in the model is to incorporate prefix-tokens, why should they not be included in the corresponding Tokenizer.

@ashutoshml
Copy link
Author

My main goal is to add_tokens to the T5Tokenizer, but not sure how the vocab_size of the model would get affected due to this, after resize_embeddings. It would help to get an idea of how to incorporate the above.

@NielsRogge
Copy link
Contributor

See #4875

@KeertiPremGadde
Copy link

Hi @ashutoshml,

I need to add 3 special tokens to Google/long-t5-tglobal-large.

The tokenizer vocabulary is 32100 which will be increased by 3.
The Model vocab size/embedding tokens size is: 32128 (with 28 buffer tokens as per the T5 models)

My question is when I add the 3 special tokens I need to resize the embedding vocab size/embedding matrix from 32128 to 32131 or the additional 3 special tokens will be part of the buffer tokens so the embedding vocab size/embedding matrix will remain at 32128 (25 buffer tokens now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants