Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

ashutoshml · 2021-12-11T05:15:06Z

Environment info

transformers version: 4.13.0
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.10.0+cu111 (False)
Tensorflow version (GPU?): 2.7.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @patil-suraj, @LysandreJik

Information

Model I am using (Bert, XLNet ...): BART and T5

The problem arises when using:

[Yes] the official example scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import BartTokenizer, BartForConditionalGeneration

t5model = T5ForConditionalGeneration.from_pretrained('t5-small')
t5tokenizer = T5Tokenizer.from_pretrained('t5-small')

print("For T5:")
print("Tokenizer vocab_size: {}".format(t5tokenizer.vocab_size))
print("Model vocab size: {}\n".format(t5model.config.vocab_size))

bartmodel = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
barttokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

print("For BART:")
print("Tokenizer vocab_size: {}".format(barttokenizer.vocab_size))
print("Model vocab size: {}".format(bartmodel.config.vocab_size))

Current Output

For T5:
Tokenizer vocab_size: 32100
Model vocab size: 32128

For BART:
Tokenizer vocab_size: 50265
Model vocab size: 50265

Expected behavior

Both - the model and the corresponding tokenizer should ideally have the same vocab_size. Incase the additional vocab_size in the model is to incorporate prefix-tokens, why should they not be included in the corresponding Tokenizer.

The text was updated successfully, but these errors were encountered:

ashutoshml · 2021-12-11T09:37:17Z

My main goal is to add_tokens to the T5Tokenizer, but not sure how the vocab_size of the model would get affected due to this, after resize_embeddings. It would help to get an idea of how to incorporate the above.

NielsRogge · 2021-12-12T10:22:40Z

See #4875

KeertiPremGadde · 2024-12-10T14:22:15Z

Hi @ashutoshml,

I need to add 3 special tokens to Google/long-t5-tglobal-large.

The tokenizer vocabulary is 32100 which will be increased by 3.
The Model vocab size/embedding tokens size is: 32128 (with 28 buffer tokens as per the T5 models)

My question is when I add the 3 special tokens I need to resize the embedding vocab size/embedding matrix from 32128 to 32131 or the additional 3 special tokens will be part of the buffer tokens so the embedding vocab size/embedding matrix will remain at 32128 (25 buffer tokens now).

ashutoshml closed this as completed Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

ashutoshml commented Dec 11, 2021 •

edited

Loading

ashutoshml commented Dec 11, 2021

NielsRogge commented Dec 12, 2021

KeertiPremGadde commented Dec 10, 2024

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

Difference between vocab_size in model T5forConditionalGeneration “t5-small” and its corresponding Tokenizer “t5-small” #14727

Comments

ashutoshml commented Dec 11, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

ashutoshml commented Dec 11, 2021

NielsRogge commented Dec 12, 2021

KeertiPremGadde commented Dec 10, 2024

ashutoshml commented Dec 11, 2021 •

edited

Loading