Camenbert length Tokenizer not equal config vocab_size #2020

Keisn1 · 2019-12-02T10:30:44Z

❓ Questions & Help

Hi there,
when I load the pretrained Camenbert model and tokenizer via

model = CamembertForMaskedLM.from_pretrained('camembert-base') tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

the length of the tokenizer is 32004 but the vocab_size of the model is 32005.
print(len(tokenizer))
'print(model.config.vocab_size'

This throws me an error

Index out of range

when I try to adapt the lm_finetuning example because of
model.resize_token_embeddings(len(tokenizer))

It runs when I comment out this line. So my question is, is this the intended behaviour resp. what's the reason for the unevenness between the tokenizer and the model vocab_size?

The text was updated successfully, but these errors were encountered:

thomwolf · 2019-12-05T12:30:51Z

Indeed, upon deeper investigation, it appears that the original fairseq model has a bunch of duplicate tokens in the dictionary:

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
list(camembert.task.source_dictionary[i] for i in range(10))
>>> ['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁de', '.']

I'm cleaning and updating for this in #2065

stale · 2020-02-03T12:55:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

thomwolf mentioned this issue Dec 5, 2019

Fixing camembert tokenization #2065

Merged

stale bot added the wontfix label Feb 3, 2020

stale bot closed this as completed Feb 10, 2020

ari9dam mentioned this issue Feb 11, 2021

T5 Base length of Tokenizer not equal config vocab_size #10144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camenbert length Tokenizer not equal config vocab_size #2020

Camenbert length Tokenizer not equal config vocab_size #2020

Keisn1 commented Dec 2, 2019

thomwolf commented Dec 5, 2019

stale bot commented Feb 3, 2020

Camenbert length Tokenizer not equal config vocab_size #2020

Camenbert length Tokenizer not equal config vocab_size #2020

Comments

Keisn1 commented Dec 2, 2019

❓ Questions & Help

thomwolf commented Dec 5, 2019

stale bot commented Feb 3, 2020