-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225
Comments
Hey! Though I agree with you on the fact that normally the tokenizer vocab size is the same as the model's, in this case, the original model was similar. The cc @LysandreJik as we had a lot of issue with other models, there were discussion on whether to always add the extra tokens or not? |
Thank you for the explanation! Feel free to close this issue if you want to keep it this way. I can work around it in my own code. For the context, I'm converting some Transformers models to another format and I want to always match the tokenizer vocabulary size to the model vocabulary size. In many cases I need to add some tokens (most often the "madeupword" to pad the vocabulary to a multiple of 8) and sometimes I need to remove some (e.g. for |
Since the original OpenAI implementation moved from HF tokenizers to their own Tiktoken library, it seems timestamp tokens are now handled and converted to token ids. Right now the timestamps tokens in HF are handled as strings instead of individual tokens. from transformers import WhisperTokenizer
from whisper.tokenizer import get_tokenizer
hf_tok = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
openai_tok = get_tokenizer(multilingual=True, language="en", task="transcribe")
openai_tok.encode("<|1.00|>", disallowed_special=[])
# [27, 91, 16, 13, 628, 91, 29]
hf_tok.encode("<|1.00|>", add_special_tokens=False)
# [27, 91, 16, 13, 628, 91, 29]
openai_tok.encode("<|1.00|>", allowed_special=set(openai_tok.special_tokens.keys()))
# [50414]
hf_tok.encode("<|1.00|>", add_special_tokens=True)
# [50258, 50363, 27, 91, 16, 13, 628, 91, 29, 50257] Could it be the time to revisit this issue? |
Nope, we also added support for decoding with timestamps. For that you just need to specify the |
Yeah, but if you want to train using the right timestamps tokens, there's no support for that AFAIK. We had to add the tokens manually. The encoding function is a bit more convoluted to modify to support encoding of the timestamps tokens with a flag like it's now implemented for decoding. |
Then we should probably add them to |
Yep I agree - took a look through and @versae is spot on, the new OpenAI tokenizer has these tokens as part of their tokenizer, so they can be encoded directly. We should follow suit and update our encoding function accordingly |
Also if they are part of the special tokens, they will not be skipped by default, and would have to be skipped using |
Hey! Sorry, haven't had the time to properly implement this, but I can confirm than using |
I'll open a PR, I am not entirely sure just using added tokens will solve this. We need backward compatibility so I'll add a new argument like |
Ok! So it seems that |
Keep us posted! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Skip special tokens was merged in #25081 so closing this now |
System Info
transformers
version: 4.24.0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The vocabulary size returned by the
WhisperTokenizer
does not match the vocabulary size reported in the configurationconfig.vocab_size
. The timestamp tokens are missing in the tokenizer vocabulary. Consider this example:The token surface used in the code snippet is copied from the reference implementation:
https://github.com/openai/whisper/blob/9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d/whisper/tokenizer.py#L151
Expected behavior
The vocabulary size returned by the tokenizer should match the model vocabulary size.
The text was updated successfully, but these errors were encountered: