Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

Closed
1 of 4 tasks
guillaumekln opened this issue Nov 15, 2022 · 14 comments
Closed
1 of 4 tasks

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

guillaumekln opened this issue Nov 15, 2022 · 14 comments
Assignees

Comments

@guillaumekln
Copy link
Contributor

guillaumekln commented Nov 15, 2022

System Info

  • transformers version: 4.24.0
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.13.0+cu117 (True)

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The vocabulary size returned by the WhisperTokenizer does not match the vocabulary size reported in the configuration config.vocab_size. The timestamp tokens are missing in the tokenizer vocabulary. Consider this example:

import transformers

tokenizer = transformers.WhisperTokenizer.from_pretrained("openai/whisper-tiny")
config = transformers.WhisperConfig.from_pretrained("openai/whisper-tiny")

vocab = tokenizer.get_vocab()

print(len(vocab) == config.vocab_size)  # prints False

for i in range(1500 + 1):
    timestamp = "<|%.2f|>" % (i * 0.02)
    vocab[timestamp] = len(vocab)

print(len(vocab) == config.vocab_size)  # prints True

The token surface used in the code snippet is copied from the reference implementation:

https://github.com/openai/whisper/blob/9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d/whisper/tokenizer.py#L151

Expected behavior

The vocabulary size returned by the tokenizer should match the model vocabulary size.

@ArthurZucker ArthurZucker self-assigned this Nov 15, 2022
@ArthurZucker
Copy link
Collaborator

Hey! Though I agree with you on the fact that normally the tokenizer vocab size is the same as the model's, in this case, the original model was similar. The timestamp tokens are all outside vocabulary and decoded as "" with the fast GPT2FastTokenizer in the original code. The WhisperTokenizer was adapted to follow this, in order to not bother with the tokens that are only used with the timestamp_logits_processor. Indeed, all the extra tokens (>50363) are treated as timestamps prediction and "ignored".

cc @LysandreJik as we had a lot of issue with other models, there were discussion on whether to always add the extra tokens or not?

@guillaumekln
Copy link
Contributor Author

Thank you for the explanation! Feel free to close this issue if you want to keep it this way. I can work around it in my own code.

For the context, I'm converting some Transformers models to another format and I want to always match the tokenizer vocabulary size to the model vocabulary size. In many cases I need to add some tokens (most often the "madeupword" to pad the vocabulary to a multiple of 8) and sometimes I need to remove some (e.g. for facebook/bart-large-cnn the tokenizer has 1 additional token for some reasons). It would be great if len(tokenizer.get_vocab()) is always consistent with the model vocabulary size.

@versae
Copy link
Contributor

versae commented May 4, 2023

Since the original OpenAI implementation moved from HF tokenizers to their own Tiktoken library, it seems timestamp tokens are now handled and converted to token ids. Right now the timestamps tokens in HF are handled as strings instead of individual tokens.

from transformers import WhisperTokenizer
from whisper.tokenizer import get_tokenizer

hf_tok = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
openai_tok = get_tokenizer(multilingual=True, language="en", task="transcribe")

openai_tok.encode("<|1.00|>", disallowed_special=[])
# [27, 91, 16, 13, 628, 91, 29]
hf_tok.encode("<|1.00|>", add_special_tokens=False)
# [27, 91, 16, 13, 628, 91, 29]

openai_tok.encode("<|1.00|>", allowed_special=set(openai_tok.special_tokens.keys()))
# [50414]
hf_tok.encode("<|1.00|>", add_special_tokens=True)
# [50258, 50363, 27, 91, 16, 13, 628, 91, 29, 50257]

Could it be the time to revisit this issue?

@versae versae mentioned this issue May 4, 2023
5 tasks
@ArthurZucker
Copy link
Collaborator

Nope, we also added support for decoding with timestamps. For that you just need to specify the decode_with_timestamps see here

@versae
Copy link
Contributor

versae commented May 25, 2023

Yeah, but if you want to train using the right timestamps tokens, there's no support for that AFAIK. We had to add the tokens manually. The encoding function is a bit more convoluted to modify to support encoding of the timestamps tokens with a flag like it's now implemented for decoding.

@ArthurZucker
Copy link
Collaborator

Then we should probably add them to added_tokens_encoder and refactor a bit the tokenizer for encoding decoding wdyt @sanchit-gandhi @hollance

@sanchit-gandhi
Copy link
Contributor

Yep I agree - took a look through and @versae is spot on, the new OpenAI tokenizer has these tokens as part of their tokenizer, so they can be encoded directly. We should follow suit and update our encoding function accordingly

@ArthurZucker
Copy link
Collaborator

Also if they are part of the special tokens, they will not be skipped by default, and would have to be skipped using skip_special_tokens = True. But should be alright! @versae feel free to open a PR and ping me if you have time, otherwise I might be able to tackle that in 2 weeks

@versae
Copy link
Contributor

versae commented Jun 24, 2023

Hey! Sorry, haven't had the time to properly implement this, but I can confirm than using AddedTokens works well 👌

@huggingface huggingface deleted a comment from github-actions bot Jun 25, 2023
@ArthurZucker
Copy link
Collaborator

I'll open a PR, I am not entirely sure just using added tokens will solve this. We need backward compatibility so I'll add a new argument like encoder_special. Will see

@ArthurZucker
Copy link
Collaborator

Ok! So it seems that skip_special_tokens when encoding will make it's way to transformers 😉

@sanchit-gandhi
Copy link
Contributor

Keep us posted!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

Skip special tokens was merged in #25081 so closing this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants