Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

guillaumekln · 2022-11-15T08:53:49Z

System Info

transformers version: 4.24.0
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.13.0+cu117 (True)

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The vocabulary size returned by the WhisperTokenizer does not match the vocabulary size reported in the configuration config.vocab_size. The timestamp tokens are missing in the tokenizer vocabulary. Consider this example:

import transformers

tokenizer = transformers.WhisperTokenizer.from_pretrained("openai/whisper-tiny")
config = transformers.WhisperConfig.from_pretrained("openai/whisper-tiny")

vocab = tokenizer.get_vocab()

print(len(vocab) == config.vocab_size)  # prints False

for i in range(1500 + 1):
    timestamp = "<|%.2f|>" % (i * 0.02)
    vocab[timestamp] = len(vocab)

print(len(vocab) == config.vocab_size)  # prints True

The token surface used in the code snippet is copied from the reference implementation:

https://github.com/openai/whisper/blob/9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d/whisper/tokenizer.py#L151

Expected behavior

The vocabulary size returned by the tokenizer should match the model vocabulary size.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2022-11-15T09:28:09Z

Hey! Though I agree with you on the fact that normally the tokenizer vocab size is the same as the model's, in this case, the original model was similar. The timestamp tokens are all outside vocabulary and decoded as "" with the fast GPT2FastTokenizer in the original code. The WhisperTokenizer was adapted to follow this, in order to not bother with the tokens that are only used with the timestamp_logits_processor. Indeed, all the extra tokens (>50363) are treated as timestamps prediction and "ignored".

cc @LysandreJik as we had a lot of issue with other models, there were discussion on whether to always add the extra tokens or not?

guillaumekln · 2022-11-16T08:49:34Z

Thank you for the explanation! Feel free to close this issue if you want to keep it this way. I can work around it in my own code.

For the context, I'm converting some Transformers models to another format and I want to always match the tokenizer vocabulary size to the model vocabulary size. In many cases I need to add some tokens (most often the "madeupword" to pad the vocabulary to a multiple of 8) and sometimes I need to remove some (e.g. for facebook/bart-large-cnn the tokenizer has 1 additional token for some reasons). It would be great if len(tokenizer.get_vocab()) is always consistent with the model vocabulary size.

versae · 2023-05-04T14:44:52Z

Since the original OpenAI implementation moved from HF tokenizers to their own Tiktoken library, it seems timestamp tokens are now handled and converted to token ids. Right now the timestamps tokens in HF are handled as strings instead of individual tokens.

from transformers import WhisperTokenizer
from whisper.tokenizer import get_tokenizer

hf_tok = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
openai_tok = get_tokenizer(multilingual=True, language="en", task="transcribe")

openai_tok.encode("<|1.00|>", disallowed_special=[])
# [27, 91, 16, 13, 628, 91, 29]
hf_tok.encode("<|1.00|>", add_special_tokens=False)
# [27, 91, 16, 13, 628, 91, 29]

openai_tok.encode("<|1.00|>", allowed_special=set(openai_tok.special_tokens.keys()))
# [50414]
hf_tok.encode("<|1.00|>", add_special_tokens=True)
# [50258, 50363, 27, 91, 16, 13, 628, 91, 29, 50257]

Could it be the time to revisit this issue?

ArthurZucker · 2023-05-25T08:44:39Z

Nope, we also added support for decoding with timestamps. For that you just need to specify the decode_with_timestamps see here

versae · 2023-05-25T11:53:46Z

Yeah, but if you want to train using the right timestamps tokens, there's no support for that AFAIK. We had to add the tokens manually. The encoding function is a bit more convoluted to modify to support encoding of the timestamps tokens with a flag like it's now implemented for decoding.

ArthurZucker · 2023-05-26T09:16:29Z

Then we should probably add them to added_tokens_encoder and refactor a bit the tokenizer for encoding decoding wdyt @sanchit-gandhi @hollance

sanchit-gandhi · 2023-05-26T16:04:52Z

Yep I agree - took a look through and @versae is spot on, the new OpenAI tokenizer has these tokens as part of their tokenizer, so they can be encoded directly. We should follow suit and update our encoding function accordingly

ArthurZucker · 2023-05-26T16:10:35Z

Also if they are part of the special tokens, they will not be skipped by default, and would have to be skipped using skip_special_tokens = True. But should be alright! @versae feel free to open a PR and ping me if you have time, otherwise I might be able to tackle that in 2 weeks

versae · 2023-06-24T18:03:53Z

Hey! Sorry, haven't had the time to properly implement this, but I can confirm than using AddedTokens works well 👌

ArthurZucker · 2023-06-26T03:19:04Z

I'll open a PR, I am not entirely sure just using added tokens will solve this. We need backward compatibility so I'll add a new argument like encoder_special. Will see

ArthurZucker · 2023-07-20T15:19:21Z

Ok! So it seems that skip_special_tokens when encoding will make it's way to transformers 😉

sanchit-gandhi · 2023-07-25T14:54:27Z

Keep us posted!

github-actions · 2023-08-18T15:03:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2023-08-18T16:09:31Z

Skip special tokens was merged in #25081 so closing this now

ArthurZucker self-assigned this Nov 15, 2022

ArthurZucker closed this as completed Nov 17, 2022

versae mentioned this issue May 4, 2023

feat: Whisper prompting #22496

Merged

5 tasks

sanchit-gandhi reopened this May 31, 2023

upskyy mentioned this issue Jun 21, 2023

WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used? #22053

Closed

4 tasks

huggingface deleted a comment from github-actions bot Jun 25, 2023

ArthurZucker mentioned this issue Jun 26, 2023

[WhisperTokenizer] Allow encoding timestamp tokens #24476

Closed

huggingface deleted a comment from github-actions bot Jul 20, 2023

ArthurZucker closed this as completed Aug 18, 2023

bnestor mentioned this issue Dec 18, 2024

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

guillaumekln commented Nov 15, 2022 •

edited

Loading

ArthurZucker commented Nov 15, 2022

guillaumekln commented Nov 16, 2022

versae commented May 4, 2023 •

edited

Loading

ArthurZucker commented May 25, 2023

versae commented May 25, 2023

ArthurZucker commented May 26, 2023

sanchit-gandhi commented May 26, 2023

ArthurZucker commented May 26, 2023

versae commented Jun 24, 2023

ArthurZucker commented Jun 26, 2023

ArthurZucker commented Jul 20, 2023

sanchit-gandhi commented Jul 25, 2023

github-actions bot commented Aug 18, 2023

ArthurZucker commented Aug 18, 2023

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

Whisper: timestamp tokens are missing in the tokenizer vocabulary #20225

Comments

guillaumekln commented Nov 15, 2022 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Nov 15, 2022

guillaumekln commented Nov 16, 2022

versae commented May 4, 2023 • edited Loading

ArthurZucker commented May 25, 2023

versae commented May 25, 2023

ArthurZucker commented May 26, 2023

sanchit-gandhi commented May 26, 2023

ArthurZucker commented May 26, 2023

versae commented Jun 24, 2023

ArthurZucker commented Jun 26, 2023

ArthurZucker commented Jul 20, 2023

sanchit-gandhi commented Jul 25, 2023

github-actions bot commented Aug 18, 2023

ArthurZucker commented Aug 18, 2023

guillaumekln commented Nov 15, 2022 •

edited

Loading

versae commented May 4, 2023 •

edited

Loading