Discrepancies between whisper training script and notebook #29973

MarhyCZ · 2024-03-31T16:19:49Z

Hi @sanchit-gandhi , first of all, thanks for such nicely curated transformers examples and fine-tuning tutorials. I would be completely lost without them.

I was comparing your latest whisper fine-tuning notebook huggingface/blog#1944 with the end to end script run_speech_recognition_seq2seq.py availble in examples. And also latest PRs with bugfixes like #29938

I still found some discrepances, but maybe they are on purpose, so I am hesitant to write any PRs.

In seq2seq script in DataCollatorSpeechSeq2SeqWithPadding class, you have:

# if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

whereas in the fine-tuning notebook it declares

if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

I get, that this is to make the script generic to any model, not only Whisper. But when we look into Whisper config.json, we can see different values

"bos_token_id": 50257,
"decoder_start_token_id": 50258,

They correspond do different values:
https://huggingface.co/openai/whisper-large-v3/blob/main/added_tokens.json

  "<|startoftranscript|>": 50258,
  "<|endoftext|>": 50257

Is this wanted state?

Also second thing. In your new fine-tuning jupyter notebook you are no longer manually emptying forced_decoder_ids and supress_tokens in model.config, but in model.config_generation

model.generation_config.language = "hi"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

So shouldnt this line be also changed in the run_speech_recognition_seq2seq.py ?

config.update({"forced_decoder_ids": model_args.forced_decoder_ids, "suppress_tokens": model_args.suppress_tokens})

I was getting a little bit different WERs but in the range of 1-2% so that could be just a variation (I didnt defined any seed) so no biggie.

Have a nice day!

The text was updated successfully, but these errors were encountered:

sanchit-gandhi · 2024-04-01T12:28:06Z

Hey @MarhyCZ - thanks for these astute remarks! Regarding the discrepancy in the data collator, you're absolutely right: we should be using the decoder_start_token_id in both cases. I've fixed the blog post in this PR: huggingface/blog#1949

For the forced/suppressed token ids, I've fixed the fine-tuning example script in this commit: a20a96f.

I'm running both the blog post and script side-by-side to check whether I get equivalent results. I'll report back in ~3 hours when these are finished.

MarhyCZ · 2024-04-01T15:58:49Z

@sanchit-gandhi You are amazing, thanks for such a quick review!
I made those edits yesterday and I am getting almost the same WER now. So I think we are good. Let me know for my curiosity :)

Have a nice day.

sanchit-gandhi mentioned this issue Apr 1, 2024

[Whisper Fine-Tuning] Fix decoder start token id huggingface/blog#1949

Merged

sanchit-gandhi closed this as completed in huggingface/blog#1949 Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies between whisper training script and notebook #29973

Discrepancies between whisper training script and notebook #29973

MarhyCZ commented Mar 31, 2024 •

edited

Loading

sanchit-gandhi commented Apr 1, 2024

MarhyCZ commented Apr 1, 2024

Discrepancies between whisper training script and notebook #29973

Discrepancies between whisper training script and notebook #29973

Comments

MarhyCZ commented Mar 31, 2024 • edited Loading

sanchit-gandhi commented Apr 1, 2024

MarhyCZ commented Apr 1, 2024

MarhyCZ commented Mar 31, 2024 •

edited

Loading