Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between whisper training script and notebook #29973

Closed
MarhyCZ opened this issue Mar 31, 2024 · 2 comments · Fixed by huggingface/blog#1949
Closed

Discrepancies between whisper training script and notebook #29973

MarhyCZ opened this issue Mar 31, 2024 · 2 comments · Fixed by huggingface/blog#1949

Comments

@MarhyCZ
Copy link

MarhyCZ commented Mar 31, 2024

Hi @sanchit-gandhi , first of all, thanks for such nicely curated transformers examples and fine-tuning tutorials. I would be completely lost without them.

I was comparing your latest whisper fine-tuning notebook huggingface/blog#1944 with the end to end script run_speech_recognition_seq2seq.py availble in examples. And also latest PRs with bugfixes like #29938

I still found some discrepances, but maybe they are on purpose, so I am hesitant to write any PRs.

In seq2seq script in DataCollatorSpeechSeq2SeqWithPadding class, you have:

# if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

whereas in the fine-tuning notebook it declares

if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

I get, that this is to make the script generic to any model, not only Whisper. But when we look into Whisper config.json, we can see different values

"bos_token_id": 50257,
"decoder_start_token_id": 50258,

They correspond do different values:
https://huggingface.co/openai/whisper-large-v3/blob/main/added_tokens.json

  "<|startoftranscript|>": 50258,
  "<|endoftext|>": 50257

Is this wanted state?

Also second thing. In your new fine-tuning jupyter notebook you are no longer manually emptying forced_decoder_ids and supress_tokens in model.config, but in model.config_generation

model.generation_config.language = "hi"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

So shouldnt this line be also changed in the run_speech_recognition_seq2seq.py ?

config.update({"forced_decoder_ids": model_args.forced_decoder_ids, "suppress_tokens": model_args.suppress_tokens})

I was getting a little bit different WERs but in the range of 1-2% so that could be just a variation (I didnt defined any seed) so no biggie.

Have a nice day!

@sanchit-gandhi
Copy link
Contributor

Hey @MarhyCZ - thanks for these astute remarks! Regarding the discrepancy in the data collator, you're absolutely right: we should be using the decoder_start_token_id in both cases. I've fixed the blog post in this PR: huggingface/blog#1949

For the forced/suppressed token ids, I've fixed the fine-tuning example script in this commit: a20a96f.

I'm running both the blog post and script side-by-side to check whether I get equivalent results. I'll report back in ~3 hours when these are finished.

@MarhyCZ
Copy link
Author

MarhyCZ commented Apr 1, 2024

@sanchit-gandhi You are amazing, thanks for such a quick review!
I made those edits yesterday and I am getting almost the same WER now. So I think we are good. Let me know for my curiosity :)

Have a nice day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants