Whisper encoder + No 30 second padding #5

farzadab · 2024-06-04T17:58:58Z

This PR enables Whisper model for training and inference, without the need for full 30 second padding.

I verified that the inference works with:

$ python -m ultravox.tools.infer_tool  -T 32 -n 10 -d boolq_in --asr \
    -m wandb://fixie/ultravox/model-llama3_whisper_s__gs__cont_lora_gs_ai_constant__longest_pad:v1

--- Sample 0 ---
Q: Transcribe <|audio|> ["do iran and afghanistan speak the same language?"]
A: Do Iran and Afghanistan speak the same language?
X: do iran and afghanistan speak the same language? [wer: 0.00, avg: 0.00]
--- Sample 1 ---
Q: Transcribe <|audio|> ["do good samaritan laws protect those who help at an accident?"]
A: Do good Samaritan laws protect those who help at an accident?
X: do good samaritan laws protect those who help at an accident? [wer: 0.00, avg: 0.00]
--- Sample 2 ---
Q: Transcribe <|audio|> ["is windows movie maker part of windows essentials?"]
A: Is Windows Movie Maker part of windows essentials?
X: is windows movie maker part of windows essentials? [wer: 0.00, avg: 0.00]
--- Sample 3 ---
Q: Transcribe <|audio|> ["is confectionary sugar the same as powdered sugar?"]
A: Is confectionary sugar the same as powdered sugar?
X: is confectionary sugar the same as powdered sugar? [wer: 0.00, avg: 0.00]
--- Sample 4 ---
Q: Transcribe <|audio|> ["is elder scrolls online the same as skyrim?"]
A: Is elder scrolls Online the same as Skyrim?
X: is elder scrolls online the same as skyrim? [wer: 0.00, avg: 0.00]
--- Sample 5 ---
Q: Transcribe <|audio|> ["can you use oyster card at epsom station?"]
A: Can you use Oyster card at Upminster station?
X: can you use oyster card at epsom station? [wer: 0.12, avg: 0.02]
--- Sample 6 ---
Q: Transcribe <|audio|> ["will there be a season 4 of da vinci's demons?"]
A: Will there be a season four of Downton Abbey?
X: will there be a season 4 of da vinci's demons? [wer: 0.45, avg: 0.08]
...

ultravox/model/modified_whisper.py

ultravox/data/datasets.py

ultravox/inference/ultravox_infer.py

ultravox/data/datasets.py

ultravox/inference/ultravox_infer.py

ultravox/model/modified_whisper.py

* enable whisper model (no need for max_padding) * bugfix: boolq_in didn't have audio_transcript -> use _get_transcribe_sample * bugfix: move model to device before merging lora weights * rename modified_whisper -> whisper_model_modified

* LSM training * use WavTokenizer directly instead of CustomWavTokenizer

farzadab added 4 commits June 4, 2024 10:52

enable whisper model (no need for max_padding)

e2ee6da

bugfix: boolq_in didn't have audio_transcript

8b46222

bugfix: move model to device before merging lora weights

ab67298

formatting

1a52c9e

farzadab marked this pull request as ready for review June 4, 2024 18:20

farzadab requested a review from juberti June 4, 2024 18:23

farzadab commented Jun 4, 2024

View reviewed changes

ultravox/model/modified_whisper.py Outdated Show resolved Hide resolved

ultravox/data/datasets.py Outdated Show resolved Hide resolved

ultravox/inference/ultravox_infer.py Show resolved Hide resolved

ultravox/inference/ultravox_infer.py Outdated Show resolved Hide resolved

juberti approved these changes Jun 4, 2024

View reviewed changes

farzadab added 5 commits June 4, 2024 12:53

rename modified_whisper -> whisper_model_modified

79c9dbd

boolq_in: use _get_transcribe_sample

988b33a

add comment for right padding audio_values

a6f10ab

improve whisper comments

333ce1c

reducing exposure of configurable audio_padding

60b330f

farzadab merged commit edc3797 into main Jun 4, 2024
1 check passed

farzadab deleted the farzad-whisper-pr branch June 4, 2024 21:51

zqhuang211 pushed a commit that referenced this pull request Feb 12, 2025

[Part-2] LSM training (#5)

9412bde

* LSM training * use WavTokenizer directly instead of CustomWavTokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper encoder + No 30 second padding #5

Whisper encoder + No 30 second padding #5

farzadab commented Jun 4, 2024 •

edited

Loading

Whisper encoder + No 30 second padding #5

Whisper encoder + No 30 second padding #5

Conversation

farzadab commented Jun 4, 2024 • edited Loading

farzadab commented Jun 4, 2024 •

edited

Loading