Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add step to whisperx role to cache models required for transcription #1607

Merged
merged 9 commits into from
Feb 4, 2025
2 changes: 2 additions & 0 deletions roles/whisperx/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
---
model_script_stage: PROD
1 change: 1 addition & 0 deletions roles/whisperx/meta/main.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
dependencies:
- pip3
- aws-tools
- role: packages
packages:
- ffmpeg
17 changes: 15 additions & 2 deletions roles/whisperx/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,26 @@
pip:
name:
- torch==2.0.0
- torchvision==0.15.1
- torchaudio==2.0.1
executable: pip3
extra_args: "--index-url https://download.pytorch.org/whl/cu118"

- name: Install whisperx
pip:
name: whisperx
name: "{{ whisperx_package | default('whisperx') }}"
executable: pip3

# We use a python script to fetch the models which is owned by the transcription service. See the below PRs for details:
# - https://github.com/guardian/amigo/pull/1607
# - https://github.com/guardian/transcription-service/pull/130
- name: Download models script
shell: |
aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the bucket a parameter too? Interestingly, the cdk-base role uses a bucket w/out the stage suffix.

exit 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exit 0 needed?


# The script lives here https://github.com/guardian/transcription-service/blob/main/whisperx-model-fetch/download_whisperx_models.py
# If you are changing these parameters it may be helpful to run it locally to test the changes.
- name: Download whisperx models
command: "python3 /tmp/download_whisperx_models.py --whisper-models --diarization-models --torch-align-models --huggingface-token {{ huggingface_token }}"
become: yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

become: yes?!?! Ansible's API is confusing! 😅

become_user: ubuntu