This project uses the Automatic Speech Recognition (ASR) model OpenAI Whisper to create subtitles for talks and similar video's. Whisper correctly transcribes most words and sentences with the base model, but the Word Error Rate (WER) can be decreased with the larger (and more resource hungry) models.
This tool can potentially take much of the required workload out of transcribing subtitles, however, manual correction MUST be performed at a later time to ensure of precision.
An example of wrong word recognition with the base model, is the word 'batch' can be recognized as 'patch' in some cases. While this is the case for the base and tiny model, it is not necessarily an issue in the larger models. Read the OpenAI Whisper model card and the paper Robust Speech Recognition via Large-Scale Weak Supervision by Radford et al. for more information on transcription precision.
Fetch a talk from media.ccc.de to test the program out.
Performance have been tested on the 18 minute talk "This years badge" by Thomas Flummer from Bornhack 2022.
Processor | Model | Transcribe duration |
---|---|---|
3 GHz CPU | base model | 15 min 12 sec |
Nvidia Tesla M60, 1 core | base model | 1 min 36 sec |
Nvidia Tesla M60, 1 core | medium model | 7 min 11 sec |
Nvidia RTX 3090 | tiny model | 21 sec |
Nvidia RTX 3090 | base model | 35 sec |
Nvidia RTX 3090 | small model | 1 min 4 sec |
Nvidia RTX 3090 | medium model | 2 min 3 sec |
Nvidia RTX 3090 | large model | 2 min 53 sec |
Nvidia RTX A4000 | tiny model | 47 sec |
As noted in the OpenAI Whisper repository, the library should work with Python 3.7 and later.
Required dependencies are ffmpeg, a Python 3 version with the virtual environment package, python dependencies listed in requirements.txt file as well as Nvidia drivers for your GPU.
sudo apt update
sudo apt upgrade -y
sudo apt install ffmpeg python3.9 python3.9-venv
Nvidia drivers
Install GPU drivers. In case OpenAI Whisper cannot find drivers, it will use the CPU on the machine to transcribe, which takes significantly longer.
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-common ubuntu-drivers-common -y
sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
Install the following packages
sudo apt update
sudo apt install linux-headers-amd64 ffmpeg python3.11 python3.11-venv
See the following wiki article for Nvidia driver installation instructions.
Install the following packages
sudo pacman -Sy ffmpeg python python-virtualenv
More information on the Arch wiki about Nvidia drivers.
Create a virtual environment and install dependencies. Look into the OpenAI Whisper setup if you encounter dependency errors.
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install git+https://github.com/openai/whisper.git
Enter virtual environment and run
source venv/bin/activate
python app.py --video <video_file> --model <whipser model>
Parameters:
usage: app.py [-h] [-v VIDEO] [-l] [-m WHISPER_MODEL]
Create subtitle file from video.
options:
-h, --help show this help message and exit
-v VIDEO, --video VIDEO
Video file to be processed
-l, --language Manually set transcription language
-m WHISPER_MODEL, --model WHISPER_MODEL
Set OpenAI Whisper model
The sample below runs ASR subtitles on a directory of videos with the large OpenAI Whisper model, and times it as well:
time python app.py --video videos/ --model large
The program outputs a SRT file named <video_file>.srt
in the same directory as the video file. You can use VLC or other media players to play the video and add the subtitles.
Exit virtual environment
deactivate
Update Whisper library
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
- OpenAI Whisper for their wonderful models
- Much inspiration have been drawn from Whisper-ASR-youtube-subtitles