diff --git a/README.md b/README.md index 45a579d9..a99343eb 100644 --- a/README.md +++ b/README.md @@ -12,63 +12,53 @@ This implementation is up to 4 times faster than [openai/whisper](https://github For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations: -* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258) -* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362) -* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e) +* [openai/whisper](https://github.com/openai/whisper)@[v20240930](https://github.com/openai/whisper/tree/v20240930) +* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[v1.7.2](https://github.com/ggerganov/whisper.cpp/tree/v1.7.2) +* [transformers](https://github.com/huggingface/transformers)@[v4.46.3](https://github.com/huggingface/transformers/tree/v4.46.3) +* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[v1.1.0](https://github.com/SYSTRAN/faster-whisper/tree/v1.1.0) ### Large-v2 model on GPU -| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory | -| --- | --- | --- | --- | --- | --- | -| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB | -| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB | -| faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB | - -*Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.* +| Implementation | Precision | Beam size | Time | VRAM Usage | +| --- | --- | --- | --- | --- | +| openai/whisper | fp16 | 5 | 2m23s | 4708MB | +| whisper.cpp (Flash Attention) | fp16 | 5 | 1m05s | 4127MB | +| transformers (SDPA)[^1] | fp16 | 5 | 1m52s | 4960MB | +| faster-whisper | fp16 | 5 | 1m03s | 4525MB | +| faster-whisper (`batch_size=8`) | fp16 | 5 | 17s | 6090MB | +| faster-whisper | int8 | 5 | 59s | 2926MB | +| faster-whisper (`batch_size=8`) | int8 | 5 | 16s | 4500MB | -### Small model on CPU +### distil-whisper-large-v3 model on GPU -| Implementation | Precision | Beam size | Time | Max. memory | +| Implementation | Precision | Beam size | Time | YT Commons WER | | --- | --- | --- | --- | --- | -| openai/whisper | fp32 | 5 | 10m31s | 3101MB | -| whisper.cpp | fp32 | 5 | 17m42s | 1581MB | -| whisper.cpp | fp16 | 5 | 12m39s | 873MB | -| faster-whisper | fp32 | 5 | 2m44s | 1675MB | -| faster-whisper | int8 | 5 | 2m04s | 995MB | - -*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.* +| transformers (SDPA) (`batch_size=16`) | fp16 | 5 | 46m12s | 14.801 | +| faster-whisper (`batch_size=16`) | fp16 | 5 | 25m50s | 13.527 | +*GPU Benchmarks are Executed with CUDA 12.4 on a NVIDIA RTX 3070 Ti 8GB.* +[^1]: transformers OOM for any batch size > 1 -### Distil-whisper +### Small model on CPU -| Implementation | Precision | Beam size | Time | Gigaspeech WER | +| Implementation | Precision | Beam size | Time | RAM Usage | | --- | --- | --- | --- | --- | -| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 | -| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 | -| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 | -| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 | - -*Executed with CUDA 11.4 on a NVIDIA 3090.* - -
-testing details (click to expand) +| openai/whisper | fp32 | 5 | 6m58s | 2335MB | +| whisper.cpp | fp32 | 5 | 2m05s | 1049MB | +| whisper.cpp (OpenVINO) | fp32 | 5 | 1m45s | 1642MB | +| faster-whisper | fp32 | 5 | 2m37s | 2257MB | +| faster-whisper (`batch_size=8`) | fp32 | 5 | 1m06s | 4230MB | +| faster-whisper | int8 | 5 | 1m42s | 1477MB | +| faster-whisper (`batch_size=8`) | int8 | 5 | 51s | 3608MB | -For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting: -```python -from faster_whisper import WhisperModel +*Executed with 8 threads on an Intel Core i7-12700K.* -model_size = "distil-large-v2" -# model_size = "distil-medium.en" -# Run on GPU with FP16 -model = WhisperModel(model_size, device="cuda", compute_type="float16") -segments, info = model.transcribe("audio.mp3", beam_size=5, language="en") -``` -
## Requirements * Python 3.8 or greater +Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package. ### GPU diff --git a/faster_whisper/version.py b/faster_whisper/version.py index b4c21869..f99ce29e 100644 --- a/faster_whisper/version.py +++ b/faster_whisper/version.py @@ -1,3 +1,3 @@ """Version information.""" -__version__ = "1.1.0rc0" +__version__ = "1.1.0"