Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VDR fixes #42

Merged
merged 1 commit into from
May 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions New-language-adaptation/German/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ When adapting Riva to a whole new language, a large amount of high-quality trans

For German, there are several significant sources of public datasets that we can readily leverage:

- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours
- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours
- [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours
- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs.
- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours, ~115 GBs.
- [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours, 4.6 Gbs.

The total amount of public datasets is thus ~2700 hours of transcribed German speech audio data.

Expand Down Expand Up @@ -58,7 +58,7 @@ Preparation of the tokenizer is made simple by the [process_asr_text_tokenizer.p

This step is carried out to filter out some outlying samples in the datasets.

- Samples that are too long, too short or empty are filtered out.
- Samples that are too long (>20 s), too short (<0.1 s) or empty are filtered out.

- In addition, we also filter out samples that are considered 'noisy', that is, samples having very high WER (word error rate) or CER (character error rate) w.r.t. a previously trained German models.

Expand Down
50 changes: 41 additions & 9 deletions asr-python-advanced-customize-vocabulary-and-lexicon.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,24 @@
"\n",
"- BYO vocabulary file: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of \"difficult words\", but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.\n",
"\n",
"- Modifying an existing one: Out of the box vocabulary files for Riva supported languages can be found on NGC, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1). Alternatively, it can also be found in a deployed Riva ASR pipeline, for example, under `/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/dict_vocab.txt` in the Riva server docker container.\n",
"- Modifying an existing one: This is the recommended approach. Out-of-the-box vocabulary files for Riva supported languages can be found either:\n",
" - On NGC, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1).\n",
" - Or in a local Riva deployment: The actual physical location of Riva assets depends on the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder. \n",
" - By default, `riva_model_loc` is set to `riva-model-repo`, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:\n",
" \n",
" ```bash\n",
" # Inspect the Riva model docker volume\n",
" docker inspect riva-model-repo\n",
" \n",
" # Inspect the content of the Riva model docker volume\n",
" docker run --rm -v riva-model-repo:/riva-model-repo alpine ls /riva-model-repo\n",
" \n",
" # Copy the vocabulary file from the docker volume to the current directory\n",
" docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt /dest\n",
" ```\n",
" \n",
" - If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<RIVA_REPO_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt`.\n",
"\n",
"You can make a copy, then extend this default vocabulary file with the words of interest.\n",
"\n",
"Once modified, you'll have to redeploy the Riva ASR pipeline with `riva-build` while passing the flag `--decoding_vocab=<modified_vocabulary_file>`."
Expand All @@ -114,15 +131,21 @@
"source": [
"## Customizing pronunciation with lexicon mapping\n",
"\n",
"The lexicon file that is used by the Flashlight decoder can be found in the Riva server docker container, under the Triton model folder, for example `/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/lexicon.txt`.\n",
"The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder (see above).\n",
"\n",
"- If `riva_model_loc` points to a docker volume (by default), you can find and copy the lexicon file with:\n",
"```bash\n",
" # Copy the lexicon file from the docker volume to the current directory\n",
" docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt /dest\n",
"``` \n",
"\n",
"Note: the actual physical location of this file in the host file system, along with other Riva model assets, is specified by the `riva_model_loc` parameter in the Riva configuration file `config.sh`under the Riva quickstart root folder.\n",
"- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<RIVA_REPO_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt`.\n",
"\n",
"### How to modify the lexicon file\n",
"\n",
"First, locate and make a copy of the lexicon file. For example:\n",
"```\n",
"cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt\n",
"cp <RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt modified_lexicon.txt\n",
"```\n",
"\n",
"Next, modify it to add the sentencepiece tokenizations for the words of interest. For example, one could add:\n",
Expand All @@ -133,7 +156,7 @@
"```\n",
"which are 3 different pronunciations/tokenizations of the word `manu`. If the acoustic model predicts those tokens, they will be decoded as `manu`.\n",
"\n",
"Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=decoding_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.\n",
"Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.\n",
"\n",
"### How to generate the correct tokenized form\n",
"\n",
Expand All @@ -143,7 +166,16 @@
"\n",
"- The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).\n",
"\n",
"The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the `sentencepiece` Python package (`pip install sentencepiece`). You can get the tokenizer model for the deployed pipeline from the model repository `ctc-decoder` directory for your model. It will be named `<hash>_tokenizer.model`.\n"
"The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the `sentencepiece` Python package (`pip install sentencepiece`). You can get the tokenizer model for the deployed pipeline from the model repository `ctc-decoder-...` directory for your model. It will be named `<hash>_tokenizer.model`. For example:\n",
"\n",
"`<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model`\n",
"\n",
"When using a docker volume to store Riva assets (by default), you can copy the tokenizer model to the local directory with a command such as:\n",
"\n",
"```bash\n",
" # Copy the tokenizer model file from the docker volume to the current directory\n",
" docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model /dest\n",
"``` "
]
},
{
Expand Down Expand Up @@ -292,9 +324,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "venv-riva-tutorials",
"display_name": "Python 3",
"language": "python",
"name": "venv-riva-tutorials"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -306,7 +338,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
"version": "3.6.9"
}
},
"nbformat": 4,
Expand Down