nvidia-riva · NikhilSrihari · May 6, 2022 · May 6, 2022
diff --git a/New-language-adaptation/German/README.md b/New-language-adaptation/German/README.md
@@ -22,9 +22,9 @@ When adapting Riva to a whole new language, a large amount of high-quality trans
 
 For German, there are several significant sources of public datasets that we can readily leverage:
 
-- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours 
-- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours
-- [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours
+- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs.
+- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours, ~115 GBs.
+- [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours, 4.6 Gbs.
 
 The total amount of public datasets is thus ~2700 hours of transcribed German speech audio data. 
 
@@ -58,7 +58,7 @@ Preparation of the tokenizer is made simple by the [process_asr_text_tokenizer.p
 
 This step is carried out to filter out some outlying samples in the datasets. 
 
-- Samples that are too long, too short or empty are filtered out.
+- Samples that are too long (>20 s), too short (<0.1 s) or empty are filtered out.
 
 - In addition, we also filter out samples that are considered 'noisy', that is, samples having very high WER (word error rate) or CER (character error rate) w.r.t. a previously trained German models. 
 

diff --git a/asr-python-advanced-customize-vocabulary-and-lexicon.ipynb b/asr-python-advanced-customize-vocabulary-and-lexicon.ipynb
@@ -102,7 +102,24 @@
     "\n",
     "- BYO vocabulary file: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of \"difficult words\", but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.\n",
     "\n",
-    "- Modifying an existing one: Out of the box vocabulary files  for Riva supported languages can be found on NGC, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1). Alternatively, it can also be found in a deployed Riva ASR pipeline, for example, under `/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/dict_vocab.txt` in the Riva server docker container.\n",
+    "- Modifying an existing one: This is the recommended approach. Out-of-the-box vocabulary files  for Riva supported languages can be found either:\n",
+    "    - On NGC, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1).\n",
+    "    - Or in a local Riva deployment: The actual physical location of Riva assets depends on the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder. \n",
+    "        - By default, `riva_model_loc` is set to `riva-model-repo`, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:\n",
+    "        \n",
+    "        ```bash\n",
+    "        # Inspect the Riva model docker volume\n",
+    "        docker inspect riva-model-repo\n",
+    "        \n",
+    "        # Inspect the content of the Riva model docker volume\n",
+    "        docker run --rm -v riva-model-repo:/riva-model-repo alpine ls /riva-model-repo\n",
+    "        \n",
+    "        # Copy the vocabulary file from the docker volume to the current directory\n",
+    "        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt /dest\n",
+    "        ```\n",
+    "        \n",
+    "        - If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<RIVA_REPO_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt`.\n",
+    "\n",
     "You can make a copy, then extend this default vocabulary file with the words of interest.\n",
     "\n",
     "Once modified, you'll have to redeploy the Riva ASR pipeline with `riva-build` while passing the flag `--decoding_vocab=<modified_vocabulary_file>`."
@@ -114,15 +131,21 @@
    "source": [
     "## Customizing pronunciation with lexicon mapping\n",
     "\n",
-    "The lexicon file that is used by the Flashlight decoder can be found in the Riva server docker container, under the Triton model folder, for example `/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/lexicon.txt`.\n",
+    "The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder (see above).\n",
+    "\n",
+    "- If `riva_model_loc` points to a docker volume (by default), you can find and copy the lexicon file with:\n",
+    "```bash\n",
+    "        # Copy the lexicon file from the docker volume to the current directory\n",
+    "        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt /dest\n",
+    "```      \n",
     "\n",
-    "Note: the actual physical location of this file in the host file system, along with other Riva model assets, is specified by the `riva_model_loc` parameter in the Riva configuration file `config.sh`under the Riva quickstart root folder.\n",
+    "- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<RIVA_REPO_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt`.\n",
     "\n",
     "### How to modify the lexicon file\n",
     "\n",
     "First, locate and make a copy of the lexicon file. For example:\n",
     "```\n",
-    "cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt\n",
+    "cp <RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt modified_lexicon.txt\n",
     "```\n",
     "\n",
     "Next, modify it to add the sentencepiece tokenizations for the words of interest. For example, one could add:\n",
@@ -133,7 +156,7 @@
     "```\n",
     "which are 3 different pronunciations/tokenizations of the word `manu`.  If the acoustic model predicts those tokens, they will be decoded as `manu`.\n",
     "\n",
-    "Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=decoding_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.\n",
+    "Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.\n",
     "\n",
     "### How to generate the correct tokenized form\n",
     "\n",
@@ -143,7 +166,16 @@
     "\n",
     "- The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).\n",
     "\n",
-    "The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the `sentencepiece` Python package (`pip install sentencepiece`). You can get the tokenizer model for the deployed pipeline from the model repository `ctc-decoder` directory for your model. It will be named `<hash>_tokenizer.model`.\n"
+    "The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the `sentencepiece` Python package (`pip install sentencepiece`). You can get the tokenizer model for the deployed pipeline from the model repository `ctc-decoder-...` directory for your model. It will be named `<hash>_tokenizer.model`. For example:\n",
+    "\n",
+    "`<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model`\n",
+    "\n",
+    "When using a docker volume to store Riva assets (by default), you can copy the tokenizer model to the local directory with a command such as:\n",
+    "\n",
+    "```bash\n",
+    "        # Copy the tokenizer model file from the docker volume to the current directory\n",
+    "        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model /dest\n",
+    "```  "
    ]
   },
   {
@@ -292,9 +324,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "venv-riva-tutorials",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "venv-riva-tutorials"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -306,7 +338,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.6"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,