This repository is based on NVIDIA's reference implementation of FastPitch, extracted from their DeepLearningExamples repository.
FastPitch learns to predict mel-scale spectrograms from input symbol sequences
(e.g. text or phones), with explicit duration and pitch prediction per symbol.
For example, you can use prepare_dataset.py
to extract target features given a
list of audio files and corresponding forced alignments:
python prepare_dataset.py \
--dataset-path $DATA_ROOT \
--wav-text-filelists $FILELIST \
--durations-from textgrid
$DATA_ROOT
is the directory where all derived features will be stored, in
subdirectories mels
, pitches
and durations
. A file listing global pitch
mean and standard deviation will also be written here, to
$DATA_ROOT/pitches_stats__$FILELIST_STEM.json
.
You should make all audio files in $FILELIST
accessible under
$DATA_ROOT/wavs
, and corresponding forced alignments represented as Praat
TextGrids under $DATA_ROOT/TextGrid
. $FILELIST
should contain audio
filenames and transcripts in your desired symbol set, with a header row and
lines like:
audio|text[|speaker][|language]
/path/to/data_root/wavs/audio1.wav|this is a text transcript[|<speaker_id>][|<language_id>]
The expected path to the TextGrid file representing alignment information for
this utterance is /path/to/data_root/TextGrid/audio1.TextGrid
.
Use the --write-meta
option to write a metadata file
$DATA_ROOT/$FILELIST_STEM.meta.txt
collecting paths to all extracted
features and preprocessed transcripts per utterance which can be passed
directly to train.py
for model training. Alternatively, you can put this
file together yourself, with a header row and lines like:
audio|duration|pitch|text[|speaker][|language]
mels/audio1.pt|durations/audio1.pt|pitches/audio1.pt|<transcript>[|<speaker_id>][|<language_id>]
Note that paths to feature files are relative to the provided --dataset-path
,
which should also be passed to train.py
. Optional speaker
and language
fields can be used to train a multi-speaker or multi-lingual model by passing
either speaker or language labels or paths to embeddings stored on disk.
Use the --input-type
and --symbol-set
options to prepare_dataset.py
to
specify the input symbols you are using. In the list below, top-level bullets
are possible values for --input-type
and second-level for --symbol-set
:
char
(raw text)english_basic
english_basic_lowercase
english_expanded
english_basic_sil
phone
orpf
(phonological feature vectors)arpabet
combilex
unisyn
globalphone
(French subset)ipa
ipa_all
xsampa
english_basic_sil
unit
(integer symbol IDs)- Size of symbol vocabulary
See common/text/symbols.py
for definitions of symbol sets corresponding to
each combination of options above.
Additional notes:
char
input can just look like regular English text with spaces between words, normalized to whatever degree makes sense given the range of characters in your chosen--symbol-set
- Check
common/text/cleaners.py
and the--text-cleaners
option for built-in text normalization options - You might want to insert additional space characters at the beginning and
end of your text transcripts to represent any leading or trailing silence
in the audio (especially if using monotonic alignment search for duration
targets). Pass
add_spaces=True
when setting up yourTextProcessor
to do this automatically.
- Check
phone
,pf
andunit
inputs should probably be individual symbols separated by spaces, e.g. anxsampa
rendering of the phrase 'the cat' would have a transcript likeD @ k a t
- Some support is possible for phone transcripts which may be more
human-readable when using
ipa{,_all}
, thanks to the way PanPhon handles such strings, e.g. transcripts likeðə kæt
- Some support is possible for phone transcripts which may be more
human-readable when using
- If you want to use phonological feature vectors as input (
--input-type pf
), transcripts should be phone strings, with symbols specified by--symbol-set
as above. We convert each symbol to PF vectors using PanPhon. unit
is intended for use with integer symbol IDs, e.g. k-means acoustic cluster IDs extracted from raw audio using a self-supervised model such as HuBERT- Pass the size of the symbol vocabulary in
--symbol-set
; if this is e.g. 100, we expect transcripts to be integer sequences like79 3 14 14 25
, where possible individual symbol values are in the range [0, 99]
- Pass the size of the symbol vocabulary in
english_basic_sil
is a special case: it is intended for use with character-level alignments, where the pronunciation of each word is represented by the characters in that word separated by spaces (e.g. 'cat' →c a t
), and also includes symbols for silence and short pauses between words. This symbol set can be used in two ways:- With
char
inputs, passhandle_sil=True
to yourTextProcessor
to treat silence phones as single tokens while still splitting other words into character sequences. - With
phone
inputs, where your input texts should be preprocessed to insert spaces between every character.
- With
Mel spectrogram feature extraction is defined by several parameters passed to
prepare_dataset.py
:
--sampling-rate
, the sampling rate of your audio data (default 22050 Hz)--filter-length
, the number of STFT frequency bins used (default 512)--hop-length
, frame shift of the STFT analysis window (default 256 samples)--win-length
, length of signal attenuation window (default 512)--n-mel-channels
, number of bins in mel-scale filter bank (default 80)--mel-f{min,max}
, minimum and maximum frequency of mel filter bank bins (defaults 0, 8000)
These parameters define the frequency and time resolution of acoustic feature
extraction. Each frequency bin in the STFT analysis window spans sampling-rate / filter-length
Hz, and each analysis window covers filter-length / sampling-rate
seconds of audio. For the default values specified above, this
gives 22050 / 512 = 43 Hz frequency resolution and 512 / 22050 = 23 ms
analysis windows. The frame shift moves the analysis window hop-length / sampling-rate
seconds forward each frame, for a stride of 256 / 22050 = 11 ms
(~50% overlap between adjacent frames). For efficiency, filter-length
should
be some power of 2, and in general win-length
should match filter-length
.
See this document for additional discussion of STFT parameters.
If you want to use HuBERT unit sequences as input, then you will need to match
the 50 Hz framerate of that feature extraction process by adjusting the
parameters listed above. If there is a mismatch, then the lengths of your mel
spectrograms will not line up with the durations calculated by run-length
encoding framewise HuBERT codes. For example, if your audio data is sampled at
16 kHz, use --hop-length 320
for a 20 ms frame shift, i.e. 50 Hz framerate.
Additional options are available for applying peak normalization to audio data (for example if it was collected across multiple recording sessions) or for trimming excessive leading or trailing silences found during forced alignment.
We support three methods for providing duration targets during training:
- Reading segment durations from forced alignments provided in Praat TextGrid format
- Run-length encoding frame-level input symbol sequences
- Monotonic alignment search without explicit targets
Forced alignment:
Given audio files, transcripts and a pronunciation lexicon (mapping words to
phone strings or just character sequences, depending on your desired input), you
could generate the required TextGrid alignment files
per utterance using a tool such as the Montreal Forced Aligner
or our own KISS Aligner.
Then, pass --durations-from textgrid
to prepare_dataset.py
to extract
durations per input symbol and save to disk.
This approach is similar to that in the
original FastPitch paper.
Run-length encoding:
Alternatively, we can extract durations from repeated sequences of input symbols
also specified at the frame level by run-length encoding. This is the expected
method to extract symbol-level duration targets from HuBERT code sequences when
using --input-type unit
, for example. Pass --durations-from unit_rle
to use
this method.
Monotonic alignment search:
Instead of providing explicit duration targets, we can also use monotonic
alignment search (MAS) to learn the correspondence between input symbols and
acoustic features during training. This approach follows the implementation from
FastPitch 1.1 in the source repo, as described in this paper.
Pass --durations-from attn_prior
to prepare_dataset.py
and --use-mas
to
train.py
to use this method. In this case, diagonal attention priors are saved
to disk in the 'durations
' data directory for each utterance.
We default to the YIN algorithm for fundamental frequency estimation. Framewise estimates are averaged per input symbol for easier interpretation and more stable performance.
We also have an option to use the more accurate probabilistic YIN,
but this algorithm runs considerably slower. If you know the expected pitch range
in your data, then you can narrow the corresponding hypothesis space for pYIN by
adjusting --pitch-f{min,max}
(defaults 40--600 Hz) to speed things up a bit.
Framewise pitch values are averaged per input symbol to provide pitch targets during training. This is done during data preprocessing when extracting target durations from TextGrid alignments or by run-length encoding input symbol sequences, and the character-level pitch values are saved to disk in this case. When training with MAS, we save frame-level pitch values to disk and average them online during training according to discovered alignments.
To train a phone-based system from IPA transcripts, after preparing a dataset and splitting into train and validation sets:
python train.py \
--dataset-path $DATA_ROOT \
--output $CHECKPOINT_DIR \
--training-files $DATA_ROOT/train_meta.txt \
--validation-files $DATA_ROOT/val_meta.txt \
--pitch-mean-std-file $DATA_ROOT/pitches_stats__${FILELIST_STEM}.json \
--input-type phone \
--symbol-set ipa \
--epochs 100 \
--epochs-per-checkpoint 10 \
--batch-size 16 \
--cuda
Model checkpoints will be saved to $CHECKPOINT_DIR
every 10 epochs, alongside
TensorBoard logs. Make sure to pass --cuda
to run on GPU if available.
Two options are available for multi-speaker/multi-lingual data:
- For joint training of speaker/language embeddings, include
speaker
/language
fields in your input metadata file and pass--{speaker,lang}-ids ${SPEAKER,LANG}_IDS.txt
files with lines like<speaker/lang_id> <int>
to set embedding indices. - To use pre-trained embeddings stored on disk, include paths to
.pt
files in your metadata file and optionally pass--{speaker,lang}-emb-dim $N_DIM
if they do not match the symbol embedding dimensionality in your FastPitch model (default--symbols-embedding-dim 384
).
Note that by default both speaker and language embeddings are added to input
symbol embeddings before passing through the encoder transformer stack. To
change the position of either embedding, pass --{speaker,lang}-cond [pre|post]
to add before (pre
) or after (post
). Passing multiple values will add the
embedding at both positions, or an empty value will disable conditioning.
To train using monotonic alignment search instead of passing explicit input
symbol duration targets, pass --use-mas
.
To predict mel spectrogram values using
trivariate-chain GMMs,
pass --tvcgmm_k $K
to set the number of GMM components. Individual
time/frequency bins will then be sampled from learned GMMs, mitigating
over-smoothing in spectrogram prediction and subsequent vocoder artefacts.
Depending on how much variance is in your data, you might start with values
between K=1
and K=5
(the default value 0
disables TVC-GMM).
To reduce model size with (probably) limited impact on performance, pass
--use-sepconv
to replace all convolutional layers with depthwise separable
convolutions. Additional options are available for automatic mixed precision
(AMP) and gradient accumulation. We also support data-parallel distributed
training at least on a single node -- just set CUDA_VISIBLE_DEVICES
and point
to a free port on your machine using --master-{addr,port}
.
After training has completed, we can predict mel spectrograms from test transcripts, and optionally generate speech audio using a separate vocoder model.
First, prepare an input metadata file with pipe-separated values and a header row indicating whichever of the following fields you care to provide (so order does not matter):
text
, transcripts of test utterancesoutput
, path to save synthesized speech audiomel_output
, path to save predicted mel spectrogram featuresspeaker
, speaker ID per utterancelanguage
, language ID per utterancemel
, path to load mel spectrogram features, e.g. reference values for copy synthesispitch
, path to load reference pitch valuesduration
, path to load reference duration values
To synthesize speech from IPA phone transcripts using the final checkpoint from
our train.py
run above, using a pre-trained HiFi-GAN vocoder:
python inference.py \
--input $DATA_ROOT/test_meta.txt \
--output $OUTPUT_DIR \
--fastpitch $CHECKPOINT_DIR/FastPitch_checkpoint_100.pt \
--input-type phone \
--symbol-set ipa \
--hifigan $PATH_TO_HIFIGAN_CHECKPOINT \
--hifigan-config hifigan/config/config_v1.json \ # 22.05 kHz audio
--cuda
If test_meta.tsv
includes an output
field, synthesized speech will be saved
to corresponding WAV files under $OUTPUT_DIR
. Otherwise, audio files will be
saved to sequential files named like $OUTPUT_DIR/audio_1.wav
.
Use the --sampling-rate
option to ensure output audio files are written at the
correct sampling rate. Predicted mel spectrograms are trimmed to remove noise
(introduced by batching multiple utterances to a fixed length), with target
durations calculated using --stft-hop-length
, which should match the value of
--hop-length
passed to prepare_dataset.py
.
For a multi-speaker FastPitch model, pass a --speaker-ids
file matching that
used with train.py
and include a speaker
field in test_meta.txt
to
synthesize utterances with the appropriate voices. Alternatively, pass
--speaker <speaker_id>
to override and synthesize all utterances using a
single speaker's voice. The same also applies to the --lang-ids
and
--language
flags for multi-lingual models.
If your model checkpoint uses depthwise separable convolutional layers, then
also pass --use-sepconv
to inference.py
. Likewise, if trained with monotonic
alignment search then pass --use-mas
, or --tvcgmm_k $K
if using TVC-GMM to
match checkpoint model architectures.
It should be possible to run any HiFi-GAN model checkpoint trained from the original repo, given an appropriate config file. We also have a sample config for 16 kHz audio with a 50 Hz frame shift, again to match audio preprocessing when working with HuBERT models.
To run copy synthesis through your chosen vocoder, load reference mel
spectrograms, pitch and duration values by adding corresponding fields in
test_meta.tsv
pointing to .pt
files on disk for each utterance, for example
as extracted using prepare_dataset.py
. Paths can be relative if you also pass
e.g. --dataset-path $DATA_ROOT
to inference.py
.
A similar method can be used to generate time-aligned synthetic data for vocoder
fine-tuning, i.e. predicting ground-truth audio from errorful mel spectrograms
predicted by a FastPitch model. In that case, you should pass reference
durations and pitch contours during synthesis to limit variation from reference
audio. Use --save-mels
to save predicted mel spectrograms to disk, either to
filepaths specified in the {mel_,}output
field of test_meta.tsv
, else to
sequential files like $OUTPUT_DIR/mel_1.pt
if not specified.
There are several options for manipulating predicted audio, for example adjusting the pace of output speech or transforming the predicted pitch contours. See the original FastPitch v1.0 README for examples.