tts: add speaker file support #12048

dm4 · 2025-02-24T08:12:42Z

Added support for TTS speaker files, including a new command-line option --tts-speaker-file to specify the file path.
Implemented JSON handling in tts.cpp to load and parse speaker data, enhancing audio generation capabilities.

common/arg.cpp

ngxson · 2025-02-26T13:38:26Z

@edwko Could you please have a look on this PR?

edwko · 2025-02-27T09:58:46Z

@ngxson @dm4 Looks good! Just a couple of thoughts, this would handle only v0.2 it might make sense to do this more dynamically, maybe add versioning logic similar to this PR #11287

Maybe get version from common_get_builtin_chat_template, or I could add more metadata to the speaker files (like a version fields) to construct the prompt based on the specific version.

// Something like this:

double get_speaker_version(json speaker) {
    if (speaker.contains("version")) {
        return speaker["version"].get<double>();
    } 
    // Also could get version from model itself
    // if (common_get_builtin_chat_template(model) == "outetts-0.3") {
    //     return 0.3;
    // }
    return 0.2;
}

static std::string audio_text_from_speaker(json speaker) {
    std::string audio_text = "<|text_start|>";
    double version = get_speaker_version(speaker);
    
    if (version <= 0.3) {
        std::string separator = (version == 0.3) ? "<|space|>" : "<|text_sep|>";
        for (const auto &word : speaker["words"])
            audio_text += word["word"].get<std::string>() + separator;
    }
    else if (version > 0.3) {
        // Future version support could be added here
    }

    return audio_text;
}

// static std::string audio_data_from_speaker(json speaker) would also need some adjustments to support different versions.

Signed-off-by: dm4 <sunrisedm4@gmail.com>

dm4 · 2025-03-01T12:41:46Z

Hello @ngxson and @edwko, I have already added support for version 0.3. Since common_get_builtin_chat_template() was removed in this commit, I have switched to using llama_model_chat_template() to obtain the model's tokenizer.chat_template metadata.

examples/tts/tts.cpp

Koalamana9 · 2025-03-01T22:53:17Z

@ggerganov merge it please

ggerganov · 2025-03-02T18:25:36Z

Can you provide examples commands both for v0.2 and v0.3 so I can run some tests?

Koalamana9 · 2025-03-02T20:41:30Z

Example commands for v0.2 and v0.3 are identical:
llama-tts -m OuteTTS-v2-or-v3 -mv Wavtokenizer -c 4096 --tts-use-guide-tokens --tts-speaker-file en_female_1.json -p "Hello world"
Speaker files from here: https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers
OuteTTS v0.3: https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/tree/main
OuteTTS v0.2: https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main
Wavtokenizer: https://huggingface.co/novateur/WavTokenizer-large-speech-75token/tree/main (must be converted to gguf)

--tts-use-guide-tokens is optional, sometimes gives better results for v0.2

For prompts longer than 10 words it can hit this assert and stop generation (tested only on CPU, not related to this PR as same assert error present on all previous builds)

llama.cpp/src/llama.cpp

Line 8470 in 14dec0c

    
           GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens");

Removing this assert allows for longer prompt generation.

ggerganov

The assert triggers because the audio codes exceed the size of the microbatch (default 512). The vocoder uses non-causal attention, so it requires that all codes fit in a single microbatch. The workaround is to add -ub 4096 to your arguments and the proper solution is to create the WavTokenizer's context with n_ubatch equal to the n_ctx_train() of the model.

examples/tts/tts.cpp

Koalamana9 · 2025-03-03T16:05:59Z

Awesome! With OuteTTS v0.3 it even generates all punctuation correctly! To be honest this is already quite a good quality for such a small model. Perhaps it is worth updating examples/tts/README and adding -ub 4096 argument as it is necessary for correct generation. I would like to see more PR's merged in this, for example #11070 server example really works well without unloading models from memory. It is possible to further develop this example as it is of great value for many, especially considering that OuteTTS 0.3-500M is allowed for commercial use.
Despite the fact that Kokoro TTS is now considered the highest quality text to speech model with the smallest size, it is worth noting that it's a very distilled model without the ability to emulate voices except for simply blending existing ones, meanwhile OuteTTS offers a novel approach to speech synthesis using simple llm where anyone can generate a copy of a voice in just a few seconds by passing simple json file.

ggerganov · 2025-03-03T16:32:45Z

Feel free to contribute improvements. I think the tts example is in a very hacky state and can be improved in many ways. Ideally, it should become a more general purpose TTS example that would support more TTS models. But we first need the infra for that to be added to libllama, which I am working on atm.

Also, I think figuring out streaming first is crucial before making major changes and additions.

* tts: add speaker file support Signed-off-by: dm4 <sunrisedm4@gmail.com> * tts: handle outetts-0.3 * tts : add new line in error message --------- Signed-off-by: dm4 <sunrisedm4@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions bot added the examples label Feb 24, 2025

ngxson reviewed Feb 24, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

dm4 force-pushed the dm4/tts-speaker-file branch from ea8711d to bf3f5ee Compare February 24, 2025 11:02

tts: add speaker file support

0b9db0a

Signed-off-by: dm4 <sunrisedm4@gmail.com>

dm4 force-pushed the dm4/tts-speaker-file branch 2 times, most recently from 57c3835 to 888f57e Compare March 1, 2025 12:40

ngxson reviewed Mar 1, 2025

View reviewed changes

examples/tts/tts.cpp Outdated Show resolved Hide resolved

tts: handle outetts-0.3

986ade7

dm4 force-pushed the dm4/tts-speaker-file branch from 888f57e to 986ade7 Compare March 1, 2025 13:19

ngxson approved these changes Mar 1, 2025

View reviewed changes

ngxson requested a review from ggerganov March 1, 2025 13:25

ggerganov approved these changes Mar 3, 2025

View reviewed changes

examples/tts/tts.cpp Outdated Show resolved Hide resolved

tts : add new line in error message

75a5139

ggerganov merged commit c43af92 into ggml-org:master Mar 3, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts: add speaker file support #12048

tts: add speaker file support #12048

dm4 commented Feb 24, 2025

ngxson commented Feb 26, 2025

edwko commented Feb 27, 2025

dm4 commented Mar 1, 2025

Koalamana9 commented Mar 1, 2025

ggerganov commented Mar 2, 2025

Koalamana9 commented Mar 2, 2025 •

edited

Loading

ggerganov left a comment

Koalamana9 commented Mar 3, 2025

ggerganov commented Mar 3, 2025

tts: add speaker file support #12048

tts: add speaker file support #12048

Conversation

dm4 commented Feb 24, 2025

ngxson commented Feb 26, 2025

edwko commented Feb 27, 2025

dm4 commented Mar 1, 2025

Koalamana9 commented Mar 1, 2025

ggerganov commented Mar 2, 2025

Koalamana9 commented Mar 2, 2025 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

Koalamana9 commented Mar 3, 2025

ggerganov commented Mar 3, 2025

Koalamana9 commented Mar 2, 2025 •

edited

Loading