Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tts: add speaker file support #12048

Merged
merged 3 commits into from
Mar 3, 2025
Merged

Conversation

dm4
Copy link
Contributor

@dm4 dm4 commented Feb 24, 2025

  • Added support for TTS speaker files, including a new command-line option --tts-speaker-file to specify the file path.
  • Implemented JSON handling in tts.cpp to load and parse speaker data, enhancing audio generation capabilities.

@dm4 dm4 force-pushed the dm4/tts-speaker-file branch from ea8711d to bf3f5ee Compare February 24, 2025 11:02
@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2025

@edwko Could you please have a look on this PR?

@edwko
Copy link

edwko commented Feb 27, 2025

@ngxson @dm4 Looks good! Just a couple of thoughts, this would handle only v0.2 it might make sense to do this more dynamically, maybe add versioning logic similar to this PR #11287

Maybe get version from common_get_builtin_chat_template, or I could add more metadata to the speaker files (like a version fields) to construct the prompt based on the specific version.

// Something like this:

double get_speaker_version(json speaker) {
    if (speaker.contains("version")) {
        return speaker["version"].get<double>();
    } 
    // Also could get version from model itself
    // if (common_get_builtin_chat_template(model) == "outetts-0.3") {
    //     return 0.3;
    // }
    return 0.2;
}

static std::string audio_text_from_speaker(json speaker) {
    std::string audio_text = "<|text_start|>";
    double version = get_speaker_version(speaker);
    
    if (version <= 0.3) {
        std::string separator = (version == 0.3) ? "<|space|>" : "<|text_sep|>";
        for (const auto &word : speaker["words"])
            audio_text += word["word"].get<std::string>() + separator;
    }
    else if (version > 0.3) {
        // Future version support could be added here
    }

    return audio_text;
}

// static std::string audio_data_from_speaker(json speaker) would also need some adjustments to support different versions.

Signed-off-by: dm4 <sunrisedm4@gmail.com>
@dm4 dm4 force-pushed the dm4/tts-speaker-file branch 2 times, most recently from 57c3835 to 888f57e Compare March 1, 2025 12:40
@dm4
Copy link
Contributor Author

dm4 commented Mar 1, 2025

Hello @ngxson and @edwko, I have already added support for version 0.3. Since common_get_builtin_chat_template() was removed in this commit, I have switched to using llama_model_chat_template() to obtain the model's tokenizer.chat_template metadata.

@dm4 dm4 force-pushed the dm4/tts-speaker-file branch from 888f57e to 986ade7 Compare March 1, 2025 13:19
@ngxson ngxson requested a review from ggerganov March 1, 2025 13:25
@Koalamana9
Copy link

@ggerganov merge it please

@ggerganov
Copy link
Member

Can you provide examples commands both for v0.2 and v0.3 so I can run some tests?

@Koalamana9
Copy link

Koalamana9 commented Mar 2, 2025

Example commands for v0.2 and v0.3 are identical:
llama-tts -m OuteTTS-v2-or-v3 -mv Wavtokenizer -c 4096 --tts-use-guide-tokens --tts-speaker-file en_female_1.json -p "Hello world"
Speaker files from here: https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers
OuteTTS v0.3: https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/tree/main
OuteTTS v0.2: https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main
Wavtokenizer: https://huggingface.co/novateur/WavTokenizer-large-speech-75token/tree/main (must be converted to gguf)

--tts-use-guide-tokens is optional, sometimes gives better results for v0.2

For prompts longer than 10 words it can hit this assert and stop generation (tested only on CPU, not related to this PR as same assert error present on all previous builds)

GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens");

Removing this assert allows for longer prompt generation.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert triggers because the audio codes exceed the size of the microbatch (default 512). The vocoder uses non-causal attention, so it requires that all codes fit in a single microbatch. The workaround is to add -ub 4096 to your arguments and the proper solution is to create the WavTokenizer's context with n_ubatch equal to the n_ctx_train() of the model.

@ggerganov ggerganov merged commit c43af92 into ggml-org:master Mar 3, 2025
5 checks passed
@Koalamana9
Copy link

Awesome! With OuteTTS v0.3 it even generates all punctuation correctly! To be honest this is already quite a good quality for such a small model. Perhaps it is worth updating examples/tts/README and adding -ub 4096 argument as it is necessary for correct generation. I would like to see more PR's merged in this, for example #11070 server example really works well without unloading models from memory. It is possible to further develop this example as it is of great value for many, especially considering that OuteTTS 0.3-500M is allowed for commercial use.
Despite the fact that Kokoro TTS is now considered the highest quality text to speech model with the smallest size, it is worth noting that it's a very distilled model without the ability to emulate voices except for simply blending existing ones, meanwhile OuteTTS offers a novel approach to speech synthesis using simple llm where anyone can generate a copy of a voice in just a few seconds by passing simple json file.

@ggerganov
Copy link
Member

Feel free to contribute improvements. I think the tts example is in a very hacky state and can be improved in many ways. Ideally, it should become a more general purpose TTS example that would support more TTS models. But we first need the infra for that to be added to libllama, which I am working on atm.

Also, I think figuring out streaming first is crucial before making major changes and additions.

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* tts: add speaker file support

Signed-off-by: dm4 <sunrisedm4@gmail.com>

* tts: handle outetts-0.3

* tts : add new line in error message

---------

Signed-off-by: dm4 <sunrisedm4@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants