-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tts: add speaker file support #12048
Conversation
ea8711d
to
bf3f5ee
Compare
@edwko Could you please have a look on this PR? |
@ngxson @dm4 Looks good! Just a couple of thoughts, this would handle only v0.2 it might make sense to do this more dynamically, maybe add versioning logic similar to this PR #11287 Maybe get version from // Something like this:
double get_speaker_version(json speaker) {
if (speaker.contains("version")) {
return speaker["version"].get<double>();
}
// Also could get version from model itself
// if (common_get_builtin_chat_template(model) == "outetts-0.3") {
// return 0.3;
// }
return 0.2;
}
static std::string audio_text_from_speaker(json speaker) {
std::string audio_text = "<|text_start|>";
double version = get_speaker_version(speaker);
if (version <= 0.3) {
std::string separator = (version == 0.3) ? "<|space|>" : "<|text_sep|>";
for (const auto &word : speaker["words"])
audio_text += word["word"].get<std::string>() + separator;
}
else if (version > 0.3) {
// Future version support could be added here
}
return audio_text;
}
// static std::string audio_data_from_speaker(json speaker) would also need some adjustments to support different versions. |
Signed-off-by: dm4 <sunrisedm4@gmail.com>
57c3835
to
888f57e
Compare
Hello @ngxson and @edwko, I have already added support for version 0.3. Since |
888f57e
to
986ade7
Compare
@ggerganov merge it please |
Can you provide examples commands both for v0.2 and v0.3 so I can run some tests? |
Example commands for v0.2 and v0.3 are identical: --tts-use-guide-tokens is optional, sometimes gives better results for v0.2 For prompts longer than 10 words it can hit this assert and stop generation (tested only on CPU, not related to this PR as same assert error present on all previous builds) Line 8470 in 14dec0c
Removing this assert allows for longer prompt generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert triggers because the audio codes exceed the size of the microbatch (default 512). The vocoder uses non-causal attention, so it requires that all codes fit in a single microbatch. The workaround is to add -ub 4096
to your arguments and the proper solution is to create the WavTokenizer's context with n_ubatch
equal to the n_ctx_train()
of the model.
Awesome! With OuteTTS v0.3 it even generates all punctuation correctly! To be honest this is already quite a good quality for such a small model. Perhaps it is worth updating examples/tts/README and adding -ub 4096 argument as it is necessary for correct generation. I would like to see more PR's merged in this, for example #11070 server example really works well without unloading models from memory. It is possible to further develop this example as it is of great value for many, especially considering that OuteTTS 0.3-500M is allowed for commercial use. |
Feel free to contribute improvements. I think the Also, I think figuring out streaming first is crucial before making major changes and additions. |
* tts: add speaker file support Signed-off-by: dm4 <sunrisedm4@gmail.com> * tts: handle outetts-0.3 * tts : add new line in error message --------- Signed-off-by: dm4 <sunrisedm4@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
--tts-speaker-file
to specify the file path.tts.cpp
to load and parse speaker data, enhancing audio generation capabilities.