added rudimentary support for outetts v0.3 500m and 1b models #11287

LostRuins · 2025-01-18T10:58:03Z

This PR adds rudimentary support for the newly released OuteTTS v0.3 500m and 1b models, found at https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF and https://huggingface.co/OuteAI/OuteTTS-0.3-1B-GGUF

This will allow loading and generating with the new models, although crucially it ignores the new punctuation tokens. I had previously added them in my own fork, but they come with a lot of edge cases that may not be so easy to untangle, since they are grouped with other tokens and there are degenerate cases (e.g. www..!...google....com??) that will cause problems if they are simply swapped in as is.

The model types are differentiated by attempting to tokenize <|space|>, which is a single token in v0.3, but not in earlier versions. For the 1B model, the token <|0|> has a different offset, thus it's been changed to be determined dynamically. The existing speaker voice is retained, but I swapped out your hardcoded token array with a runtime tokenization for the same reasons (and also adapting the v0.3 format)

Remains compatible with v0.2 and should be able to load all 3 model types.

It is actually ready to merge as-is, but feel free to make whatever changes you deem necessary. Cheers!

edwko · 2025-01-18T11:13:11Z

Yeah, that’s why in the library I grouped them before and after words, it might not be the best solution, but it works:

Input: www..!...google....com??

Converts to:

<|im_start|>
<|text_start|>www<|period|><|period|><|exclamation_mark|><|period|><|period|><|period|><|space|>google<|period|><|period|><|period|><|period|><|space|>com<|question_mark|><|question_mark|><|text_end|>
<|audio_start|>

LostRuins · 2025-01-18T11:44:06Z

Yeah, but even if the TTC part works, I think the CTS part might fail. I can definitely do that if you think it's better.

ngxson · 2025-01-18T22:26:53Z

examples/tts/tts.cpp

@@ -371,7 +371,7 @@ static std::string replace_numbers_with_words(const std::string & input_text) {
 }

 // Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
-static std::string process_text(const std::string & text) {
+static std::string process_text(const std::string & text, bool is_version_0_3) {


btw to check if the version is 0.3, you can use:

bool is_version_0_3 = common_get_builtin_chat_template(model) == "outetts-0.3"

@edwko I planned to add this as a dedicated GGUF meta key, but turns out I still not have the time to implement this. I'll try to do this in next week! And btw congrats for the release of v0.3 😄

LostRuins · 2025-01-19T07:49:32Z

@edwko how is this case currently handled for you:

google .. . . com

I had issues when encountering fragments with only spaces and punctuations but no readable text. The narration breaks down once that is encountered

edwko · 2025-01-19T09:03:15Z

@LostRuins All punctuations are merged to the closest word in cases like this google .. . . com

<|im_start|>
<|text_start|>google<|period|><|period|><|period|><|period|><|space|>com<|text_end|>
<|audio_start|>

Speech generation works fine if you follow this format. I just tested both google .. . . com and www..!...google....com??, and everything was generated correctly.

LostRuins · 2025-01-19T09:37:10Z

<|text_start|>google<|period|><|period|><|period|><|period|><|space|>com<|text_end|>

I noticed you removed the inbetween spaces. Whats the rules for that? The naive approach would generate

<|text_start|>google<|period|><|period|><|space|><|period|><|space|><|space|><|period|><|space|>com<|text_end|>

edwko · 2025-01-19T10:42:38Z

It processes the text like this:
google .. . . com -> google.... com -> to prompt
For example, if the text was:
google .. . . ..com . . -> google.... ..com.. -> to prompt

Here’s the implementation for this:
_process_text also self.normalize_token_spacing constructs the spacing correctly.

When constructing the words back to create the audio prompt, it joins the punctuation like this:

word = s["word"]
if i.get("before", []):
    word = "".join(i["before"]) + word
if i.get("after", []):
    word += "".join(i["after"])

LostRuins · 2025-01-19T13:28:04Z

Yeah, anyway this is exactly what I meant by the various edge cases that may need to be untangled regarding punctuation, which is why I initially excluded it.

Perhaps we can consider starting with this, and then expanding the implementation? Happy for someone to improve upon it here, either before or after merging.

recommended way to check if the version is 0.3, as requested by ngxson

Koalamana9 · 2025-02-24T21:33:27Z

@ngxson @ggerganov are there any problems with this?

ggerganov · 2025-02-25T09:20:34Z

I am looking towards adding more general-purpose TTS support, so don't want to spend too much effort on this example. It's main purpose was to demonstrate a possible TTS implementation.

Koalamana9 · 2025-02-25T13:09:59Z

OuteTTS v0.3 is as must have for this example

cjcox17 · 2025-02-26T17:41:20Z

This is currently broken when running:

./llama-cli -hf OuteAI/OuteTTS-0.3-500M-GGUF
build: 4506 (90a03493) with Debian clang version 14.0.6 for aarch64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
common_download_file: previous metadata file found /root/.cache/llama.cpp/OuteAI_OuteTTS-0.3-500M-GGUF_OuteTTS-0.3-500M-Q4_K_M.gguf.json: {"etag":"\"c5e44a0544a6da675bf9d75b808cdd31-26\"","lastModified":"Wed, 15 Jan 2025 08:59:01 GMT","url":"https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/resolve/main/OuteTTS-0.3-500M-Q4_K_M.gguf"}
curl_perform_with_retry: Trying to download from https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/resolve/main/OuteTTS-0.3-500M-Q4_K_M.gguf (attempt 1 of 3)...
llama_model_loader: loaded meta data with 25 key-value pairs and 290 tensors from /root/.cache/llama.cpp/OuteAI_OuteTTS-0.3-500M-GGUF_OuteTTS-0.3-500M-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = OuteTTS 0.3 500M
llama_model_loader: - kv   3:                           general.basename str              = OuteTTS-0.3
llama_model_loader: - kv   4:                         general.size_label str              = 500M
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,157696]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,157696]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = outetts-0.3
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_0:  132 tensors
llama_model_loader: - type q8_0:   13 tensors
llama_model_loader: - type q4_K:   12 tensors
llama_model_loader: - type q6_K:   12 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 378.94 MiB (6.37 BPW) 
load: special tokens cache size = 5151
load: token to piece cache size = 0.9712 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 499.19 M
print_info: general.name     = OuteTTS 0.3 500M
print_info: vocab type       = BPE
print_info: n_vocab          = 157696
print_info: n_merges         = 151387
print_info: BOS token        = 151644 '<|im_start|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151645 '<|im_end|>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size =   378.94 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    48.00 MiB
llama_init_from_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.60 MiB
llama_init_from_model:        CPU compute buffer size =   311.50 MiB
llama_init_from_model: graph nodes  = 846
llama_init_from_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
terminate called after throwing an instance of 'std::runtime_error'
  what():  this custom template is not supported
Aborted

But works fine with OuteAI/OuteTTS-0.2-500M-GGUF, any thoughts?

edwko · 2025-02-26T18:28:22Z

@cjcox17 I think it's because v0.3 adds tokenizer.chat_template: outetts-0.3 it seems like that gives you an error, looking at this: what(): this custom template is not supported the v0.2 quants don't have a defined chat_template.

ngxson · 2025-02-26T18:32:35Z

The chat template logic recently changed quite a lot after the introduction of jinja engine, I'll have a look later

added rudimentary support for outetts v0.3 500m and 1b models

b548695

LostRuins requested a review from ggerganov January 18, 2025 10:58

github-actions bot added the examples label Jan 18, 2025

ngxson reviewed Jan 18, 2025

View reviewed changes

recommended way to check if the version is 0.3, as requested by ngxson

90a0349

recommended way to check if the version is 0.3, as requested by ngxson

LostRuins requested a review from ngxson January 19, 2025 13:44

ggerganov mentioned this pull request Feb 11, 2025

tts : add OuteTTS support #10784

Merged

9 tasks

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Feb 25, 2025

edwko mentioned this pull request Feb 27, 2025

tts: add speaker file support #12048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added rudimentary support for outetts v0.3 500m and 1b models #11287

added rudimentary support for outetts v0.3 500m and 1b models #11287

LostRuins commented Jan 18, 2025

edwko commented Jan 18, 2025

LostRuins commented Jan 18, 2025 •

edited

Loading

ngxson Jan 18, 2025

LostRuins commented Jan 19, 2025

edwko commented Jan 19, 2025

LostRuins commented Jan 19, 2025

edwko commented Jan 19, 2025

LostRuins commented Jan 19, 2025

Koalamana9 commented Feb 24, 2025

ggerganov commented Feb 25, 2025

Koalamana9 commented Feb 25, 2025

cjcox17 commented Feb 26, 2025

edwko commented Feb 26, 2025

ngxson commented Feb 26, 2025

added rudimentary support for outetts v0.3 500m and 1b models #11287

Are you sure you want to change the base?

added rudimentary support for outetts v0.3 500m and 1b models #11287

Conversation

LostRuins commented Jan 18, 2025

edwko commented Jan 18, 2025

LostRuins commented Jan 18, 2025 • edited Loading

ngxson Jan 18, 2025

Choose a reason for hiding this comment

LostRuins commented Jan 19, 2025

edwko commented Jan 19, 2025

LostRuins commented Jan 19, 2025

edwko commented Jan 19, 2025

LostRuins commented Jan 19, 2025

Koalamana9 commented Feb 24, 2025

ggerganov commented Feb 25, 2025

Koalamana9 commented Feb 25, 2025

cjcox17 commented Feb 26, 2025

edwko commented Feb 26, 2025

ngxson commented Feb 26, 2025

LostRuins commented Jan 18, 2025 •

edited

Loading