Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added API support for local Zonos. #73

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

PhialsBasement
Copy link

Add REST API Endpoints

This PR adds FastAPI endpoints to Zonos, allowing programmatic access to the model's functionality alongside the existing Gradio interface.

Added Features

  • /models endpoint to list available models
  • /generate endpoint for text-to-speech generation
  • /speaker_embedding endpoint for creating speaker embeddings

Changes

  • Added FastAPI integration
  • Model responses are streamed as WAV files
  • Added Pydantic models for request validation

Testing

Tested with curl commands:

  • GET /models works as expected
  • POST /generate successfully generates audio
  • POST /speaker_embedding successfully creates embeddings

The implementation reuses existing model management code and runs alongside the Gradio interface on a different port.

@PhialsBasement PhialsBasement mentioned this pull request Feb 14, 2025
@darkacorn
Copy link
Contributor

darkacorn commented Feb 14, 2025

i would maybe separate that into a different api file without gradio -
as you use one or the other most likely not the same time

and have a api consuming gradio ui - as a refactor - if that is the goal

also as a request -

maybe trying to keep in alignment to openai's tts api
that is very much integrated and supported everywhere,optional features as separate parameters

this would allow easy integration for 3rd party systems without much hassle and with sane defaults

@Steveboy123
Copy link

Thank you @PhialsBasement , you are a lifesaver.

@darkacorn
Copy link
Contributor

darkacorn commented Feb 15, 2025

image

thats more akin to what im proposing .. ( mind you uploading a voice file for every request to a remote mashine maybe suboptimal)

we may even want to isolate loading transformer and hybrid at the same time so there is no need to swap over .. models are small enough to fit even in peanut cards - ( model loading time would hurt throughput ) ( optional pinning or full override-able but i would make that the default behaviour for any load bearing api)

in an api scenario batch processor with queue could be prefixed with just what model to take as both are present in vram ( i work on that once we get a go ahead or at least a LGTM from the team)

voice could be embedded as tensors on voice upload - and on usege we just pull in the tensor to save computation

atm i support mp3/wav while always converting to wav as a baseline

happy to help out .. but i think api and gradio should be clearly separated .. can someone from zyphra chip in here ?

@zaydek
Copy link

zaydek commented Feb 15, 2025

Just want to mention this thread as relevant for when a teammate comes around to see this PR: #37.

@darkacorn
Copy link
Contributor

darkacorn commented Feb 15, 2025

agreed but that is different as there api has different sampling .. that should be compensate able once we know what they use
the model cond. has params for min p top k / top p / temp and rep_pen .. which are not exposed or used atm in oss only min_p for the time beeing

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 15, 2025

With OAI endpoint and speakers from folder as returned voices it would work straight away in sillytavern. Unconditional emotions and it would be good "as-is".

@darkacorn
Copy link
Contributor

With OAI endpoint and speakers from folder as returned voices it would work straight away in sillytavern. Unconditional emotions and it would be good "as-is".

pretty much why i proposed it that way .. integration in hundreds of systems would work w/o any extra work

@PhialsBasement
Copy link
Author

@darkacorn just threw in some of your suggestions, check it out and tell me if its what you were thinking

@darkacorn
Copy link
Contributor

amazing thanks for pulling that in, good baseline

@ther3zz
Copy link

ther3zz commented Feb 16, 2025

I'm currently testing the openai endpoint, will report back if I run into any issues!
That being said, it makes sense to include a swagger docs endpoint as well (or at least some variable to enable/disable the docs page)

@ther3zz
Copy link

ther3zz commented Feb 16, 2025

Has anyone been able to create embeddings? I'm running into this error:

{
    "detail": "'int' object has no attribute 'query'"
}

@PhialsBasement
Copy link
Author

@ther3zz Fixed. Issue was in api.py, i was tryina use .query() on a CUDA stream handle, now its just a normal UNIX timestamp instead.
image

@ther3zz
Copy link

ther3zz commented Feb 17, 2025

@ther3zz Fixed. Issue was in api.py, i was tryina use .query() on a CUDA stream handle, now its just a normal UNIX timestamp instead. image

Looks like it's working!

@ther3zz
Copy link

ther3zz commented Feb 17, 2025

Another issue I noticed is that MODEL_CACHE_DIR=/app/models doesnt seem to work. I'm not seeing the models cached there. I see them going here: /root/.cache/huggingface/hub/

@PhialsBasement
Copy link
Author

Whack, ill look into it and see whats going on there

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 17, 2025

Why can't we just load models from a folder we manually saved? I get that huggingface hub is used for docker, but not all of us are doing that.

@darkacorn
Copy link
Contributor

i dont think there is anything that prevents it .. you can even use it offline with the hf client

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 17, 2025

I've had to change loading to from_local in gradio and all. The from_pretrained is hijacked away from torch.

@mathematicalmichael
Copy link

@PhialsBasement
Copy link
Author

@ther3zz can you move this to issues tab over on my fork?

@Sturmgewehr444
Copy link

Sturmgewehr444 commented Mar 1, 2025

But if we manually clone it, Sillytavern would be supporting one specific branch of Zonos that may or may not continue to have its other features or be maintained. We would have to tell everyone : "No, you can't use its latest update, you have to go git switch and then use the API from that particular branch!"

@darkacorn
Copy link
Contributor

welcome to opensource - you patch it your self - if you dont want or cant do that - use 11labs

@PhialsBasement
Copy link
Author

@PhialsBasement Any chance on getting these suggestions implemented in your PR? #73 (comment) #73 (comment)

ill look into it soon

@PhialsBasement
Copy link
Author

PhialsBasement commented Mar 2, 2025

Are there beginner-friendly instructions for the API setup I got the UI to work but I can't get the API part set up.

Is the API container running at all or do you mean you're trying to send API requests but that is not working?

my bad for not including instructions, you need to do

docker compose build
docker compose up zonos-api

after this wait until the api is up and running, it will first download the models and once done it will open the endpoint.

@PhialsBasement
Copy link
Author

PhialsBasement commented Mar 2, 2025

FYI, i was on a bit of a break since the last comment i left here. Ill pick up with working on it now so any suggestions should be reiterated in case i miss them, for this i have enabled the issues and discussions tab on the fork, please add issues and suggestions over there. Thank you.

@PhialsBasement
Copy link
Author

But if we manually clone it, Sillytavern would be supporting one specific branch of Zonos that may or may not continue to have its other features or be maintained. We would have to tell everyone : "No, you can't use its latest update, you have to go git switch and then use the API from that particular branch!"

I agree this will become an issue sooner or later if i ever am unable to continue maintaining this.

- Implement file-based storage for voice embeddings and metadata
- Add support for custom voice naming during creation
- Enable voice lookup by either name or ID
- Create new /v1/audio/voices endpoint to list saved voices
- Improve reliability with UUID-based voice ID generation
- Enhance error handling with descriptive messages
@PhialsBasement
Copy link
Author

@ther3zz just implemented your suggestions from #73 (comment)

@Sturmgewehr444
Copy link

Sturmgewehr444 commented Mar 2, 2025

you patch it your self

Again, until the API has been merged, Sillytavern can not support Zonos TTS. You seem to be misunderstanding me. It has to be merged.

@darkacorn
Copy link
Contributor

darkacorn commented Mar 2, 2025

you patch it your self

Again, until the API has been merged, Sillytavern can not support Zonos TTS. You seem to be misunderstanding me. It has to be merged.

you are wrong on that - a the api has no splitting of long ctx so anything over 30 sec will fail - you dont need a custom intergration - openai tts compatible endpoint and just link the url of the api .. no custom integration needed

image

pull the pr run the api and you are off to the races BY DEFAULT .. no custom stuff needed

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 2, 2025

So one of the concatenation methods has to go into the API. Currently they're targeting the gradio.

@Sturmgewehr444
Copy link

Sturmgewehr444 commented Mar 2, 2025

pull the pr run the api and you are off to the races BY DEFAULT .. no custom stuff needed#

As far as I know, this here is only supported by Linux. What about Windows users?

@darkacorn
Copy link
Contributor

pull the pr run the api and you are off to the races BY DEFAULT .. no custom stuff needed#

As far as I know, this here is only supported by Linux. What about Windows users?

if you manage to run zonos on windows that will run on windows too - there are no exotic dependencies for the api

@Sturmgewehr444
Copy link

if you manage to run zonos on windows that will run on windows too - there are no exotic dependencies for the api

Taken from the description:

Installation
At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM).

@darkacorn
Copy link
Contributor

and if you look lower there is a experimental link for windows installations . but i would not recommend it .. albeit some run it on windows just fine

@Sturmgewehr444
Copy link

and if you look lower there is a experimental link for windows installations . but i would not recommend it .. albeit some run it on windows just fine

Docker? I am talking about windows without any other software. How would you install this rep?

@darkacorn
Copy link
Contributor

and if you look lower there is a experimental link for windows installations . but i would not recommend it .. albeit some run it on windows just fine

Docker? I am talking about windows without any other software. How would you install this rep?

read the documentation and maybe dont spam an PR - open an issue .. and if someone cares enough they will answer - this is the wrong thread for that

@PhialsBasement
Copy link
Author

PhialsBasement commented Mar 3, 2025

So one of the concatenation methods has to go into the API. Currently they're targeting the gradio.

@Ph0rk0z ill look into this when i get back from work tonight

@Sturmgewehr444 windows should work just fine but this PR focuses on adding API support primairly. If i have extra time ill look into streamlining it for windows but it will not be a main focus.

@Napolitain
Copy link

why do we have an endpoint for creating a speaker embeddings ?
from my understanding, you make the service stateful. Should we have a stateless service, where we generate the embeddings inside the generation of the audio, and if its repeated, then it will be cached instead? It seems a better design to let the backend handle its resources

@darkacorn
Copy link
Contributor

could be seen as such and be optional sure - but generally you use the api for your self or just change that and pass that over - even with user auth - you fence that off to a different s3 bucket and fetch from there as its faster in throwput then the customer to have to send the voices he uses frequent all the time

  • there are many ways to rome - you can most certainly change that part
    however .. its a convinience factor for most people who use the api for a local integration - which is the majority of the customer base

@darkacorn
Copy link
Contributor

darkacorn commented Mar 3, 2025

also this only stores the torch tensor not the audio files / its to mimick oai tts api as close as possible and thats with fixed voices

@ther3zz
Copy link

ther3zz commented Mar 3, 2025

@ther3zz just implemented your suggestions from #73 (comment)

Sorry for the delay!

Just tested this and its working perfectly!

@YellowRoseCx
Copy link

would someone mind telling me how I setup the voice IDs with the JSON?

@darkacorn
Copy link
Contributor

darkacorn commented Mar 8, 2025

would someone mind telling me how I setup the voice IDs with the JSON?

voice^^

curl -X POST "http://localhost:8000/v1/audio/speech" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Zyphra/Zonos-v0.1-transformer",
    "input": "Hello, this is a test of the Zonos API.",
    "voice": "voice_12345_0",
    "speed": 1.0,
    "language": "en-us",
    "emotion": {
      "happiness": 1.0
    },
    "response_format": "mp3"
  }' ```

@YellowRoseCx
Copy link

would someone mind telling me how I setup the voice IDs with the JSON?

voice^^

curl -X POST "http://localhost:8000/v1/audio/speech" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Zyphra/Zonos-v0.1-transformer",
    "input": "Hello, this is a test of the Zonos API.",
    "voice": "voice_12345_0",
    "speed": 1.0,
    "language": "en-us",
    "emotion": {
      "happiness": 1.0
    },
    "response_format": "mp3"
  }' ```

thank you! And I was thinking a good way to improve this overall would be by incorporating 100ms silence that's included in the Zonos/Asset folder as the default value for the "prefix_audio" argument in the speechrequest instead of none because I read somewhere in the file docs or a commit that it increases quality and stability of generations

@ther3zz
Copy link

ther3zz commented Mar 20, 2025

any luck in getting this merged?
I've been using it and its working really well

@kleineluka
Copy link

Still interested in Zonos having a built-in API like this - any idea if it's possible for merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.