Exploring Gemini's Multimodal Live API.
script.py
: A script to interact with the API.app.py
: A Flask app to interact with the API.
- Requirements.txt file
- API Key
The model automatically performs voice activity detection. VAD is always enabled and is not configurable. This provides the opportunity for a natural flowing conversation but in practice is problematic if you are using a speaker and a microphone - the audio feedback stops the model itself. It also seems to be problematic in a noisy ambient environment. Wearing headphones at the moment seems like the only viable approach.
What I have done is change the script so that when the model is speaking, no input audio is sent.
Sessions duration is limited to 15 minutes for audio. When the session duration exceeds the limit, the connection is terminated. 3 concurrent sessions per API key are allowed.
Video and audio input is limited to 2 minutes.
- How to handle keyboard interrupt nicely?