Streaming real-time text to speech with XTTS V2

Company

Baseten

Date Published

April 18, 2024

Author

Het Trivedi, Philip Kiely

Word count

1318

Language

English

Hacker News points

None

URL

www.baseten.co/blog/streaming-real-time-text-to-speech-with-xtts-v2

Summary

Here is a summary of the provided text in one paragraph: A streaming endpoint for XTTS V2, a state-of-the-art open-source text-to-speech model with voice cloning capabilities, can be deployed to power an entire new class of AI applications. The streaming endpoint has a round-trip time to first chunk of as little as 200 milliseconds and delivers near real-time audio playback for a given text input. XTTS V2 is natively capable of streaming and can generate speech in 17 languages, with the ability to support over a dozen languages. A model server implemented in Truss enables fast inference times, and deploying the streaming endpoint requires setting GPU resources in config.yaml and running `truss push` to create a development deployment on Baseten. Consuming the model output depends on the application, but can be demonstrated with a quick Python script that streams the audio with FFmpeg.