Octave TTS: the first text-to-speech system that understands what it’s saying

Company

Hume

Date Published

Feb. 26, 2025

Author

Word count

1635

Language

English

Hacker News points

None

URL

www.hume.ai/blog/octave-the-first-text-to-speech-model-that-understands-what-its-saying

Summary

** Octave is a state-of-the-art large language model trained to understand and synthesize speech, offering a new level of expressiveness and nuance in text-to-speech capabilities. In a blind comparison study with 180 human raters, Octave's outputs were favored over ElevenLabs Voice Design in terms of audio quality, naturalness, and how well speech generations matched descriptions of the desired voice. Octave can predict the tune, rhythm, and timbre of speech, inferring when to whisper secrets or shout triumphantly, and transforming that understanding into lifelike speech. The model can also generate voices from prompts, take instructions to modify the emotion and style of a given utterance, and create any AI voice you can imagine. Octave is available today on platform.hume.ai and through its API, offering developer tools such as Voice Design, Acting Instructions, and a voice library of over 40+ premade voices. The model has also been benchmarked against ElevenLabs in an internal evaluation, which showed that Hume's Octave outperformed the industry-leading TTS system across all three human preference metrics: audio quality, naturalness, and description/prompt match. An Expressive TTS Arena is being launched to facilitate broader comparative assessments of expressive speech synthesis, making it an ideal tool for evaluating how well new TTS systems handle nuanced, creative, and emotionally rich content and prompts typical of real use cases.