Deep Learning Paper Recap - Transfer Learning
This paper, titled "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis", presents a novel method for generating high-quality voice clones using just 5 seconds of audio from speakers not seen in the training set. Previously, state-of-the-art models needed tens of minutes of audio data. The authors achieved this by decoupling the speaker encoder and TTS network, reducing the data quality requirements for each step and enabling zero-shot learning. By training a large dataset on a self-supervised speaker verification task, the speaker encoder network generates fixed dimensional speaker embedding vectors that represent a speaker's voice characteristics independently from the audio content. These embeddings are then fed into a standard TTS pipeline with user input text embeddings to create log-mel spectrograms before being transformed into waveforms by a final vocoder network. This approach requires significantly less labeled data compared to previous end-to-end pipelines, which may open up opportunities for generating unlimited high-quality labeled data through various tweaks and modifications.
Company
AssemblyAI
Date published
Aug. 10, 2022
Author(s)
Michael Liang
Word count
273
Language
English
Hacker News points
None found.