Review - Text-Free Prosody-Aware Generative Spoken Language Modeling

Post Details

Company

AssemblyAI

Date Published

Sept. 24, 2021

Author

Steven Hillis

Word Count

333

Language

English

Hacker News Points

-

Source URL

www.assemblyai.com/blog/review-text-free-prosody-aware-generative-spoken-language-modeling

Summary

The paper "Text-Free Prosody-Aware Generative Spoken Language Modeling" introduces a novel approach to generative spoken language modeling by incorporating prosody as a feature. Previously, text has been the intermediate representation between speech inputs and NLP analyses, but this work suggests that it is suboptimal due to being a lossy medium for capturing speech. By directly modeling in the spoken language domain without cascading through text, the authors aim for a more optimal representation. They leverage self-supervised acoustic units representing phonetic content and quantized, speaker-mean normalized log F0 bins together with unit durations as input streams, which are modeled jointly with a transformer language model. The paper's findings show that prosodic input features improve both content and prosody modeling. This research direction is promising but still exploratory, indicating the potential for spoken language modeling to move towards end-to-end approaches in the future.