Deep Learning Paper Recaps - Modality Matching and Masked Autoencoders
This week's recaps cover two Deep Learning papers: MAESTRO and Masked Autoencoders that Listen. The first paper proposes a method for learning unified representations from speech and text modalities, outperforming the current State-of-the-Art in ASR tasks. Key findings include the incorporation of lexical information using text-only inputs, improved performance in monolingual and multilingual setups, and efficient representation unification with minimal supervised data. In the second paper, a novel extension of masked autoencoders to audio is presented. The model works by splitting mel spectrograms into patches, masking most patches, and reconstructing them using an encoder-decoder approach. Key findings include the possibility of extending this method to temporal information like audio and video, extremely high patching ratios leading to more robust models in quality and bias settings, and local attention outperforming global for speech domains.
Company
AssemblyAI
Date published
July 27, 2022
Author(s)
Luka Chkhetiani, Ruben Bousbib
Word count
332
Language
English
Hacker News points
None found.