Deep Learning Paper Recaps - Modality Matching and Masked Autoencoders

Company

AssemblyAI

Date Published

July 27, 2022

Author

Luka Chkhetiani, Ruben Bousbib

Word count

332

Language

English

Hacker News points

None

URL

www.assemblyai.com/blog/deep-learning-paper-recaps-modality-matching-and-masked-autoencoders

Summary

This week's recaps cover two Deep Learning papers: MAESTRO and Masked Autoencoders that Listen. The first paper proposes a method for learning unified representations from speech and text modalities, outperforming the current State-of-the-Art in ASR tasks. Key findings include the incorporation of lexical information using text-only inputs, improved performance in monolingual and multilingual setups, and efficient representation unification with minimal supervised data. In the second paper, a novel extension of masked autoencoders to audio is presented. The model works by splitting mel spectrograms into patches, masking most patches, and reconstructing them using an encoder-decoder approach. Key findings include the possibility of extending this method to temporal information like audio and video, extremely high patching ratios leading to more robust models in quality and bias settings, and local attention outperforming global for speech domains.