Review - JUST: Joint Unsupervised and Supervised Training For Multilingual ASR
The paper "JUST - JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR" presents a novel Wav2Vec2-inspired pre-training technique for multilingual automatic speech recognition (ASR). JUST utilizes a five-stage modeling architecture with three stage-level unsupervised and supervised loss functions. The proposed approach achieves a 32% performance increase over the first-stage Wav2Vec2 XLSR network in low-resource language ASR settings. Key findings include the use of contrastive MLM (Masked Language Modelling) and RNN-T losses for joint pre-training on audio-text pairs across multiple languages, leading to more useful information extraction, better generalization, and robust contextualized token prediction. JUST outperforms Wav2Vec2 by using only the MLS dataset for pre-training, demonstrating its effectiveness in multilingual ASR tasks with fewer data requirements.
Company
AssemblyAI
Date published
Dec. 15, 2021
Author(s)
Luka Chkhetiani
Word count
717
Language
English
Hacker News points
None found.