The paper "JUST - JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR" presents a novel Wav2Vec2-inspired pre-training technique for multilingual automatic speech recognition (ASR). JUST utilizes a five-stage modeling architecture with three stage-level unsupervised and supervised loss functions. The proposed approach achieves a 32% performance increase over the first-stage Wav2Vec2 XLSR network in low-resource language ASR settings. Key findings include the use of contrastive MLM (Masked Language Modelling) and RNN-T losses for joint pre-training on audio-text pairs across multiple languages, leading to more useful information extraction, better generalization, and robust contextualized token prediction. JUST outperforms Wav2Vec2 by using only the MLS dataset for pre-training, demonstrating its effectiveness in multilingual ASR tasks with fewer data requirements.