Building an End-to-End Speech Recognition Model in PyTorch
Deep Learning has revolutionized Automatic Speech Recognition (ASR) by introducing end-to-end models such as Baidu's Deep Speech and Google's Listen Attend Spell (LAS). These models directly output transcriptions from audio inputs, simplifying speech recognition pipelines. Both models are based on recurrent neural network architectures but use different approaches to modeling speech recognition. Deep learning has enabled the creation of robust ASR models by leveraging large datasets and eliminating the need for hand-engineered acoustic features or complex GMM-HMM model architectures. The tutorial guides users through building an end-to-end speech recognition model in PyTorch, inspired by Deep Speech 2. It covers data preparation, augmentation, defining the model architecture, selecting appropriate optimizers and schedulers, implementing the CTC loss function, evaluating the model using WER and CER metrics, and monitoring experiments with Comet.ml. The tutorial also discusses various advancements in speech recognition, such as Transformers, unsupervised pre-training, and word piece models, which can improve accuracy and efficiency.
Company
AssemblyAI
Date published
Dec. 1, 2020
Author(s)
Michael Nguyen
Word count
3346
Hacker News points
288
Language
English