An Overview of Transducer Models for ASR
In the field of Deep Learning and Speech Recognition, three main types of neural network architectures are widely used: Connectionist Temporal Classification (CTC), Listen-Attend-Spell (LAS) models, and Transducers. Transducers have recently become the best performing model architecture for most ASR tasks, surpassing CTC and LAS models. The Transducer architecture was first introduced in 2012 by Alex Graves in the paper "Sequence Transduction with Recurrent Neural Networks." RNNTs or Recurrent Neural Networks Transducers were created to solve some of the shortcomings of CTC models, which required an external Language Model to perform well. Compared to a CTC model, an RNNT model has three modules that are trained jointly: The Encoder, Predictor, and Joint network. These three each have their purpose. At AssemblyAI, they've recently transitioned their core transcription model from a CTC model to a Transducer model, achieving substantially greater accuracy. They replace Recurrent Neural Networks with Transformers, in particular the Conformer variant of Transformers. The ability of Transformers to model global features from sequential data is what makes it so powerful. However, for speech, it makes sense to not only look at the global features in audio data, but local features as well, since acoustic features are more likely to be correlated with adjacent features than those that are far away. The Conformer is a variant of the Transformer that was first introduced in the paper "Conformer: Convolution-augmented Transformers for Speech Recognition."
Company
AssemblyAI
Date published
Nov. 5, 2021
Author(s)
Michael Nguyen, Kevin Zhang
Word count
1164
Language
English
Hacker News points
8