Combining Speech Recognition and Diarization in one model
Researchers from Carnegie Mellon University and Università Politecnica delle Marche propose a novel approach to combine Speaker Diarization (SD) and Automatic Speech Recognition (ASR) into a unified end-to-end framework. The objective is to simplify the speech processing pipeline while maintaining accurate speaker attribution and transcription. Traditional pipelines that couple SD and ASR rely on many distinct models, resulting in technical pitfalls like difficulty with hyperparameter tuning and model evaluation, computational overhead, and error propagation. SLIDAR, a 2-step approach to SD+ASR, involves analyzing fixed-length speech windows independently, employing a clustering mechanism for speaker identities, and maintaining linear computational costs relative to recording length. The proposed model demonstrates comparable performance to state-of-the-art methods despite using significantly less supervised training data.
Company
AssemblyAI
Date published
Oct. 27, 2023
Author(s)
Marco Ramponi
Word count
915
Hacker News points
None found.
Language
English