/plushcap/analysis/deepgram/text-cleaning-asr-turkish

Text Cleaning for ASR: The Case of Turkish

What's this blog post about?

Text cleaning is an essential component of natural language processing that helps prepare training data for automatic speech recognition (ASR) systems. It involves transforming raw data into a "cleaner" version, closer to the actual phonetics of what was said. This process is language-dependent and requires a multi-step processing pipeline to ensure accurate transcriptions. In Turkish text cleaning, challenges include handling the apostrophe, consonant assimilation, vowel harmony rules, and processing currencies and numbers. Text cleaning is crucial for ASR training as it helps improve the accuracy of transcriptions by ensuring a good match between phonetics and phonetic transcription.

Company
Deepgram

Date published
Aug. 30, 2022

Author(s)
Morris Gevirtz

Word count
2160

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.