/plushcap/analysis/algolia/algolia-engineering-increase-decompounding-accuracy-by-generating-a-language-specific-lexicon

Decompounding with language-specific lexicons | Algolia

What's this blog post about?

Decompounding is the process of breaking down compound words into their meaningful parts. It's crucial for multilingual search engines, as many languages use compounding to generate new words. Decompounding involves identifying and separating individual words within a compound word while maintaining its original meaning. This can be challenging due to language-specific rules and the risk of over-decomposing words. Various decompounding methods exist, including rule-based approaches, statistical methods, and machine learning techniques. Algolia uses a lexicon-based approach that involves creating a list of words (atoms) that should not be further decomposed. The algorithm then splits compound words by finding the longest atoms that fit perfectly within them. To build an effective lexicon, Algolia collects and processes large quantities of text data in the target language using part-of-speech tagging and lemmatization. They also filter out infrequently used words and very long words to improve the quality of their lexicon. Finally, they use statistical properties of potential atoms to remove unwanted compounds from the lexicon. Algolia's decompounding algorithm is a greedy right-to-left longest match approach that uses language-specific linking morphemes between atoms. They compare their results with other methods, such as SECOS, and find that their method achieves a close F1-score while maintaining faster computation times. Algolia currently supports decompounding for six languages: Dutch, German, Finnish, Danish, Swedish, and Norwegian Bokmål. They also allow customers to force the decompounding of specific words using their customer dictionaries.

Company
Algolia

Date published
Aug. 2, 2023

Author(s)
Marc von Wyl

Word count
2404

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.