Decompounding with language-specific lexicons | Algolia
Decompounding is the process of breaking down compound words into their meaningful parts. It's crucial for multilingual search engines, as many languages use compounding to generate new words. Decompounding involves identifying and separating individual words within a compound word while maintaining its original meaning. This can be challenging due to language-specific rules and the risk of over-decomposing words. Various decompounding methods exist, including rule-based approaches, statistical methods, and machine learning techniques. Algolia uses a lexicon-based approach that involves creating a list of words (atoms) that should not be further decomposed. The algorithm then splits compound words by finding the longest atoms that fit perfectly within them. To build an effective lexicon, Algolia collects and processes large quantities of text data in the target language using part-of-speech tagging and lemmatization. They also filter out infrequently used words and very long words to improve the quality of their lexicon. Finally, they use statistical properties of potential atoms to remove unwanted compounds from the lexicon. Algolia's decompounding algorithm is a greedy right-to-left longest match approach that uses language-specific linking morphemes between atoms. They compare their results with other methods, such as SECOS, and find that their method achieves a close F1-score while maintaining faster computation times. Algolia currently supports decompounding for six languages: Dutch, German, Finnish, Danish, Swedish, and Norwegian Bokmål. They also allow customers to force the decompounding of specific words using their customer dictionaries.
Company
Algolia
Date published
Aug. 2, 2023
Author(s)
Marc von Wyl
Word count
2404
Hacker News points
None found.
Language
English