Company
Date Published
Author
Tom Nijhof
Word count
1340
Language
English
Hacker News points
None

Summary

The author of the blog post is a Back End Developer at CytoSMART who built a full-text search system for a graph database containing millions of chemical compounds. The goal was to link different synonyms of the same compound to each other, allowing users to find similar chemicals. To solve this challenge, the developer used Neo4j as the graph database, Lucene for full-text searching, and Python scripts to wrangle and load data into the database. After loading 197M synonyms and 57M compounds, the developer created a full-text index on the synonym nodes and implemented several query options, including basic fuzzy matching, one synonym per compound, and optimized queries with limiting results to reduce processing time. The system also handles cases where a synonym has multiple compounds associated with it, returning all related compounds without duplicates.