Company
Date Published
Author
Nathan Smith
Word count
1364
Language
English
Hacker News points
None

Summary

A senior Data Scientist at Neo4j explores using embeddings to represent string edit distance in Neo4j, a graph database. The author uses a convolutional neural network (CNN) to create an embedding vector that represents the spelling of a string, allowing for efficient comparison of strings with similar edits. By generating embeddings and sending them to Neo4j, the author creates a graph projection and uses K-nearest neighbors to identify pairs of strings with low edit distance. The results are compared to record linkage methods, showing that the CNN approach is faster but may miss certain pairs due to its reliance on a specific blocking scheme. Overall, the author concludes that using embeddings has potential for inclusion in graph-based entity resolution pipelines.