Wikipedia and Weaviate
This article outlines how to conduct semantic search queries on a large scale using a vector database. The complete English language Wikipedia corpus backup is open-sourced in Weaviate, which can be used for similar vector and semantic search solutions in other projects. The dataset contains 11.348.257 articles, 27.377.159 paragraphs, and 125.447.595 graph cross-references. The article provides step-by-step instructions on how to import the data into Weaviate, create a schema for semantic search, and query the data using GraphQL. It also discusses implementation strategies for bringing semantic search solutions to production, emphasizing scalability and the need for data, ML-models, and a vector database.
Company
Weaviate
Date published
Nov. 25, 2021
Author(s)
Bob van Luijt
Word count
1439
Hacker News points
None found.
Language
English