Wikipedia and Weaviate

Company

Weaviate

Date Published

Nov. 25, 2021

Author

Bob van Luijt

Word count

1439

Language

English

Hacker News points

None

URL

weaviate.io/blog/semantic-search-with-wikipedia-and-weaviate

Summary

This article outlines how to conduct semantic search queries on a large scale using a vector database. The complete English language Wikipedia corpus backup is open-sourced in Weaviate, which can be used for similar vector and semantic search solutions in other projects. The dataset contains 11.348.257 articles, 27.377.159 paragraphs, and 125.447.595 graph cross-references. The article provides step-by-step instructions on how to import the data into Weaviate, create a schema for semantic search, and query the data using GraphQL. It also discusses implementation strategies for bringing semantic search solutions to production, emphasizing scalability and the need for data, ML-models, and a vector database.