/plushcap/analysis/datastax/datastax-indexing-all-of-wikipedia-on-a-laptop

Indexing All of Wikipedia on a Laptop

What's this blog post about?

Cohere has released a dataset containing all of Wikipedia chunked and embedded to vectors using their multilingual-v3 model. This makes creating a semantic, vector-based index of Wikipedia practical for an individual for the first time. The JVector library now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. By using Locally-Adaptive Quantization (LVQ) compression, it improves on previous methods and allows for faster searches while maintaining accuracy. This has enabled the indexing of all of English Wikipedia on a laptop, which was previously not practical.

Company
DataStax

Date published
June 5, 2024

Author(s)
Jonathan Ellis

Word count
1581

Language
English

Hacker News points
None found.