The Sphere Dataset in Weaviate
Meta has released an open-source dataset called Sphere, which consists of 134 million documents broken up into 906 million 100-word snippets. It is one of the largest knowledge bases that can help solve knowledge-intensive natural language tasks such as question-answering and fact-checking. The dataset aims to act as a "universal, uncurated and unstructured source of knowledge." However, accessing and using Sphere in its current open-source format is challenging for the average developer due to its enormity. To make this resource more accessible, Weaviate now offers Sphere as JSON or Parquet files that can be easily imported with Python and Spark.
Company
Weaviate
Date published
Dec. 6, 2022
Author(s)
Zain Hasan
Word count
1129
Language
English
Hacker News points
None found.