The Sphere Dataset in Weaviate

Post Details

Company

Weaviate

Date Published

Dec. 6, 2022

Author

Zain Hasan

Word Count

1,129

Language

English

Hacker News Points

-

Source URL

weaviate.io/blog/sphere-dataset-in-weaviate

Summary

Meta has released an open-source dataset called Sphere, which consists of 134 million documents broken up into 906 million 100-word snippets. It is one of the largest knowledge bases that can help solve knowledge-intensive natural language tasks such as question-answering and fact-checking. The dataset aims to act as a "universal, uncurated and unstructured source of knowledge." However, accessing and using Sphere in its current open-source format is challenging for the average developer due to its enormity. To make this resource more accessible, Weaviate now offers Sphere as JSON or Parquet files that can be easily imported with Python and Spark.