FAQ: Building LLMs with RedPajama-v2, a 30 trillion token web dataset

Company

Together AI

Date Published

May 1, 2024

Author

Together AI

Word count

2248

Language

English

Hacker News points

None

URL

www.together.ai/blog/redpajama-v2-faq

Summary

The RedPajama-V2 dataset is a 30 trillion token web dataset designed for training large language models (LLMs). It's not intended to be used out of the box, but rather as a foundation for creating high-quality datasets. The dataset has a high-recall vs high-precision design, enabling researchers to experiment with different data selection techniques and discover recipes that produce downstream models with desired properties. To facilitate its use, the dataset comes with quality signals, duplication tags, and minhash signatures, which can be used to filter out low-quality documents, deduplicate data, and perform fuzzy deduplication. The dataset is partitioned into head, middle, and tail partitions based on perplexity, with the head and middle partitions having higher quality data than the tail partition. Users can use metadata fields extracted from the CCNet pipeline to filter the dataset by URL domain or date. The dataset also provides an example of how to remove documents containing ellipses and interpret the format of the quality signals. To ensure retries when downloading the dataset, users can pass a download config with the number of retries set to something larger than 1. The dataset's total size is approximately 260TB, comprising four components: text documents, quality signals, minhash signatures, and duplicate IDs. The dataset can be used as a starting point for creating high-quality datasets that resemble data from high-quality sources such as Wikipedia or OpenWebText.