Ray Datasets for large-scale machine learning ingest and scoring

Company

Anyscale

Date Published

Feb. 14, 2022

Author

Clark Zinzow, Alex Wu, Jiajun Yao, Eric Liang, Chen Shen

Word count

1661

Language

English

Hacker News points

URL

www.anyscale.com/blog/ray-datasets-for-machine-learning-training-and-scoring

Summary

Ray Datasets is a data loading and preprocessing library built on top of the Ray framework, designed to simplify machine learning (ML) pipelines by providing a flexible and scalable API for working with data within Ray. It leverages Ray's task, actor, and object APIs to enable large-scale ML ingest, training, and inference, all within a single Python application. Datasets supports popular storage backends and file formats, common ML preprocessing operations, and works seamlessly with Ray-integrated libraries and ML frameworks such as TensorFlow and Torch. The library aims to be a universal parallel data loader, providing a narrow data waist for Ray applications and libraries to interface with. It offers convenient data preprocessing functionality, supports running stateful computations on GPUs, and enables batch inference on large datasets. With its robust distributed dataplane, Datasets delegates most of the heavy lifting to the Ray dataplane, focusing on higher-level features such as convenient APIs, data format support, and stage pipelining.