Deep Dive: Data Ingest in a Third Generation ML Architecture

Company

Anyscale

Date Published

Nov. 30, 2021

Author

Eric Liang, Chen Shen, Clark Zinzow, Waleed Kadous

Word count

1783

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture

Summary

The third generation of machine learning (ML) architectures has made significant progress in improving performance by exploiting the full bandwidth of distributed memory. Distributed libraries such as Ray Datasets and Ray Train enable improved performance by allowing for greater programmability, reducing operational and development overheads, and achieving better performance by passing data in-memory and with pipelining. A key capability of these architectures is their composability, which allows developers to compose existing distributed systems to solve complex problems. The example code snippet demonstrates how to create a Ray Dataset Pipeline with a distributed Ray Train Job, which can be used to express the ML ingest and training pipeline with just a few lines of Python code. This approach achieves lower operational and development overheads, better performance, and greater programmability compared to second-generation architectures.