Executing a distributed shuffle without a MapReduce system

Company

Anyscale

Date Published

March 22, 2021

Author

Stephanie Wang

Word count

1675

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/executing-a-distributed-shuffle-without-a-mapreduce-system

Summary

A distributed shuffle is a data-intensive operation that requires a system built specifically for the purpose, but can be expressed in just a few lines of Python using Ray, a general-purpose framework whose core API contains no shuffle operations. Shuffling a small dataset is simple enough to do in memory, but larger datasets bring new challenges such as multiple nodes, spilling to external storage, and enforcing memory limits. A common method for shuffling a large dataset is to split the execution into a map and reduce phase, with the data being shuffled between the two phases. Ray manages parallelism and task dependencies for small shuffles, but for larger shuffles, it requires scheduling in waves due to limited memory capacity, spilling objects to external storage, and restoring them during the reduce phase. Ray's scheduler takes into account both CPU and memory availability to prevent deadlocks and enforce memory limits. The framework provides a high-performance shared-memory object store, spills objects transparently to external storage, applies admission control during task scheduling, and is designed to scale up to large datasets such as sorting 100TB.