Company
Date Published
Author
Stephanie Wang
Word count
1675
Language
English
Hacker News points
None

Summary

A distributed shuffle is a data-intensive operation that requires a system built specifically for the purpose, but can be expressed in just a few lines of Python using Ray, a general-purpose framework whose core API contains no shuffle operations. Shuffling a small dataset is simple enough to do in memory, but larger datasets bring new challenges such as multiple nodes, spilling to external storage, and enforcing memory limits. A common method for shuffling a large dataset is to split the execution into a map and reduce phase, with the data being shuffled between the two phases. Ray manages parallelism and task dependencies for small shuffles, but for larger shuffles, it requires scheduling in waves due to limited memory capacity, spilling objects to external storage, and restoring them during the reduce phase. Ray's scheduler takes into account both CPU and memory availability to prevent deadlocks and enforce memory limits. The framework provides a high-performance shared-memory object store, spills objects transparently to external storage, applies admission control during task scheduling, and is designed to scale up to large datasets such as sorting 100TB.