Company
Date Published
Author
Hao Zhang, Richard Liaw
Word count
1494
Language
English
Hacker News points
None

Summary

The introduction of collective communication primitives in Ray 1.2.0 simplifies distributed operations by enabling processes to exchange information across many distributed processes simultaneously, significantly speeding up certain distributed operations. These primitives allow programs to express complex communication patterns between many processes, providing low-level control over the communication backend and optimizing for different types of computing devices. The `allreduce` primitive is a widely adopted collective communication primitive used in many distributed ML training systems, including Horovod and distributed TensorFlow. Ray's native collective communication primitives, such as `ray.util.collective.allreduce()`, can be used to simplify code and boost performance significantly compared to using the standard Ray APIs like `ray.get()` and `ray.put()`. The use of these primitives enables fast point-to-point communication between distributed GPUs and supports a variety of backends for collective communication.