Online Resource Allocation with Ray at Ant Group
Ant Group has implemented a flexible, high-performance, stable, and scalable online resource allocation system based on Ray to support the largest online shopping event in the world, Double 11. The system's deployment scale reached more than 6000 CPU cores and is currently used for various application scenarios including marketing and order allocation. The online resource allocation solution involves a complex engineering implementation relying on offline and real-time data, with the algorithm's implementation relying on both real-time and iterative calculations. Ray provides a simple and easy-to-use API, supports convenient resource scheduling, and has second level fault-tolerant recovery ability, ensuring the availability of the service. The online resource allocation solution based on Ray has been running stably in Ant Group, successfully supporting large-scale activities such as Double 11 and Double 12.
Company
Anyscale
Date published
March 30, 2021
Author(s)
Xingyu Lu, Yang Liu, Tengwei Cai, Fengbin Fang
Word count
1838
Hacker News points
1
Language
English