Gang Scheduling Ray Clusters on Kubernetes with Multi-Cluster-App-Dispatcher (MCAD)

Company

Anyscale

Date Published

Nov. 16, 2022

Author

Abhishek Malvankar (IBM Research) and Dmitri Gekhtman (Anyscale)

Word count

1406

Language

English

Hacker News points

URL

www.anyscale.com/blog/gang-scheduling-ray-clusters-on-kubernetes-with-multi-cluster-app-dispatcher

Summary

The large scale machine learning workloads on Kubernetes often suffer from a lack of resource reservation systems, leading to gang scheduling issues where jobs are stuck waiting for resources to become available. KubeRay with the Multi-Cluster-App-Dispatcher (MCAD) controller helps to avoid such situations by queuing each Ray workload until resource availability requirements are met. MCAD allows users to queue each of their Ray workloads until aggregated resources are available in one of the Kubernetes clusters, ensuring that all pods can be scheduled. With KubeRay and MCAD, users can scale their Python and AI applications from a laptop to a cluster seamlessly, using gang scheduling and workload pre-emption capabilities. The Multi-Cluster-App-Dispatcher (MCAD) is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single Kubernetes cluster or multi-Kubernetes-cluster environment, allowing users to queue Ray clusters until resource availability requirements are met.