Company
Date Published
Author
Abhishek Malvankar (IBM Research) and Dmitri Gekhtman (Anyscale)
Word count
1406
Language
English
Hacker News points
1

Summary

The large scale machine learning workloads on Kubernetes often suffer from a lack of resource reservation systems, leading to gang scheduling issues where jobs are stuck waiting for resources to become available. KubeRay with the Multi-Cluster-App-Dispatcher (MCAD) controller helps to avoid such situations by queuing each Ray workload until resource availability requirements are met. MCAD allows users to queue each of their Ray workloads until aggregated resources are available in one of the Kubernetes clusters, ensuring that all pods can be scheduled. With KubeRay and MCAD, users can scale their Python and AI applications from a laptop to a cluster seamlessly, using gang scheduling and workload pre-emption capabilities. The Multi-Cluster-App-Dispatcher (MCAD) is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single Kubernetes cluster or multi-Kubernetes-cluster environment, allowing users to queue Ray clusters until resource availability requirements are met.