Automatic and optimistic memory scheduling for ML workloads in Ray

Company

Anyscale

Date Published

March 2, 2023

Author

Clarence Ng, Jules S. Damji

Word count

2423

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/automatic-and-optimistic-memory-scheduling-for-ml-workloads-in-ray

Summary

The OOM monitor is an out-of-memory (OOM) detection and prevention feature in Ray, designed to prevent memory-intensive tasks and actors from consuming excessive resources and causing cluster-wide resource degradation. This feature aims to provide better Python support for detecting memory usage mechanisms while using Ray native libraries or third-party Python libraries with Ray, enabling machine learning engineers to observe and debug their applications more effectively. The monitor works by periodically inspecting collective memory usage on each worker node, terminating a task or actor as a preventive measure before an OOM event occurs, and rescheduling it later if necessary. The policy applied to decide which task to free up for memory is multi-step, prioritizing retriable tasks, grouping tasks by caller, picking one task from the group, and ensuring fairness among callers. The feature is enabled by default in Ray 2.2 and 2.3, providing a novel solution to prevent OOM errors and improve observability and transparency of ML workloads.