Deploy Ray Serve with up to 50% fewer nodes using Anyscale Replica Compaction

Company

Anyscale

Date Published

July 15, 2024

Author

Matt Connor, Akshay Malik, Cindy Zhang

Word count

883

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/new-feature-replica-compaction

Summary

Ray Serve, a scalable model serving library built on Ray, helps manage increased traffic but struggles to scale down once traffic abates, leading to resource fragmentation and underutilized resources. This is where Anyscale's new Replica Compaction feature comes in, optimizing resource usage for online inference and model serving by automatically migrating replicas into fewer nodes to reduce costs. With Replica Compaction, Anyscale can detect when a deployment is downscaled and migrate excess replicas into a single node, reducing instance seconds and cost savings. The feature has shown significant efficiency improvements, with an average efficiency gain of ~10% on high-end GPUs like A100s and H100s, translating to substantial cost savings advantages, especially in less scaled scenarios where costs can be reduced by 50% or more.