Ray Serve, a scalable model serving library built on Ray, helps manage increased traffic but struggles to scale down once traffic abates, leading to resource fragmentation and underutilized resources. This is where Anyscale's new Replica Compaction feature comes in, optimizing resource usage for online inference and model serving by automatically migrating replicas into fewer nodes to reduce costs. With Replica Compaction, Anyscale can detect when a deployment is downscaled and migrate excess replicas into a single node, reducing instance seconds and cost savings. The feature has shown significant efficiency improvements, with an average efficiency gain of ~10% on high-end GPUs like A100s and H100s, translating to substantial cost savings advantages, especially in less scaled scenarios where costs can be reduced by 50% or more.