Company
Date Published
Author
Amog Kamsetty, Hao Chen, Liguang Xie
Word count
1872
Language
English
Hacker News points
7

Summary

We leverage multi-modal models to enable applications such as text-based image retrieval or object detection, powering various use cases at ByteDance, including large model offline inference. To handle the scale of our workload, we utilize Ray as a computing framework, specifically Ray Data, which provides flexibility and scalability for large-scale model parallel inference. We employ pipeline sharding, splitting our model across GPU devices to fit within memory constraints. This approach allows us to overcome technical challenges posed by data size and model size. By utilizing Ray Data's streaming execution paradigm and elastic resource scheduling, we can achieve high efficiency in building scalable offline inference applications for large models. Additionally, we leverage KubeRay to manage our Ray clusters, providing a comprehensive solution for deploying and managing Ray applications.