How ByteDance Scales Offline Inference with multi-modal LLMs to 200 TB Data

Company

Anyscale

Date Published

Aug. 14, 2023

Author

Amog Kamsetty, Hao Chen, Liguang Xie

Word count

1872

Language

English

Hacker News points

URL

www.anyscale.com/blog/how-bytedance-scales-offline-inference-with-multi-modal-llms-to-200TB-data

Summary

We leverage multi-modal models to enable applications such as text-based image retrieval or object detection, powering various use cases at ByteDance, including large model offline inference. To handle the scale of our workload, we utilize Ray as a computing framework, specifically Ray Data, which provides flexibility and scalability for large-scale model parallel inference. We employ pipeline sharding, splitting our model across GPU devices to fit within memory constraints. This approach allows us to overcome technical challenges posed by data size and model size. By utilizing Ray Data's streaming execution paradigm and elastic resource scheduling, we can achieve high efficiency in building scalable offline inference applications for large models. Additionally, we leverage KubeRay to manage our Ray clusters, providing a comprehensive solution for deploying and managing Ray applications.