Loading Llama-2 70b 20x faster with Anyscale Endpoints

Company

Anyscale

Date Published

Oct. 11, 2023

Author

Yi Cheng, Cade Daniel, Chen Shen, Liguang Xie

Word count

1961

Language

English

Hacker News points

URL

www.anyscale.com/blog/loading-llama-2-70b-20x-faster-with-anyscale-endpoints

Summary

The importance of speed when loading large language models is discussed, particularly in the context of the Llama 2 series of models. The current process of loading a model into GPU memory can take up to 10 minutes and involves multiple steps such as getting a node from the cluster, pulling down the docker image, setting up the environment, fetching data from S3, decoding the model, and transferring it to GPU memory. This process is slow due to the sequential nature of these steps and the disk I/O being a bottleneck. To address this issue, techniques such as parallel downloading using multiple threads, streaming data directly into GPU memory, and optimizing CPU memory usage are employed. The Anyscale Model Loader is proposed as a solution that can achieve a speed increase of over 20x by leveraging concurrent downloading with multiple threads, caching data in disk for later usage, and removing network bandwidth as the bottleneck.