How ThirdAI uses Ray for Parallel Training of Billion-Parameter Neural Networks on Commodity CPUs

Company

Anyscale

Date Published

Aug. 29, 2023

Author

Vihan Lakshman, Pratik Pranav, Siddharth Jain, Tharun Medini

Word count

1643

Language

English

Hacker News points

URL

www.anyscale.com/blog/how-thirdai-uses-ray-for-parallel-training-of-billion-parameter-neural-networks-on-commodity-cpus

Summary

This startup, ThirdAI Corp, has developed a new deep learning framework called BOLT, which efficiently trains large models on standard CPU hardware by making sparsity a first-class design principle. The company leveraged Ray for distributed training of their models, achieving near-linear scaling for terabyte-scale datasets and billion-parameter models. By using Ray's distributed data parallel engine, ThirdAI was able to quickly build an industry-grade solution with features like fault-tolerance, multiple modes of communication, and seamless scalability. This approach allows for the democratization of deep learning in a sustainable manner, as specialized hardware is not required, reducing costs and energy consumption. The company has also simplified their developer experience by transitioning from Ray Core to Ray Trainer, which provides a streamlined training pipeline, enhanced fault tolerance, and refined automatic scaling. Experimental results demonstrate the performance of BOLT on various benchmarks, showcasing its competitive efficiency on CPUs.