How to Train Large Deep Learning Models as a Startup

Company

AssemblyAI

Date Published

Oct. 7, 2021

Author

Dylan Fox

Word count

2099

Language

English

Hacker News points

273

URL

www.assemblyai.com/blog/how-to-train-large-deep-learning-models-as-a-startup

Summary

OpenAI's GPT-3 is a large deep learning model with 175 billion parameters, which requires significant computational resources and time for training. Training such models on a single GPU would take hundreds of years. However, OpenAI utilized Microsoft's high-bandwidth cluster of NVIDIA V100 GPUs to train GPT-3 in weeks instead of years. The cost of setting up a similar cluster with 1,024x NVIDIA A100 GPUs is estimated at almost $10 million, not including electricity and hardware maintenance costs. Training large models is expensive and slow, which poses challenges for startups that need to iterate quickly. AssemblyAI, a startup building large Automatic Speech Recognition (ASR) models, has learned several lessons about training large models efficiently. They recommend using more GPUs, improving GPU performance, and reducing precision during training to improve iteration speed. To reduce costs, they suggest buying your own hardware or renting dedicated servers from smaller hosting providers like Cirrascale instead of relying on public clouds like AWS or Google Cloud.