How to Train Large Deep Learning Models as a Startup
OpenAI's GPT-3 is a large deep learning model with 175 billion parameters, which requires significant computational resources and time for training. Training such models on a single GPU would take hundreds of years. However, OpenAI utilized Microsoft's high-bandwidth cluster of NVIDIA V100 GPUs to train GPT-3 in weeks instead of years. The cost of setting up a similar cluster with 1,024x NVIDIA A100 GPUs is estimated at almost $10 million, not including electricity and hardware maintenance costs. Training large models is expensive and slow, which poses challenges for startups that need to iterate quickly. AssemblyAI, a startup building large Automatic Speech Recognition (ASR) models, has learned several lessons about training large models efficiently. They recommend using more GPUs, improving GPU performance, and reducing precision during training to improve iteration speed. To reduce costs, they suggest buying your own hardware or renting dedicated servers from smaller hosting providers like Cirrascale instead of relying on public clouds like AWS or Google Cloud.
Company
AssemblyAI
Date published
Oct. 7, 2021
Author(s)
Dylan Fox
Word count
2099
Language
English
Hacker News points
273