Tesla A100 Server Total Cost of Ownership Analysis

Company

Lambda

Date Published

Sept. 22, 2021

Author

Chuan Li

Word count

2221

Language

English

Hacker News points

None

URL

lambda.ai/blog/tesla-a100-server-total-cost-of-ownership

Summary

This summary compares the Total Cost of Ownership (TCO) for Lambda servers and clusters with NVIDIA A100 GPUs versus cloud instances with NVIDIA A100 GPUs. The analysis focuses on two models of Lambda servers: Hyperplane-A100 and Scalar-A100, which differ in how their A100 GPUs are interconnected. The study finds that owning a single Lambda Hyperplane-A100 server is significantly cheaper than renting an AWS p4d.24xlarge instance over a 3-year period, with savings ranging from 41.7% to 71.3%. However, the choice of server depends on the use case, such as distributed training or inference. The analysis also shows that increasing occupancy rate from 50% to 100% doubles the total petaflops but only increases the TCO by $10,534, resulting in a higher flops/$ ratio. Additionally, the study benchmarks these servers using various deep learning models and compares their training throughput, finding similar performance between Lambda Hyperplane-A100 and AWS p4d.24xlarge instance, with Scalar-A100 being slower due to its slower GPU interconnection and less power consumption.