Company
Date Published
Author
Eole Cervenka
Word count
1248
Language
English
Hacker News points
None

Summary

We present an inference benchmark of Stable Diffusion on different GPUs and CPUs to shed light on the questions of what hardware is needed for running this state-of-the-art text-to-image model. The findings show that many consumer-grade GPUs can do a fine job, with the most powerful Ampere GPU (A100) being only 33% faster than the 3080 card when it comes to speed. However, A100 outperforms 3080 in terms of throughput by 2.5x. We also observe that half-precision reduces the time for generating a single output image by about 40% for Ampere GPUs and by 52% for the previous generation RTX8000 GPU. The increase is not linear, and the tensor cores on the GPU are saturated when batch size reaches a certain value. Removing autocast speeds up inference with pytorch at half-precision by ~25%. We verify performance gains both on speed and memory usage side. Our observation is that there are indeed visible differences between single-precision output and half-precision output, especially in early steps.