How Imagen Actually Works

Company

AssemblyAI

Date Published

June 23, 2022

Author

Ryan O'Connor

Word count

6060

Language

English

Hacker News points

142

URL

www.assemblyai.com/blog/how-imagen-actually-works

Summary

Imagen is a text-to-image generation model developed by Google Brain that demonstrates state-of-the-art performance in generating high quality and diverse images. It utilizes large language models (T5) for encoding the input text prompts, which are then used to condition a series of diffusion models for image synthesis. These models were trained on two corpora: a large dataset of web-scraped image-text pairs and a smaller dataset containing high quality images. The model achieved impressive results in human evaluations against other state-of-the-art text-to-image generation models such as DALL-E 2, GLIDE, VQGAN+CLIP, and Laten Diffusion. It outperforms all of these models on a comprehensive set of challenging prompts called DrawBench. The key takeaways from the Imagen paper suggest that scaling up the text encoder is very effective, dynamic thresholding is critical, noise conditioning augmentation in the super-resolution models is critical, text conditioning via cross attention is critical, and efficient U-Net design is important for achieving high performance in image generation tasks. Reference(s): [1] Sahariah, A., Steiner, T., Anderson, J., et al. (2022). Imagen: High Resolution Image Synthesis with Text-Guided Diffusion Models. arXiv preprint arXiv:2205.11493. [2] Dhariwal, P., and Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. arXiv preprint arXiv:2106.07748.