How Imagen Actually Works
Imagen is a text-to-image generation model developed by Google Brain that demonstrates state-of-the-art performance in generating high quality and diverse images. It utilizes large language models (T5) for encoding the input text prompts, which are then used to condition a series of diffusion models for image synthesis. These models were trained on two corpora: a large dataset of web-scraped image-text pairs and a smaller dataset containing high quality images. The model achieved impressive results in human evaluations against other state-of-the-art text-to-image generation models such as DALL-E 2, GLIDE, VQGAN+CLIP, and Laten Diffusion. It outperforms all of these models on a comprehensive set of challenging prompts called DrawBench. The key takeaways from the Imagen paper suggest that scaling up the text encoder is very effective, dynamic thresholding is critical, noise conditioning augmentation in the super-resolution models is critical, text conditioning via cross attention is critical, and efficient U-Net design is important for achieving high performance in image generation tasks. Reference(s): [1] Sahariah, A., Steiner, T., Anderson, J., et al. (2022). Imagen: High Resolution Image Synthesis with Text-Guided Diffusion Models. arXiv preprint arXiv:2205.11493. [2] Dhariwal, P., and Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. arXiv preprint arXiv:2106.07748.
Company
AssemblyAI
Date published
June 23, 2022
Author(s)
Ryan O'Connor
Word count
6060
Language
English
Hacker News points
142