MinImagen is a lightweight text-to-image model introduced by Google DeepMind in 2022. It demonstrates that it's possible to train a high quality text-to-image generator using a much smaller dataset and computational resources compared to models like DALL-E or Imagen.
The MinImagen model consists of two main components: a base U-Net which generates low-resolution images, and a super-resolution U-Net that upscales the generated images to higher resolutions. The key innovation in MinImagen is using classifier-free guidance, where both the unguided (text-only) and guided (text + image caption) logits are used during training and sampling to improve the quality of generated images.
Training a MinImagen model involves first training the base U-Net on low-resolution images paired with captions, followed by fine-tuning the super-resolution U-Net using the outputs from the base U-Net as inputs. The final MinImagen model can then be used to generate high quality images based on textual descriptions.
In summary, MinImagen is a significant step forward in making advanced text-to-image models more accessible and computationally efficient, paving the way for further improvements and applications in this area.