Combining Images and Text Together: How Multimodal Retrieval Transforms Search

Company

Zilliz

Date Published

Oct. 22, 2024

Author

David Wang

Word count

3733

Language

English

Hacker News points

None

URL

zilliz.com/blog/combine-image-and-text-how-multimodal-retrieval-transforms-search

Summary

The rise of multimodal models has led to a shift in search methods, with multimodal retrieval gaining popularity due to its ability to combine inputs from multiple modalities such as text and images. This approach allows for more nuanced and precise ways to capture users' search intents by leveraging the strengths of both modalities. One common task within multimodal retrieval is Composed Image Retrieval (CIR), where users provide a query that includes a reference image along with a descriptive caption. This dual-input approach enables the retrieval of specific images by combining visual content with textual instructions, creating a more detailed and accurate query. Various techniques have been developed for CIR, including Pic2Word, CompoDiff, CIReVL, and MagicLens. Each of these builds on the foundational capabilities of CLIP while adopting different approaches to improve retrieval. For example, Pic2Word transforms images into text tokens embedded in a text-based search, leveraging CLIP text embeddings for highly versatile, text-driven image retrieval. CompoDiff employs text-guided denoising, refining noisy visual embeddings with text input to conditionally reconstruct image embeddings, improving search precision. MagicLens uses Transformer models to process text and images in parallel, generating a unified embedding that captures both modalities and enhances retrieval performance. Explore Our Multimodal Search Demo! We’ve developed an online demo for multimodal search powered by the Milvus vector database. In this demo, you can upload an image and input text instructions, which are processed by a composed image retrieval model to find matching images based on both visual and textual input.