The modality gap is a significant challenge in multimodal embedding models, which are used to interpret text and images across various industries. This gap arises due to the spatial separation between embeddings from different input types, such as texts and images that are semantically similar but far apart in the vector space. Despite advancements in multimodal embedding models like OpenAI's CLIP, these models still face challenges in accurately capturing semantic relationships within data.
To address this issue, JinaCLIP was developed to build upon the original CLIP architecture and improve its performance by expanding text input and using an adapted BERT v2 architecture for text encoding. The training process of JinaCLIP focuses on overcoming the challenges posed by short text inputs in image captions and introducing hard negatives, which significantly improves the model's text-only performance while maintaining strong performance in multimodal tasks.
A practical example of how to build a multimodal retrieval system using Milvus, an open-source vector database, and JinaCLIP is also discussed. This system allows users to input either text or images and retrieve the most semantically relevant results from a mixed dataset. By understanding the reasons behind the modality gap and implementing strategies to mitigate its impact, multimodal retrieval systems can be optimized for more accurate and efficient performance across various applications.