Training Text Embeddings with Jina AI

Company

Zilliz

Date Published

June 9, 2024

Author

Denis Kuria

Word count

1913

Language

English

Hacker News points

None

URL

zilliz.com/blog/training-text-embeddings-with-jina-ai

Summary

Bo Wang from Jina AI discussed the development of state-of-the-art text embeddings, which power modern vector search and Retrieval-Augmented Generation (RAG) systems. The release of Jina-Embeddings-V2 garnered significant attention in the AI community, with over 3 million downloads on Hugging Face. It has been integrated into various AI frameworks like LangChain and LlamaIndex, as well as vector databases such as Milvus and Zilliz Cloud. Jina embeddings closely compete with OpenAI embeddings. Jina AI initially began by fine-tuning existing models like BERT but soon realized that the industry was not ready for fine-tuning techniques. This led them to develop their own embedding model from scratch, resulting in Jina-Embeddings-V1 and later V2. The latest version, V2, can handle sequences up to 8,192 tokens during inference while training on shorter sequences. Jina-Embeddings-V2 removes position embeddings and introduces Attention with Linear Biases (ALiBi) for dynamic context modeling. It also adapts ALiBi for bidirectional transformers and retrains BERT from scratch, resulting in JinaBERT as the backbone for V2. The model has been successful in handling multilingual data and consistently outperforms competitors like Multilingual E5 and Cohere Embed V3. When developing RAG applications using Jina-Embeddings-V2, it's essential to consider document length and the positioning of relevant information within these documents. The team at Jina AI is already working on Jina-Embeddings-V3, which promises improvements in speed, efficiency, multilingual support, real-world problem solving, task-specific enhancements, and chunk and schema awareness.