Company
Date Published
Oct. 3, 2023
Author
Yujian Tang
Word count
1846
Language
English
Hacker News points
None

Summary

Vector embeddings are crucial when working with semantic similarity. They represent input data as a series of numbers, allowing mathematical operations to be performed on the data instead of relying on qualitative comparisons. The appropriate vector embeddings must be obtained before use, as using an image model for text or vice versa may result in poor results. Vector embeddings are influential for many tasks, particularly semantic search. Vector embeddings are created by removing the last layer and taking the output from the second-to-last layer of a deep learning model (embedding models or a deep neural network). The dimensionality of a vector embedding is equivalent to the size of the second-to-last layer in the model. Common vector dimensionalities include 384, 768, 1,536, and 2,048. A single dimension in a vector embedding does not mean anything; however, when all dimensions are taken together, they provide the semantic meaning of the input data. The dimensions represent high-level, abstract attributes that depend on the training data and the model itself. Different models generate different embeddings based on their training data and architecture. To obtain proper vector embeddings, identify the type of data you wish to embed (images, text, audio, videos, or multimodal data) and use appropriate open-source embedding models from Hugging Face or PyTorch. For example, ResNet-50 is a popular image recognition model, while MiniLM-L6-v2 and MPNet-Base-V2 are text embedding models. Vector databases like Milvus and Zilliz Cloud are used to store, index, and search across massive datasets of unstructured data through vector embeddings. They employ the Approximate Nearest Neighbor (ANN) algorithm to calculate spatial distances between query vectors and stored vectors in the database.