Building a High Performance Embeddings Engine at Tecton
Tecton has built a high-performance Embeddings Engine to complement its Aggregation Engine, using PyArrow, PyTorch, and Tokenizers/Transformers. The engine is designed to allow users to quickly and efficiently batch-generate text embeddings with open-source models and use them in production applications. It meets several key requirements, including top-tier performance with minimal configuration, distributed batch model inference on GPU instances, and automatically tuned model inference to leverage available computational resources. The Embeddings Engine is built around Tecton's Rift engine, which supports diverse computational workloads through separate computational stages. The engine implements several performance optimizations, such as single node parallelism using multi-threading, distributed inference, handling larger datasets, optimizing input processing, and fine-tuning batch operations. It also addresses challenges like input length sorting, dynamic token batching, automated token budget selection, CUDA OOM batch splitting, and results in significant improvements in throughput and resource utilization. The engine's performance is auto-tuned to the local hardware, allowing for easy adoption of new OSS models and customer models. Future enhancements include more flexibility, better performance, and further optimizations for batch and real-time generation of embeddings.
Company
Tecton
Date published
July 19, 2024
Author(s)
Brian Hart
Word count
1711
Language
English
Hacker News points
None found.