Refining Vector Search Queries With Time Filters in pgvector: A Tutorial

Company

Timescale

Date Published

April 1, 2024

Author

John Pruitt

Word count

5003

Language

English

Hacker News points

URL

www.timescale.com/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial

Summary

In this tutorial, we will learn how to use vector search with time-based filters in PostgreSQL using the pgvector extension and TimescaleDB's hypertables. We will demonstrate how to create a table with embedded vectors, perform similarity searches, and filter results based on timestamps. First, let's install the necessary extensions: ```sql CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; CREATE EXTENSION IF NOT EXISTS "pgvector"; ``` Next, we will create a table with embedded vectors and timestamps: ```sql CREATE TABLE wiki2 ( id SERIAL PRIMARY KEY, embedding TSVECTOR, content TEXT, time TIMESTAMPTZ NOT NULL DEFAULT NOW() ); ``` Now, let's insert some sample data into the table: ```sql INSERT INTO wiki2 (embedding, content) SELECT '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR, random_wiki_content() FROM generate_series(1, 100000); ``` To perform a similarity search on the embedded vectors, we can use the `<=>` operator provided by the pgvector extension: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 ORDER BY dist LIMIT 10; ``` This query will return the 10 most similar rows based on the embedded vectors. However, it does not consider any time-based filters. To add a time filter to our search, we can modify the query as follows: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 WHERE '2000-01-04'::TIMESTAMPTZ <= time AND time < '2000-01-06'::TIMESTAMPTZ ORDER BY dist LIMIT 10; ``` This query will return the 10 most similar rows based on the embedded vectors, but only for rows with timestamps between '2000-01-04' and '2000-01-06'. To improve performance when dealing with large datasets, we can use TimescaleDB's hypertables. Hypertables automatically partition data across multiple chunks based on time, allowing for more efficient querying and storage management. To create a hypertable from our existing table, we can run the following command: ```sql SELECT create_hypertable('wiki2', 'time'); ``` Now, let's perform the same similarity search with a time filter using the hypertable: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 WHERE '2000-01-04'::TIMESTAMPTZ <= time AND time < '2000-01-06'::TIMESTAMPTZ ORDER BY dist LIMIT 10; ``` This query will use the vector index associated with the relevant chunk(s) to perform an approximate nearest-neighbor search, which is faster and more efficient than computing exact distances on the fly. Additionally, as your dataset grows, TimescaleDB's hypertables will continue to offer better performance due to chunk exclusion optimization. In conclusion, by combining vector search with time-based filters in PostgreSQL using the pgvector extension and TimescaleDB's hypertables, we can efficiently retrieve more temporally relevant vectors while maintaining fast query times even as our dataset grows. This technique is particularly useful for AI applications that require contextually aware interactions based on both semantic similarity and temporal relevance.