Refining Vector Search Queries With Time Filters in pgvector: A Tutorial
In this tutorial, we will learn how to use vector search with time-based filters in PostgreSQL using the pgvector extension and TimescaleDB's hypertables. We will demonstrate how to create a table with embedded vectors, perform similarity searches, and filter results based on timestamps. First, let's install the necessary extensions: ```sql CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; CREATE EXTENSION IF NOT EXISTS "pgvector"; ``` Next, we will create a table with embedded vectors and timestamps: ```sql CREATE TABLE wiki2 ( id SERIAL PRIMARY KEY, embedding TSVECTOR, content TEXT, time TIMESTAMPTZ NOT NULL DEFAULT NOW() ); ``` Now, let's insert some sample data into the table: ```sql INSERT INTO wiki2 (embedding, content) SELECT '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR, random_wiki_content() FROM generate_series(1, 100000); ``` To perform a similarity search on the embedded vectors, we can use the `<=>` operator provided by the pgvector extension: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 ORDER BY dist LIMIT 10; ``` This query will return the 10 most similar rows based on the embedded vectors. However, it does not consider any time-based filters. To add a time filter to our search, we can modify the query as follows: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 WHERE '2000-01-04'::TIMESTAMPTZ <= time AND time < '2000-01-06'::TIMESTAMPTZ ORDER BY dist LIMIT 10; ``` This query will return the 10 most similar rows based on the embedded vectors, but only for rows with timestamps between '2000-01-04' and '2000-01-06'. To improve performance when dealing with large datasets, we can use TimescaleDB's hypertables. Hypertables automatically partition data across multiple chunks based on time, allowing for more efficient querying and storage management. To create a hypertable from our existing table, we can run the following command: ```sql SELECT create_hypertable('wiki2', 'time'); ``` Now, let's perform the same similarity search with a time filter using the hypertable: ```sql SELECT id, embedding <=> '{"x": 0.1, "y": 0.2, "z": 0.3}'::TSVECTOR AS dist FROM wiki2 WHERE '2000-01-04'::TIMESTAMPTZ <= time AND time < '2000-01-06'::TIMESTAMPTZ ORDER BY dist LIMIT 10; ``` This query will use the vector index associated with the relevant chunk(s) to perform an approximate nearest-neighbor search, which is faster and more efficient than computing exact distances on the fly. Additionally, as your dataset grows, TimescaleDB's hypertables will continue to offer better performance due to chunk exclusion optimization. In conclusion, by combining vector search with time-based filters in PostgreSQL using the pgvector extension and TimescaleDB's hypertables, we can efficiently retrieve more temporally relevant vectors while maintaining fast query times even as our dataset grows. This technique is particularly useful for AI applications that require contextually aware interactions based on both semantic similarity and temporal relevance.
Company
Timescale
Date published
April 1, 2024
Author(s)
John Pruitt
Word count
5003
Language
English
Hacker News points
4