pgvector is a powerful extension that is supported by ApsaraDB RDS for PostgreSQL to perform vector similarity searches. The extension enables the system to support a new data type and allows you to perform vector similarity searches in an efficient manner.
Background information
ApsaraDB RDS for PostgreSQL supports the pgvector extension. The extension can be used to store vector data, perform vector similarity searches, and enrich the underlying data for AI services.
The pgvector extension provides the following benefits:
It enables the system to support vector data. You can use the pgvector extension to store and search for vector data.
It supports exact and approximate nearest neighbor (ANN) searches. This allows you to calculate the distances or similarities between vectors in a vector space by L2 Euclidean distance, cosine similarity, or inner product. It supports the construction of Hierarchical Navigable Small World (HNSW) indexes and IVFFLAT parallel indexes, element-by-element multiplication of vectors, L1 distance functions, and sum aggregation.
It allows you to create vectors with up to 16,000 dimensions and create indexes for vectors with up to 2,000 dimensions.
Terms and implementation principles
embedding
Embedding is a technique that translates high-dimensional data into a low-dimensional space. In machine learning and natural language processing (NLP), embedding is a common method that is used to represent sparse symbols or objects as continuous vectors.
In NLP, word embedding is an approach that is used to represent words in the form of real-valued vectors. This allows computers to better understand text-based content. Word embedding is used to translate the semantics and grammatical relations of words into a vector space.
Implementation principles
Embedding is a technique that abstracts data such as words, images, audio content, and videos from multiple dimensions and translates the data into vectors.
The pgvector extension supports vector data, which allows ApsaraDB RDS for PostgreSQL to store vector data.
The pgvector extension can be used to perform exact and ANN searches on stored vector data.
Assume that you need to store three objects (an apple, a banana, and a cat) in an RDS instance and use the pgvector extension to calculate the similarities among the objects. You can perform the following steps:
Use embeddings to translate the apple, banana, and cat objects into vectors. The following section provides an example of two-dimensional embeddings:
Apple: embedding [1,1] Banana: embedding [1.2,0.8] Cat: embedding [6,0.4]
Store the vector data in the RDS instance. For more information, see Examples.
The following figure shows the distribution of the apple, banana, and cat objects in a two-dimensional space.
The apple and the banana are closer because they are both fruits and are more similar to each other. The cat is an animal and is located far from the fruits.
The attributes of fruits can be grouped in terms of dimensions such as color, place of origin, and taste. If you group the fruits from finer-grained dimensions, you can obtain more accurate results.
Scenarios
You want to store vector data.
You want to perform vector similarity searches.
Prerequisites
The RDS instance meets the following requirements:
The RDS instance runs PostgreSQL 14 or later.
The minor engine version of your RDS instance is 20230430 or later.
NoteThis extension is not supported by ApsaraDB RDS for PostgreSQL instances that run PostgreSQL 17.
For more information about how to upgrade the major engine version or update the minor engine version of the RDS instance, see Upgrade the major engine version of an ApsaraDB RDS for PostgreSQL instance or Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.
Extension management
Create the extension.
CREATE EXTENSION IF NOT EXISTS vector;
Delete the extension.
DROP EXTENSION vector;
Update the extension.
ALTER EXTENSION vector UPDATE [ TO new_version ]
NoteIn the preceding statement, the new_version parameter indicates the version of the pgvector extension. For more information about the latest version and related features of the pgvector extension, see official documentation of pgvector.
Examples
This section provides an example on how to use the pgvector extension. For more information, see official documentation of pgvector.
Create a table named items of the vector type to store embeddings.
CREATE TABLE items ( id bigserial PRIMARY KEY, item text, embedding vector(2) );
NoteIn the preceding statements, two-dimensional vectors are used. The pgvector extension allows you to create vectors with up to 16,000 dimensions.
Insert the vectors into the created table.
INSERT INTO items (item, embedding) VALUES ('Apple', '[1,1]'), ('Banana', '[1.2,0.8]'), ('Cat', '[6,0.4]');
Use the cosine similarity operator (
<=>
) to calculate the similarities between the banana, apple, and cat objects.SELECT item, 1 - (embedding <=> '[1.2, 0.8]') AS cosine_similarity FROM items ORDER BY cosine_similarity DESC;
NoteIn the preceding statements, the
cosine_similarity = 1 - cosine_distance
formula is used to calculate similarities. The shorter the distance between vectors, the more similar the objects are.You can also use the Euclidean distance operator (
<->
) or the inner product operator (<#>
) to calculate similarities.
Sample output:
item | cosine_similarity ------+-------------------- Banana | 1 Apple | 0.980580680748848 Cat | 0.867105556566985
In the preceding output:
A value of 1 for the banana indicates an exact match.
A value of 0.98 for the apple indicates that the apple has a high similarity with the banana.
A value of 0.86 for the cat indicates that the cat has a low similarity with the banana.
NoteYou can specify a similarity threshold based on your business requirements to exclude results with low similarities.
Create an index for vectors to improve the efficiency of vector similarity searches by using the pgvector extension. You can execute the following statement to create an index for the embedding parameter:
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
The following table describes the parameters.
Parameter/Value
Description
items
The name of the table for which the index is created.
embedding
The name of the column for which the index is created.
vector_cosine_ops
The access mode that is specified in the vector index method.
For vector similarity searches by cosine similarity, set this parameter to
vector_cosine_ops
.For vector similarity searches by Euclidean distance, set this parameter to
vector_l2_ops
.For vector similarity searches by inner product, set this parameter to
vector_ip_ops
.
lists = 100
The lists parameter indicates the number of lists into which the dataset is divided. A larger value indicates that the dataset is divided into more lists, the size of each subset is smaller, and the index search speed is faster. As the value of the lists parameter increases, the search recall rate may decrease.
NoteThe recall rate is the ratio of the number of exactly searched samples or classified samples to the total number of relevant samples in an information search or classification task. The recall rate is a metric that is used to measure the capability of the system to identify all relevant samples.
Index creation requires a large number of memory resources. If the value of the lists parameter is larger than 2000, the
ERROR: memory required is xxx MB, maintenance_work_mem is xxx MB
error message is displayed. In this case, you must change the value of the maintenance_work_mem parameter to a larger value. However, an excessively large value may cause out of memory (OOM) errors. For more information about how to change the value of the parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance.You can change the value of the lists parameter to make a trade-off between the search speed and the recall rate. This helps meet your requirements in specific scenarios.
Use one of the following methods to configure the ivfflat.probes parameter. The parameter specifies the number of lists that you want to search in the index. To search more lists and improve the recall rate of the search result, you can increase the value of the ivfflat.probes parameter.
Session level
SET ivfflat.probes = 10;
Transaction level
BEGIN; SET LOCAL ivfflat.probes = 10; SELECT ... COMMIT;
A large value of the ivfflat.probes parameter indicates a high recall rate of the search result but a slow search speed. To ensure the optimal search performance and recall rate, you must adjust the values of the ivfflat.probes and lists parameters based on your business requirements and the characteristics of the dataset.
NoteIf the value of the ivfflat.probes parameter is the same as the value of the lists parameter that is specified during the index creation, vector indexes are ignored and a full table scan is performed for your search. In this case, indexes are not used, and the full table is scanned. This may degrade search performance.
Performance data
When you create indexes for vectors, you must make a trade-off between the search speed and the recall rate based on your business requirements. The following test results provide examples of performance optimization.
This section describes the storage that is separately occupied by vectors and indexes for different volumes of data. This section also describes the impacts on the search speed and the recall rate when you adjust the values of the lists and probes parameters.
Test data preparation
Create a test database.
CREATE DATABASE testdb;
Install the extension.
CREATE EXTENSION IF NOT EXISTS vector;
Generate random vectors with a fixed length and use the vectors as the test data.
CREATE OR REPLACE FUNCTION random_array(dim integer) RETURNS DOUBLE PRECISION[] AS $$ SELECT array_agg(random()) FROM generate_series(1, dim); $$ LANGUAGE SQL VOLATILE COST 1;
Create a table that can store vectors with 1,536 dimensions.
CREATE TABLE vtest(id BIGINT, v VECTOR(1536));
Insert data into a table.
INSERT INTO vtest SELECT i, random_array(1536)::VECTOR(1536) FROM generate_series(1, 100000) AS i;
Create an index.
CREATE INDEX ON vtest USING ivfflat(v vector_cosine_ops) WITH(lists = 100);
Procedure
To prevent the impacts on the test data due to factors such as network latency, we recommend that you use the internal endpoint of the RDS instance. In this example, the test is performed on an Elastic Compute Service (ECS) instance that resides in the same region and virtual private cloud (VPC) as the RDS instance.
Use a random vector to calculate the similarities between the data in the vtest table and obtain the most similar 50 records.
You must create an SQL file and write the following content to the file for subsequent stress testing:
WITH tmp AS ( SELECT random_array(1536)::VECTOR(1536) AS vec ) SELECT id FROM vtest ORDER BY v <=> (SELECT vec FROM tmp) LIMIT FLOOR(RANDOM() * 50);
Use pgbench to perform stress testing.
pgbench is a program that runs benchmark tests in PostgreSQL. You must run the following commands in the CLI. Make sure that the PostgreSQL client is installed. In this example, PostgreSQL 15.1 is used. For more information, see PostgreSQL official documentation.
pgbench -f ./test.sql -c6 -T60 -P5 -U testuser -h pgm-bp****.pg.rds.aliyuncs.com -p 5432 -d testdb
The following table describes the parameters.
Parameter/Value
Description
-f ./test.sql
The path and file name of the test script file.
./test.sql
is an example. You must modify the parameter based on your business requirements.-c6
The number of concurrent client connections. -c indicates that the number of concurrent client connections. In this example, 6 indicates that six concurrent client connections can be used for the test.
-T60
The testing time. -T indicates the time that is required to perform the test. In this example, 60 indicates that the test lasts for 60 seconds.
-P5
The script parameter. In this example, the progress report is displayed every 5 seconds.
-U testuser
The username of the database. You must replace testuser with your database username.
-h pgm-bp****.pg.rds.aliyuncs.com
The internal endpoint of the RDS instance.
-p 5432
The internal port of the RDS instance.
-d testdb
The database to which you want to connect. In this example, testdb is connected.
Test results
Conclusions
The value of the lists parameter has a slight impact on the storage that is occupied by indexes. The occupied storage is affected by the data volume in the table.
The values of the lists and probes parameters have opposite impacts on the search speed and the recall rate. Appropriate settings of the two parameters help make a trade-off between the search speed and the recall rate.
We recommend that you configure the lists and probes parameters based on the number of data rows in the table. The following section provides sample configuration settings.
If the number of data rows in the table is less than or equal to 1 million, you can calculate the values of the lists and probes parameters by using the following formulas:
lists = Number of data rows/1,000
andprobes = Value of the lists parameter/10
If the number of data rows in the table is greater than 1 million, you can calculate the values of the lists and probes parameters by using the following formulas:
lists = sqrt(Number of data rows)
andprobes = sqrt(Value of the lists parameter)
.Notesqrt indicates the square root function.
Best practices
Build an LLM-driven dedicated chatbot on top of ApsaraDB RDS for PostgreSQL