PGVector - PolarDB - Alibaba Cloud Documentation Center

PGVector is a high-performance database extension for vector similarity searches and supports multiple algorithms and data types. You can use this extension to efficiently store and query vector embeddings. This topic describes the background information, implementation principles, usage, and references of PGVector.

Background information

As data science and machine learning rapidly develops, vector computing becomes one of the most common computing tasks in the big data field. PolarDB for PostgreSQL databases can be combined with the PGVector extension and use custom data types and storage methods to improve the performance of high-dimensional vector computing.

Note

High-dimensional storage, including major text embedding models, is used for data input and output. PGVector allows you to create vectors with up to 16000 dimensions.

Prerequisites

Only the open source PGVector is supported. For more information, see open source PGVector.

Precautions

The cross-node parallel execution feature allows you to use the sort clause to traverse high-dimensional vectors.
The cross-node parallel execution feature does not support index queries.

How it works

Like the PASE extension, PGVector uses Hierarchical Navigable Small World (HNSW) indexes. The Inverted File with Flat Compression (IVFFlat) algorithm is based on inverted indexes and implements approximate nearest neighbor searches. It can be used to perform vector similarity searches. IVFFlat divides the vector space into regions that each represents a cluster of vectors, and creates an inverted index to perform vector similarity searches.

IVFFlat is a simplified Inverted File System with Asymmetric Distance Computation (IVFADC) algorithm. IVFFlat is suitable for business scenarios that require high precision but can tolerate up to 100 milliseconds taken for queries. IVFFlat has the following advantages compared with other algorithms: high recall rate, high precision, simple algorithm and parameters, and low storage usage.

The PGVector extension is implemented based on the extension mechanism of PolarDB for PostgreSQL. The PGVector extension is written in the C programming language and supports a variety of vector computing algorithms and data types. The following section describes the process of how the algorithm works:

IVFFlat uses a clustering algorithm such as k-means to divide vectors in the high-dimensional space into clusters based on implicit clustering properties. This way, each cluster has a centroid.
IVFFlat traverses the centroids of all clusters to identify the n centroids that are nearest to the vector that you want to query.
IVFFlat traverses and sorts all vectors in the clusters to which the identified n centroids belong. Then, IVFFlat obtains the nearest k vectors.

Usage notes

You can use the PGVector extension to perform sequential or index search for high-dimensional vectors. For more information, see Example.
Recall rate and query performance
In versions earlier than 0.5.0, the PGVector extension uses the IVFFlat indexing method which is known for its fast building speed. The PGVector extension enhances query performance by using the IVFFlat indexing method. However, the recall rate is moderate and a significant amount of memory is used. The new HNSW indexing method provides a better recall rate and improved query performance. However, the index-building speed is slower, and the memory usage is higher. To effectively query vector data based on vector indexes, you must balance the advantages and disadvantages related to query performance and recall rate. The following section describes how to configure the parameters for the preceding indexing methods to enhance recall rates.
- HNSW
  m: the number of bidirectional links (or paths) connected to each index element. The value is in the range of 2 to 100. The default value is 16. To increase the recall rate, you can specify a large number for the parameter. However, a large number of bidirectional links significantly extends the index generation time and may negatively affect query performance.
  ef_construction: the number of neighbors that you want to check when an element is added to the index. The value is in the range of 4 to 100. The default value is 64. You can increase the recall rate by increasing the value of this parameter. However, the index building time may be extended. The value of this parameter must be at least twice the value of the m parameter.
```
CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(1536));
CREATE INDEX ON vecs USING hnsw(embedding vector_l2_ops) WITH (m=16, ef_construction=64);
```
  If you use the HNSW indexing method, you must specify an operator class. For example, if you want to use cosine similarity as the metric for HNSW indexes, execute the following statement:
```
CREATE INDEX ON vecs USING hnsw(embedding vector_cosine_ops);
```
  You can use the default index building configuration items to optimize the index building time. If the expected recall rate is not achieved on your dataset, increase the value of the ef_construction parameter. Then, adjust the value of the m parameter. To increase the recall rate, you can specify a larger value for the hnsw.ef_search parameter. For example, you can set the value to 100. A larger value specifies a higher recall rate.
- IVFFlat
  lists: the number of cluster centers for all vectors in the PGVector sampling table.
```
CREATE INDEX ON vecs USING ivfflat(embedding) WITH (lists=100);
```

For more information about the index and parameters, see the README module in the open source code.

Example

Add the PGVector extension to the database.
```
CREATE EXTENSION vector;
```
Create a table.
```
CREATE TABLE t (val vector(3));
```

Insert data into the table.

INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);

Create a vector index.

CREATE INDEX ON t USING ivfflat (val vector_ip_ops) WITH (lists = 1);

Find similar vectors.
```
SELECT * FROM t ORDER BY val <#> '[3,3,3]';
```
The following results are returned:
```
   val
---------
 [1,2,3]
 [1,1,1]
 [0,0,0]
(3 rows)
```
Note
- In val vector_ip_ops, val is the column on which you want to create an index. vector_ip_ops is the vector operator provided by PolarDB for PostgreSQL that is used to calculate the similarities between vectors. The preceding vector operator supports the calculation of dot product distance, cosine similarity, and Euclidean distance between vectors.
- WITH (lists = 1) indicates that only one region is created, which means that all vectors are assigned to the same region. In actual vector query scenarios, specify the number of regions based on the amount of data and query performance.

References

For more information about the embedding process of vectors, see the output of Chinese mainland and international text embedding models.