Background, principles, usage, and other information about PGVector - PolarDB

PGVector is an efficient vector database extension that supports multiple vector computing algorithms and data types. It also provides efficient storage and querying for AI embeddings that are represented as vectors. This topic describes the background, principles, and usage of PGVector.

Background information

Due to the rapid development of data science and machine learning, vector computing has become a common task in big data applications. PolarDB for PostgreSQL (Compatible with Oracle) is a widely used relational database. When combined with the PGVector extension, it uses custom data types and storage methods to accelerate high-dimensional vector computing.

Note

The PGVector extension supports vectors with up to 16,000 dimensions. This is ideal for scenarios that use high-dimensional storage for input and output, such as mainstream text embedding models.

Applicability

The following versions of PolarDB for PostgreSQL (Compatible with Oracle) are supported:

Oracle syntax compatibility 2.0 (minor engine version 2.0.14.7.9.0 or later)
Oracle syntax compatibility 1.0 (minor engine version 2.0.11.9.35.0 or later)

Note

You can view the minor engine version number in the console or by running the SHOW polardb_version; statement. If your minor engine version does not meet the requirement, upgrade the minor engine version.

Precautions

The cross-node parallel execution (PX) feature supports traversing high-dimensional vectors with `sort`.
PX does not support index queries.

How it works

PGVector uses the IVFFlat index algorithm, which is the same vector algorithm used by the PASE extension. IVFFlat is an approximate nearest neighbor (ANN) search algorithm based on an inverted index that efficiently queries for vector similarity. The algorithm divides the vector space into partitions, and each partition contains a set of vectors. Then, an inverted index is created to quickly find vectors that are similar to a given query vector.

IVFFlat is a simplified version of the Inverted File System with Asymmetric Distance Computation (IVFADC) algorithm. It is suitable for scenarios that require a high recall rate and can tolerate query response times in the 100 ms range. Compared to other algorithms, IVFFlat offers the advantages of a high recall rate, high precision, a simple algorithm and parameters, and low storage usage.

The PGVector extension is built on the extension mechanism of PolarDB for PostgreSQL (Compatible with Oracle). It is written in C and implements various vector computing algorithms and data types. The algorithm flow is as follows:

A clustering algorithm, such as K-Means, groups vectors in the high-dimensional space into clusters. A centroid is then calculated for each cluster.
To retrieve a vector, the algorithm first traverses all cluster centroids to find the `n` centroids that are closest to the target vector.
The algorithm then traverses all elements within the clusters that correspond to the `n` centroids. A global sort is performed to find the `k` closest vectors.

Usage guide

The PGVector extension can retrieve high-dimensional vectors sequentially or using an index. The Examples section provides a simple use case.
The supported extension versions vary slightly depending on the database engine version. For more information, see Supported extensions.
Recall rate and performance
PGVector versions earlier than 0.5.0 used IVFFlat indexes. IVFFlat indexes can be built quickly and improve query performance compared to an unindexed search. However, their recall rate is moderate and they consume some memory. The newer HNSW index offers a better recall rate and higher performance, but it takes longer to build and uses more memory. When you use vector indexes to query vector data, you often need to balance performance and recall rate. This section describes how to configure index parameters for both methods to improve the recall rate.
- HNSW
  m: The number of bidirectional links (or paths) between each index element. The default value is 16. The valid range is 2 to 100. A higher value increases the recall rate but also significantly increases the index build time and may negatively affect query performance.
  ef_construction: The number of nearest neighbors to check when adding an element to the index. The default value is 64. The valid range is 4 to 100. Increasing this value improves the recall rate but increases the index build time. This value must be at least twice the value of m.
```
CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(1536));
CREATE INDEX ON vecs USING hnsw(embedding vector_l2_ops) WITH (m=16, ef_construction=64);
```
  Specify the operator class for the HNSW index. For example, to use cosine similarity for an HNSW index, run the following command:
```
CREATE INDEX ON vecs USING hnsw(embedding vector_cosine_ops);
```
  To optimize build time, start with the default index build configuration items. If the recall rate for your dataset does not meet your expectations, first increase the ef_construction value, and then adjust the m value. You can increase the query recall rate by setting a larger hnsw.ef_search value. For example, you can set it to 100. A larger value usually results in a higher recall rate.
- IVFFlat
  lists: The number of cluster centroids for the vectors in the PGVector sampling table.
```
CREATE INDEX ON vecs USING ivfflat(embedding) WITH (lists=100);
```

For more information about indexes and other parameters, see the README file in the open source code repository.

Examples

Create the extension.
```
CREATE EXTENSION vector;
```
Create a table.
```
CREATE TABLE t (val vector(3));
```

Insert data.

INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);

Create a vector index.

CREATE INDEX ON t USING ivfflat (val vector_ip_ops) WITH (lists = 1);

Calculate approximate vectors.
```
SELECT * FROM t ORDER BY val <#> '[3,3,3]';
```
The following result is returned:
```
   val
---------
 [1,2,3]
 [1,1,1]
 [0,0,0]
(3 rows)
```
Note
- val vector_ip_ops indicates that an index is created on the `val` column. The vector_ip_ops vector operator from PolarDB for PostgreSQL (Compatible with Oracle) calculates the similarity between vectors. This operator supports calculations such as dot product, cosine similarity, and Euclidean distance.
- WITH (lists = 1) specifies that the number of partitions is 1. This means all vectors are assigned to the same partition. In a real-world application, you should adjust the number of partitions based on your data volume and query performance requirements.

References

To obtain vector embeddings, you can use text embedding models that are available for the Chinese mainland and international regions.