Use vector databases to perform vector search and integrated search - AnalyticDB

AnalyticDB for PostgreSQL provides the exact search method that compares the distance or similarity between all vectors and the approximate index search method that uses Hierarchical Navigable Small World (HNSW) indexes. The approximate index search method returns results faster than the exact search method.

Exact search

The exact search method is a brute-force search method that compares the distance or similarity between all vectors. This method results in low response speeds but can achieve a 100% recall rate.

You can perform exact search based on the Euclidean distance, inner product distance, or cosine similarity.

Euclidean distance

SELECT ID, l2_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY l2_squared_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) LIMIT <TOPK>;

Inner product distance

SELECT ID, inner_product_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY negative_inner_product_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) LIMIT <TOPK>;

Cosine similarity

SELECT ID, cosine_similarity(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY 
cosine_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) LIMIT <TOPK>;

Parameters:

score: the Euclidean distance, inner product distance, or cosine similarity. Vectors are sorted by Euclidean distance in ascending order, or by inner product distance or cosine similarity in descending order.
<VECTOR_COLUMN_NAME>: the name of the vector column.
<TABLE_NAME>: the name of the vector table.
<TOPK>: the number of top result sets.

Approximate index search

The approximate index search method uses HNSW indexes to search data. This method works fast but returns approximate results at a recall rate of more than 99%.

Syntax

You can perform approximate index search based on the Euclidean distance, inner product distance, or cosine similarity.

Note

Only AnalyticDB for PostgreSQL instances of V6.3.10.18 or later support approximate index search based on the inner product distance or cosine similarity. Make sure that the minor version of your AnalyticDB for PostgreSQL instance is V6.3.10.18 or later. For information about how to view and update the minor version, see Update the minor engine version.

Euclidean distance

SELECT ID, l2_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY <VECTOR_COLUMN_NAME> <-> array[1,2,3...N]::float4[] LIMIT <TOPK>;

Inner product distance

SELECT ID, inner_product_distance(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY <VECTOR_COLUMN_NAME> <#> array[1,2,3...N]::float4[] LIMIT <TOPK>;

Cosine similarity

SELECT ID, cosine_similarity(<VECTOR_COLUMN_NAME>, array[1,2,3...N]::float4[]) as score FROM <TABLE_NAME> ORDER BY <VECTOR_COLUMN_NAME> <=> array[1,2,3...N]::float4[] LIMIT <TOPK>;

Note

The parameter descriptions in approximate index search are the same as those in exact search.
If no indexes are created or no parameters are configured for the FastANN vector search engine, the approximate index search is downgraded to exact search.
When you use a vector index, you must specify or omit ASC in the ORDER BY clause. To sort vectors by distance in descending order, or to sort the results of vector index search by using another column, specify the vector index search as a subquery and add an ORDER BY clause to the parent query.
When you use a vector index, you must specify an operator such as <->, <#>, or <=> in the ORDER BY clause. In addition, the <->, <#>, or <=> operator must work with a vector index that is created by using the corresponding distance measurement. This way, you can use the vector index to accelerate queries. For information about the supported operators and how to use the operators, see Create a vector index.

Examples

In this example, a text knowledge base is used. You can use the vector index search method to search for the source articles of a text. Sample statement:

SELECT id, chunk, intime, url FROM chunks
ORDER BY
    feature <=> array[10,2.0,..., 1536.0]::real[]
LIMIT 100;

You can execute an EXPLAIN statement to check the execution plan of the query.

In the following sample execution plan, Ann Index Scan indicates that a vector index is used.

QUERY PLAN
-------------------------------------------------
 Limit
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Merge Key: ((feature <=> $0))
          ->  Limit
               ->  Ann Index Scan using feature_idx on chunks
                Order By: (feature <=> $0)
 Optimizer: Postgres query optimizer

If you want to search for the source articles of a text within the last month, you can use the integrated search method. Sample statement:

SELECT id, chunk, intime, url FROM chunks WHERE
    intime > '2023-04-01' AND intime <= '2023-05-01'
ORDER BY
    feature <=> array[10,2.0,..., 1536.0]::real[]
LIMIT 100;

References

SQL optimization for vector search

This section describes SQL optimization for vector search. After you create a vector table, you can create multiple vector indexes for a vector column based on your business requirements. The SQL statement and the involved index must match. For example, the <-> operator can work with only indexes that are created by using the Euclidean distance, the <#> operator can work with only indexes that are created by using the inner product distance, and the <=> operator can work with only indexes that are created by using the cosine similarity.

CREATE TABLE test_table (
    id serial primary key,
    feature real[]
) distributed by (id);

CREATE INDEX idx_test_table_feature_l2 ON test_table USING ann(feature) WITH (dim=768, distancemeasure=l2, hnsw_m=64, pq_enable=1);
CREATE INDEX idx_test_table_feature_ip ON test_table USING ann(feature) WITH (dim=768, distancemeasure=ip, hnsw_m=64, pq_enable=1);
CREATE INDEX idx_test_table_feature_cosine ON test_table USING ann(feature) WITH (dim=768, distancemeasure=cosine, hnsw_m=64, pq_enable=1);

If you do not want to return the score value of the vector distance, you can execute the following SQL statements to query vector data:

-- Search vector data based on the Euclidean distance. 
SELECT id 
FROM test_table
ORDER BY feature <-> array[1,2,3 ... 768]::real[]
LIMIT topk;

-- Search vector data based on the inner product distance. 
SELECT id
FROM test_table
ORDER BY feature <#> array[1,2,3 ... 768]::real[]
LIMIT topk;

-- Search vector data based on the cosine similarity. 
SELECT id
FROM test_table
ORDER BY feature <=> array[1,2,3 ... 768]::real[]
LIMIT topk;

If you want to return the score value of the vector distance, you can use the sort result of the vector index to calculate the actual score value of the vector distance. This avoids a complete vector distance calculation process and reduces the amount of time required for calculation. Sample statements:

-- Search vector data based on the Euclidean distance. 
SELECT t.id as id, sqrt(t.score) as score 
FROM 
  (SELECT id, feature <-> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t;

-- Search vector data based on the inner product distance. 
SELECT t.id as id, (-1 * t.score) as score
FROM
  (SELECT id, feature <#> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t;

-- Search vector data based on the cosine similarity. 
SELECT t.id as id, (1.0 - t.score) as score
FROM 
  (SELECT id, feature <=> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t;

If you want to return results that are filtered based on a score range, you can execute the following SQL statements to query vector data. The statements use the sort result of the vector index to calculate the final score value of the vector distance. This reduces the amount of time required for calculation.

-- Search vector data based on the Euclidean distance. 
SELECT t.id as id, sqrt(t.score) as score 
FROM 
  (SELECT id, feature <-> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t
WHERE score < 100;

-- Search vector data based on the inner product distance. 
SELECT t.id as id, (-1 * t.score) as score
FROM
  (SELECT id, feature <#> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t
WHERE score > 10;

-- Search vector data based on the cosine similarity. 
SELECT t.id as id, (1.0 - t.score) as score
FROM 
  (SELECT id, feature <=> array[1,2,3 ... 768]::real[] as score
   FROM test_table
   ORDER BY score LIMIT topk) t
WHERE score > 0.5;

Engine parameters related to vector search

Engine parameter	Description	Default value	Valid value
fastann.build_parallel_processes	The number of processes that are built in parallel by using vector indexes. The value of this parameter varies based on the database specifications.	4	[1, 64]
fastann.pq_amp	The amplification factor of vector search result sets when product quantization (PQ) is used to reduce the dimensionality of high-dimensional vectors. You can use this parameter to test the recall rate.	10	[1, 1000]
fastann.hnsw_max_scan_points	The maximum number of scan points when HNSW indexes are used for vector search. You can use this parameter to quickly end the search and test the recall rate.	6000	[1, 6000000]
fastann.hnsw_ef_search	The size of search candidate sets when HNSW indexes are used for vector search. You can use this parameter to test the recall rate.	400	[10, 10000]

You can configure the preceding parameters for sessions.