In search scenarios that involve large amounts of data or require fast data access, such as database query optimization, machine learning, data mining, image and video search, and spatial data query, you can create vector indexes to accelerate vector search. This way, you can accelerate data analysis, optimize search jobs, and improve the query performance and response efficiency of the system.
Background information
AnalyticDB for PostgreSQL vector databases use the FastANN vector search engine to implement the mainstream Hierarchical Navigable Small World (HNSW) algorithm. The FastANN vector search engine uses the segmented paging storage of PostgreSQL and stores only the vector column pointers in indexes to significantly reduce the storage that is occupied by vector indexes. The vector search engine uses the product quantization (PQ) feature to reduce the dimensionality of high-dimensional vectors. This way, low-dimensional vectors are stored in indexes to reduce index lookups when you insert or query vectors and improve vector search performance.
Syntax
CREATE INDEX [INDEX_NAME]
ON [SCHEMA_NAME].[TABLE_NAME]
USING ANN(COLUMN_NAME)
WITH (DIM=<DIMENSION>,
DISTANCEMEASURE=<MEASURE>,
HNSW_M=<M>,
HNSW_EF_CONSTRUCTION=<EF_CONSTURCTION>,
PQ_ENABLE=<PQ_ENABLE>,
PQ_SEGMENTS=<PQ_SEGMENTS>,
PQ_CENTERS=<PQ_CENTERS>,
EXTERNAL_STORAGE=<EXTERNAL_STORAGE>);
Parameters:
INDEX_NAME: the name of the index.
SCHEMA_NAME: the name of the schema or namespace.
TABLE_NAME: the name of the table.
COLUMN_NAME: the name of the vector index column.
The following table describes other vector index parameters.
Vector index parameter
Required
Description
Default
Value range
DIM
Yes
The number of vector dimensions. This parameter is used to check vectors when vectors are inserted. If the number of vector dimensions does not match, the system displays an error message.
0
[1, 8192]
DISTANCEMEASURE
No
The distance or similarity measurement. Valid values:
L2: uses the squared Euclidean distance function to create indexes. This measurement is suitable for image similarity search scenarios.
Formula:
IP: uses the negative inner product distance function to create indexes. This measurement is used to replace cosine similarity after vector normalization.
Formula:
COSINE: uses the cosine distance function to create indexes. This measurement is suitable for text similarity search scenarios.
Formula:
NoteIf you set the DISTANCEMEASURE parameter to IP or COSINE, make sure that the minor version of the AnalyticDB for PostgreSQL instance is 6.3.10.18 or later. For information about how to view and update the minor version, see Update the minor version of an instance.
To achieve the best performance, we recommend that you store the normalized vector data in databases and use the inner product distance measurement to create indexes.
When you use the inner product distance measure to create indexes, you can also query the Euclidean distance and cosine similarity. For more information, see Perform vector search.
For information about user-defined functions (UDFs) for vector search, see the "References" section of this topic.
L2
(L2, IP, COSINE)
HNSW_M
No
The maximum number of neighbors for the HNSW algorithm.
32
[10, 1000]
HNSW_EF_CONSTRUCTION
No
The size of the candidate set when the HNSW algorithm is used to create indexes.
600
[40, 4000]
PQ_ENABLE
No
Specifies whether to enable the PQ feature for dimensionality reduction. The PQ feature uses existing vector sample data for training. If the number of rows of data is less than 500,000, we recommend that you do not specify this parameter.
0
[0, 1]
PQ_SEGMENTS
No
The number of segments that are used in the K-means clustering algorithm after you enable the PQ feature. This parameter takes effect only if you set the PQ_ENABLE parameter to 1.
If the DIM parameter divided by eight is an integer, you do not need to specify this parameter. Otherwise, you must specify this parameter.
The value of the PQ_SEGMENTS parameter must be greater than 0. The DIM parameter divided by the PQ_SEGMENTS parameter must be an integer.
0
[1, 256]
PQ_CENTERS
No
The number of centroids that are used in the K-means clustering algorithm after you enable the PQ feature. This parameter takes effect only if you set the PQ_ENABLE parameter to 1.
2048
[256, 8192]
EXTERNAL_STORAGE
No
Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values:
0: uses segmented paging storage to create indexes. This method uses the shared buffer of PostgreSQL for caching and supports the DELETE and UPDATE operations.
1: uses the memory mapping technology to create indexes. This method supports marked deletions of indexes and allows you to perform a small number of DELETE and UPDATE operations for AnalyticDB for PostgreSQL instances of V6.6.2.3 or later.
0
[0, 1]
Example
In this example, a text knowledge base is used. An article in the knowledge base is split into chunks, converted into embedding vectors, and then stored in a database. The following table describes the fields that are contained in a table named chunks.
Field | Data type | Description |
id | SERIAL | The serial number. |
chunk | VARCHAR(1024) | The text chunk. |
intime | TIMESTAMP | The time when the article is stored in the database. |
url | VARCHAR(1024) | The URL of the article to which the text chunk belongs. |
feature | REAL[] | The embedding vector of the text chunk. |
Create a table to store vectors.
CREATE TABLE chunks ( id SERIAL PRIMARY KEY, chunk VARCHAR(1024), intime TIMESTAMP, url VARCHAR(1024), feature REAL[] ) DISTRIBUTED BY (id);
Set the storage mode of the vector column to PLAIN to reduce data scanning costs and achieve higher performance.
ALTER TABLE chunks ALTER COLUMN feature SET STORAGE PLAIN;
Create a vector index for the vector column.
-- Create a vector index based on the Euclidean distance. CREATE INDEX idx_chunks_feature_l2 ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=l2, hnsw_m=64, pq_enable=1); -- Create a vector index based on the inner product distance. CREATE INDEX idx_chunks_feature_ip ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=ip, hnsw_m=64, pq_enable=1); -- Create a vector index based on the cosine similarity. CREATE INDEX idx_chunks_feature_cosine ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
NoteYou can create multiple vector columns in each vector table and multiple vector indexes on each vector column. To avoid invalid indexes, you must create vector indexes based on your business requirements.
The vector search methods must match the vector indexes that are created by using the corresponding distance measurements. For example, the
<->
operator works with the vector index created by using the Euclidean distance, the<#>
operator works with the vector index created by using the inner product distance, and the<=>
operator works with the vector index created by using the cosine similarity. If the operators do not work with the corresponding vector indexes, the vector search method is downgraded to exact search (brute-force search).Use the
ANN
access method that is provided by the FastANN vector search extension. If you enable the vector search engine optimization feature for an AnalyticDB for PostgreSQL instance, the FastANN vector search extension is automatically created in the AnalyticDB for PostgreSQL instance.If the
You must specify an operator class or define a default operator class for the data type
error message is displayed, the vector search engine optimization feature is not enabled for the AnalyticDB for PostgreSQL instance. In this case, enable the feature and try again. For more information, see Enable or disable vector search engine optimization.
Create an index for structured data columns to accelerate hybrid search.
CREATE INDEX ON chunks(intime);
Index creation modes
You can create vector indexes in one of the following modes:
Streaming mode
Create an empty table and then create a vector index for the table. This way, the streaming index is created in real time when you import vector data. This mode is suitable for real-time vector search scenarios. However, you cannot quickly import vector data in this mode.
Asynchronous mode
Create an empty table, import vector data to the table, and then create a vector index for the imported vector data. This mode is suitable for scenarios in which you want to import large amounts of vector data.
AnalyticDB for PostgreSQL allows you to concurrently create vector indexes in asynchronous mode. You can use the Grand Unified Configuration (GUC) parameter fastann.build_parallel_processes to specify the number of concurrent index creation processes.
For example, you can create an index in asynchronous mode by using Data Management (DMS). Sample code:
-- For example, an AnalyticDB for PostgreSQL instance whose compute node specifications are 8 cores and 32 GB memory is used.
-- To prevent timeout errors, set the statement_timeout and idle_in_transaction_session_timeout parameters to 0.
-- Set the fastann.build_parallel_processes parameter to 8 to ensure that the CPU resources of compute nodes are fully utilized. The value of the fastann.build_parallel_processes parameter must be the same as the number of CPU cores in the compute node specifications.
-- In DMS, you must specify values for parameters in the same line as the index creation statement.
SET statement_timeout = 0;SET idle_in_transaction_session_timeout = 0;SET fastann.build_parallel_processes = 8;CREATE INDEX ON chunks USING ann(feature)
WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
You can create an index in asynchronous mode by using the psql tool. Sample code:
-- For example, an AnalyticDB for PostgreSQL instance whose compute node specifications are 8 cores and 32 GB memory is used.
-- Step 1: Store the following content in the create_index.sql file:
SET statement_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET fastann.build_parallel_processes = 8;
CREATE INDEX ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
-- Step 2: Run the following command to create an index in parallel:
psql -h gp-xxxx-master.gpdb.rds.aliyuncs.com -p 5432 -d dbname -U username -f create_index.sql
If other services are running on the AnalyticDB for PostgreSQL instance, you cannot set the number of concurrent index creation processes to the number of CPU cores in the compute node specifications.