This topic describes how to create a table for a MaxCompute data source.
Prerequisites
You are familiar with MaxCompute. For more information about MaxCompute, see What is MaxCompute?
The account that you use to log on to the OpenSearch Vector Search Edition console is granted the following permissions on the MaxCompute table that you want to configure: the DESCRIBE, SELECT, and DOWNLOAD permissions on the table and the LABEL permission on the fields of the table.
You can execute the following statements to grant the required permissions to an account:
-- Add an account.
add user ****@aliyun.com;
-- Grant the required permissions to the account.
GRANT describe,select,download ON TABLE table_xxx TO USER ****@aliyun.com
GRANT describe,select,download ON TABLE table_xxx_done TO USER ****@aliyun.com
-- If you enable field permission verification for your MaxCompute table, the system prevents you from accessing highly privileged fields when you pull data, and indexes cannot be created for the table. In this case, you must grant your account the permissions to access all fields.
-- Grant permissions on the entire project.
SET LABEL 3 to USER ****@aliyun.com
-- Grant permissions on a single table.
GRANT LABEL 3 ON TABLE table_xxx(col1, col2) TO ****@aliyun.com
Fields contained in your MaxCompute table are of the following data types: STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME.
For more information about the table creation statement and the parameters for adding a MaxCompute data source, see CREATE TABLE statement for creating a table in a MaxCompute data source.
Create a table
Log on to the OpenSearch Vector Search Edition console. In the left-side navigation pane, click Instances. On the Instances page, find the instance for which you want to create a table and click the instance name or ID. On the instance details page, click Table Management in the left-side pane. On the page that appears, click Add Table.
In the Basic Table Information step of the Create wizard, configure the following parameters and click Next.
Parameters:
Table Name: the name of the table. You can customize the table name.
Data Shards: the number of data shards contained in the table. If you create multiple index tables in an OpenSearch instance, make sure that the index tables contain the same number of shards. Alternatively, make sure that at least one index table contains one shard and other index tables contain the same number of shards.
Number of Resources for Data Updates: the number of resources used for data updates. By default, OpenSearch provides a free quota of two resources for data updates for each data source in an OpenSearch Vector Search Edition instance. Each resource consists of 4 CPU cores and 8 GB of memory. You are charged for resources that exceed the free quota. For more information, see .
Scenario Template: the template that is used to create the table. Valid values: Common Template, Vector: Image Search, and Vector: Semantic Search for Text.
In the Data Synchronization step, configure the following parameters to add a data source, and then click Check to check the data source information. If the check is passed, click Next.
Parameters:
Full Data Source: the type of the data source. Valid values: MaxCompute + API, OSS + API, and API. In this example, MaxCompute + API is selected.
Project: the name of the MaxCompute project that you want to access.
AccessKey: the AccessKey ID of your Alibaba Cloud account or a Resource Access Management (RAM) user within the Alibaba Cloud account.
AccessKey Secret: the AccessKey secret that corresponds to the AccessKey ID.
Table: the name of the MaxCompute table that you want to access.
Partition Key: the partition key of the MaxCompute data source. This parameter is required. Example: ds=20170626.
Timestamp: If incremental data is pushed by using API operations, this parameter specifies the point in time from which the system synchronizes incremental data from the data source. You can select a date and time within the previous 72 hours.
Automatic Reindexing: specifies whether to enable the automatic reindexing feature. If the automatic reindexing feature is enabled, the system automatically performs reindexing for the index table that references the data source each time the system detects a data change in the data source.
If you enable automatic reindexing, you must create a done table. For more information, see the Configure automatic reindexing section of this topic.
In the Field Configuration step, configure fields for the table and click Next.
The primary key field and vector field are required. The primary key field must be of the INT or STRING type. The vector field must be of the FLOAT type.
By default, the vector field is a multi-value field of the FLOAT type, and multiple values of the vector field are separated by HA3 delimiters (^]). This delimiter is encoded as \x1D in the UTF format. You can also enter a custom multi-value delimiter.
In the Index Schema step, configure indexes for the table and click Next.
Configure the following parameters in the Vector Index section:
The primary key field and vector field are required. The namespace field is optional and can be left empty.
You can configure only the three fixed fields for the Fields Contained parameter and cannot add fields.
Vector Dimension: the dimension of vectors. Specify a vector dimension based on the vector model that you select.
Distance Type: the type of vector distance. Valid values: SquareEuclidean and InnerProduct. Specify a distance type based on the vector model that you select.
Vector Index Algorithm: the algorithm that is used to create the vector index. Valid values: Qc, Linear, and HNSW. Specify an algorithm based on the vector model that you select.
Real-time Indexing: specifies whether to build real-time indexes for incremental data that is pushed by using API operations. Valid values: true and false. Default value: true.
You can also configure parameters for the advanced configurations of the vector index. For more information, see Common configurations of vector indexes.
In the Confirm step, click Confirm. The table that you configure is automatically created.
To view the creation progress of the table, click Change History in the left-side pane on the instance details page, and then click the Data Source Changes tab.
If the table enters the In Use state, you can run query tests on the Query Test page.
Configure automatic reindexing
Description of a done table: If you enable automatic reindexing when you configure a data source, OpenSearch Vector Search Edition automatically performs reindexing based on the changes in the done table.
Example: When you configure a MaxCompute data source, you specify mytable as the MaxCompute table and ds=20220113 as the partition. After you configure reindexing for the data source for the first time, the system generates a new partition on a daily basis. Each new partition contains the full data of the table. Each time a new partition is generated, OpenSearch Vector Search Edition is required to scan the new partition and automatically perform reindexing based on the data in the new partition. To meet this requirement, you can use the automatic reindexing feature and done table.
Procedure
Enable automatic reindexing when you add a data source.
Configure the corresponding done table in MaxCompute. If the name of the MaxCompute table is mytable and the name of the partition key of the mytable table is ds, the name of the done table is mytable_done and the name of the partition key of the done table is ds. The following code block shows how the two tables are displayed in MaxCompute:
odps:sql:xxx> show tables;
InstanceId: xxx
SQL: .
ALIYUN$****@aliyun.com:mytable # The table that stores the full data of the data source.
ALIYUN$****@aliyun.com:mytable_done # The done table to which the full data of the source table is automatically synchronized.
The following figure shows the done table.
You can execute the following statement to create the done table:
create table mytable_done (attribute string) partitioned by (ds string);
When the ds=20220114 partition of the mytable table is generated, configure the done table to trigger OpenSearch Vector Search Edition to perform reindexing.
-- Add a partition.
alter table mytable_done add if not exists partition (ds="20220114");
-- Insert a semaphore to enable automatic full data synchronization.
insert into table mytable_done partition (ds="20220114") select '{"swift_start_timestamp":1642003200}';
The done table contains the following content:
odps:sql:xxx> select * from mytable_done where ds=20220114 limit 1;
InstanceId: xxx
SQL: .
+-----------+----+
| attribute | ds |
+-----------+----+
| {"swift_start_timestamp":1642003200} | 20220114 |
+-----------+----+
After the semaphore for automatic full data synchronization is inserted into the done table, OpenSearch Vector Search Edition scans the semaphore of the done table and automatically triggers reindexing.
Make sure that you specify at least one partition key for the done table. The name of the partition key of the done table must be the same as the name of the partition key of the MaxCompute table. If the partition key of the MaxCompute table is ds, the partition key of the done table must be set to ds.
The done table contains only one field of the STRING type. The field name must be attribute.
The partition that you add to the done table must exist in the MaxCompute table. For example, if the MaxCompute table contains the ds=20220114, ds=20220115, and ds=20220116 partitions, you must select a partition to be added to the done table from the three partitions.
When you insert data into the done table, the value of the attribute field must be a JSON string, such as {"swift_start_timestamp":1642003200}. The timestamp specifies the start offset for real-time incremental data synchronization.
Usage notes
MaxCompute does not support external tables. You must create an internal table.
The MaxCompute table that you specify when you add a MaxCompute data source must be a partitioned table.
You can use the full data of MaxCompute tables as data sources to build indexes in OpenSearch Vector Search Edition and use API data sources to synchronize incremental data in real time.