Create a table for an OSS data source - OpenSearch - Alibaba Cloud ドキュメントセンター

Activate OSS

Create a table

Log on to the OpenSearch Vector Search Edition console. In the left-side navigation pane, click Instances. On the Instances page, find the instance for which you want to create a table and click Manage in the Actions column. On the instance details page, click Table Management in the left-side pane. On the page that appears, click Add Table.

In the Basic Table Information step of the Create wizard, configure the following parameters and click Next.

Parameter description:

Table Name: the name of the table. You can customize the table name.
Data Shards: the number of data shards contained in the table. If you create multiple index tables in an OpenSearch instance, make sure that the index tables contain the same number of shards. Alternatively, make sure that at least one index table contains one shard and other index tables contain the same number of shards.
Number of Resources for Data Updates: the number of resources used for data updates. By default, a free quota of two resources for data updates is provided for each index. Each resource consists of 4 CPU cores and 8 GB of memory. You are charged for resources that exceed the free quota. For more information, see Billing overview of OpenSearch Vector Search Edition for the international site (alibabacloud.com).
Scenario Template: the template that is used to create the table. Valid values: Common Template, Vector: Image Search, and Vector: Semantic Search for Text.

In the Data Synchronization step, configure the following parameters to add a data source, and then click Check to check the data source information. If the check is passed, click Next.

Full Data Source: the type of the data source. Valid values: MaxCompute + API, OSS + API, and API. In this example, OSS + API is selected.
OSS Path: the path that is used to access an OSS object. The path must start with a forward slash (/) and cannot contain special characters, including question marks (?), equal signs (=), and ampersands (&).
Bucket: the name of the OSS bucket.
Timestamp: If incremental data is pushed by using API operations, this parameter specifies the point in time from which the system synchronizes incremental data from the data source. You can select a date and time within the previous 72 hours.

Note

To create an OSS path, perform the following operations: Go to the Buckets page of the OSS console, click the name of the created OSS bucket in the bucket list, and then click Create Directory. In the Create Directory panel, configure the Directory Name parameter. In this example, /opensearch_index_data/ is created.

To obtain the name of the created OSS bucket, perform the following operations: Go to the Buckets page of the OSS console, and view the bucket name in the Bucket Name column.

In the Field Configuration step, configure fields for the table and click Next.

In this example, the pk and embeddings fields are configured. For more information about sample data, see oss_test.txt.

CMD=add
pk=999000
embeddings=0.00.0039257140.0098142860.0039257140.00
pk=999000
embeddings=0.00.0039257140

For more information about the index schema, see the Data files for indexing section of this topic.

Note

The primary key field and vector field are required. For the primary key field, you must set the Type parameter to an integer type or the STRING type and select the option button in the Primary Key column. For the vector field, you must set the Type parameter to FLOAT and select the check box in the Vector Field column.
By default, the vector field is a multi-value field of the FLOAT type. The multiple values of the vector field are separated by HA3 delimiters (^]). This delimiter is encoded as \x1D in the UTF format. You can also use custom delimiters to separate the values.
If a field does not exist or is empty in the source data, the system automatically sets the field to the default value. By default, a field of the numeric type is set to 0 and a field of the STRING type is set to an empty string. You can also specify custom default values.

In the Index Schema step, configure indexes for the table and click Next.

Configure the following parameters for the vector index:

The primary key field and vector field are required. The namespace field is optional and can be left empty.
You can configure only the three fixed fields for the Fields Contained parameter. You cannot add fields.
Vector Dimension: the dimension of vectors. Specify a vector dimension based on the vector model that you select.
Distance Type: the type of vector distance. Valid values: SquareEuclidean and InnerProduct. Specify a distance type based on the vector model that you select.
Vector Index Algorithm: the algorithm that is used to create the vector index. Valid values: Qc, Linear, and HNSW. Specify an algorithm based on the vector model that you select.
Real-time Indexing: specifies whether to build real-time indexes for incremental data that is pushed by using API operations. Valid values: true and false. Default value: true.

You can also configure parameters for the advanced configurations of the vector index. For more information, see Common configurations of vector indexes.

In the Confirm step, click Confirm. The table that you configure is automatically created.

To view the creation progress of the table, click Change History in the left-side pane on the instance details page.

If the table enters the In Use state, you can run query tests on the Query Test page.

Data files for indexing

A file serves as the data source of indexing. The file must be encoded in the UTF-8 format. This section describes the standard input format of a data file for indexing.

The following sample code provides an example of the content of a complete data file named standard_sample.data:

CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_

This data file contains add and delete commands. Each command consists of multiple lines, and each line is a key-value pair. Commands are separated by '^^\n', key-value pairs are separated by '^_\n', and values are separated by '^]'. The following content describes the file delimiters and command formats.

File delimiters

C++ encoding	ASCII in hexadecimal notation	Description	Display pattern in Emacs or Vi	Input method in Emacs	Input method in Vi
"\x1F\n"	1F0A	The key-value pair delimiter.	^_ (followed by a line break)	C-q C-7	C-v C-7
"\x1E\n"	1E0A	The command delimiter.	^^ (followed by a line break)	C-q C-6	C-v C-6
"\x1D"	1D	The multi-value delimiter.	^]	C-q C-5	C-v C-5
"\x1C"	1C	The section weight identifier.	^\	C-q C-4	C-v C-4
"\x1D"	1D	The section delimiter.	^]	C-q C-5	C-v C-5
"\x03"	03	The field delimiter of child documents.	^C	C-q C-c	C-v C-c

Command formats
- The add command is used to add data to the index schema. The first line of the add command must be CMD=add, which is followed by the fields of the document. The order of the fields can be the same as that of the fields in the index schema. All the fields that are displayed in the add command must be specified in the index schema.

CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^

- The delete command is used to remove data from the index schema. The first line of the delete command must be CMD=delete, which is followed by the field that is defined as the primary key field in the index schema, and the field used for hash partitioning. If the two fields are the same, you need to only specify one of the fields.

CMD=delete^_
PK=12345321^_
^^CMD=delete^_
PK=12345321^_
^^

Usage notes:

You must activate OSS in the same region as the purchased OpenSearch Vector Search Edition instance.
OpenSearch Vector Search Edition does not support Anywhere OSS buckets.
When you configure an OSS data source, the system automatically creates a service-linked role named AliyunServiceRoleForSearchEngine. If the service-linked role already exists, the system does not create another role. OpenSearch uses this role to access your cloud resources to implement related features.