MaxCompute data source and configure a MaxCompute data source - OpenSearch

MaxCompute is an open computing platform. If you want to import data generated by MaxCompute to OpenSearch Industry Algorithm Edition, you can connect a MaxCompute data source to your application in OpenSearch Industry Algorithm Edition. After reindexing is triggered in the application, OpenSearch automatically obtains full data from tables in the MaxCompute data source. To obtain incremental data from the MaxCompute data source, you must use the APIs or SDKs of OpenSearch.

Configure the AccessKey pair for an Alibaba Cloud account

After you configure a MaxCompute data source in OpenSearch Industry Algorithm Edition, OpenSearch Industry Algorithm Edition downloads data from MaxCompute tables by using the AccessKey pair that you enter. Therefore, before you configure a MaxCompute data source, you must configure an AccessKey pair for your account.

Note

Make sure that the MaxCompute project is created within the Alibaba Cloud account that you use to log on to the OpenSearch console.

You can use the AccessKey pair of your Alibaba Cloud account to access tables in MaxCompute projects that are created within your Alibaba Cloud account.
To mitigate security risks, you can use the AccessKey pair of a Resource Access Management (RAM) user. To create a RAM user and grant permissions to the RAM user, perform the following steps:

Create a RAM user within your Alibaba Cloud account. For more information, see Create and authorize RAM users.
Log on to the MaxCompute console and add a member for the RAM user.

Assign a role to the added member based on your requirements.

Run the list users; command on the DataStudio page to view the account to which the added member belongs. For more information, see DataWorks.

Copy the account name and grant permissions to the account. xxx indicates the account obtained in Step 3.

-- 1. Grant the LIST permission on the project.
grant CreateInstance,List on project zy_ts_test to user xxx;

-- 2. Grant the SELECT, DESCRIBE, and DOWNLOAD permissions on MaxCompute tables.
GRANT select,describe,download ON TABLE people_info TO USER xxx;

-- 3. Optional. Grant label-based permissions on MaxCompute tables.
set label 2 to USER  xxx;

-- Query the permissions of a specific user and information about the role that is assigned to the user.
show grants for xxx;

After you create a RAM user and grant permissions to the RAM user, you can configure a MaxCompute data source in the OpenSearch Industry Algorithm Edition console.

Configure the MaxCompute data source

On the Configure Application page, click Use Data Source in the Application Schema Creation Method section.

In the Select Data Source panel, select MaxCompute as the data source.

Click Connect to Database. In the Connect to Database dialog box, configure the Project Name, AccessKey ID, and AccessKey Secret parameters.

Click Connect. Then, select one or more tables that you want to configure.

The system automatically maps corresponding fields. You can fine-tune the fields based on your business requirements. Click Next.

Important

When you configure the application schema, you must create a primary table and a unique primary key field for each table.

Configure the index schema. You can select an appropriate analyzer based on your search requirements. For more information, see Index schema. Then, click Next.

Configure a data source. In this step, you can configure field mappings, partition information, and concurrency control for data synchronization.

5.1. Configure field mappings: Click Edit in the Actions column. OpenSearch Industry Algorithm Edition provides multiple data source plug-ins for MaxCompute data. If you need to use a plug-in, click the plus sign (+) in the Content Conversion column when you configure a field mapping. This way, the source field is converted before it is synchronized to OpenSearch Industry Algorithm Edition. If the plug-in does not work due to errors such as configuration errors or connection failures, the source field is synchronized to the destination field without conversion.

Configure the plug-in.

Important

The following types of MaxCompute data are supported: BIGINT, DOUBLE, BOOLEAN, DATETIME, STRING, and DECIMAL.
The system automatically converts data of the DATETIME type in MaxCompute tables to milliseconds. You must set the data type to INT for the corresponding OpenSearch Industry Algorithm Edition fields.

5.2 Configure partition information: OpenSearch Industry Algorithm allows you to specify partitions whose data you want to import based on the characteristics of MaxCompute data. Regular expressions are supported. You can click Reindex on the Instance Details page to create a scheduled reindexing task. This way, incremental partition data can be imported every day.

Regular expression: Equal signs (=), commas (,), semicolons (;), and double vertical bars (||) are reserved characters of the system. For example, ds=%Y%m%d || -1 days specifies automatic import of the full data of the specified partition of the previous day.

Note

ds specifies the name of the partition field. No other invisible characters such as spaces are allowed on either side of the equal sign (=).

The following section describes how to configure partition conditions of MaxCompute:

1: You can specify multiple partition filter rules by separating them with semicolons (;). For example, pt=1;pt=2 matches all partitions that meet the partition filter rule pt=1 or pt=2.
2: You can set multiple partition fields in a partition filter rule by separating them with commas (,). For example, pt1=1,pt2=2,pt3=3 matches all partitions that meet all the partition filter conditions pt1=1, pt2=2, and pt3=3. Functions such as %Y%m%d || -1 days do not support multiple partition fields, but support a single partition field.

Example: The pt partitions in a MaxCompute table contain ds child partitions.

Specify multiple partitions: pt=1;pt=2 specifies synchronization of all data in pt=1 and pt=2 partitions.
Set multiple partition fields: pt=1,ds=1 specifies synchronization of the data in the ds=1 child partition of the pt=1 partition.
pt=1,ds=%Y%m%d || -1 days or pt=1;pt=%Y%m%d || -1 days is not supported.
3: The value of a partition field can be an asterisk (*), which indicates that the value of the partition field can be an arbitrary value. In this case, this field is optional in the filter rule.
4: The value of a partition field can contain a regular expression. For example, pt=[0-9]* matches all partitions whose pt value is a number.
5: The value of a partition field supports time matching. The filter rule is in the following format: pt=Partition field value that contains formatted time || Time interval expression. For example, ds=%Y%m%d || -1 days indicates that the partition field is ds, the formatted time is 20150510, and the data of the previous day is required.
5.1 Formatted time parameters can be standard time format parameters.
5.2 The time interval expression can be in the following format: +/- n week|weeks|day|days|hour|hours|minute|minutes|second|seconds|microsecond|microseconds. The plus sign (+) indicates N weeks, days, hours, minutes, seconds, or milliseconds after a scheduled reindexing task is created. The minus sign (-) indicates N weeks, days, hours, minutes, seconds, or milliseconds before a scheduled reindexing task is created.
5.3 By default, the system converts time parameters in all filter rules by using the +0 days condition. Therefore, the field values that are used for filtering cannot contain the following strings as regular string parameters. For example, for tasks that are created on Wednesday, pt=%abc matches the partitions whose pt value is Wedbc instead of pt=%abc.

The following list describes all parameters that can be contained in regular expressions:

%d: the sequence number of the day in the month.  
%H: the hour in a 24-hour system. Valid values: [0, 23].     
%m: the sequence number of the month in the year. Valid values: [01, 12].  
%M: the minute. Valid values: [00, 59].   
%S: the second. Valid values: [00, 61].   
%y: the year represented by two digits.  
%Y: the year represented by four digits.

5.3. Configure concurrency control for data synchronization:

If you select Use DONE File, you can upload a DONE file to control the timing for OpenSearch to pull full data. This ensures the data integrity. Before OpenSearch pulls full data from MaxCompute, OpenSearch checks whether the DONE file of the current day exists. If the file does not exist, OpenSearch waits for the DONE file to appear. The default timeout period is 1 hour.

You must download the installation package of the MaxCompute client from the official website of MaxCompute. The file name of the package is odps_clt_release_64.tar.gz.
You must have the CreateResource permission on the required MaxCompute project.
After you install the MaxCompute client, run the following command on your MaxCompute client. The DONE file is named in the $prefix_%Y-%m-%d format. $prefix specifies the prefix of the name of the DONE file. By default, the prefix of the name of the DONE file is the table name. %Y-%m-%d specifies the date of a scheduled reindexing task. The minimum interval for scheduled reindexing tasks is one day.
```
odpscmd -u accessid -p accesskey --project=<prj_name>-e "add file <done file>;"
```
For more information about how to use the MaxCompute client, see MaxCompute client (odpscmd).
The content of DONE files is in the JSON format. A DONE file needs to contain only the timestamp in milliseconds of the current full data. The system retains only the incremental data in the recent three days. Therefore, the point in time that is specified by the timestamp must be within the previous three days.
The timestamp in a DONE file indicates the point in time of the incremental data to be pulled. If you do not specify the timestamp, incremental data from the start time of the scheduled reindexing task is appended. OpenSearch retains only the incremental data in the recent three days. Therefore, the point in time must be within the previous three days.
For example, full data is generated at 09:00 on the current day, MaxCompute processes the full data at 10:00, and the scheduled reindexing task in OpenSearch starts at 10:30. After MaxCompute processes the full data, the incremental data after 09:00 on the current day is appended. You must specify the timestamp that corresponds to 09:00 on the current day in milliseconds in the DONE file to ensure data integrity. Otherwise, the incremental data that is generated after 10:30, which is the default start time of the scheduled reindexing task, is appended. The incremental data from 09:00 to 10:30 is lost. Proceed with caution. If no incremental data is generated, you do not need to specify the timestamp.
The following sample code shows an example of the content of a DONE file for an advanced application. The timestamp in the DONE file is used to append incremental data. You can use a similar method to specify the timestamp in DONE files for standard applications.

{
"timestamp":"1234567890000"
}

Priorities of a DONE file and the data time:

The data time of MaxCompute data sources is required and takes precedence over a DONE file.
If you create only one version for an application, you need to specify only the data time. In this case, you cannot use a DONE file alone.
If you need to use a scheduled reindexing task, you must specify both the data time and a DONE file. The data time takes precedence over the DONE file for the first version. The DONE file takes precedence over the data time for subsequent versions.

Usage notes:

Important

MaxCompute data sources support only full synchronization, but do not support incremental synchronization.