OpenSearch data source - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides OpenSearch Writer for you to write data to OpenSearch data sources. This topic describes the capabilities of writing data to OpenSearch data sources in offline mode.

Supported OpenSearch versions

OpenSearch V3 uses a second-party package, with POM of com.aliyun.opensearch aliyun-sdk-opensearch 2.1.3.
To use OpenSearch Writer, you must install JDK 1.6-32 or later. You can run the java-version command to view the JDK version.
The following Alibaba Cloud OpenSearch editions are supported: Industry Algorithm Edition, LLM-Based Conversational Search Edition, High-performance Search Edition, Vector Search Edition, and Retrieval Engine Edition.

Limits

OpenSearch Writer supports only serverless resource groups and exclusive resource groups for Data Integration. OpenSearch Writer does not support custom resource groups. We recommend that you use serverless resource groups.
The columns in OpenSearch are unordered. OpenSearch Writer writes data in strict accordance with the order of the specified columns. If the number of specified columns is less than that in OpenSearch, excess columns in OpenSearch are set to the default value or null.
For example, an OpenSearch table contains Columns a, b, and c, and you want to write data to Columns b and c. You can set the column parameter to ["c","b"]. In this case, OpenSearch Writer imports the first and second columns of the source data that is obtained from a reader to Columns c and b in the OpenSearch table. Column a in the OpenSearch table is set to the default value or null.
You can use only the code editor to configure a batch synchronization task to synchronize data to OpenSearch data sources.

Data type mappings

OpenSearch Writer supports most OpenSearch data types. Make sure that the data types of your database are supported. The following table lists the data type mappings based on which OpenSearch Writer converts data types.

Category	OpenSearch data type

Category	OpenSearch data type
Integer	INT
Floating point	DOUBLE and FLOAT
String	TEXT, LITERAL, and SHORT_TEXT
Date and time	INT
Boolean	LITERAL

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

For more information about the configuration procedure, see Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Additional information

Handling of column configuration errors

To prevent data loss caused by redundant columns and ensure high data reliability, OpenSearch Writer returns an error if the number of columns that are to be written is more than that in the destination table. For example, an OpenSearch table contains Columns a, b, and c. If more than three columns need to be written to the table, OpenSearch Writer returns an error.

Table configuration

OpenSearch Writer can write data to only one table at a time.

Task rerunning

After a node is rerun, data is overwritten based on IDs. Therefore, the data written to OpenSearch must contain an ID column. An ID is a unique identifier of a row in OpenSearch. The existing data that has the same ID as the new data is overwritten.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for OpenSearch Writer of Industry Algorithm Edition, LLM-Based Conversational Search Edition, and High-performance Search Edition

{
    "type": "job",
    "version": "1.0",
    "configuration": {
        "reader": {},
        "writer": {
            "plugin": "opensearch",
            "parameter": {
                "accessId": "*********",
                "accessKey": "********",
                "host": "http://yyyy.aliyuncs.com",
                "indexName": "datax_xxx",
                "table": "datax_yyy",
                "column": [
                "appkey",
                "id",
                "title",
                "gmt_create",
                "pic_default"
                ],
                "batchSize": 500,
                "writeMode": add,
                "version":"v2",
                "ignoreWriteError": false
            }
        }
    }
}

Parameters in code for OpenSearch Writer of Industry Algorithm Edition, LLM-Based Conversational Search Edition, and High-performance Search Edition

Parameter	Description	Required	Default value
accessId	The AccessKey ID of the account that you use to connect to OpenSearch.	Yes	No default value
accessKey	The AccessKey secret of the account that you use to connect to OpenSearch.	Yes	No default value
host	The endpoint of OpenSearch. You can obtain the endpoint in the Alibaba Cloud Management Console.	Yes	No default value
indexName	The name of the OpenSearch project.	Yes	No default value
table	The name of the table to which you want to write data. You can specify only one table because Data Integration cannot import data to multiple tables at a time.	Yes	No default value
column	The names of the columns to which you want to write data. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (), such as `"column":[""]`. If you want to write data only to specific columns in the destination table, set this parameter to the column names. Separate the column names with commas (,), such as `"column":["id","name"]`. OpenSearch Writer can filter columns and change the order of columns. For example, an OpenSearch table contains three columns: a, b, and c. If you want to write data only to Columns c and b, you can set the column parameter to `["c","b"]`. During data synchronization, Column a is automatically set to null.	Yes	No default value
batchSize	The number of data records to write at a time. OpenSearch Writer writes multiple data records to OpenSearch at a time. OpenSearch provides the data query feature. In most cases, the transactions per second (TPS) of OpenSearch is not high. Set this parameter based on the resources available for the account that is used to connect to OpenSearch. In most cases, the size of a data record must be less than 1 MB, and the total size of the data records to write at a time must be less than 2 MB.	Required only for writing data to a partitioned table	300
writeMode	The write mode. To ensure the idempotence of write operations, set this parameter to add/update. add: If a failure occurs and the synchronization task is rerun, OpenSearch Writer deletes existing data records and inserts new data records to OpenSearch. This is an atomic operation. update: OpenSearch Writer updates existing data records based on new data records. This is also an atomic operation. Note Writing multiple data records to OpenSearch at a time is not an atomic operation. Some of the data records may fail to be written. Exercise caution when you configure the writeMode parameter. OpenSearch V3 does not support the update mode.	Yes	No default value
ignoreWriteError	Specifies whether to ignore the write operations that fail. Example: `"ignoreWriteError":true`. If OpenSearch Writer writes multiple data records to OpenSearch at a time, this parameter specifies whether to ignore write operations that fail in the current batch. If you set this parameter to true, OpenSearch Writer continues to perform other write operations. If you set this parameter to false, the synchronization task ends, and OpenSearch Writer returns an error. We recommend that you use the default value.	No	false
version	The version of OpenSearch, such as `"version":"v3"`. We recommend that you use OpenSearch V3 because the push operation has many limits in OpenSearch V2.	No	v2

Code for OpenSearch Writer of Vector Search Edition and Retrieval Engine Edition

{
  "stepType": "opensearch",
  "parameter": {
    "indexName": "",
    "column": [
      {
        "name": "col3double",
        "type": "DOUBLE"
      },
      {
        "name": "col2vector",
        "type": "MULTI_FLOAT"
      }
    ],
    "datasource": "zm_test_vector_01",
    "batchSize": "500",
    "table": "demotable"
  },
  "name": "Writer",
  "category": "writer"
}

Parameters in code for OpenSearch Writer of Vector Search Edition and Retrieval Engine Edition

Parameter	Description	Required	Default value
table	The name of the table to which you want to write data. You can specify only one table because Data Integration cannot import data to multiple tables at a time.	Yes	No default value
column	The names of the columns to which you want to write data. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (), such as `"column":[""]`. If you want to write data only to specific columns in the destination table, set this parameter to the column names. Separate the column names with commas (,), such as `"column":["id","name"]`. OpenSearch Writer can filter columns and change the order of columns. For example, an OpenSearch table contains three columns: a, b, and c. If you want to write data only to Columns c and b, you can set the column parameter to `["c","b"]`. During data synchronization, Column a is automatically set to null.	Yes	No default value
batchSize	The number of data records to write at a time. OpenSearch Writer writes multiple data records to OpenSearch at a time. OpenSearch provides the data query feature. In most cases, the transactions per second (TPS) of OpenSearch is not high. Set this parameter based on the resources available for the account that is used to connect to OpenSearch. In most cases, the size of a data record must be less than 1 MB, and the total size of the data records to write at a time must be less than 2 MB.	Required only for writing data to a partitioned table	300