Import data to an ApsaraDB for SelectDB instance by using DataWorks - ApsaraDB for SelectDB

ApsaraDB for SelectDB allows you to use the Data Integration feature of DataWorks to import data from a table by using SelectDB Writer. This topic describes how to use DataWorks to synchronize data from a MySQL data source to an ApsaraDB for SelectDB instance.

Usage notes

If you use the Data Integration feature of DataWorks to import data to an ApsaraDB for SelectDB instance, the data must be written offline.
You cannot write fields of the BITMAP, HyperLogLog (HLL), or QUANTILE_STATE data type.

Develop a data synchronization task

Add data sources

Before you configure a data synchronization task, you must add a MySQL data source and an ApsaraDB for SelectDB data source to DataWorks.

For more information about how to add a MySQL data source, see MySQL data source.

For more information about how to add an ApsaraDB for SelectDB data source, see Add and manage data sources. The following table describes the parameters that are used to add an ApsaraDB for SelectDB data source.

Parameter	Description
Data Source Name	The name of the data source.
JDBC URL	The Java Database Connectivity (JDBC) URL in the `jdbc:mysql://<ip>:<port>/<dbname>` format. To obtain the virtual private cloud (VPC) endpoint or public endpoint and MySQL port of an ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page of the instance whose information you want to view. In the Network Information section of the Basic Information tab, view the values of the VPC Endpoint or Public Endpoint parameter and the MySQL Port parameter. Example: `jdbc:mysql://selectdb-cn-4xl3jv1**.selectdbfe.rds.aliyuncs.com:9030/test_db` Note** If the MySQL instance and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint for this parameter. If the MySQL instance and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint for this parameter.
HTTP address	The IP address and port number that are used to access the ApsaraDB for SelectDB instance over HTTP. The value is in the `<ip>:<port>` format. To obtain the VPC endpoint or public endpoint and HTTP port of an ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page of the instance whose information you want to view. In the Network Information section of the Basic Information tab, view the values of the VPC Endpoint or Public Endpoint parameter and the HTTP Port parameter. Example: `selectdb-cn-4xl3jv1**.selectdbfe.rds.aliyuncs.com:8080` Note** If the MySQL instance and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint for this parameter. If the MySQL instance and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint for this parameter.
Username	The username of the owner account of the ApsaraDB for SelectDB instance.
Password	The password of the owner account of the ApsaraDB for SelectDB instance.

Important

To ensure that the data source can be added, you must add the IP address of the resource group in DataWorks to the IP address whitelist of the data source such as ApsaraDB for SelectDB. For more information, see Configure an IP address whitelist. When you configure IP address whitelists, you must add the IP addresses of different types of resource groups to different IP address whitelists. For example, the IP addresses of an exclusive resource group for Data Integration and a shared resource group for Data Integration must be added to different IP address whitelists.

Configure a batch synchronization task to synchronize data from a single table

You can configure a batch synchronization task by using the codeless UI or the code editor. For more information, see the following topics:

Configure a batch synchronization task by using the codeless UI
Configure a batch synchronization task by using the code editor
For more information about the example on how to configure a batch synchronization task by using the code editor and the related parameters, see the Sample code and parameters section of this topic.

Sample code and parameters

Configure a batch synchronization task by using the code editor

If you configure a batch synchronization task by using the code editor, you must write the script based on the format requirements in the code editor. For more information, see Configure a batch synchronization task by using the code editor.

Code for MySQL Reader and SelectDB Writer

{
	"type":"job",
	"version":"2.0",
	"steps":[
		{
			"stepType":"mysql",
			"parameter":{
				"column":[
          "<id>",
          "<table_id>",
          "<table_no>",
          "<table_name>",
          "<table_status>"
				],
				"connection":[
					{
						"datasource":"<mysql_datasource>",
						"table":[
							"<mysql_table_name>"
						]
					}
				],
				"where":"",
				"splitPk":"",
				"encoding":"UTF-8"
			},
			"name":"Reader",
			"category":"reader"
		},
		{
			"stepType":"selectdb",
			"parameter":{
        "postSql":[
        ],
        "preSql":[
        ],
        "username": "<selectdb_username>",
        "password": "<selectdb_password>",
        "loadUrl":[
          "<ip:port>"
        ],
        "column":[
          "<id>",
          "<table_id>",
          "<table_no>",
          "<table_name>",
          "<table_status>"
				],
        "connection":[
					{
						"datasource":"<selectdb_datasource>",
						"table":[
							"<selectdb_table_name>"
						]
					}
				],
        "maxBatchRows":1000000,
        "loadProps":{
          "format":"csv",
          "column_separator": "\\x01",
          "line_delimiter": "\\x02"
        }
      },
			"name":"Writer",
			"category":"writer"
		}
	],
	"setting":{
		"errorLimit":{
			"record":"0"
		},
		"speed":{
			"throttle":false,
			"concurrent":1
		}
	},
	"order":{
		"hops":[
			{
				"from":"Reader",
				"to":"Writer"
			}
		]
	}
}

Parameters

Parameter	Description
datasource	The name of the data source. This parameter is required. You can add a data source when you configure a batch synchronization task by using the code editor. The value of this parameter must be the same as the name of the data source to be added.
table	The name of the table from which data is to be synchronized. This parameter is required.
column	The names of the columns to which you want to write data in the destination table. Separate the column names with commas (,). This parameter is required. Example: `"column":["id","name","age"]`. If you want to write data to all columns in the destination table, set this parameter to an asterisk `()`. Example: `"column":[""]`.
loadUrl	The URL that is used to access the ApsaraDB for SelectDB instance. This parameter is required. Format: `ip:port`. `ip` indicates the VPC endpoint of the ApsaraDB for SelectDB instance, and `port` indicates the HTTP port of the ApsaraDB for SelectDB instance. Example: `selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080`.
username	The username that is used to access the ApsaraDB for SelectDB database. This parameter is required.
password	The password that is used to access the ApsaraDB for SelectDB database. This parameter is required.
preSql	The SQL statement that you want to execute before the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement in the codeless UI and multiple SQL statements in the code editor.
postSql	The SQL statement that you want to execute after the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.
maxBatchRows	The maximum number of rows that you can write to the destination table at a time. Default value: 500000.
loadProps	The parameters of the COPY INTO request, which are used to specify the format of the data to be imported. Default value: json. If the loadProps parameter is left empty or is set to `{}`, the JSON format is used by default. If the JSON format is used, you can configure only the following settings for the ApsaraDB for SelectDB instance. The `strip_outer_array` parameter is set to true. `"loadProps": { "format": "json", "strip_outer_array":true }` If you want to import data in the CSV format, you can configure the following settings to specify the CSV format and configure the row and column delimiters. If you do not specify the row and column delimiters, the imported data is converted into strings by default. `\t` is used as the column delimiter and `\n` is used as the row delimiter to form a CSV file for importing data to the ApsaraDB for SelectDB instance. `"loadProps": { "format":"csv", "column_separator": "\\x01", "line_delimiter": "\\x02" }`

Supported data types

The following table describes the data types that are supported by ApsaraDB for SelectDB.

Data type	Offline data writing of SelectDB Writer
INT	Supported
BIGINT	Supported
LARGEINT	Supported
SMALLINT	Supported
TINYINT	Supported
BOOLEAN	Supported
DECIMAL	Supported
DOUBLE	Supported
FLOAT	Supported
CHAR	Supported
VARCHAR	Supported
STRING	Supported
DATE	Supported
DATEV2	Supported
DATETIME	Supported
DATETIMEV2	Supported
ARRAY	Supported
JSONB	Supported
BITMAP	Not supported
HLL	Not supported
QUANTILE_STATE	Not supported

References

What is DataWorks?