All Products
Search
Document Center

ApsaraDB for SelectDB:Import data by using DataWorks

Last Updated:May 09, 2024

ApsaraDB for SelectDB allows you to use the Data Integration feature of DataWorks to import data from a table by using SelectDB Writer. This topic describes how to use DataWorks to synchronize data from a MySQL data source to an ApsaraDB for SelectDB instance.

Usage notes

  • If you use the Data Integration feature of DataWorks to import data to an ApsaraDB for SelectDB instance, the data must be written offline.

  • You cannot write fields of the BITMAP, HyperLogLog (HLL), or QUANTILE_STATE data type.

Develop a data synchronization task

Add data sources

Before you configure a data synchronization task, you must add a MySQL data source and an ApsaraDB for SelectDB data source to DataWorks.

  1. For more information about how to add a MySQL data source, see MySQL data source.

  2. For more information about how to add an ApsaraDB for SelectDB data source, see Add and manage data sources. The following table describes the parameters that are used to add an ApsaraDB for SelectDB data source.

    Parameter

    Description

    Data Source Name

    The name of the data source.

    JDBC URL

    The Java Database Connectivity (JDBC) URL in the jdbc:mysql://<ip>:<port>/<dbname> format.

    To obtain the virtual private cloud (VPC) endpoint or public endpoint and MySQL port of an ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page of the instance whose information you want to view. In the Network Information section of the Basic Information tab, view the values of the VPC Endpoint or Public Endpoint parameter and the MySQL Port parameter.

    Example: jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030/test_db

    Note

    If the MySQL instance and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint for this parameter. If the MySQL instance and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint for this parameter.

    HTTP address

    The IP address and port number that are used to access the ApsaraDB for SelectDB instance over HTTP. The value is in the <ip>:<port> format.

    To obtain the VPC endpoint or public endpoint and HTTP port of an ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page of the instance whose information you want to view. In the Network Information section of the Basic Information tab, view the values of the VPC Endpoint or Public Endpoint parameter and the HTTP Port parameter.

    Example: selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080

    Note

    If the MySQL instance and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint for this parameter. If the MySQL instance and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint for this parameter.

    Username

    The username of the owner account of the ApsaraDB for SelectDB instance.

    Password

    The password of the owner account of the ApsaraDB for SelectDB instance.

    Important

    To ensure that the data source can be added, you must add the IP address of the resource group in DataWorks to the IP address whitelist of the data source such as ApsaraDB for SelectDB. For more information, see Configure an IP address whitelist. When you configure IP address whitelists, you must add the IP addresses of different types of resource groups to different IP address whitelists. For example, the IP addresses of an exclusive resource group for Data Integration and a shared resource group for Data Integration must be added to different IP address whitelists.

Configure a batch synchronization task to synchronize data from a single table

You can configure a batch synchronization task by using the codeless UI or the code editor. For more information, see the following topics:

Sample code and parameters

Configure a batch synchronization task by using the code editor

If you configure a batch synchronization task by using the code editor, you must write the script based on the format requirements in the code editor. For more information, see Configure a batch synchronization task by using the code editor.

Code for MySQL Reader and SelectDB Writer

{
	"type":"job",
	"version":"2.0",
	"steps":[
		{
			"stepType":"mysql",
			"parameter":{
				"column":[
          "<id>",
          "<table_id>",
          "<table_no>",
          "<table_name>",
          "<table_status>"
				],
				"connection":[
					{
						"datasource":"<mysql_datasource>",
						"table":[
							"<mysql_table_name>"
						]
					}
				],
				"where":"",
				"splitPk":"",
				"encoding":"UTF-8"
			},
			"name":"Reader",
			"category":"reader"
		},
		{
			"stepType":"selectdb",
			"parameter":{
        "postSql":[
        ],
        "preSql":[
        ],
        "username": "<selectdb_username>",
        "password": "<selectdb_password>",
        "loadUrl":[
          "<ip:port>"
        ],
        "column":[
          "<id>",
          "<table_id>",
          "<table_no>",
          "<table_name>",
          "<table_status>"
				],
        "connection":[
					{
						"datasource":"<selectdb_datasource>",
						"table":[
							"<selectdb_table_name>"
						]
					}
				],
        "maxBatchRows":1000000,
        "loadProps":{
          "format":"csv",
          "column_separator": "\\x01",
          "line_delimiter": "\\x02"
        }
      },
			"name":"Writer",
			"category":"writer"
		}
	],
	"setting":{
		"errorLimit":{
			"record":"0"
		},
		"speed":{
			"throttle":false,
			"concurrent":1
		}
	},
	"order":{
		"hops":[
			{
				"from":"Reader",
				"to":"Writer"
			}
		]
	}
}

Parameters

Parameter

Description

datasource

The name of the data source. This parameter is required. You can add a data source when you configure a batch synchronization task by using the code editor. The value of this parameter must be the same as the name of the data source to be added.

table

The name of the table from which data is to be synchronized. This parameter is required.

column

The names of the columns to which you want to write data in the destination table. Separate the column names with commas (,). This parameter is required. Example: "column":["id","name","age"]. If you want to write data to all columns in the destination table, set this parameter to an asterisk (*). Example: "column":["*"].

loadUrl

The URL that is used to access the ApsaraDB for SelectDB instance. This parameter is required. Format: ip:port. ip indicates the VPC endpoint of the ApsaraDB for SelectDB instance, and port indicates the HTTP port of the ApsaraDB for SelectDB instance. Example: selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080.

username

The username that is used to access the ApsaraDB for SelectDB database. This parameter is required.

password

The password that is used to access the ApsaraDB for SelectDB database. This parameter is required.

preSql

The SQL statement that you want to execute before the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement in the codeless UI and multiple SQL statements in the code editor.

postSql

The SQL statement that you want to execute after the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.

maxBatchRows

The maximum number of rows that you can write to the destination table at a time. Default value: 500000.

loadProps

The parameters of the COPY INTO request, which are used to specify the format of the data to be imported. Default value: json. If the loadProps parameter is left empty or is set to {}, the JSON format is used by default. If the JSON format is used, you can configure only the following settings for the ApsaraDB for SelectDB instance. The strip_outer_array parameter is set to true.

"loadProps": {
   "format": "json",
   "strip_outer_array":true
}

If you want to import data in the CSV format, you can configure the following settings to specify the CSV format and configure the row and column delimiters. If you do not specify the row and column delimiters, the imported data is converted into strings by default. \t is used as the column delimiter and \n is used as the row delimiter to form a CSV file for importing data to the ApsaraDB for SelectDB instance.

"loadProps": {
   "format":"csv",
   "column_separator": "\\x01",
   "line_delimiter": "\\x02"
}

Supported data types

The following table describes the data types that are supported by ApsaraDB for SelectDB.

Data type

Offline data writing of SelectDB Writer

INT

Supported

BIGINT

Supported

LARGEINT

Supported

SMALLINT

Supported

TINYINT

Supported

BOOLEAN

Supported

DECIMAL

Supported

DOUBLE

Supported

FLOAT

Supported

CHAR

Supported

VARCHAR

Supported

STRING

Supported

DATE

Supported

DATEV2

Supported

DATETIME

Supported

DATETIMEV2

Supported

ARRAY

Supported

JSONB

Supported

BITMAP

Not supported

HLL

Not supported

QUANTILE_STATE

Not supported

References

What is DataWorks?