Use external tables to import data from MaxCompute to Data Lakehouse Edition - AnalyticDB

By default, AnalyticDB for MySQL allows you to access and import MaxCompute data by using external tables in tunnel record API mode. You can also select the tunnel arrow API mode. Compared with the tunnel record API mode, the tunnel arrow API mode can read MaxCompute data in columns, which improves the efficiency of data access and processing.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A MaxCompute project is created in the same region as the AnalyticDB for MySQL cluster.
Elastic Network Interface (ENI) is enabled for the AnalyticDB for MySQL cluster.
Note
You can log on to the AnalyticDB for MySQL console, choose Cluster Management > Cluster Information in the left-side navigation pane of the cluster details page, and then turn on ENI in the Network Information section.
The CIDR blocks of the virtual private cloud (VPC) in which the AnalyticDB for MySQL cluster resides are added to an IP address whitelist of the MaxCompute project.
Note
You can log on to the AnalyticDB for MySQL console to view the VPC ID on the Cluster Information page. Then, you can log on to the VPC console and find the VPC ID on the VPC page to view the CIDR blocks. For information about how to configure a MaxCompute whitelist, see Manage IP address whitelists.
The minor version of the AnalyticDB for MySQL cluster is 3.2.2.1 or later if you use the tunnel arrow API mode to access and import MaxCompute data.
Note
To query the minor version of an AnalyticDB for MySQL Data Lakehouse Edition cluster, execute the SELECT adb_version(); statement. To update the minor version of a cluster, contact technical support.

Sample Data

In this example, a MaxCompute project named test_adb and a MaxCompute table named person are used.

CREATE TABLE IF NOT EXISTS person (
    id int,
    name varchar(1023),
    age int)
partitioned by (dt string);

Execute the following statement to create a partition in the person table:

ALTER TABLE person 
ADD 
PARTITION (dt='202207');

Execute the following statement to insert data into the partition:

INSERT INTO test_adb.person 
partition (dt='202207') 
VALUES (1,'james',10),(2,'bond',20),(3,'jack',30),(4,'lucy',40);

(Optional) Enable the arrow API feature

By default, AnalyticDB for MySQL allows you to access and import MaxCompute data in tunnel record API mode. If you want to use the tunnel arrow API mode to access and import MaxCompute data, you must first enable the arrow API feature. After you enable the arrow API feature, AnalyticDB for MySQL allows you to access and import MaxCompute data in tunnel arrow API mode.

Enable the arrow API feature

You can use one of the following methods to enable the arrow API feature:

Execute the following SET statement to enable the arrow API feature at the cluster level:
```
SET ADB_CONFIG <config_name>= <value>;
```
Use the following hint to enable the arrow API feature at the query level:
```
/*<config_name>= <value>*/ SELECT * FROM table;
```

Parameters

config_name

Description

ODPS_TUNNEL_ARROW_ENABLED

Specifies whether to enable the arrow API feature. Valid values:

true
false (default)

ODPS_TUNNEL_SPLIT_BY_SIZE_ENABLED

Specifies whether to enable the dynamic splitting feature. Valid values:

true
false (default)

Procedure

You can import data by using the regular import or elastic import method. By default, the regular import method is used. If you use the regular import method, data is read from compute nodes and indexes are created on storage nodes. The regular import method consumes computing and storage resources. If you use the elastic import method, data is read and indexes are created for Serverless Spark jobs. The elastic import method consumes resources of job resource groups. Only AnalyticDB for MySQL clusters of V3.1.10.0 or later that have job resource groups support the elastic import method. Compared with the regular import method, the elastic import method consumes fewer resources. This reduces impacts on real-time data reads and writes, and improves resource isolation and data import efficiency. For more information, see Data import methods.

Regular import

Go to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Job Development > SQL Development.

Create an external database.

CREATE EXTERNAL DATABASE adb_external_db;

Create an external table. In this example, an external table named test_adb is created.
```
CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_db.test_adb (
    id int,
    name varchar(1023),
    age int,
    dt string
    ) ENGINE='ODPS'
TABLE_PROPERTIES='{
"accessid":"LTAILd4****",
"endpoint":"http://service.cn-hangzhou.maxcompute.aliyun.com/api",
"accesskey":"4A5Q7ZVzcYnWMQPysX****",
"partition_column":"dt",
"project_name":"test_adb",
"table_name":"person"
}';
```
Note
- The AnalyticDB for MySQL external table must use the same names, number, and order of fields as the MaxCompute table. The data types of the fields must be compatible between the two tables.
- For information about the parameters that are used to create an external table, see CREATE EXTERNAL TABLE.

Query data.

SELECT * FROM adb_external_db.test_adb;

Sample result:

+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+
4 rows in set (0.35 sec)

Perform the following steps to import data from MaxCompute to AnalyticDB for MySQL:
1. Create a database in the AnalyticDB for MySQL cluster.
```
CREATE DATABASE adb_demo; 
```
2. Create a table in the AnalyticDB for MySQL database to store data that is imported from MaxCompute.
  Note
  The created table must use the same number and order of fields as the external table that is created in Step 3. The data types of the fields must be compatible between the two tables.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id int,
    name string,
    age int,
    dt string
    PRIMARY KEY(id,dt)
)
DISTRIBUTED BY HASH(id)  
PARTITION BY VALUE('dt'); 
```
3. Write data to the table.
  - Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted. In this case, the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO.
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    If you want to import data from a specific partition to the adb_demo.adb_import_test table, you can execute the following statement:
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb 
WHERE dt = '202207'; 
```
  - Method 2: Execute the INSERT OVERWRITE INTO statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value.
```
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
  - Method 3: Execute the INSERT OVERWRITE INTO statement to asynchronously import data. In most cases, the SUBMIT JOB statement is used to submit an asynchronous job. You can add a hint (/*+ direct_batch_load=true*/) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.
```
SUBMIT job 
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    Sample result:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+
```
    For information about how to asynchronously submit a job, see Asynchronously submit an import job.

Elastic import

Go to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Job Development > SQL Development.
Create a database. If you have already created a database, skip this step.
```
CREATE DATABASE adb_demo; 
```
Create an external table.
Note
- The name of the AnalyticDB for MySQL external table must be the same as that of the MaxCompute project. Otherwise, the external table fails to be created.
- The AnalyticDB for MySQL external table must use the same names, number, and order of fields as the MaxCompute table. The data types of the fields must be compatible between the two tables.
- The elastic import method allows you to create external tables only by using the CREATE TABLE statement.
```
CREATE TABLE IF NOT EXISTS test_adb
(
    id int,
    name string,
    age int,
    dt string
)
 ENGINE='ODPS'
 TABLE_PROPERTIES='{
 "endpoint":"http://service.cn-hangzhou.maxcompute.aliyun-inc.com/api",
 "accessid":"LTAILd4****",
 "accesskey":"4A5Q7ZVzcYnWMQPysX****",
 "partition_column":"dt",
 "project_name":"test_adb",
 "table_name":"person"
 }';                 
```
For information about the external table parameters, see the parameter table in the Use external tables to import data to Data Warehouse Edition topic.

Query data.

SELECT * FROM adb_demo.test_adb;

Sample result:

+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+
4 rows in set (0.35 sec)

Create a table in the AnalyticDB for MySQL database to store data that is imported from MaxCompute.
Note
The created internal table must use the same names, number, order, and data types of fields as the external table that is created in Step 3.
```
CREATE TABLE IF NOT EXISTS adb_import_test
(   id int,
    name string,
    age int,
    dt string,
    PRIMARY KEY(id,dt)
)
DISTRIBUTE BY HASH(id)
PARTITION BY VALUE('dt') LIFECYCLE 30;  
```

Import data.

Important

The elastic import method allows you to import data only by using the INSERT OVERWRITE INTO statement.

Method 1: Execute the INSERT OVERWRITE INTO statement to elastically import data. If the primary key has duplicate values, the original value is overwritten by the new value.

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group|spark.adb.eni.vswitchId=vsw-bp12ldm83z4zu9k4d****]*/
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;

Method 2: Asynchronously execute the INSERT OVERWRITE INTO statement to elastically import data. In most cases, the SUBMIT JOB statement is used to submit an asynchronous job.

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group|spark.adb.eni.vswitchId=vsw-bp12ldm83z4zu9k4d****]*/
SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;

Important

When you asynchronously submit an elastic import job, you cannot configure priority queues.

Sample result:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2023081517192220291720310090151****** |
+---------------------------------------+

After you use the SUBMIT JOB statement to submit an asynchronous job, only a job ID is returned, which indicates that the asynchronous job is successfully submitted. You can use the returned job ID to terminate the asynchronous job or query the status of the asynchronous job. For more information, see Asynchronously submit an import job.

Hint parameters:

elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

Parameter	Required	Description
adb.load.resource.group.name	Yes	The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu	No	The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1. Execute the following statement to query the number of shards in the cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;`
spark.driver.resourceSpec	No	The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec	No	The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize	No	The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

(Optional) Check whether the submitted job is an elastic import job.

SELECT job_name, (job_type = 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs WHERE job_name = "2023081818010602101701907303151******";

Sample result:

+---------------------------------------+------------------+
| job_name                              | is_elastic_load  |
+---------------------------------------+------------------+
| 2023081517195203101701907203151****** |       1          |
+---------------------------------------+------------------+

If an elastic import job is submitted, 1 is returned for the is_elastic_load parameter. If a regular import job is submitted, 0 is returned.