Use external tables to import data from OSS to Data Lakehouse Edition - AnalyticDB

AnalyticDB for MySQL Data Lakehouse Edition allows you to access and import data by using external tables. You can import data by using the regular import or elastic import method. Compared with the regular import method, the elastic import method consumes fewer resources and reduces impacts on real-time data reads and writes. This topic describes how to use external tables to query Object Storage Service (OSS) data and import data from OSS to AnalyticDB for MySQL.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
An OSS bucket is created in the same region as the AnalyticDB for MySQL cluster.
Data is uploaded to an OSS directory. For more information, see Upload objects.

Sample data

In this example, the person object is uploaded to the testBucketName/adb/dt=2023-06-15 directory in OSS. Line feeds are used as row delimiters, and commas (,) are used as column delimiters. Sample data in the person object:

1,james,10,2023-06-15
2,bond,20,2023-06-15
3,jack,30,2023-06-15
4,lucy,40,2023-06-15

Procedure

Go to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Job Development > SQL Development.

Import data.

You can import data by using the regular import or elastic import method. By default, the regular import method is used. If you use the regular import method, data is read from compute nodes and indexes are created on storage nodes. The regular import method consumes computing and storage resources. If you use the elastic import method, data is read and indexes are created for Serverless Spark jobs. The elastic import method consumes resources of job resource groups. Only AnalyticDB for MySQL Data Lakehouse Edition clusters of V3.1.10.0 or later that have job resource groups support the elastic import method. For more information, see Data import methods.

Regular import

Create an external database.

CREATE EXTERNAL DATABASE adb_external_db;

Use the CREATE EXTERNAL TABLE statement to create an OSS external table in the adb_external_db database. In this example, the external table is named adb_external_db.person.
Note
The OSS external table of AnalyticDB for MySQL must use the same names, number, order, and data types of fields as the OSS object.
Create a non-partitioned OSS external table
```
CREATE EXTERNAL TABLE adb_external_db.person
(
 id INT,
 name VARCHAR(1023),
 age INT,
 dt VARCHAR(1023)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY  ','
STORED AS TEXTFILE
LOCATION  'oss://testBucketName/adb/dt=2023-06-15/';
```
Create a partitioned OSS external table
Create a partitioned OSS external table and add partitions to the table.
1. Create a partitioned OSS external table.
  CREATE EXTERNAL TABLE adb_external_db.person ( id INT, name VARCHAR(1023) , age INT ) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'oss://testBucketName/adb/';
2. Add partitions to the OSS external table. You can use the ALTER TABLE ADD PARTITION statement to manually add partitions or use the MSCK REPAIR TABLE statement to automatically identify and add partitions.
  ALTER TABLE adb_external_db.person ADD PARTITION (dt='2023-06-15') LOCATION 'oss://testBucketName/adb/dt=2023-06-15/';
  Note
  For information about how to add partitions to or remove partitions from an OSS external table, see ALTER TABLE (external tables).
  For information about how to synchronize partitions of an OSS external table, see MSCK REPAIR TABLE.
For information about the syntax, see CREATE EXTERNAL TABLE.

Query data.

After you create an external table, you can execute the SELECT statement in AnalyticDB for MySQL to query data of the external table.

SELECT * FROM adb_external_db.person;

Sample result:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+
4 rows in set (0.35 sec)

Create a database and a table in an AnalyticDB for MySQL cluster. For more information, see CREATE DATABASE and AnalyticDB for MySQL. If you have already created a database, skip this step.
```
CREATE DATABASE adb_demo; 
```
Create a table in the AnalyticDB for MySQL cluster to store data that is imported from OSS.
Note
The created internal table must use the same names, number, order, and data types of fields as the external table that is created in Step b.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id INT,
    name VARCHAR(1023),
    age INT,
    dt VARCHAR(1023)
)
DISTRIBUTED BY HASH(id);
```
Import data to the table.
- Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted, and the INSERT INTO statement is equivalent to INSERT IGNORE INTO. For more information, see INSERT INTO.
```
INSERT INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 2: Execute the INSERT OVERWRITE INTO statement to synchronously import data. If the primary key has duplicate values, the original value is overwritten by the new value.
```
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 3: Execute the INSERT OVERWRITE INTO statement to asynchronously import data. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.
```
SUBMIT JOB INSERT OVERWRITE adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```

Elastic import

Create a database. If you have created a database, skip this step.
```
CREATE DATABASE adb_demo; 
```
Create an OSS external table.
Note
- The OSS external table of AnalyticDB for MySQL must use the same names, number, order, and data types of fields as the OSS object.
- The elastic import method allows you to create external tables only by using the CREATE TABLE statement.
```
CREATE TABLE oss_import_test_external_table
(
  id INT(1023),
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://<bucket-name>/adb/oss_import_test_data.csv",
    "accessid":"LTAI5t8sqJn5GhpBVtN8****",
    "accesskey":"HlClegbiV5mJjBYBJHEZQOnRF7****",
    "delimiter":","
}';
```
Important
When you create an external table, you can configure the following TABLE_PROPERTIES parameters for the CSV, Parquet, or ORC external table:
- CSV: endpoint, url, accessid, accesskey, format, delimiter, null_value, and partition_column.
- Parquet: endpoint, url, accessid, accesskey, format, and partition_column.
- ORC: endpoint, url, accessid, accesskey, format, and partition_column.
For more information about external table parameters, see the "Non-partitioned OSS external tables" and "Partitioned OSS external tables" sections of the Use external tables to import data to Data Warehouse Edition topic.

Query data.

After you create an external table, you can execute the SELECT statement in AnalyticDB for MySQL to query data of the external table.

SELECT * FROM oss_import_test_external_table;

Sample result:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+
4 rows in set (0.35 sec)

Create a table in the AnalyticDB for MySQL cluster to store data that is imported from OSS.
Note
The created internal table must use the same names, number, order, and data types of fields as the external table that is created in Step b.
```
CREATE TABLE adb_import_test
(
  id INT,
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
  primary key(id)
)
DISTRIBUTED BY HASH(uid);
```

Import data.

Important

The elastic import method allows you to import data only by using the INSERT OVERWRITE INTO statement.

Method 1: Execute the INSERT OVERWRITE INTO statement to elastically import data. If the primary key has duplicate values, the original value is overwritten by the new value.

/+*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Method 2: Asynchronously execute the INSERT OVERWRITE INTO statement to elastically import data. In most cases, the SUBMIT JOB statement is used to submit an asynchronous job.

/*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Important

When you asynchronously submit an elastic import job, you cannot configure priority queues.

Sample result:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2023081517195102101701907203151****** |

After you use the SUBMIT JOB statement to submit an asynchronous job, only a job ID is returned, which indicates that the asynchronous job is successfully submitted. You can use the returned job ID to terminate the asynchronous job or query the status of the asynchronous job. For more information, see Asynchronously submit an import job.

Hint parameters:

elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

Parameter	Required	Description
adb.load.resource.group.name	Yes	The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu	No	The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1. Execute the following statement to query the number of shards in the cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;`
spark.driver.resourceSpec	No	The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec	No	The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize	No	The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

(Optional) Check whether the submitted job is an elastic import job.

SELECT job_name, (job_type == 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs where job_name = "2023081818010602101701907303151******"

Sample result:

+---------------------------------------+------------------+
| job_name                              | is_elastic_load  |
+---------------------------------------+------------------+
| 2023081517195102101701907203151****** |       1          |
+---------------------------------------+------------------+

If an elastic import job is submitted, 1 is returned for the is_elastic_load parameter. If a regular import job is submitted, 0 is returned.

Prerequisites

Sample data

Procedure

Regular import

Create a non-partitioned OSS external table

Create a partitioned OSS external table

Elastic import