how to use external tables to export data from AnalyticDB for MySQL to Apsara File Storage for HDFS - AnalyticDB

AnalyticDB for MySQL allows you to use external tables to import and export data. This topic describes how to use external tables to export data from AnalyticDB for MySQL Data Warehouse Edition (V3.0) to Apsara File Storage for HDFS.

Prerequisites

The AnalyticDB for MySQL cluster runs V3.1.4.4 or later.
Note For information about how to query the minor version of a cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.
An Apsara File Storage for HDFS cluster is created, and a folder is created in the cluster to store the imported AnalyticDB for MySQL data. In this example, the folder is named hdfs_output_test_csv_data.
Note
When you use the INSERT OVERWRITE statement to import data, the existing file in the destination folder is overwritten. To prevent the existing file from being overwritten, we recommend that you create another destination folder when you export data.
The following service access ports for the AnalyticDB for MySQL cluster are configured in the Apsara File Storage for HDFS cluster:
- namenode: used to read and write metadata of a file system. You can configure the port number by using the fs.defaultFS parameter. The default port number is 8020.
  For information about the detailed configurations, see core-default.xml.
- datanode: used to read and write data. You can configure the port number by using the dfs.datanode.address parameter. The default port number is 50010.
  For information about the detailed configurations, see hdfs-default.xml.
If the AnalyticDB for MySQL cluster is in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.

Usage notes

Only CSV and Parquet files can be exported from an AnalyticDB for MySQL cluster to an Apsara File Storage for HDFS cluster. ORC files cannot be exported.
AnalyticDB for MySQL does not support the INSERT syntax that is used to insert data to external tables by row, such as INSERT INTO VALUES and REPLACE INTO VALUES.
You cannot export individual files from AnalyticDB for MySQL to Apsara File Storage for HDFS by using partitioned external tables.
When you export data from partitioned external tables, the data files do not contain data of partition columns. The data of partition columns is displayed in the form of Apsara File Storage for HDFS directories.
For example, three ordinary columns and two partition columns are defined in a partitioned external table. The name of the first-level partition column is p1, and the value of the column is 1. The name of the second-level partition column is p2, and the value of the column is a. You must use the partitioned external table to export data to the adb_data/ directory of an Apsara File Storage for HDFS cluster.
If p1 is set to 1 and p2 is set to a, the relative path to which data is exported is adb_data/p1=1/p2=a/. The exported CSV or Parquet file contains only the values of the three ordinary columns and does not contain the values of the p1 and p2 columns.

Procedure

Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.
Create a source database. For more information, see Create a database.
In this example, a database named adb_demo is used as the source database in the AnalyticDB for MySQL cluster.
Create a source table and insert data into the source table.
Execute the following statement to create a source table named adb_hdfs_import_source in the adb_demo source database:
```
CREATE TABLE IF NOT EXISTS adb_hdfs_import_source
(
    uid string,
    other string
)
DISTRIBUTED BY HASH(uid);
```
Execute the following statement to insert a row of test data into the adb_hdfs_import_source table:
```
INSERT INTO adb_hdfs_import_source VALUES ("1", "a"), ("2", "b"), ("3", "c");
```
Create an external table.
You can use the following syntax to create an external table in the adb_demo source database. The table is used to export AnalyticDB for MySQL data to Apsara File Storage for HDFS.
- Execute the following statement to create a standard external table. In this example, the destination table is named hdfs_import_external.
```
CREATE TABLE IF NOT EXISTS hdfs_import_external
(
    uid string,
    other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"csv",
    "delimiter":",",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_output_test_csv_data"
}';
```
- Execute the following statement to create a partitioned external table. You must define ordinary columns (such as uid and other) and partition columns (such as p1, p2, and p3) in the statement. In this example, the destination table is named hdfs_import_external_par.
```
CREATE TABLE IF NOT EXISTS hdfs_import_external_par
(
    uid string,
    other string,
    p1 date,
    p2 int,
    p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"csv",
    "delimiter":",",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_output_test_csv_data"
    "partition_column":"p1, p2, p3"
}';
```
Note
- Only CSV and Parquet files can be exported from an AnalyticDB for MySQL cluster to an Apsara File Storage for HDFS cluster. ORC files cannot be exported.
- For the detailed syntax to create an external table, see the "Create an Apsara File Storage for HDFS external table" and "Create a partitioned Apsara File Storage for HDFS external table" sections of the Use external tables to import data to Data Warehouse Edition topic.
Export the source data from the AnalyticDB for MySQL cluster to Apsara File Storage for HDFS.
- For the detailed syntax to export data by using a standard external table, see the "Appendix 1: Syntax to export data by using a standard external table" section of this topic.
- For the detailed syntax to export data by using a partitioned external table, see the "Appendix 2: Syntax to export data by using a partitioned external table" section of this topic.

What to do next

After the data is exported, you can view the exported data in the hdfs_output_test_csv_data destination folder by using the Hadoop client. You can also connect to the AnalyticDB for MySQL cluster and execute the following statement to query the data exported by using an external table. The query syntax is the same for partitioned external tables and standard external tables. In this example, the hdfs_import_external standard external table is used.

SELECT * FROM hdfs_import_external LIMIT 100;

Appendix 1: Syntax to export data by using a standard external table

If partition columns are not specified when you create an external table, you can use one of the following methods to export data:

Method 1: If the destination table already contains data, use the INSERT INTO statement to import data into the external table. Each time when you use this statement to write data from the source table to the Apsara File Storage for HDFS folder, a new Apsara File Storage for HDFS file is generated.
Note
The columns that you want to export must all be written to the external table. INSERT INTO generates new files without overwriting existing files.
Syntax:
```
INSERT INTO <target_table> 
SELECT <col_name> FROM <source_table>;
```
Example:
```
INSERT INTO hdfs_import_external
SELECT col1, col2, col3 FROM adb_hdfs_import_source;
```
Note
col1, col2, col3 indicate all columns of the external table.
Method 2: You cannot define primary keys in external tables for Apsara File Storage for HDFS. REPLACE INTO and INSERT INTO are completely equivalent. They both replicate data into external tables. If the destination table already contains data, existing data remains unchanged and new data is appended to the destination file when you import data by executing the REPLACE INTO statement.
Note
- The columns that you want to export must all be written to the external table. You cannot specify specific columns to be written.
- REPLACE INTO generates new files without overwriting existing files.
Syntax:
```
REPLACE INTO <target_table> 
SELECT <col_name> FROM <source_table>;
```
Example:
```
REPLACE INTO hdfs_import_external
SELECT col1, col2, col3 FROM adb_hdfs_import_source;
```
Method 3: Batch insert data into the external table by using the INSERT OVERWRITE statement. If the destination external table already contains data, all files in the external table path are deleted before a new Apsara File Storage for HDFS file is generated.
Important
- The columns that you want to export must all be written to the external table. You cannot specify specific columns to be written.
- INSERT OVERWRITE overwrites existing data in a directory. Proceed with caution.
Syntax:
```
INSERT OVERWRITE <target_table> 
SELECT <col_name> FROM <source_table>;
```
Example:
```
INSERT OVERWRITE hdfs_import_external 
SELECT col1, col2, col3 FROM adb_hdfs_import_source;
```

Method 4: Export data by asynchronously executing the INSERT OVERWRITE statement. Syntax:

SUBMIT job INSERT OVERWRITE <target_table> 
SELECT <col_name> FROM <source_table>;

Example:

SUBMIT JOB INSERT OVERWRITE hdfs_import_external  
SELECT col1, col2, col3 FROM adb_hdfs_import_source;

The following result is returned:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+

You can also check the state of the asynchronous task based on the job_id value. For more information, see Asynchronously submit an import task.

Appendix 2: Syntax to export data by using a partitioned external table

When you export data by using a partitioned external table, you must add the PARTITION field to the syntax. You can also specify partition columns and partition values in the PARTITION field to determine whether to use static or dynamic partitions.

Method 1: Use the INSERT INTO PARTITION statement to batch insert data into the partitioned external table.
Note
When you write data, the data is appended to the corresponding column. Each time when you write data, a new Apsara File Storage for HDFS file is generated, and existing data is not overwritten. The columns that you want to export must all be written to the external table. You cannot specify specific columns to be written.
- Fully static partition
  Syntax:
```
INSERT INTO <target_table> PARTITION(par1=val1,par2=val2,...) 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
INSERT INTO hdfs_import_external_par PARTITION(p1='2021-05-06',p2=1,p3='test') 
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
- Static-dynamic partition
  Note
  Static partitions must be placed in front of dynamic partitions. You cannot change the sequence.
  Syntax:
```
INSERT INTO <target_table> PARTITION(par1=val1,par2,...) 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
INSERT INTO hdfs_import_external_par PARTITION(p1='2021-05-27',p2,p3)  
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
- Fully dynamic partition (partition that does not require the PARTITION field)
  Syntax:
```
INSERT INTO <target_table> 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
INSERT INTO hdfs_import_external_par  
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
Method 2: You cannot define primary keys in external tables for Apsara File Storage for HDFS. REPLACE INTO PARTITION and INSERT INTO PARTITION are completely equivalent.
Note
The columns that you want to export must all be written to the external table. You cannot specify specific columns to be written. REPLACE INTO PARTITION generates new files without overwriting existing files.
Syntax:
- Fully static partition
  Syntax:
```
REPLACE INTO <target_table> PARTITION(par1=val1,par2=val2,...) 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
REPLACE INTO hdfs_import_external_par PARTITION(p1='2021-05-06',p2=1,p3='test') 
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
- Static-dynamic partition
  Note
  Static partitions must be placed in front of dynamic partitions. You cannot change the sequence.
  Syntax:
```
REPLACE INTO <target_table> PARTITION(par1=val1,par2,...) 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
REPLACE INTO hdfs_import_external_par PARTITION(p1='2021-05-06',p2,p3)  
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
- Fully dynamic partition (partition that does not require the PARTITION field)
  Syntax:
```
REPLACE INTO <target_table> 
SELECT <col_name> FROM <source_table>;
```
  Example:
```
REPLACE INTO hdfs_import_external_par  
SELECT col1, col2, col3, FROM adb_hdfs_import_source;
```
Method 3: The INSERT OVERWRITE PARTITION statement can be used in the same manner as the INSERT INTO PARTITION statement. However, when you execute the INSERT OVERWRITE PARTITION statement, existing data in the destination partitions is overwritten. If no new data is written to the partitions, existing data is not cleared.
Syntax:
```
INSERT OVERWRITE <target_table> PARTITION(par1=val1,par2=val2,...)[IF NOT EXISTS] 
SELECT <col_name> FROM <source_table>;
```
Important
- The columns that you want to export must all be written to the external table. You cannot specify specific columns to be written. INSERT OVERWRITE PARTITION overwrites existing data in a directory. Proceed with caution.
- IF NOT EXISTS: indicates that if an external table has a partition, data is not imported into this partition.
Example:
```
INSERT OVERWRITE hdfs_import_external_par PARTITION(p1='2021-05-06',p2=1,p3='test') IF NOT EXISTS 
SELECT col1, col2, col3 FROM adb_hdfs_import_source;
```