AnalyticDB for MySQL allows you to use external tables to import and export data. This topic describes how to use an external table to query Hadoop Distributed File System (HDFS) data and import the data to AnalyticDB for MySQL .
Prerequisites
An AnalyticDB for MySQL cluster of V3.1.4.4 or later is created.
An HDFS data file is in the CSV, Parquet, or ORC format.
An HDFS cluster is created, and the data that you want to import is stored in an HDFS folder. In this example, the data is stored in the hdfs_import_test_data.csv folder.
The following service access ports for the AnalyticDB for MySQL cluster are configured in the HDFS cluster:
namenode
: used to read and write metadata of a file system. You can configure the port number by using the fs.defaultFS parameter. The default port number is 8020.For information about the detailed configurations, see core-default.xml.
datanode
: used to read and write data. You can configure the port number by using the dfs.datanode.address parameter. The default port number is 50010.For information about the detailed configurations, see hdfs-default.xml.
If the AnalyticDB for MySQL cluster is in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.
Procedure
Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.
Create a database. For more information, see Create a database.
In this example, a database named
adb_demo
is created in the AnalyticDB for MySQL cluster.Create an external table in the CSV, Parquet, or ORC format in the destination database
adb_demo
by executing aCREATE TABLE
statement.To create a standard external table, see the "Create an HDFS external table" section of this topic.
To create a partitioned external table, see the "Create a partitioned HDFS external table" section of this topic.
Create a destination table.
You can execute one of the following statements to create a destination table in the destination database
adb_demo
to store data imported from HDFS.The following statement shows how to create a destination table corresponding to the standard external table. In this example, the destination table is named
adb_hdfs_import_test
.CREATE TABLE IF NOT EXISTS adb_hdfs_import_test ( uid string, other string ) DISTRIBUTED BY HASH(uid);
The following statement shows how to create a destination table corresponding to the partitioned external table. In this example, the destination table is named
adb_hdfs_import_parquet_partition
. To create the table, you must define ordinary columns (such asuid
andother
) and partition columns (such asp1
,p2
, andp3
) in the statement.CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition ( uid string, other string, p1 date, p2 int, p3 varchar ) DISTRIBUTED BY HASH(uid);
Import data from HDFS to the destination AnalyticDB for MySQL cluster.
You can use one of the following methods to import data. The syntax to import data by using a partitioned table is the same as the syntax to import data by using a standard table.
Method 1 (recommended): Execute an
INSERT OVERWRITE
statement to import data. This method allows you to batch import data and provides good performance. If the import succeeds, the data is available for query. Otherwise, the data is rolled back to its previous state. Example:INSERT OVERWRITE adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;
Method 2: Execute an
INSERT INTO
statement to import data. This method allows you to query imported data in real time. This method is suitable when you want to import a small amount of data. Example:INSERT INTO adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;
Method 3: Submit an asynchronous task to import data. Example:
SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;
The following information is returned:
+---------------------------------------+ | job_id | +---------------------------------------+ | 2020112122202917203100908203303****** | +---------------------------------------+
You can check the state of the asynchronous task based on the
job_id
value. For more information, see Asynchronously submit an import task.
What to do next
After data is imported, you can log on to the destination database adb_demo
in the AnalyticDB for MySQL cluster and then execute the following statement to check whether the data is imported from the source table to the destination table adb_hdfs_import_test
:
SELECT * FROM adb_hdfs_import_test LIMIT 100;
Create an HDFS external table
Create an external table in the CSV format
The following statement shows how to create an external table in the CSV format:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"csv", "delimiter":",", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv" }';
Parameter
Required
Description
ENGINE='HDFS'
Yes
The storage engine used for the external table. In this example, HDFS is used as the storage engine.
TABLE_PROPERTIES
The connection information used by AnalyticDB for MySQL to access HDFS data.
format
The format of the data file. When you create an external table in the CSV format, you must set this parameter to
csv
.delimiter
The column delimiter of the CSV file. In this example, a comma (,) is used as the delimiter.
hdfs_url
The absolute URL of the destination data file or folder in the HDFS cluster. The URL must start with
hdfs://
.Example:
hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv
partition_column
No
The partition column of the external table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see the "Create a partitioned HDFS external table" section of this topic.
compress_type
The compression type of the data file. Only the GZIP compression type is supported for CSV files.
skip_header_line_count
The number of header rows to skip when you import data. The first row of a CSV file is the table header. If you set this parameter to 1, the first row of the file is skipped when you import data.
The default value of this parameter is 0, which indicates that no rows are skipped.
hdfs_ha_host_port
If the high availability (HA) feature is configured for the HDFS cluster, you must specify the
hdfs_ha_host_port
parameter in theip1:port1,ip2:port2
format when you create an external table. The IP addresses and port numbers of the primary and secondaryNameNodes
are specified in this parameter.Example:
192.168.xx.xx:8020,192.168.xx.xx:8021
Create an external table in the Parquet or ORC format
The following statement shows how to create an external table in the Parquet format:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"parquet", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/" }';
Parameter
Required
Description
ENGINE='HDFS'
Yes
The storage engine used for the external table. In this example, HDFS is used as the storage engine.
TABLE_PROPERTIES
The connection information used by AnalyticDB for MySQL to access HDFS data.
format
The format of the data file.
When you create an external table in the Parquet format, you must set this parameter to
parquet
.When you create an external table in the ORC format, you must set this parameter to
orc
.
hdfs_url
The absolute URL of the destination data file or folder in the HDFS cluster. The URL must start with
hdfs://
.partition_column
No
The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see the "Create a partitioned HDFS external table" section of this topic.
hdfs_ha_host_port
If the high availability (HA) feature is configured for the HDFS cluster, you must specify the
hdfs_ha_host_port
parameter in theip1:port1,ip2:port2
format when you create an external table. The IP addresses and port numbers of the primary and secondaryNameNodes
are specified in this parameter.Example:
192.168.xx.xx:8020,192.168.xx.xx:8021
NoteThe column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.
When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.
Data type mappings between Parquet and AnalyticDB for MySQL
Basic type in Parquet
Logical type in Parquet
Data type in AnalyticDB for MySQL
BOOLEAN
N/A
BOOLEAN
INT32
INT_8
TINYINT
INT32
INT_16
SMALLINT
INT32
N/A
INT or INTEGER
INT64
N/A
BIGINT
FLOAT
N/A
FLOAT
DOUBLE
N/A
DOUBLE
FIXED_LEN_BYTE_ARRAY
BINARY
INT64
INT32
DECIMAL
DECIMAL
BINARY
UTF-8
VARCHAR
STRING
JSON (available if the Parquet object contains a column of the JSON type)
INT32
DATE
DATE
INT64
TIMESTAMP_MILLIS
TIMESTAMP or DATETIME
INT96
N/A
TIMESTAMP or DATETIME
ImportantParquet external tables that use columns of the
STRUCT
type cannot be created.Data type mappings between ORC and AnalyticDB for MySQL
Data type in ORC
Data type in AnalyticDB for MySQL
BOOLEAN
BOOLEAN
BYTE
TINYINT
SHORT
SMALLINT
INT
INT or INTEGER
LONG
BIGINT
DECIMAL
DECIMAL
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
STRING
VARCHAR
VARCHAR
STRING
JSON (available if the ORC object contains a column of the JSON type)
TIMESTAMP
TIMESTAMP or DATETIME
DATE
DATE
ImportantORC external tables that use the
LIST
,STRUCT
, orUNION
type cannot be created. ORC external tables that use theMAP
type cab be created but cannot be queried.
Create a partitioned HDFS external table
HDFS can partition data that is in the Parquet, CSV, or ORC format. It generates a hierarchical directory for data that contains partitions. In the following example, p1
indicates the level-1 partition, p2
indicates the level-2 partition, and p3
indicates the level-3 partition:
parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ │ │ └── 000000_1
│ │ └── p3=SHENZHEN
│ │ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0
The following statement shows how to create an external table and specify partition columns in the external table. In this example, a Parquet file is used.
CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
uid varchar,
other varchar,
p1 date,
p2 int,
p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
"format":"parquet", //Specify the format of the file. To create a CSV or ORC file, change the format value to csv or orc.
"partition_column":"p1, p2, p3" //If you want to query the HDFS data by partition, you must specify the partition_column parameter in the statement used to create an external table when you import data to AnalyticDB for MySQL.
}';
The
partition_column
property in theTABLE_PROPERTIES
parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by thepartition_column
property must conform to the partition levels of the sample data.When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.
The partition key columns defined in the statement must be in the same order as the partition key columns specified by the
partition_column
property.Partition key columns support the following data types:
BOOLEAN
,TINYINT
,SMALLINT
,INT
,INTEGER
,BIGINT
,FLOAT
,DOUBLE
,DECIMAL
,VARCHAR
,STRING
,DATE
, andTIMESTAMP
.When you query data, partition key columns can be displayed and used in the same manner as other columns.
If you leave the format parameter empty, the CSV format is used.
For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.