All Products
Search
Document Center

AnalyticDB:Use external tables to import data to Data Warehouse Edition

Last Updated:Jun 20, 2023

AnalyticDB for MySQL allows you to use external tables to import and export data. This topic describes how to use an external table to query Apsara File Storage for HDFS data and import the data to AnalyticDB for MySQL Data Warehouse Edition (V3.0).

Prerequisites

  • The AnalyticDB for MySQL cluster runs V3.1.4.4 or later.

    Note For information about how to query the minor version of a cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.
  • The Apsara File Storage for HDFS data file is in the CSV, Parquet, or ORC format.

  • An Apsara File Storage for HDFS cluster is created, and the data that you want to import is stored in an Apsara File Storage for HDFS folder. In this example, the data is stored in the hdfs_import_test_data.csv folder.

  • The following service access ports for the AnalyticDB for MySQL cluster are configured in the Apsara File Storage for HDFS cluster:

    • namenode: used to read and write metadata of a file system. You can configure the port number by using the fs.defaultFS parameter. The default port number is 8020.

      For information about the detailed configurations, see core-default.xml.

    • datanode: used to read and write data. You can configure the port number by using the dfs.datanode.address parameter. The default port number is 50010.

      For information about the detailed configurations, see hdfs-default.xml.

  • If the AnalyticDB for MySQL cluster is in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page. Turn on ENI

Procedure

  1. Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.
  2. Create a destination database. For more information, see Create a database.

    In this example, the database named adb_demo is used as the destination database in the AnalyticDB for MySQL cluster.

  3. Create an external table in the CSV, Parquet, or ORC format in the destination database adb_demo by executing a CREATE TABLE statement.

  4. Create a destination table.

    You can execute one of the following statements to create a destination table in the destination database adb_demo to store data imported from Apsara File Storage for HDFS.

    • The following statement shows how to create a destination table corresponding to the standard external table. In this example, the destination table is named adb_hdfs_import_test.

      CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
      (
          uid string,
          other string
      )
      DISTRIBUTED BY HASH(uid);
    • The following statement shows how to create a destination table corresponding to the partitioned external table. In this example, the destination table is named adb_hdfs_import_parquet_partition. To create the table, you must define ordinary columns (such as uid and other) and partition columns (such as p1, p2, and p3) in the statement.

      CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
      (
          uid string,
          other string,
          p1 date,
          p2 int,
          p3 varchar
      )
      DISTRIBUTED BY HASH(uid);
  5. Import data from Apsara File Storage for HDFS to the destination AnalyticDB for MySQL cluster.

    You can use one of the following methods to import data. The syntax to import data by using a partitioned table is the same as the syntax to import data by using a standard table.

    • Method 1 (recommended): Execute an INSERT OVERWRITE statement to import data. This method allows you to batch import data and provides good performance. If the import succeeds, the data is available for query. Otherwise, the data is rolled back to its previous state. Example:

      INSERT OVERWRITE adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 2: Execute an INSERT INTO statement to import data. This method allows you to query imported data in real time. This method is suitable when you want to import a small amount of data. Example:

      INSERT INTO adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 3: Submit an asynchronous task to import data. Example:

      SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;

      The following information is returned:

      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |
      +---------------------------------------+

      You can check the state of the asynchronous task based on the job_id value. For more information, see Asynchronously submit an import task.

What to do next

After data is imported, you can log on to the destination database adb_demo in the AnalyticDB for MySQL cluster and then execute the following statement to check whether the data is imported from the source table to the destination table adb_hdfs_import_test:

SELECT * FROM adb_hdfs_import_test LIMIT 100;

Create an Apsara File Storage for HDFS external table

  • Create an external table in the CSV format

    The following statement shows how to create an external table in the CSV format:

    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"csv",
        "delimiter":",",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
    }';

    Parameter

    Required

    Description

    ENGINE='HDFS'

    Yes

    The storage engine used for the external table. In this example, HDFS is used as the storage engine.

    TABLE_PROPERTIES

    The connection information used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.

    format

    The format of the data file. When you create an external table in the CSV format, you must set this parameter to csv.

    delimiter

    The column delimiter of the CSV file. In this example, a comma (,) is used as the delimiter.

    hdfs_url

    The absolute URL of the destination data file or folder in the Apsara File Storage for HDFS cluster. The URL must start with hdfs://.

    Example: hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv

    partition_column

    No

    The partition column of the external table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see the "Create a partitioned Apsara File Storage for HDFS external table" section of this topic.

    compress_type

    The compression type of the data file. Only the GZIP compression type is supported for CSV files.

    skip_header_line_count

    The number of header rows to skip when you import data. The first row of a CSV file is the table header. If you set this parameter to 1, the first row of the file is skipped when you import data.

    The default value of this parameter is 0, which indicates that no rows are skipped.

    hdfs_ha_host_port

    If the high availability (HA) feature is configured for the Apsara File Storage for HDFS cluster, you must specify the hdfs_ha_host_port parameter in the ip1:port1,ip2:port2 format when you create an external table. The IP addresses and port numbers of the primary and secondary NameNodes are specified in this parameter.

    Example: 192.168.xx.xx:8020,192.168.xx.xx:8021

  • Create an external table in the Parquet or ORC format

    The following statement shows how to create an external table in the Parquet format:

    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"parquet",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
    }';

    Parameter

    Required

    Description

    ENGINE='HDFS'

    Yes

    The storage engine used for the external table. In this example, HDFS is used as the storage engine.

    TABLE_PROPERTIES

    The connection information used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.

    format

    The format of the data file.

    • When you create an external table in the Parquet format, you must set this parameter to parquet.

    • When you create an external table in the ORC format, you must set this parameter to orc.

    hdfs_url

    The absolute URL of the destination data file or folder in the Apsara File Storage for HDFS cluster. The URL must start with hdfs://.

    partition_column

    No

    The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see the "Create a partitioned Apsara File Storage for HDFS external table" section of this topic.

    hdfs_ha_host_port

    If the high availability (HA) feature is configured for the Apsara File Storage for HDFS cluster, you must specify the hdfs_ha_host_port parameter in the ip1:port1,ip2:port2 format when you create an external table. The IP addresses and port numbers of the primary and secondary NameNodes are specified in this parameter.

    Example: 192.168.xx.xx:8020,192.168.xx.xx:8021

    Note
    • The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.

    • When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.

    • If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.

    The following table describes the mappings between data types in Parquet and AnalyticDB for MySQL V3.0.

    Basic type in Parquet

    Logical type in Parquet

    Data type in AnalyticDB for MySQL V3.0

    BOOLEAN

    None

    BOOLEAN

    INT32

    INT_8

    TINYINT

    INT32

    INT_16

    SMALLINT

    INT32

    None

    INT or INTEGER

    INT64

    None

    BIGINT

    FLOAT

    None

    FLOAT

    DOUBLE

    None

    DOUBLE

    • FIXED_LEN_BYTE_ARRAY

    • BINARY

    • INT64

    • INT32

    DECIMAL

    DECIMAL

    BINARY

    UTF-8

    • VARCHAR

    • STRING

    • JSON (JSON is available if an object in the Parquet format contains a column in the JSON format.)

    INT32

    DATE

    DATE

    INT64

    TIMESTAMP_MILLIS

    TIMESTAMP or DATETIME

    INT96

    None

    TIMESTAMP or DATETIME

    Important

    Parquet external tables that use columns of the STRUCT data type cannot be created.

    The following table describes the mappings between data types in ORC and AnalyticDB for MySQL V3.0.

    Data type in ORC

    Data type in AnalyticDB for MySQL V3.0

    BOOLEAN

    BOOLEAN

    BYTE

    TINYINT

    SHORT

    SMALLINT

    INT

    INT or INTEGER

    LONG

    BIGINT

    DECIMAL

    DECIMAL

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    • BINARY

    • STRING

    • VARCHAR

    • VARCHAR

    • STRING

    • JSON (JSON is available if an object in the ORC format contains a column in the JSON format.)

    TIMESTAMP

    TIMESTAMP or DATETIME

    DATE

    DATE

    Important

    ORC external tables that use columns of the LIST, STRUCT, or UNION data type cannot be created. ORC external tables that use columns of the MAP data type can be created, but you cannot perform queries on the tables.

Create a partitioned Apsara File Storage for HDFS external table

Apsara File Storage for HDFS can partition data that is in the Parquet, CSV, or ORC format. It generates a hierarchical directory for data that contains partitions. In the following example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition:

parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│   │   ├── p3=SHANGHAI
│   │   │   ├── 000000_0
│   │   │   └── 000000_1
│   │   └── p3=SHENZHEN
│   │       └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

The following statement shows how to create an external table and specify partition columns in the external table. In this example, a Parquet file is used.

CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
  "hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
  "format":"parquet",  //Specify the format of the file. To create a CSV or ORC file, change the format value to csv or orc. 
  "partition_column":"p1, p2, p3"  //If you want to query the Apsara File Storage for HDFS data by partition, you must specify the partition_column parameter in the statement used to create an external table when you import data to AnalyticDB for MySQL. 
}';
Note
  • The partition_column property of TABLE_PROPERTIES must declare the partition columns (such as p1, p2, and p3 in the example). The sequence of partition columns specified by partition_column must be the same as the sequence of partition columns defined in the statement used to create the external table.

  • When you define partition columns in an external table, you must specify the columns that contain partitions (such as p1, p2, and p3 in the example) and their data types. You must specify the partition columns at the end of the statement.

  • The sequence of partition columns defined in the statement used to create an external table must be the same as the sequence of partition columns specified by the partition_column property.

  • Partition columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.

  • The syntax used to query partition columns and the way in which the query results are displayed are the same as those for other columns.

  • If the format parameter is not specified, the CSV format is used.