All Products
Search
Document Center

AnalyticDB:Use external tables to import data to Data Warehouse Edition

Last Updated:Oct 29, 2024

If you want to import data from Object Storage Service (OSS) to AnalyticDB for MySQL, you can use external tables to import OSS objects in the Parquet, CSV, or ORC format to Data Warehouse Edition.

Prerequisites

  • An AnalyticDB for MySQL Data Warehouse Edition cluster is created.

    Note

    If the AnalyticDB for MySQL cluster is of the Data Warehouse Edition in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.

  • An OSS bucket is created in the same region as the AnalyticDB for MySQL cluster. For more information, see Activate OSS.

Sample data

In this example, the oss_import_test_data.csv object is uploaded to the testBucketname/adb/ directory in OSS. The row delimiter is a line feed, and the column delimiter is a semicolon (;) The following code shows part of the data contained in the oss_import_test_data.csv object:

uid;other
12;hello_world_1
27;hello_world_2
28;hello_world_3
33;hello_world_4
37;hello_world_5
40;hello_world_6        

Procedure

  1. Create a database in the AnalyticDB for MySQL cluster.

    CREATE DATABASE adb_demo;
  2. Create an external table. You can execute the CREATE TABLE statement to create an OSS external table in the CSV, Parquet, or ORC format. For more information, see the "Syntax for creating OSS external tables" section of this topic.

    In this example, a non-partitioned external table named oss_import_test_external_table in the CSV format is used.

    CREATE TABLE IF NOT EXISTS oss_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='OSS'
    TABLE_PROPERTIES='{
        "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
        "url":"oss://testBucketname/adb/oss_import_test_data.csv",
        "accessid":"LTAIF****5FsE",
        "accesskey":"Ccw****iWjv",
        "delimiter":";",
        "skip_header_line_count":1,
        "charset":"utf-8"
    }'; 
  3. Query the data of the oss_import_test_external_table external table.

    Note

    For queries on CSV, Parquet, and ORC objects, a larger amount of data results in greater performance loss. To improve query efficiency, we recommend that you perform the following steps (4 and 5) to import the data of the OSS external table to AnalyticDB for MySQL for query.

    SELECT uid, other FROM oss_import_test_external_table WHERE uid < 100 LIMIT 10;
  4. Create a table in the AnalyticDB for MySQL cluster to store data that is imported from the OSS external table.

    CREATE TABLE IF NOT EXISTS adb_oss_import_test
    (
        uid string,
        other string
    )
    DISTRIBUTED BY HASH(uid);
  5. Execute an INSERT statement to import data from the OSS external table to AnalyticDB for MySQL.

    Important

    By default, the INSERT INTO and INSERT OVERWRITE SELECT statements import data synchronously. If hundreds of gigabytes of data is imported, the client must maintain a connection with the AnalyticDB for MySQL server for an extended period of time. During this process, the import may fail due to a network disconnection. Therefore, if you want to import a large amount of data, we recommend that you execute the SUBMIT JOB INSERT OVERWRITE SELECT statement to import data asynchronously.

    • Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO.

      INSERT INTO adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 2: Execute the INSERT OVERWRITE statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value.

      INSERT OVERWRITE adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 3: Execute the INSERT OVERWRITE statement to import data asynchronously. In most cases, the SUBMIT JOB statement is used to submit an asynchronous job. You can add a hint (/*+ direct_batch_load=true*/) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.

      SUBMIT JOB INSERT OVERWRITE adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;

      Sample result:

      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |

      For information about how to submit asynchronous jobs, see Asynchronously submit an import job.

Syntax for creating OSS external tables

Non-partitioned OSS external tables

CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"endpoint",
    "url":"OSS_LOCATION",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "format":"txt|orc|parquet",
    "delimiter":";",
    "skip_header_line_count":1,
    "charset":"utf-8"
}';

Table format

Parameter

Required

Description

CSV, Parquet, or ORC

ENGINE='OSS'

Yes

The table engine. Set the value to OSS.

endpoint

The endpoint of the OSS bucket. AnalyticDB for MySQL can access OSS only by using a virtual private cloud (VPC).

Note

You can log on to the OSS console, find the bucket, and then obtain the endpoint on the Overview page.

url

The path of the OSS object or directory. Valid values:

  • The absolute path of the OSS object. Example: oss://testBucketname/adb/oss_import_test_data.csv.

  • The path of the OSS directory that ends with a forward slash (/). Example: oss://testBucketname/adb/.

    Note

    If you set this parameter to the path of the OSS directory, the created external table contains all data in the directory.

  • The path that ends with an asterisk (*) wildcard, which is used to match all objects and directories that meet the specified pattern. Example: oss://testBucketname/adb/list_file_with_prefix/test*.

    Note

    The preceding sample path matches all objects and directories that meet the specified prefix criterion, such as oss://testBucketname/adb/list_file_with_prefix/testfile1 and

    oss://testBucketname/adb/list_file_with_prefix/test1/file2.

accessid

The AccessKey ID of the Alibaba Cloud account or the Resource Access Management (RAM) user that has permissions on OSS.

For information about how to obtain an AccessKey ID, see Accounts and permissions.

accesskey

The AccessKey secret of the Alibaba Cloud account or the RAM user that has permissions on OSS.

For information about how to obtain an AccessKey secret, see Accounts and permissions.

CSV

delimiter

The column delimiter of the CSV object.

Parquet or ORC

format

The format of the OSS object.

  • When you create an external table in the Parquet format, you must set this parameter to parquet.

  • When you create an external table in the ORC format, you must set this parameter to orc.

Note
  • You must specify this parameter only when you create an OSS external table in the Parquet or ORC format.

  • If you leave the format parameter empty, the CSV format is used.

CSV

null_value

No

The NULL value of the CSV object. By default, an empty string is defined as NULL, which is "null_value": "".

Important

This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later.

ossnull

The rule for defining the NULL value of the CSV object. Valid values:

  • 1 (default): EMPTY_SEPARATORS. Only empty strings are defined as NULL.

    For example, a,"",,c is interpreted as "a","",NULL,"c" based on this rule.

  • 2: EMPTY_QUOTES. Only quotation marks ("") are defined as NULL.

    For example, a,"",,c is interpreted as "a",NULL,"","c" based on this rule.

  • 3: BOTH. Both empty strings and quotation marks ("") are defined as NULL.

    For example, a,"",,c is interpreted as "a",NULL,NULL,"c" based on this rule.

  • 4: NEITHER. Empty strings and quotation marks ("") are not defined as NULL.

    For example, a,"",,c is interpreted as "a","","","c" based on this rule.

Note

The preceding examples are provided on the premise of "null_value": "".

skip_header_line_count

The number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data.

The default value of this parameter is 0, which specifies that no rows are skipped.

oss_ignore_quote_and_escape

Specifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false, which specifies that quotation marks (") and escape characters are not ignored.

Important

This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later.

charset

The character set that is used by the OSS external table. Valid values:

  • utf-8 (default)

  • gbk

Important

This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.10.4 or later.

Note
  • The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.

  • When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.

  • If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.

Data type mappings between Parquet, ORC, and AnalyticDB for MySQL

Data type mappings between Parquet and AnalyticDB for MySQL

Basic type in Parquet

Logical type in Parquet

Data type in AnalyticDB for MySQL

BOOLEAN

N/A

BOOLEAN

INT32

INT_8

TINYINT

INT32

INT_16

SMALLINT

INT32

N/A

INT or INTEGER

INT64

N/A

BIGINT

FLOAT

N/A

FLOAT

DOUBLE

N/A

DOUBLE

  • FIXED_LEN_BYTE_ARRAY

  • BINARY

  • INT64

  • INT32

DECIMAL

DECIMAL

BINARY

UTF-8

  • VARCHAR

  • STRING

  • JSON (available if the Parquet object contains a column of the JSON type)

INT32

DATE

DATE

INT64

TIMESTAMP_MILLIS

TIMESTAMP or DATETIME

INT96

N/A

TIMESTAMP or DATETIME

Important

Parquet external tables that use columns of the STRUCT type cannot be created.

Data type mappings between ORC and AnalyticDB for MySQL

Data type in ORC

Data type in AnalyticDB for MySQL

BOOLEAN

BOOLEAN

BYTE

TINYINT

SHORT

SMALLINT

INT

INT or INTEGER

LONG

BIGINT

DECIMAL

DECIMAL

FLOAT

FLOAT

DOUBLE

DOUBLE

  • BINARY

  • STRING

  • VARCHAR

  • VARCHAR

  • STRING

  • JSON (available if the ORC object contains a column of the JSON type)

TIMESTAMP

TIMESTAMP or DATETIME

DATE

DATE

Important

ORC external tables that use the LIST, STRUCT, or UNION type cannot be created. ORC external tables that use the MAP type cab be created but cannot be queried.

AnalyticDB for MySQL allows you to read and write data of TEXT files in Hive by using OSS external tables in the CSV format. The following statement can be used to create an external table:

CREATE TABLE adb_csv_hive_format_oss (
  a tinyint,
  b smallint,
  c int,
  d bigint,
  e boolean,
  f float,
  g double,
  h varchar,
  i varchar, -- binary
  j timestamp,
  k DECIMAL(10, 4),
  l varchar, -- char(10)
  m varchar, -- varchar(100)
  n date
) ENGINE = 'OSS' TABLE_PROPERTIES='{
    "format": "csv",
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "accessid":"LTAIF****5FsE",
    "accesskey":"Ccw****iWjv",
    "url":"oss://testBucketname/adb_data/",
    "delimiter": "\\1",
    "null_value": "\\\\N",
    "oss_ignore_quote_and_escape": "true",
    "ossnull": 2
}';
Note

When you create an OSS external table in the CSV format to read and write data of a TEXT file in Hive, take note of the following items:

  • The default column delimiter for the TEXT file in Hive is \1. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \1 must be escaped to \\1 for the delimiter parameter.

  • By default, the NULL value of the TEXT file in Hive is \N. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \N must be escaped to \\\\N for the null_value parameter.

  • The BINARY, CHAR(N), and VARCHAR(N) types in Hive all correspond to the VARCHAR type in AnalyticDB for MySQL. Other basic data types in Hive such as BOOLEAN are the same as those in AnalyticDB for MySQL.

Partitioned OSS external tables

A hierarchical directory is generated for OSS data that contains partitions. Example:

parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│           └── 000000_1
│       └── p3=SHENZHEN
│ │     └── 000000_0
│ └── p2=6
│     └── p3=SHENZHEN
│         └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│     ├── p3=SHANGHAI
│     │ └── 000000_0
│     └── p3=SHENZHEN
│         └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

In the preceding example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition. If you want to query the data by partition, you must specify partition key columns in the statement used to create an OSS external table. The following statement shows how to create an OSS external table and specify partition key columns in the external table. In this example, a Parquet object is used.

CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"endpoint",
    "url":"OSS_LOCATION",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "format":"parquet",
    "partition_column":"p1, p2, p3"
}';
Note
  • The partition_column property in the TABLE_PROPERTIES parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by the partition_column property must conform to the partition levels of the sample data.

  • When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.

  • The partition key columns defined in the statement must be in the same order as the partition key columns specified by the partition_column property.

  • Partition key columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.

  • When you query data, partition key columns can be displayed and used in the same manner as other columns.

  • If you leave the format parameter empty, the CSV format is used.

  • For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.