Use external tables to import data from OSS to Data Warehouse Edition - AnalyticDB

If you want to import data from Object Storage Service (OSS) to AnalyticDB for MySQL, you can use external tables to import OSS objects in the Parquet, CSV, or ORC format to Data Warehouse Edition.

Prerequisites

An AnalyticDB for MySQL Data Warehouse Edition cluster is created.
Note
If the AnalyticDB for MySQL cluster is of the Data Warehouse Edition in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.
An OSS bucket is created in the same region as the AnalyticDB for MySQL cluster. For more information, see Activate OSS.

Sample data

In this example, the oss_import_test_data.csv object is uploaded to the testBucketname/adb/ directory in OSS. The row delimiter is a line feed, and the column delimiter is a semicolon (;) The following code shows part of the data contained in the oss_import_test_data.csv object:

uid;other
12;hello_world_1
27;hello_world_2
28;hello_world_3
33;hello_world_4
37;hello_world_5
40;hello_world_6

Procedure

Create a database in the AnalyticDB for MySQL cluster.
```
CREATE DATABASE adb_demo;
```

Create an external table. You can execute the CREATE TABLE statement to create an OSS external table in the CSV, Parquet, or ORC format. For more information, see the "Syntax for creating OSS external tables" section of this topic.

In this example, a non-partitioned external table named oss_import_test_external_table in the CSV format is used.

CREATE TABLE IF NOT EXISTS oss_import_test_external_table
(
    uid string,
    other string
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://testBucketname/adb/oss_import_test_data.csv",
    "accessid":"************",
    "accesskey":"************",
    "delimiter":";",
    "skip_header_line_count":1,
    "charset":"utf-8"
}';

Query the data of the oss_import_test_external_table external table.
Note
For queries on CSV, Parquet, and ORC objects, a larger amount of data results in greater performance loss. To improve query efficiency, we recommend that you perform the following steps (4 and 5) to import the data of the OSS external table to AnalyticDB for MySQL for query.
```
SELECT uid, other FROM oss_import_test_external_table WHERE uid < 100 LIMIT 10;
```

Create a table in the AnalyticDB for MySQL cluster to store data that is imported from the OSS external table.

CREATE TABLE IF NOT EXISTS adb_oss_import_test
(
    uid string,
    other string
)
DISTRIBUTED BY HASH(uid);

Execute an INSERT statement to import data from the OSS external table to AnalyticDB for MySQL.
Important
By default, the INSERT INTO and INSERT OVERWRITE SELECT statements import data synchronously. If hundreds of gigabytes of data is imported, the client must maintain a connection with the AnalyticDB for MySQL server for an extended period of time. During this process, the import may fail due to a network disconnection. Therefore, if you want to import a large amount of data, we recommend that you execute the SUBMIT JOB INSERT OVERWRITE SELECT statement to import data asynchronously.
- Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO.
```
INSERT INTO adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
- Method 2: Execute the INSERT OVERWRITE statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value.
```
INSERT OVERWRITE adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
- Method 3: Execute the INSERT OVERWRITE statement to import data asynchronously. In most cases, the SUBMIT JOB statement is used to submit an asynchronous job. You can add a hint (/*+ direct_batch_load=true*/) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.
```
SUBMIT JOB INSERT OVERWRITE adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
  Sample result:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
```
  For information about how to submit asynchronous jobs, see Asynchronously submit an import job.

Syntax for creating OSS external tables

Non-partitioned OSS external tables

CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"endpoint",
    "url":"OSS_LOCATION",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "format":"txt|orc|parquet",
    "delimiter":";",
    "skip_header_line_count":1,
    "charset":"utf-8"
}';

Table format	Parameter	Required	Description
CSV, Parquet, or ORC	ENGINE='OSS'	Yes	The table engine. Set the value to OSS.
	endpoint		The endpoint of the OSS bucket. AnalyticDB for MySQL can access OSS only by using a virtual private cloud (VPC). Note You can log on to the OSS console, find the bucket, and then obtain the endpoint on the Overview page.
	url		The path of the OSS object or directory. Valid values: The absolute path of the OSS object. Example: `oss://testBucketname/adb/oss_import_test_data.csv`. The path of the OSS directory that ends with a forward slash (/). Example: `oss://testBucketname/adb/`. Note If you set this parameter to the path of the OSS directory, the created external table contains all data in the directory. The path that ends with an asterisk (``) wildcard, which is used to match all objects and directories that meet the specified pattern. Example: `oss://testBucketname/adb/list_file_with_prefix/test`. Note The preceding sample path matches all objects and directories that meet the specified prefix criterion, such as `oss://testBucketname/adb/list_file_with_prefix/testfile1` and `oss://testBucketname/adb/list_file_with_prefix/test1/file2`.
	accessid		The AccessKey ID of the Alibaba Cloud account or the Resource Access Management (RAM) user that has permissions on OSS. For information about how to obtain an AccessKey ID, see Accounts and permissions.
	accesskey		The AccessKey secret of the Alibaba Cloud account or the RAM user that has permissions on OSS. For information about how to obtain an AccessKey secret, see Accounts and permissions.
CSV	delimiter		The column delimiter of the CSV object.
Parquet or ORC	format		The format of the OSS object. When you create an external table in the Parquet format, you must set this parameter to `parquet`. When you create an external table in the ORC format, you must set this parameter to `orc`. Note You must specify this parameter only when you create an OSS external table in the Parquet or ORC format. If you leave the format parameter empty, the CSV format is used.
CSV	null_value	No	The `NULL` value of the CSV object. By default, an empty string is defined as `NULL`, which is `"null_value": ""`. Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later.
	ossnull		The rule for defining the `NULL` value of the CSV object. Valid values: 1 (default): `EMPTY_SEPARATORS`. Only empty strings are defined as `NULL`. For example, `a,"",,c` is interpreted as "a","",NULL,"c" based on this rule. 2: `EMPTY_QUOTES`. Only quotation marks (`""`) are defined as `NULL`. For example, `a,"",,c` is interpreted as "a",NULL,"","c" based on this rule. 3: `BOTH`. Both empty strings and quotation marks (`""`) are defined as `NULL`. For example, `a,"",,c` is interpreted as "a",NULL,NULL,"c" based on this rule. 4: `NEITHER`. Empty strings and quotation marks (`""`) are not defined as `NULL`. For example, `a,"",,c` is interpreted as "a","","","c" based on this rule. Note The preceding examples are provided on the premise of `"null_value": ""`.
	skip_header_line_count		The number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data. The default value of this parameter is 0, which specifies that no rows are skipped.
	oss_ignore_quote_and_escape		Specifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false, which specifies that quotation marks (") and escape characters are not ignored. Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later.
	charset		The character set that is used by the OSS external table. Valid values: utf-8 (default) gbk Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.10.4 or later.

Note

The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.
When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.

Data type mappings between Parquet, ORC, and AnalyticDB for MySQL

Data type mappings between Parquet and AnalyticDB for MySQL

Basic type in Parquet	Logical type in Parquet	Data type in AnalyticDB for MySQL
BOOLEAN	N/A	BOOLEAN
INT32	INT_8	TINYINT
INT32	INT_16	SMALLINT
INT32	N/A	INT or INTEGER
INT64	N/A	BIGINT
FLOAT	N/A	FLOAT
DOUBLE	N/A	DOUBLE
FIXED_LEN_BYTE_ARRAY BINARY INT64 INT32	DECIMAL	DECIMAL
BINARY	UTF-8	VARCHAR STRING JSON (available if the Parquet object contains a column of the JSON type)
INT32	DATE	DATE
INT64	TIMESTAMP_MILLIS	TIMESTAMP or DATETIME
INT96	N/A	TIMESTAMP or DATETIME

Important

Parquet external tables that use columns of the STRUCT type cannot be created.

Data type mappings between ORC and AnalyticDB for MySQL

Data type in ORC	Data type in AnalyticDB for MySQL
BOOLEAN	BOOLEAN
BYTE	TINYINT
SHORT	SMALLINT
INT	INT or INTEGER
LONG	BIGINT
DECIMAL	DECIMAL
FLOAT	FLOAT
DOUBLE	DOUBLE
BINARY STRING VARCHAR	VARCHAR STRING JSON (available if the ORC object contains a column of the JSON type)
TIMESTAMP	TIMESTAMP or DATETIME
DATE	DATE

Important

ORC external tables that use the LIST, STRUCT, or UNION type cannot be created. ORC external tables that use the MAP type cab be created but cannot be queried.

AnalyticDB for MySQL allows you to read and write data of TEXT files in Hive by using OSS external tables in the CSV format. The following statement can be used to create an external table:

CREATE TABLE adb_csv_hive_format_oss (
  a tinyint,
  b smallint,
  c int,
  d bigint,
  e boolean,
  f float,
  g double,
  h varchar,
  i varchar, -- binary
  j timestamp,
  k DECIMAL(10, 4),
  l varchar, -- char(10)
  m varchar, -- varchar(100)
  n date
) ENGINE = 'OSS' TABLE_PROPERTIES='{
    "format": "csv",
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "accessid":"************",
    "accesskey":"************",
    "url":"oss://testBucketname/adb_data/",
    "delimiter": "\\1",
    "null_value": "\\\\N",
    "oss_ignore_quote_and_escape": "true",
    "ossnull": 2
}';

Note

When you create an OSS external table in the CSV format to read and write data of a TEXT file in Hive, take note of the following items:

The default column delimiter for the TEXT file in Hive is \1. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \1 must be escaped to \\1 for the delimiter parameter.
By default, the NULL value of the TEXT file in Hive is \N. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \N must be escaped to \\\\N for the null_value parameter.
The BINARY, CHAR(N), and VARCHAR(N) types in Hive all correspond to the VARCHAR type in AnalyticDB for MySQL. Other basic data types in Hive such as BOOLEAN are the same as those in AnalyticDB for MySQL.

Partitioned OSS external tables

A hierarchical directory is generated for OSS data that contains partitions. Example:

parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│           └── 000000_1
│       └── p3=SHENZHEN
│ │     └── 000000_0
│ └── p2=6
│     └── p3=SHENZHEN
│         └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│     ├── p3=SHANGHAI
│     │ └── 000000_0
│     └── p3=SHENZHEN
│         └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

In the preceding example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition. If you want to query the data by partition, you must specify partition key columns in the statement used to create an OSS external table. The following statement shows how to create an OSS external table and specify partition key columns in the external table. In this example, a Parquet object is used.

CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"endpoint",
    "url":"OSS_LOCATION",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "format":"parquet",
    "partition_column":"p1, p2, p3"
}';

Note

The partition_column property in the TABLE_PROPERTIES parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by the partition_column property must conform to the partition levels of the sample data.
When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.
The partition key columns defined in the statement must be in the same order as the partition key columns specified by the partition_column property.
Partition key columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.
When you query data, partition key columns can be displayed and used in the same manner as other columns.
If you leave the format parameter empty, the CSV format is used.
For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.