If you want to import data from Object Storage Service (OSS) to AnalyticDB for MySQL, you can use external tables to import OSS objects in the Parquet, CSV, or ORC format to Data Warehouse Edition.
Prerequisites
An AnalyticDB for MySQL Data Warehouse Edition cluster is created.
NoteIf the AnalyticDB for MySQL cluster is of the Data Warehouse Edition in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.
An OSS bucket is created in the same region as the AnalyticDB for MySQL cluster. For more information, see Activate OSS.
Sample data
In this example, the oss_import_test_data.csv
object is uploaded to the testBucketname/adb/
directory in OSS. The row delimiter is a line feed, and the column delimiter is a semicolon (;) The following code shows part of the data contained in the oss_import_test_data.csv
object:
uid;other
12;hello_world_1
27;hello_world_2
28;hello_world_3
33;hello_world_4
37;hello_world_5
40;hello_world_6
Procedure
Create a database in the AnalyticDB for MySQL cluster.
CREATE DATABASE adb_demo;
Create an external table. You can execute the CREATE TABLE statement to create an OSS external table in the CSV, Parquet, or ORC format. For more information, see the "Syntax for creating OSS external tables" section of this topic.
In this example, a non-partitioned external table named oss_import_test_external_table in the CSV format is used.
CREATE TABLE IF NOT EXISTS oss_import_test_external_table ( uid string, other string ) ENGINE='OSS' TABLE_PROPERTIES='{ "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "url":"oss://testBucketname/adb/oss_import_test_data.csv", "accessid":"LTAIF****5FsE", "accesskey":"Ccw****iWjv", "delimiter":";", "skip_header_line_count":1, "charset":"utf-8" }';
Query the data of the
oss_import_test_external_table
external table.NoteFor queries on CSV, Parquet, and ORC objects, a larger amount of data results in greater performance loss. To improve query efficiency, we recommend that you perform the following steps (4 and 5) to import the data of the OSS external table to AnalyticDB for MySQL for query.
SELECT uid, other FROM oss_import_test_external_table WHERE uid < 100 LIMIT 10;
Create a table in the AnalyticDB for MySQL cluster to store data that is imported from the OSS external table.
CREATE TABLE IF NOT EXISTS adb_oss_import_test ( uid string, other string ) DISTRIBUTED BY HASH(uid);
Execute an INSERT statement to import data from the OSS external table to AnalyticDB for MySQL.
ImportantBy default, the
INSERT INTO
andINSERT OVERWRITE SELECT
statements import data synchronously. If hundreds of gigabytes of data is imported, the client must maintain a connection with the AnalyticDB for MySQL server for an extended period of time. During this process, the import may fail due to a network disconnection. Therefore, if you want to import a large amount of data, we recommend that you execute theSUBMIT JOB INSERT OVERWRITE SELECT
statement to import data asynchronously.Method 1: Execute the
INSERT INTO
statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to theINSERT IGNORE INTO
statement. For more information, see INSERT INTO.INSERT INTO adb_oss_import_test SELECT * FROM oss_import_test_external_table;
Method 2: Execute the INSERT OVERWRITE statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value.
INSERT OVERWRITE adb_oss_import_test SELECT * FROM oss_import_test_external_table;
Method 3: Execute the
INSERT OVERWRITE
statement to import data asynchronously. In most cases, theSUBMIT JOB
statement is used to submit an asynchronous job. You can add a hint (/*+ direct_batch_load=true*/
) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.SUBMIT JOB INSERT OVERWRITE adb_oss_import_test SELECT * FROM oss_import_test_external_table;
Sample result:
+---------------------------------------+ | job_id | +---------------------------------------+ | 2020112122202917203100908203303****** |
For information about how to submit asynchronous jobs, see Asynchronously submit an import job.
Syntax for creating OSS external tables
Non-partitioned OSS external tables
CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
"endpoint":"endpoint",
"url":"OSS_LOCATION",
"accessid":"accesskey_id",
"accesskey":"accesskey_secret",
"format":"txt|orc|parquet",
"delimiter":";",
"skip_header_line_count":1,
"charset":"utf-8"
}';
Table format | Parameter | Required | Description |
CSV, Parquet, or ORC | ENGINE='OSS' | Yes | The table engine. Set the value to OSS. |
endpoint | The endpoint of the OSS bucket. AnalyticDB for MySQL can access OSS only by using a virtual private cloud (VPC). Note You can log on to the OSS console, find the bucket, and then obtain the endpoint on the Overview page. | ||
url | The path of the OSS object or directory. Valid values:
| ||
accessid | The AccessKey ID of the Alibaba Cloud account or the Resource Access Management (RAM) user that has permissions on OSS. For information about how to obtain an AccessKey ID, see Accounts and permissions. | ||
accesskey | The AccessKey secret of the Alibaba Cloud account or the RAM user that has permissions on OSS. For information about how to obtain an AccessKey secret, see Accounts and permissions. | ||
CSV | delimiter | The column delimiter of the CSV object. | |
Parquet or ORC | format | The format of the OSS object.
Note
| |
CSV | null_value | No | The Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. |
ossnull | The rule for defining the
Note The preceding examples are provided on the premise of | ||
skip_header_line_count | The number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data. The default value of this parameter is 0, which specifies that no rows are skipped. | ||
oss_ignore_quote_and_escape | Specifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false, which specifies that quotation marks (") and escape characters are not ignored. Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. | ||
charset | The character set that is used by the OSS external table. Valid values:
Important This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.10.4 or later. |
The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.
When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.
Data type mappings between Parquet, ORC, and AnalyticDB for MySQL
Data type mappings between Parquet and AnalyticDB for MySQL
Basic type in Parquet | Logical type in Parquet | Data type in AnalyticDB for MySQL |
BOOLEAN | N/A | BOOLEAN |
INT32 | INT_8 | TINYINT |
INT32 | INT_16 | SMALLINT |
INT32 | N/A | INT or INTEGER |
INT64 | N/A | BIGINT |
FLOAT | N/A | FLOAT |
DOUBLE | N/A | DOUBLE |
| DECIMAL | DECIMAL |
BINARY | UTF-8 |
|
INT32 | DATE | DATE |
INT64 | TIMESTAMP_MILLIS | TIMESTAMP or DATETIME |
INT96 | N/A | TIMESTAMP or DATETIME |
Parquet external tables that use columns of the STRUCT
type cannot be created.
Data type mappings between ORC and AnalyticDB for MySQL
Data type in ORC | Data type in AnalyticDB for MySQL |
BOOLEAN | BOOLEAN |
BYTE | TINYINT |
SHORT | SMALLINT |
INT | INT or INTEGER |
LONG | BIGINT |
DECIMAL | DECIMAL |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
|
|
TIMESTAMP | TIMESTAMP or DATETIME |
DATE | DATE |
ORC external tables that use the LIST
, STRUCT
, or UNION
type cannot be created. ORC external tables that use the MAP
type cab be created but cannot be queried.
AnalyticDB for MySQL allows you to read and write data of TEXT files in Hive by using OSS external tables in the CSV format. The following statement can be used to create an external table:
CREATE TABLE adb_csv_hive_format_oss (
a tinyint,
b smallint,
c int,
d bigint,
e boolean,
f float,
g double,
h varchar,
i varchar, -- binary
j timestamp,
k DECIMAL(10, 4),
l varchar, -- char(10)
m varchar, -- varchar(100)
n date
) ENGINE = 'OSS' TABLE_PROPERTIES='{
"format": "csv",
"endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
"accessid":"LTAIF****5FsE",
"accesskey":"Ccw****iWjv",
"url":"oss://testBucketname/adb_data/",
"delimiter": "\\1",
"null_value": "\\\\N",
"oss_ignore_quote_and_escape": "true",
"ossnull": 2
}';
When you create an OSS external table in the CSV format to read and write data of a TEXT file in Hive, take note of the following items:
The default column delimiter for the TEXT file in Hive is
\1
. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \1 must be escaped to\\1
for thedelimiter
parameter.By default, the
NULL
value of the TEXT file in Hive is\N
. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \N must be escaped to\\\\N
for thenull_value
parameter.The
BINARY
,CHAR(N)
, andVARCHAR(N)
types in Hive all correspond to theVARCHAR
type in AnalyticDB for MySQL. Other basic data types in Hive such asBOOLEAN
are the same as those in AnalyticDB for MySQL.
Partitioned OSS external tables
A hierarchical directory is generated for OSS data that contains partitions. Example:
parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ └── 000000_1
│ └── p3=SHENZHEN
│ │ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0
In the preceding example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition. If you want to query the data by partition, you must specify partition key columns in the statement used to create an OSS external table. The following statement shows how to create an OSS external table and specify partition key columns in the external table. In this example, a Parquet object is used.
CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, ...])
ENGINE='OSS'
TABLE_PROPERTIES='{
"endpoint":"endpoint",
"url":"OSS_LOCATION",
"accessid":"accesskey_id",
"accesskey":"accesskey_secret",
"format":"parquet",
"partition_column":"p1, p2, p3"
}';
The
partition_column
property in theTABLE_PROPERTIES
parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by thepartition_column
property must conform to the partition levels of the sample data.When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.
The partition key columns defined in the statement must be in the same order as the partition key columns specified by the
partition_column
property.Partition key columns support the following data types:
BOOLEAN
,TINYINT
,SMALLINT
,INT
,INTEGER
,BIGINT
,FLOAT
,DOUBLE
,DECIMAL
,VARCHAR
,STRING
,DATE
, andTIMESTAMP
.When you query data, partition key columns can be displayed and used in the same manner as other columns.
If you leave the format parameter empty, the CSV format is used.
For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.