In most cases, tables that contain large amounts of data in a relational database can be partitioned to improve the data query performance. Similarly, you can partition Object Storage Service (OSS) and Tablestore data in Data Lake Analytics (DLA) to perform fine-grained data processing. This reduces the time that is required to respond to a query. This topic describes how to create, use, and manage partitioned tables to process OSS data in DLA.
Mapping between partitioned tables and OSS directories
DLA maps directories or files stored in OSS to a partitioned table. Data in the table is the OSS file content. A partition key column corresponds to a directory in OSS. The OSS directories must comply with the following naming conventions:
If a partition key column corresponds to an OSS subdirectory specified in LOCATION, the OSS subdirectory is named in the format of Name of the partition key column=Value of the partition key column.
If multiple partition key columns are included in the partitioned table, the OSS subdirectories that correspond to these columns are nested in the same order as the columns listed in the table creation statement.
The following examples show the structure of OSS directories:
https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2018/m=12/kv2.txt
https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2019/m=2/kv1.txt
The OSS directories in the preceding examples indicate that two partition key columns y (year) and m (month) are included in the partitioned table that you created in DLA.
CREATE EXTERNAL TABLE dla (
prod_name string,
cnt bigint)
PARTITIONED BY (y string, m string)
STORED AS TEXTFILE
LOCATION 'oss://oss-bucket-name/dla/';
Usage notes
When you create and use OSS partitioned tables in DLA, take note of the following items:
The OSS directories that correspond to partition key columns must be nested in the same order as the columns listed in the partitioned table.
For the preceding OSS directory structure, the following statement used to create a partitioned table is invalid:
CREATE EXTERNAL TABLE dla ( prod_name string, cnt bigint) PARTITIONED BY (m string, y string) STORED AS TEXTFILE LOCATION 'oss://bucket-name/dla/';
DLA reads only data from the OSS directories that correspond to the partition key columns of the partitioned table.
If the partition key columns specified in the table creation statement are y and m, you can query only data in the kv3.txt file from the partitioned table. However, you cannot query data in the kv4.txt file.
https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2019/m=3/kv3.txt https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/kv4.txt
Prerequisites
Before you create and use OSS partitioned tables in DLA, make sure that the following operations are performed to prepare test data in OSS:
OSS is activated. For more information, see Activate OSS.
A bucket is created. For more information, see Create a bucket.
A directory that is named in the format of Name of partition key column=Value of partition key column is created. For more information, see Create a directory.
A data file is uploaded. For more information, see Upload file data.
Step 1: Create a schema
Log on to the DLA console.
In the left-side navigation pane, click SQL access point. On the page that appears, find your VPC and click Log on in DMS in the Actions column. Then, execute the following SQL statement to create an OSS schema.
You can also use a MySQL client or program code to connect to DLA, and execute the following SQL statement to create an OSS schema.
CREATE SCHEMA dla_oss_db with DBPROPERTIES( catalog='oss', location= 'oss://oss-bucket-name/dla/' );
catalog: the type of the schema that you want to create. In this example, the OSS schema is to be created.
location: the directory of the OSS bucket in which the file is saved. This directory must end with a forward slash (
/
).
Step 2: Create a partitioned table
Execute the following SQL statement in DLA to create a partitioned table for the OSS file:
CREATE EXTERNAL TABLE dla (
prod_name string,
cnt bigint)
PARTITIONED BY (y string, m string)
STORED AS TEXTFILE
LOCATION 'oss://oss-bucket-name/dla/';
Step 3: Run the MSCK command to update partition information
After the partitioned table is created, you must run the MSCK REPAIR TABLE
command to synchronize the partition information to the table in DLA.
MSCK REPAIR TABLE dla;
The MSCK command can recognize only the information of partitions whose name complies with the naming rule Name of partition key column=Value of partition key column.
If an OSS directory that corresponds to a partition key column in the partitioned table changes, you must run the MSCK command again. Then, DLA automatically synchronizes the partition information based on the current values of the partition key columns in the partitioned table.
After the MSCK command is successfully run, the following information is returned:
Repair: Added partition to metastore dla_oss_db.dla:y=2018/m=12
Repair: Added partition to metastore dla_oss_db.dla:y=2019/m=2
Step 4: Use the SHOW PARTITIONS command to view partition information
After the MSCK command is successfully run, you can run the SHOW PARTITIONS
command to view the partition information of the partitioned table.
show partitions dla;
+-----------+
| y=2018/m=12|
| y=2019/m=2 |
Step 5: Query data from the partitioned table
select count(*) from dla;
+-------+
| _col0 |
+-------+
| 4 |
select * from dla;
+---------------------------+
|prod_name | cnt | y | m |
+---------------------------+
|Monitor | 2 |2019 | 2 |
|Hard disk | 6 |2018 | 12 |
|Keyboard | 3 |2018 | 12 |
|Mouse | 1 |2019 | 2 |
You can filter the query results based on a partition key column:
select * from dla where y='2019';
+---------------------------+
|prod_name | cnt | y | m |
+---------------------------+
|Monitor | 2 |2019 | 2 |
|Mouse | 1 |2019 | 2 |
Manage partitions
Synchronize partition information
If an OSS directory that corresponds to a partition key column in the partitioned table changes, for example, an OSS subdirectory is added, you must run the MSCK command to make the new partition information take effect.
MSCK REPAIR TABLE table_name;
Add partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...]; partition_spec: : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
You can add multiple partitions at the same time. The partition names are separated by commas (,).
ALTER TABLE order_part ADD PARTITION (dt='2008-08-08', status='ready') location '/path/to/ready/part080808', PARTITION (dt='2008-08-09', status='new') location '/path/to/new/part080809';
NoticeWhen you add a partition, you can use IF NOT EXISTS to check whether the partition exists. If it exists, the directories specified in the original LOCATION that corresponds to the partition are overwritten by the directories specified in the new LOCATION.
Delete partitions
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...] ;
You can delete multiple partitions at the same time. The partition names are separated by commas (,).
ALTER TABLE order_part DROP PARTITION (dt='2008-08-08', status='ready');
NoticeYou can delete only partitions specified by Name of partition key column=Value of partition key column. However, you cannot delete the partitions specified by expressions, for example,
partitionCol > 100
.If the partition that you deleted comply with the naming rule Name of partition key column=Value of partition key column, the deleted partition is automatically added when you run the MSCK command.