MaxCompute allows you to create Object Storage Service (OSS) external tables as partitioned tables to access OSS data that is stored in partitions. This method reduces the volume of data that must be read and improves the data processing efficiency. This topic describes the standard and custom formats of OSS partition directories that are supported by MaxCompute.
Background information
After you create an OSS external table, MaxCompute performs a full table scan on all data in the OSS directory, including data files in subdirectories. If the volume of data in the OSS directory is large, a full table scan causes extra I/O operations and prolongs data processing. You can address this issue by using one of the following methods:
Method 1 (recommended): Store data in a standard or custom partition directory in OSS. When you create an OSS external table by using MaxCompute, you must specify the partitions and oss_location information in the statement that is used to create the table. We recommend that you use standard partition directories. For more information, see Standard partition directories and Custom partition directories.
Method 2: Plan multiple directories to store data. You can create multiple OSS external tables to read the data that is stored in each OSS directory. This way, each OSS external table points to a subset of OSS data. This method is complex and has poor data management performance. We recommend that you do not use this method.
Standard partition directories
The following example shows the format of a standard partition directory.
oss://<oss_endpoint>/<Bucket name>/<Directory name>/<partitionKey1=value1>/<partitionKey2=value2>/...
Data is uploaded to OSS in the OSS console or by using other OSS tools. Therefore, the format of the directory where the data is saved varies based on the format of the data that is uploaded.
Parameter | Required | Description |
oss_endpoint | Yes | The OSS endpoint. We recommend that you use an internal endpoint of OSS to prevent extra fees that are incurred by Internet traffic. For more information about the internal endpoints of OSS, see Regions and endpoints. Note We recommend that the OSS bucket for storing data is deployed in the same region as the MaxCompute project. MaxCompute can be deployed only in some regions. Therefore, cross-region data connectivity issues may occur. |
Bucket Name | Yes | The name of the OSS bucket. For more information about how to view the bucket name, see List buckets. |
Directory Name | Yes | The name of the OSS directory. You do not need to include file names in directory names. |
partitionKey | Yes | The name of the partition key column in the OSS external table. |
value | Yes | The value of the partition key column in the OSS external table. |
For example, a company stores log files in the CSV format in OSS and uses MaxCompute to process data in these log files on a daily basis. Specify the following standard partition directories for OSS data:
oss://oss-odps-test/log_data/year=2016/month=06/day=01/logfile
oss://oss-odps-test/log_data/year=2016/month=06/day=02/logfile
oss://oss-odps-test/log_data/year=2016/month=07/day=10/logfile
oss://oss-odps-test/log_data/year=2016/month=08/day=08/logfile
...
The following statements show how to create an OSS external table and import and analyze partition data based on the preceding partition directories:
-- Create an OSS external table.
CREATE EXTERNAL TABLE log_table_external (
click STRING,
ip STRING,
url STRING
)
partitioned BY (
year STRING,
month STRING,
day STRING
)
stored BY 'com.aliyun.odps.CsvStorageHandler'
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket_name/oss-odps-test/log_data/';
-- Import OSS partition data.
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '01');
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '02');
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '07', day = '10');
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '08', day = '08');
...
-- Analyze the data. MaxCompute accesses only the file (logfile) in the log_data/year=2016/month=06/day=01 subdirectory. MaxCompute does not scan all data in the log_data directory.
SELECT count(DISTINCT(ip)) FROM log_table_external WHERE year = '2016' AND month = '06' AND day = '01';
For more information about how to read data in a standard partition directory, see Example: Create a partitioned table as an OSS external table by using a built-in text extractor.
Custom partition directories
If data in OSS is stored in partitions in a non-standard partition directory, MaxCompute allows you to associate different subdirectories with different partitions.
The following example shows the format of custom partition directories. A custom partition directory contains only values of partition key columns and does not contain names of partition key columns.
oss://oss-odps-test/log_data_customized/2016/06/01/logfile
oss://oss-odps-test/log_data_customized/2016/06/02/logfile
oss://oss-odps-test/log_data_customized/2016/07/10/logfile
oss://oss-odps-test/log_data_customized/2016/08/08/logfile
...
After you create an OSS external table, execute the alter table ... add partition ... location ...
statement to specify subdirectories and associate these subdirectories with different partitions. Sample statements:
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '01')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket_name/oss-odps-test/log_data_customized/2016/06/01/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '02')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket_name/oss-odps-test/log_data_customized/2016/06/02/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '07', day = '10')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket_name/oss-odps-test/log_data_customized/2016/07/10/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '08', day = '08')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket_name/oss-odps-test/log_data_customized/2016/08/08/';
After you specify custom partition directories, you can access data in the subdirectories regardless of whether the data is saved in standard partition directories.