MaxCompute allows you to create an Object Storage Service (OSS) external table in your MaxCompute project. The OSS external table maps to an OSS directory. You can use the OSS external table to access unstructured data in data files in the OSS directory or write data from your MaxCompute project to the OSS directory. This topic describes the syntax and parameters that are used to create an OSS external table. This topic also provides examples on how to create an OSS external table.
Background information
OSS is a secure, cost-effective, and highly reliable cloud storage service that can store a large number of data files in any format. If you want to use MaxCompute to read data from an OSS directory or write data from a MaxCompute project to an OSS directory, you can create an OSS external table in your MaxCompute project. The OSS external table maps to the OSS directory.
You can create a partitioned table or a non-partitioned table as an OSS external table. The type of the table that you must create varies based on the format of the OSS directory in which data files are stored. If data files are stored in a partition directory, you must create a partitioned table. Otherwise, you must create a non-partitioned table. For more information about how to read partition data, see Read OSS data stored in partitions.
Prerequisites
Before you create an OSS external table, make sure that the Alibaba Cloud account or RAM user that you use meets the following requirements:
Access to OSS has been granted.
Supports access to the OSS external tables by Alibaba Cloud account, RAM user, or RAM role. For more information about authorization, see STS authorization for OSS.
Permissions to create tables in a MaxCompute project have been granted.
For more information about operation permissions on tables in a MaxCompute project, see MaxCompute permissions.
Precautions
When you use OSS external tables, take note of the following items:
For data files in different formats, the statements that are used to create OSS external tables differ only in some parameters. We recommend that you read the syntax that is used to create an OSS external table and the parameter descriptions to learn more. This helps you create an OSS external table that meets your business requirements. If an invalid parameter is configured when you create an OSS external table, you may fail to read OSS data or write data to OSS by using the OSS external table.
OSS external tables record only the mappings between these tables and OSS directories. If you delete an OSS external table, data files in the OSS directory that is mapped to the table are not deleted.
If a data file stored in OSS is an object of the Archive storage class, you must first restore the file. For more information about the restoration operation, see Restore objects.
You must use the classic network endpoint of OSS. MaxCompute does not ensure network connectivity for public endpoints.
Limits
OSS external tables do not support the clustering attribute.
Platforms
The following table lists platforms in which you can create OSS external tables.
Creation method | Tool |
MaxCompute SQL statements | |
Web UI | |
Syntax used to create an OSS external table
The following table describes the syntax that is used to create an OSS external table in each scenario. The following table also provides references of the examples in each scenario. For more information about syntax parameters and properties, see Reference: Parameters in the syntax, Reference: WITH SERDEPROPERTIES properties, and Reference: TBLPROPERTIES properties.
Scenario | Syntax | Supported format of OSS data files | Example |
Create an external table by using a built-in text extractor |
|
| |
Create an external table by using a built-in open source data extractor |
|
Note Only Hudi data that is generated by Data Lake Formation (DLF) can be read. | Example: Create an OSS external table by using a built-in open source data extractor |
Create an external table by using a custom extractor |
| Data files except the preceding formats |
If the OSS external table that you created is a partitioned table, you must execute the
MSCK
orALTER
statement before you can use the OSS external table. For more information, see Syntax used to add partition data to OSS external tables.For more information about examples of creating OSS external tables, see Example: Create an OSS external table and specify the first row as the table header and Example: Create an OSS external table in which the number of columns is different from the number of columns in the OSS data file.
Syntax used to add partition data to OSS external tables
If the OSS external table that you created is a partitioned table, partition data needs to be added to the OSS external table.
Method 1 (recommended): MaxCompute automatically parses the OSS directory structure, identifies partitions, and then adds partition data to the OSS external table.
This way, MaxCompute automatically supplements the partitions of the OSS external table based on the partition directory that you specify when you create the OSS external table, instead of adding the partitions one by one based on the names of partition key columns and partition names. This method is suitable for supplementing all missing historical partitions at a time.
msck repair TABLE <mc_oss_extable_name> ADD partitions [ WITH properties (key:VALUE, key: VALUE ...)];
NoteThis method is not suitable for scenarios in which partitions are continuously added, especially in scenarios in which the OSS directory contains a large number of partitions. For example, if the OSS directory contains more than 1,000 partitions, we recommend that you do not use this method. If the number of new partitions is far less than the number of existing partitions, frequent use of the
msck
command results in a large number of repeated scans and metadata update requests for the OSS directory. This significantly reduces the command execution efficiency. Therefore, we recommend that you use method 2 if you want to update data in new partitions.Method 2: Manually run the following commands to add partition data to the OSS external table:
If specific partitions have been created and you need to periodically add partitions, we recommend that you use this method to create partitions before you run tasks for data writing. After a partition is created, the external table can read the latest data from the OSS directory without the need to update the related partition even if new data is written to the OSS directory.
ALTER TABLE < mc_oss_extable_name > ADD PARTITION (< col_name >= < col_value >)[ ADD PARTITION (< col_name >= < col_value >)...][location URL];
The values of col_name and col_value must be consistent with the names of the OSS subdirectories where the partition data file is stored. If the OSS directory structure of the partition data file shown in the following figure is used, the value of col_name is
direction
, and the values of col_value areN, NE, S, SW, and W
. Oneadd partition
clause in the ALTER TABLE statement is used to add a partition that is mapped to an OSS subdirectory. The number ofadd partition
clauses must be the same as that of OSS subdirectories.Sample statement
Create a directory named
demo8
in OSS and create two partition folders under the directory to store the related files.The file
demo8-pt1.txt
is stored in the partition folder$pt1=1/$pt2=2
.The file
demo8-pt2.txt
is stored in the partition folder$pt1=3/$pt2=4
.
Create an external table and specify
partition
fields.-- Create an OSS external table. CREATE EXTERNAL TABLE mf_oss_spe_pt (id int, name string) partitioned BY (pt1 string, pt2 string) stored AS TEXTFILE location "oss://oss-cn-beijing-internal.aliyuncs.com/mfoss*******/demo8/"; -- Specify partition fields. MSCK REPAIR TABLE mf_oss_spe_pt ADD PARTITIONS WITH PROPERTIES ('odps.msck.partition.column.mapping'='pt1:$pt1,pt2:$pt2'); -- Query data from the external table. SELECT * FROM mf_oss_spe_pt WHERE pt1=1 AND pt2=2; -- The following result is returned: +------------+------------+------------+------------+ | id | name | pt1 | pt2 | +------------+------------+------------+------------+ | 1 | kyle | 1 | 2 | | 2 | nicole | 1 | 2 | +------------+------------+------------+------------+ -- Query data from the external table. SELECT * FROM mf_oss_spe_pt WHERE pt1=3 AND pt2=4; +------------+------------+------------+------------+ | id | name | pt1 | pt2 | +------------+------------+------------+------------+ | 3 | john | 3 | 4 | | 4 | lily | 3 | 4 | +------------+------------+------------+------------+
If the names of partition key columns in the OSS external table are inconsistent with the OSS directory structure, specify OSS directories.
-- Mappings between MaxCompute partitions and OSS directories. --pt1=8-->8 --pt2=8-->$pt2=8 -- Add partitions. ALTER TABLE mf_oss_spe_pt ADD PARTITION (pt1=8,pt2=8) location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo8/8/$pt2=8/'; -- Disable the commit mode. -- Insert data into the external table. SET odps.sql.unstructured.oss.commit.mode=false; INSERT INTO mf_oss_spe_pt PARTITION (pt1=8,pt2=8) VALUES (1,'tere'); -- Query data from the external table. SET odps.sql.unstructured.oss.commit.mode=false; SELECT * FROM mf_oss_spe_pt WHERE pt1=8 AND pt2=8; +------+------+-----+-----+ | id | name | pt1 | pt2 | +------+------+-----+-----+ | 1 | tere | 8 | 8 | +------+------+-----+-----+
Example: Prepare data
Sample data is provided for you to better understand the examples in this topic.
oss_endpoint:
oss-cn-hangzhou-internal.aliyuncs.com
, which indicates that OSS is deployed in the China (Hangzhou) region.Bucket name:
oss-mc-test
.Directory name:
Demo1/
,Demo2/
,Demo3/
, andSampleData/
.The following data files are uploaded to the preceding directories:
The vehicle.csv file is uploaded to the
Demo1/
directory. The vehicle.csv file is used to create a mapping between the Demo1/ directory and the non-partitioned external table that is created by using a built-in text extractor. The file contains the following data:1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S 1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W 1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S 1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N 1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW 1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE 1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,N
The
Demo2/
directory contains the following subdirectories:direction=N/
,direction=NE/
,direction=S/
,direction=SW/
, anddirection=W/
. The vehicle1.csv, vehicle2.csv, vehicle3.csv, vehicle4.csv, and vehicle5.csv files are separately uploaded to the five subdirectories. The files are used to create mappings between the Demo2/ directory and the partitioned external table that is created by using a built-in text extractor. The file contains the following data:--vehicle1.csv 1,7,53,1,46.81006,-92.08174,9/15/2014 0:00 1,10,31,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle2.csv 1,2,13,1,46.81006,-92.08174,9/14/2014 0:00 1,3,48,1,46.81006,-92.08174,9/14/2014 0:00 1,9,4,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle3.csv 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00 1,5,47,1,46.81006,-92.08174,9/14/2014 0:00 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle4.csv 1,8,63,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle5.csv 1,4,30,1,46.81006,-92.08174,9/14/2014 0:00
The vehicle.csv.gz file is uploaded to the
Demo3/
directory. The vehicle.csv file is contained in the vehicle.csv.gz file and has the same content as the vehicle.csv file in theDemo1/
directory. The vehicle.csv file is used to create a mapping between the Demo3/ directory and the OSS external table whose data is compressed.The vehicle6.csv file is uploaded to the
SampleData/
directory. The vehicle6.csv file is used to create a mapping between the SampleData/ directory and the OSS external table that is created by using a built-in open source data extractor. The file contains the following data:1|1|51|1|46.81006|-92.08174|9/14/2014 0:00|S 1|2|13|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|3|48|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|4|30|1|46.81006|-92.08174|9/14/2014 0:00|W 1|5|47|1|46.81006|-92.08174|9/14/2014 0:00|S 1|6|9|1|46.81006|-92.08174|9/14/2014 0:00|S 1|7|53|1|46.81006|-92.08174|9/14/2014 0:00|N 1|8|63|1|46.81006|-92.08174|9/14/2014 0:00|SW 1|9|4|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|10|31|1|46.81006|-92.08174|9/14/2014 0:00|N
Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor
Create a mapping between the non-partitioned external table and the Demo1/
directory in Example: Prepare data. Sample statement used to create an OSS external table:
create external table if not exists mc_oss_csv_external1
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string,
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
)
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/';
You can execute the DESC EXTENDED mc_oss_csv_external1;
statement to view the schema of the OSS external table that you created.
Example: Create a partitioned table as an OSS external table by using a built-in text extractor
Create a mapping between the partitioned external table and the Demo2/
directory in Example: Prepare data. Sample statements used to create an OSS external table and add partition data to the OSS external table:
create external table if not exists mc_oss_csv_external2
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string
)
partitioned by (
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
)
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/';
-- Add partition data.
msck repair table mc_oss_csv_external2 add partitions;
-- The preceding statement is equivalent to the following statement:
alter table mc_oss_csv_external2 add partition (direction = 'N') partition (direction = 'NE') partition (direction = 'S') partition (direction = 'SW') partition (direction = 'W');
You can execute the DESC EXTENDED mc_oss_csv_external2;
statement to view the schema of the OSS external table that you created.
Example: Create an OSS external table that is used to read or write compressed data by using a built-in text extractor
Create a mapping between the OSS external table and the Demo3/
directory in Example: Prepare data. Sample statement used to create an OSS external table:
create external table if not exists mc_oss_csv_external3
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string,
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole',
'odps.text.option.gzip.input.enabled'='true',
'odps.text.option.gzip.output.enabled'='true'
)
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/';
You can execute the DESC EXTENDED mc_oss_csv_external3;
statement to view the schema of the OSS external table that you created.
Example: Create an OSS external table by using a built-in open source data extractor
TEXTFILE data files are mapped to the OSS external table.
Example of creating an OSS external table that is associated with OSS data in the TEXT format
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored AS textfile location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
CREATE EXTERNAL TABLE statements do not support custom
row format
settings. Defaultrow format
settings:FIELDS TERMINATED BY :'\001' ESCAPED BY: '\' COLLECTION ITEMS TERMINATED BY: '\002' MAP KEYS TERMINATED BY: '\003' LINES TERMINATED BY: '\n' NULL DEFINED AS :'\N'
Example of creating an OSS external table that is associated with OSS data in the CSV format
-- Disable the native text reader. SET odps.ext.hive.lazy.simple.serde.native=false; -- Create an OSS external table. CREATE EXTERNAL TABLE <mc_oss_extable_name> ( <col_name> <data_type>, ... ) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH serdeproperties ( "separatorChar" = ",", "quoteChar"= '"', "escapeChar"= "\\" ) stored AS textfile location '<oss_location>' tblproperties ( "skip.header.line.count"="1", "skip.footer.line.count"="1" ); SELECT ... FROM <mc_oss_extable_name> ...;
NoteOpenCSVSerde supports only data of the STRING type.
Example of creating an OSS external table that maps to JSON data files
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored AS textfile location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
PARQUET data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] stored AS parquet location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
ORC data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] stored AS orc location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
NoteFor more information about the examples, see Use OSS external tables to perform data mapping based on field names.
RCFILE data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_rcfile_extable> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] row format serde 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' stored AS rcfile location '<oss_location>'; SELECT ... FROM <mc_oss_rcfile_extable> ...;
AVRO data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] stored AS avro location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
SEQUENCEFILE data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] stored AS sequencefile location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
Hudi data files are mapped to the OSS external table.
NoteHudi external tables only support reading data from all files mapped by the external table. They do not support automatically hiding system columns, incremental reads, snapshot reads, or write operations. We recommend that you use features such as MaxCompute Delta tables or Paimon external tables to achieve ACID-compliant read and write operations.
The default version of the Hudi SDK integrated into MaxCompute is
org.apache.hudi:hudi-hadoop-mr-bundle:0.12.2-emr-1.0.6
, which does not guarantee forward or backward compatibility of the Hudi SDK. The compatibility is ensured by the open-source community.MaxCompute provides a method for users to upload jars and specify the SDK compatibility version when creating external tables, to read data compatible with specific versions.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' location '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
Delta Lake data files are mapped to the OSS external table.
NoteDelta Lake external tables only support reading data from all files mapped by the external table. They do not support automatically hiding system columns, incremental reads, snapshot reads, or write operations. We recommend that you use features such as MaxCompute Delta tables or Paimon external tables to achieve ACID-compliant read and write operations.
The default version of the Delta Lake SDK integrated into MaxCompute is
io.delta:delta-core_2.11:0.2.0.5
, which does not guarantee forward or backward compatibility of the Delta Lake SDK. The compatibility is ensured by the open-source community.MaxCompute provides a method for users to upload jars and specify the SDK compatibility version when creating external tables, to read data compatible with specific versions.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat' OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat' LOCATION '<oss_location>'; SELECT ... FROM <mc_oss_extable_name> ...;
Apache Paimon data files are mapped to the OSS external table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned BY (<col_name> <data_type>, ...)] stored BY 'org.apache.paimon.hive.PaimonStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole' ) location '<oss_location>' USING 'paimon_maxcompute_connector.jar';
You must upload the JAR file
paimon_maxcompute_connector.jar
to the MaxCompute project in advance. For more information, see Apache Paimon external tables.
Example: Create an OSS external table by using a custom extractor
Create a mapping between the OSS external table and the SampleData/
directory in Example: Prepare data.
Perform the following steps:
Use Java programs TextExtractor.java, TextOutputer.java, SplitReader.java, and TextStorageHandler.java that are developed by using MaxCompute Studio.
For more information about how to develop a Java program, see Develop a UDF.
MaxCompute Studio allows you to package the TextStorageHandler.java program into a JAR file and upload the file to your MaxCompute project as a resource.
In this example, the JAR file is named
javatest-1.0-SNAPSHOT.jar
. For more information about how to package resources into files and upload the files to your MaxCompute project, see Package a Java program, upload the package, and create a MaxCompute UDF.NoteIf multiple dependencies are required, separately package them and upload them as MaxCompute resources.
Execute the following statement to create an OSS external table:
CREATE EXTERNAL TABLE [IF NOT EXISTS] ambulance_data_txt_external ( vehicleId int, recordId int, patientId int, calls int, locationLatitute double, locationLongtitue double, recordTime string, direction string ) stored BY 'com.aliyun.odps.udf.example.text.TextStorageHandler' WITH serdeproperties ( 'delimiter'='|', 'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole' ) location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SampleData/' USING 'javatest-1.0-SNAPSHOT.jar';
delimiter specifies the name of the delimiter of OSS data.
You can execute the
DESC EXTENDED ambulance_data_txt_external;
statement to view the schema of the OSS external table that you created.NoteIf you create an OSS external table by using a custom extractor, data in the external table is not sharded. This prevents data accuracy issues when the extractor is used to read data from OSS. If data in shards can be processed, you must run the following command to enable data sharding. When data sharding is enabled, multiple mappers are started to process data.
set odps.sql.unstructured.data.single.file.split.enabled=true;
Example: Create an OSS external table that is used to read or write non-text data by using a custom extractor
Create a mapping between the OSS external table and the SpeechSentence/
directory in Example: Prepare data.
Perform the following steps:
Use Java programs SpeechSentenceSnrExtractor.java and SpeechStorageHandler.java that are developed by using MaxCompute Studio.
For more information about how to develop a Java program, see Develop a UDF.
MaxCompute Studio allows you to package the SpeechStorageHandler.java program into a JAR file and upload the file to your MaxCompute project as a resource.
In this example, the JAR file is named
speechtest-1.0-SNAPSHOT.jar
. For more information about how to package resources into files and upload the files to your MaxCompute project, see Package a Java program, upload the package, and create a MaxCompute UDF.Execute the following statement to create an OSS external table:
CREATE EXTERNAL TABLE [IF NOT EXISTS] speech_sentence_snr_external ( sentence_snr double, id string ) stored BY 'com.aliyun.odps.udf.example.speech.SpeechStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole', 'mlfFileName'='sm_random_5_utterance.text.label', 'speechSampleRateInKHz' = '16' ) location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SpeechSentence/' USING 'speechtest-1.0-SNAPSHOT.jar,sm_random_5_utterance.text.label';
You can execute the
DESC EXTENDED speech_sentence_snr_external;
statement to view the schema of the OSS external table that you created.
Example: Create an OSS external table and specify the first row as the table header
-- Create an OSS external table.
CREATE EXTERNAL TABLE mf_oss_wtt
(
id bigint,
name string,
tran_amt double
)
STORED BY 'com.aliyun.odps.CsvStorageHandler'
WITH serdeproperties (
'odps.text.option.header.lines.count' = '1',
'odps.sql.text.option.flush.header' = 'true'
)
location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo11/';
-- Insert data into the external table.
INSERT overwrite TABLE mf_oss_wtt VALUES (1, 'val1', 1.1),(2, 'value2', 1.3);
-- Query data from the external table.
-- When you create the table, you can set the data types of all fields to STRING. If you do not set the data types of all fields to STRING, an error is reported when the table header is read.
-- When you create the table, you can also configure the following settings to skip the table header: 'odps.text.option.header.lines.count' = '1'.
SELECT * FROM mf_oss_wtt;
+------------+------+------------+
| id | name | tran_amt |
+------------+------+------------+
| 1 | val1 | 1.1 |
| 2 | value2 | 1.3 |
+------------+------+------------+
The following figure shows the data in the OSS file that corresponds to the external table.
Example: Create an OSS external table in which the number of columns is different from the number of columns in the OSS data file
Create a data file in the CSV format and upload the data file to the OSS directory
doc-test-01/demo
.1,kyle1,this is desc1 2,kyle2,this is desc2,this is two 3,kyle3,this is desc3,this is three, I have 4 columns
Create an OSS external table.
Perform the
truncate
operation to process the rows in the external table in which the number of columns is different from the number of columns in the data file.-- Delete the table. DROP TABLE test_mismatch; -- Create an OSS external table. CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch ( id string, name string, dect string, col4 string ) stored BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'truncate') location 'oss://oss-cn-shanghai-internal.aliyuncs.com/doc-test-01/demo/';
Perform the
ignore
operation to process the rows in the table in which the number of columns is different from the number of columns in the data file.-- Delete the table. DROP TABLE test_mismatch01; -- Create an OSS external table. CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch01 ( id string, name string, dect string, col4 string ) stored BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'ignore') location 'oss://oss-cn-shanghai-internal.aliyuncs.com/doc-test-01/demo/';
Query data in the table.
SELECT * FROM test_mismatch; -- The following result is returned: +----+------+------+------+ | id | name | dect | col4 | +----+------+------+------+ | 1 | kyle1 | this is desc1 | NULL | | 2 | kyle2 | this is desc2 | this is two | | 3 | kyle3 | this is desc3 | this is three | +----+------+------+------+
SELECT * FROM test_mismatch01; -- The following result is returned: +----+------+------+------+ | id | name | dect | col4 | +----+------+------+------+ | 2 | kyle2 | this is desc2 | this is two | +----+------+------+------+
Reference: Parameters in the syntax
This section describes the parameters in the syntax.
Parameter | Required | Description |
mc_oss_extable_name | Yes | The name of the OSS external table that you want to create. Table names are not case-sensitive. When you query external tables, the table names are not case-sensitive, and forced uppercase and lowercase conversions are not supported. |
col_name | Yes | The name of a column in the OSS external table. Before you read OSS data, make sure that the schema of the OSS external table is the same as the schema of the OSS data files. Otherwise, OSS data cannot be read. |
data_type | Yes | The data type of a column in the OSS external table. Before you read OSS data, make sure that the data types of columns in the OSS external table are the same as the data types of columns in the OSS data files. Otherwise, OSS data cannot be read. |
partitioned by (<col_name> <data_type>, ...) | Required if specific conditions are met | If data files in OSS are stored in partitioned directories, this parameter is required to create a partitioned external table.
|
StorageHandler | Required if specific conditions are met | If you use a built-in text extractor or a custom extractor to create an OSS external table, this parameter is required. MaxCompute extractors are categorized into the following types:
|
'<property_name>'='<property_value>' | Yes | The extended property of the external table. For more information about the properties, see Reference: WITH SERDEPROPERTIES properties. |
oss_location | Yes | The OSS directory where data files are stored. The OSS directory is in the
|
jar_name | Required if specific conditions are met | If you use a custom extractor to create an OSS external table, this parameter is required. This parameter specifies the JAR package that corresponds to the custom extractor code. The JAR package must be added to your MaxCompute project as a resource. For more information about how to add resources, see Add resources. |
serde_class | No | This parameter specifies the built-in open source data extractor of MaxCompute. If the data file is in the TEXTFILE format, this parameter is required. You do not need to configure this parameter in other scenarios. Mappings between open source data formats that are supported by MaxCompute and SerDe classes:
|
file_format | Required if specific conditions are met | If OSS data files are in an open source format, this parameter is required. This parameter specifies the format of the OSS data files. Note The size of a single file cannot exceed 3 GB. If the file size exceeds 3 GB, we recommend that you split the file into multiple files. |
resource_name | No | If you use a custom SerDe class, you must use this parameter to specify the resource that contains the custom SerDe class. JAR packages that are related to the SerDe class must be added to your MaxCompute project as resources. For more information about how to add resources, see Add resources. |
'<tbproperty_name>'='<tbproperty_value>' | No | The extended property of the external table. For more information about the properties, see Reference: TBLPROPERTIES properties. |
Reference: WITH SERDEPROPERTIES properties
property_name | Scenario | Description | property_value | Default value |
odps.properties.rolearn | If you use Security Token Service (STS) authorization, add this property. | Specifies the Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS. | You can obtain the ARN from the role details page in the RAM console. | None |
odps.text.option.gzip.input.enabled | If you need to read data from a CSV.GZ or TSV.GZ data file, add this property. | CSV or TSV compression properties. If you set this parameter to True, MaxCompute normally reads data from CSV.GZ or TSV.GZ data files. Otherwise, the data read operation fails. |
| False |
odps.text.option.gzip.output.enabled | If you need to write data to OSS by using the GZIP compression algorithm, add this property. | CSV or TSV compression properties. If you set this parameter to True, MaxCompute normally writes data to OSS by using the GZIP compression algorithm. Otherwise, the data is not compressed. |
| False |
odps.text.option.header.lines.count | If the OSS data file is in the CSV or TSV format and you need to ignore the first N rows in the OSS data file, add this property. | Specifies the number of rows that are ignored when you use MaxCompute to read OSS data files. | Non-negative integer | 0 |
odps.text.option.null.indicator | If the OSS data file is in the CSV or TSV format and you need to define parsing rules of NULL in the OSS data file, add this property. | Specifies the string that is parsed as NULL in an SQL statement. For example, if you use | A string | Empty string |
odps.text.option.ignore.empty.lines | If the OSS data file is in the CSV or TSV format and you need to define processing rules of empty rows in the OSS data file, add this property. | If you set this parameter to True, MaxCompute reads all rows in the data file. Otherwise, MaxCompute reads the empty rows. |
| True |
odps.text.option.encoding | If the OSS data file is in the CSV or TSV format and the encoding format of the OSS data file is not the default encoding format, add this property. | Make sure that the encoding format that is configured in this scenario is the same as the encoding format of the OSS data file. Otherwise, MaxCompute cannot read data successfully. |
| UTF-8 |
odps.text.option.delimiter | If you need to specify column delimiters for a CSV or TSV data file, add this property. | Make sure that the column delimiters specified in this scenario can be used to normally read each column of the OSS data file. Otherwise, the data read by MaxCompute is misaligned. | A single character | Comma (,) |
odps.text.option.use.quote | If fields in a CSV or TSV data file contain carriage return, line feed (CRLF) pairs, double quotation marks ("), or commas (,), add this property. | Specifies whether to recognize column delimiters in a CSV file if it uses double quotation marks ( |
| False |
mcfed.parquet.compression | If you need to write data in a PARQUET data file to OSS by using a compression algorithm, add this property. | PARQUET compression property. By default, data in the PARQUET data file is not compressed. |
| None |
mcfed.parquet.compression.codec.zstd.level | When | A higher level value results in a higher compression ratio. However, practical tests show that when the level value is high, the reduction in written data volume is very limited, while the consumption of time and resources increases rapidly, significantly reducing cost-effectiveness. Therefore, for scenarios involving large-scale read and write operations with compressed parquet files, using a lower zstd compression level (level 3 to level 5) yields the best results. For example: | Valid values: 1 to 22 | 3 |
parquet.file.cache.size | If you need to improve the performance of reading data in a PARQUET data file, add this property. | Specifies the maximum amount of data that can be cached when you read OSS data files. Unit: KB. | 1024 | None |
parquet.io.buffer.size | If you need to improve the performance of reading data in a PARQUET data file, add this property. | Specifies the maximum amount of data that can be cached when the size of the OSS data file exceeds 1,024 KB. Unit: KB. | 4096 | None |
separatorChar | If you need to specify column delimiters for a CSV data file in the TEXTFILE format, add this property. | Specifies the column delimiters for a CSV data file. | A single string | Comma (,) |
quoteChar | If fields in a CSV data file in the TEXTFILE format contain CRLF pairs, double quotation marks ("), or commas (,), add this property. | Specifies the quote for a CSV data file. | A single string | None |
escapeChar | If you need to specify the rules that are used to escape a CSV data file in the TEXTFILE format, add this property. | Specifies escape characters for a CSV data file. | A single string | None |
mcfed.orc.schema.resolution | If the schemas of data in the same OSS external table are different, add this property. | Specifies the parsing mode of ORC files. If this property is set to |
| By default, ORC files are parsed based on column IDs. If ORC files are parsed based on column IDs, the setting is equivalent to the setting of |
odps.sql.text.option.flush.header | This parameter specifies whether to use the first row of the file block as the table header when you write data to OSS. | This parameter must be specified for CSV files. |
| False |
odps.sql.text.schema.mismatch.mode | If the number of columns in the OSS data file is different from the number of columns of the schema in the external table, add this property. | Specifies the method to process rows in the table in which the number of columns is different from the number of columns in the data file. Note If the odps.text.option.use.quote parameter is set to True, this feature does not take effect. |
| error |
Reference: TBLPROPERTIES properties
property_name | Scenario | Description | property_value | Default value |
skip.header.line.count | If you need to ignore the first N rows in a CSV data file in the TEXTFILE format, add this property. | If you use MaxCompute to read OSS data, the data in the specified number of rows that start from the first row is ignored. | Non-negative integer | N/A |
skip.footer.line.count | If you need to ignore the last N rows in a CSV data file in the TEXTFILE format, add this property. | If you use MaxCompute to read OSS data, the data in the specified number of rows that start from the last row is ignored. | Non-negative integer | N/A |
mcfed.orc.compress | If you need to write data in an ORC data file to OSS by using a compression algorithm, add this property. | ORC compression property: specifies the compression algorithm of ORC data files. |
| N/A |
mcfed.mapreduce.output.fileoutputformat.compress | If you need to write data in a TEXTFILE data file to OSS by using a compression algorithm, add this property. | TEXTFILE compression property. If you set this parameter to True, MaxCompute writes data in the TEXTFILE data files to OSS by using a compression algorithm. Otherwise, data is not compressed. |
| False |
mcfed.mapreduce.output.fileoutputformat.compress.codec | If you need to write data in a TEXTFILE data file to OSS by using a compression algorithm, add this property. | TEXTFILE compression property: Specifies the compression algorithm of the TEXTFILE data files. Note Only the compression methods that are described in the |
| N/A |
io.compression.codecs | If the OSS data file is in the raw Snappy format, add this property. | If you set this parameter to True, MaxCompute normally reads data from compressed files. Otherwise, the data read operation fails. | com.aliyun.odps.io.compress.SnappyRawCodec | N/A |
References
After you create an OSS external table, you can use the external table to read the data that is stored in an OSS directory. For more information, see Read OSS data.