All Products
Search
Document Center

MaxCompute:Create an OSS external table

Last Updated:Nov 22, 2024

MaxCompute allows you to create an Object Storage Service (OSS) external table in your MaxCompute project. The OSS external table maps to an OSS directory. You can use the OSS external table to access unstructured data in data files in the OSS directory or write data from your MaxCompute project to the OSS directory. This topic describes the syntax and parameters that are used to create an OSS external table. This topic also provides examples on how to create an OSS external table.

Background information

OSS is a secure, cost-effective, and highly reliable cloud storage service that can store a large number of data files in any format. If you want to use MaxCompute to read data from an OSS directory or write data from a MaxCompute project to an OSS directory, you can create an OSS external table in your MaxCompute project. The OSS external table maps to the OSS directory.

You can create a partitioned table or a non-partitioned table as an OSS external table. The type of the table that you must create varies based on the format of the OSS directory in which data files are stored. If data files are stored in a partition directory, you must create a partitioned table. Otherwise, you must create a non-partitioned table. For more information about how to read partition data, see Read OSS data stored in partitions.

Prerequisites

Before you create an OSS external table, make sure that the Alibaba Cloud account or RAM user that you use meets the following requirements:

  • Access to OSS has been granted.

    Supports access to the OSS external tables by Alibaba Cloud account, RAM user, or RAM role. For more information about authorization, see STS authorization for OSS.

  • Permissions to create tables in a MaxCompute project have been granted.

    For more information about operation permissions on tables in a MaxCompute project, see MaxCompute permissions.

Precautions

When you use OSS external tables, take note of the following items:

  • For data files in different formats, the statements that are used to create OSS external tables differ only in some parameters. We recommend that you read the syntax that is used to create an OSS external table and the parameter descriptions to learn more. This helps you create an OSS external table that meets your business requirements. If an invalid parameter is configured when you create an OSS external table, you may fail to read OSS data or write data to OSS by using the OSS external table.

  • OSS external tables record only the mappings between these tables and OSS directories. If you delete an OSS external table, data files in the OSS directory that is mapped to the table are not deleted.

  • If a data file stored in OSS is an object of the Archive storage class, you must first restore the file. For more information about the restoration operation, see Restore objects.

  • You must use the classic network endpoint of OSS. MaxCompute does not ensure network connectivity for public endpoints.

Limits

OSS external tables do not support the clustering attribute.

Platforms

The following table lists platforms in which you can create OSS external tables.

Creation method

Tool

MaxCompute SQL statements

MaxCompute client

DataWorks console

ODPS SQL nodes in the DataWorks console

SQL scripts in MaxCompute Studio

Web UI

SQL scripts in MaxCompute Studio

DataWorks console

Syntax used to create an OSS external table

The following table describes the syntax that is used to create an OSS external table in each scenario. The following table also provides references of the examples in each scenario. For more information about syntax parameters and properties, see Reference: Parameters in the syntax, Reference: WITH SERDEPROPERTIES properties, and Reference: TBLPROPERTIES properties.

Scenario

Syntax

Supported format of OSS data files

Example

Create an external table by using a built-in text extractor

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> 
(
<col_name> <data_type>,
...
)
[partitioned BY (<col_name> <data_type>, ...)] 
stored BY '<StorageHandler>'  
WITH serdeproperties (
 ['<property_name>'='<property_value>',...]
) 
location '<oss_location>';
  • CSV

  • CSV data files compressed in the GZIP format

  • TSV

  • TSV data files compressed in the GZIP format

Create an external table by using a built-in open source data extractor

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[partitioned BY (<col_name> <data_type>, ...)]
[row format serde '<serde_class>'
  [WITH serdeproperties (
    ['<property_name>'='<property_value>',...])
  ]
]
stored AS <file_format> 
location '<oss_location>' 
[USING '<resource_name>']
[tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];
  • PARQUET

  • PARQUET data files compressed in the ZSTD, SNAPPY, or GZIP format

  • TEXTFILE (JSON or TEXT) data files

  • TEXTFILE data files compressed in the SNAPPY, LZO, BZ2, GZ, or DEFLATE format

  • ORC

  • ORC data files compressed in the SNAPPY or ZLIB format

  • RCFILE

  • AVRO

  • SEQUENCEFILE

Note

Only Hudi data that is generated by Data Lake Formation (DLF) can be read.

Example: Create an OSS external table by using a built-in open source data extractor

Create an external table by using a custom extractor

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> 
(
<col_name> <date_type>,
...
)
[partitioned BY (<col_name> <data_type>, ...)] 
stored BY '<StorageHandler>' 
WITH serdeproperties (
 ['<property_name>'='<property_value>',...]
) 
location '<oss_location>' 
USING '<jar_name>';

Data files except the preceding formats

Note

Syntax used to add partition data to OSS external tables

If the OSS external table that you created is a partitioned table, partition data needs to be added to the OSS external table.

  • Method 1 (recommended): MaxCompute automatically parses the OSS directory structure, identifies partitions, and then adds partition data to the OSS external table.

    This way, MaxCompute automatically supplements the partitions of the OSS external table based on the partition directory that you specify when you create the OSS external table, instead of adding the partitions one by one based on the names of partition key columns and partition names. This method is suitable for supplementing all missing historical partitions at a time.

    msck repair TABLE <mc_oss_extable_name> ADD partitions [ WITH properties (key:VALUE, key: VALUE ...)];
    Note

    This method is not suitable for scenarios in which partitions are continuously added, especially in scenarios in which the OSS directory contains a large number of partitions. For example, if the OSS directory contains more than 1,000 partitions, we recommend that you do not use this method. If the number of new partitions is far less than the number of existing partitions, frequent use of the msck command results in a large number of repeated scans and metadata update requests for the OSS directory. This significantly reduces the command execution efficiency. Therefore, we recommend that you use method 2 if you want to update data in new partitions.

  • Method 2: Manually run the following commands to add partition data to the OSS external table:

    If specific partitions have been created and you need to periodically add partitions, we recommend that you use this method to create partitions before you run tasks for data writing. After a partition is created, the external table can read the latest data from the OSS directory without the need to update the related partition even if new data is written to the OSS directory.

    ALTER TABLE < mc_oss_extable_name >
        ADD PARTITION (< col_name >= < col_value >)[
    ADD PARTITION (< col_name >= < col_value >)...][location URL];

    The values of col_name and col_value must be consistent with the names of the OSS subdirectories where the partition data file is stored. If the OSS directory structure of the partition data file shown in the following figure is used, the value of col_name is direction, and the values of col_value are N, NE, S, SW, and W. One add partition clause in the ALTER TABLE statement is used to add a partition that is mapped to an OSS subdirectory. The number of add partition clauses must be the same as that of OSS subdirectories.分区路径

  • Sample statement

    1. Create a directory named demo8 in OSS and create two partition folders under the directory to store the related files.

      • The file demo8-pt1.txt is stored in the partition folder $pt1=1/$pt2=2.

      • The file demo8-pt2.txt is stored in the partition folder $pt1=3/$pt2=4.

    2. Create an external table and specify partition fields.

      -- Create an OSS external table.
      CREATE EXTERNAL TABLE mf_oss_spe_pt (id int, name string)
      partitioned BY (pt1 string, pt2 string)
      stored AS TEXTFILE
      location "oss://oss-cn-beijing-internal.aliyuncs.com/mfoss*******/demo8/";
      
      -- Specify partition fields.
      MSCK REPAIR TABLE  mf_oss_spe_pt ADD PARTITIONS
      WITH PROPERTIES ('odps.msck.partition.column.mapping'='pt1:$pt1,pt2:$pt2');
      
      -- Query data from the external table.
      SELECT * FROM mf_oss_spe_pt WHERE pt1=1 AND pt2=2;
      -- The following result is returned:
      +------------+------------+------------+------------+
      | id         | name       | pt1        | pt2        |
      +------------+------------+------------+------------+
      | 1          | kyle       | 1          | 2          |
      | 2          | nicole     | 1          | 2          |
      +------------+------------+------------+------------+
      -- Query data from the external table.
      SELECT * FROM mf_oss_spe_pt WHERE pt1=3 AND pt2=4;
      +------------+------------+------------+------------+
      | id         | name       | pt1        | pt2        |
      +------------+------------+------------+------------+
      | 3          | john       | 3          | 4          |
      | 4          | lily       | 3          | 4          |
      +------------+------------+------------+------------+                               
    3. If the names of partition key columns in the OSS external table are inconsistent with the OSS directory structure, specify OSS directories.

      -- Mappings between MaxCompute partitions and OSS directories.
      --pt1=8-->8
      --pt2=8-->$pt2=8
      -- Add partitions.
      ALTER TABLE mf_oss_spe_pt ADD PARTITION (pt1=8,pt2=8)
            location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo8/8/$pt2=8/';
      -- Disable the commit mode.
      -- Insert data into the external table.
      SET odps.sql.unstructured.oss.commit.mode=false;
      INSERT INTO mf_oss_spe_pt PARTITION (pt1=8,pt2=8) VALUES (1,'tere');
      -- Query data from the external table.
      SET odps.sql.unstructured.oss.commit.mode=false;
      SELECT * FROM mf_oss_spe_pt WHERE pt1=8 AND pt2=8;
      +------+------+-----+-----+
      | id   | name | pt1 | pt2 |
      +------+------+-----+-----+
      | 1    | tere | 8   | 8   |
      +------+------+-----+-----+
                                      

Example: Prepare data

Sample data is provided for you to better understand the examples in this topic.

  • oss_endpoint: oss-cn-hangzhou-internal.aliyuncs.com, which indicates that OSS is deployed in the China (Hangzhou) region.

  • Bucket name: oss-mc-test.

  • Directory name: Demo1/, Demo2/, Demo3/, and SampleData/.

    The following data files are uploaded to the preceding directories:

    • The vehicle.csv file is uploaded to the Demo1/ directory. The vehicle.csv file is used to create a mapping between the Demo1/ directory and the non-partitioned external table that is created by using a built-in text extractor. The file contains the following data:

      1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S
      1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE
      1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE
      1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W
      1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S
      1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N
      1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW
      1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE
      1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,N
    • The Demo2/ directory contains the following subdirectories: direction=N/, direction=NE/, direction=S/, direction=SW/, and direction=W/. The vehicle1.csv, vehicle2.csv, vehicle3.csv, vehicle4.csv, and vehicle5.csv files are separately uploaded to the five subdirectories. The files are used to create mappings between the Demo2/ directory and the partitioned external table that is created by using a built-in text extractor. The file contains the following data:

      --vehicle1.csv
      1,7,53,1,46.81006,-92.08174,9/15/2014 0:00
      1,10,31,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle2.csv
      1,2,13,1,46.81006,-92.08174,9/14/2014 0:00
      1,3,48,1,46.81006,-92.08174,9/14/2014 0:00
      1,9,4,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle3.csv
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
      1,5,47,1,46.81006,-92.08174,9/14/2014 0:00
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle4.csv
      1,8,63,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle5.csv
      1,4,30,1,46.81006,-92.08174,9/14/2014 0:00

      Demo2

    • The vehicle.csv.gz file is uploaded to the Demo3/ directory. The vehicle.csv file is contained in the vehicle.csv.gz file and has the same content as the vehicle.csv file in the Demo1/ directory. The vehicle.csv file is used to create a mapping between the Demo3/ directory and the OSS external table whose data is compressed.压缩

    • The vehicle6.csv file is uploaded to the SampleData/ directory. The vehicle6.csv file is used to create a mapping between the SampleData/ directory and the OSS external table that is created by using a built-in open source data extractor. The file contains the following data:

      1|1|51|1|46.81006|-92.08174|9/14/2014 0:00|S
      1|2|13|1|46.81006|-92.08174|9/14/2014 0:00|NE
      1|3|48|1|46.81006|-92.08174|9/14/2014 0:00|NE
      1|4|30|1|46.81006|-92.08174|9/14/2014 0:00|W
      1|5|47|1|46.81006|-92.08174|9/14/2014 0:00|S
      1|6|9|1|46.81006|-92.08174|9/14/2014 0:00|S
      1|7|53|1|46.81006|-92.08174|9/14/2014 0:00|N
      1|8|63|1|46.81006|-92.08174|9/14/2014 0:00|SW
      1|9|4|1|46.81006|-92.08174|9/14/2014 0:00|NE
      1|10|31|1|46.81006|-92.08174|9/14/2014 0:00|N

Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor

Create a mapping between the non-partitioned external table and the Demo1/ directory in Example: Prepare data. Sample statement used to create an OSS external table:

create external table if not exists mc_oss_csv_external1
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string,
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler' 
with serdeproperties (
 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
) 
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/';

You can execute the DESC EXTENDED mc_oss_csv_external1; statement to view the schema of the OSS external table that you created.

Example: Create a partitioned table as an OSS external table by using a built-in text extractor

Create a mapping between the partitioned external table and the Demo2/ directory in Example: Prepare data. Sample statements used to create an OSS external table and add partition data to the OSS external table:

create external table if not exists mc_oss_csv_external2
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string
)
partitioned by (
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler' 
with serdeproperties (
 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
) 
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/';
-- Add partition data. 
msck repair table mc_oss_csv_external2 add partitions;
-- The preceding statement is equivalent to the following statement: 
alter table mc_oss_csv_external2 add partition (direction = 'N') partition (direction = 'NE') partition (direction = 'S') partition (direction = 'SW') partition (direction = 'W');

You can execute the DESC EXTENDED mc_oss_csv_external2; statement to view the schema of the OSS external table that you created.

Example: Create an OSS external table that is used to read or write compressed data by using a built-in text extractor

Create a mapping between the OSS external table and the Demo3/ directory in Example: Prepare data. Sample statement used to create an OSS external table:

create external table if not exists mc_oss_csv_external3
(
vehicleId int,
recordId int,
patientId int,
calls int,
locationLatitute double,
locationLongtitue double,
recordTime string,
direction string
)
stored by 'com.aliyun.odps.CsvStorageHandler' 
with serdeproperties (
 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole',
 'odps.text.option.gzip.input.enabled'='true',
 'odps.text.option.gzip.output.enabled'='true' 
) 
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/';

You can execute the DESC EXTENDED mc_oss_csv_external3; statement to view the schema of the OSS external table that you created.

Example: Create an OSS external table by using a built-in open source data extractor

  • TEXTFILE data files are mapped to the OSS external table.

    • Example of creating an OSS external table that is associated with OSS data in the TEXT format

      CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
      (
      <col_name> <data_type>,
      ...
      )
      [partitioned BY (<col_name> <data_type>, ...)]
      row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
      stored AS textfile
      location '<oss_location>';
      SELECT ... FROM <mc_oss_extable_name> ...;

      CREATE EXTERNAL TABLE statements do not support custom row format settings. Default row format settings:

      FIELDS TERMINATED BY :'\001'
      ESCAPED BY: '\'
      COLLECTION ITEMS TERMINATED BY: '\002'
      MAP KEYS TERMINATED BY: '\003'
      LINES TERMINATED BY: '\n'
      NULL DEFINED AS :'\N'
    • Example of creating an OSS external table that is associated with OSS data in the CSV format

      -- Disable the native text reader. 
      SET odps.ext.hive.lazy.simple.serde.native=false;
      -- Create an OSS external table. 
      CREATE EXTERNAL TABLE <mc_oss_extable_name>
      (
      <col_name> <data_type>,
      ...
      )
      row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
      WITH serdeproperties (
      "separatorChar" = ",",
      "quoteChar"= '"',
      "escapeChar"= "\\"
      )
      stored AS textfile
      location '<oss_location>'
      tblproperties (
      "skip.header.line.count"="1",
      "skip.footer.line.count"="1"
      );
      SELECT ... FROM <mc_oss_extable_name> ...;
      Note

      OpenCSVSerde supports only data of the STRING type.

    • Example of creating an OSS external table that maps to JSON data files

      CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
      (
      <col_name> <data_type>,
      ...
      )
      [partitioned BY (<col_name> <data_type>, ...)]
      row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
      stored AS textfile
      location '<oss_location>';
      SELECT ... FROM <mc_oss_extable_name> ...;
  • PARQUET data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    stored AS parquet  
    location '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
  • ORC data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    stored AS orc
    location '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
    Note

    For more information about the examples, see Use OSS external tables to perform data mapping based on field names.

  • RCFILE data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_rcfile_extable>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    row format serde 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
    stored AS rcfile
    location '<oss_location>';
    SELECT ... FROM <mc_oss_rcfile_extable> ...;
  • AVRO data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    stored AS avro
    location '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
  • SEQUENCEFILE data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    stored AS sequencefile
    location '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
  • Hudi data files are mapped to the OSS external table.

    Note
    • Hudi external tables only support reading data from all files mapped by the external table. They do not support automatically hiding system columns, incremental reads, snapshot reads, or write operations. We recommend that you use features such as MaxCompute Delta tables or Paimon external tables to achieve ACID-compliant read and write operations.

    • The default version of the Hudi SDK integrated into MaxCompute is org.apache.hudi:hudi-hadoop-mr-bundle:0.12.2-emr-1.0.6, which does not guarantee forward or backward compatibility of the Hudi SDK. The compatibility is ensured by the open-source community.

    • MaxCompute provides a method for users to upload jars and specify the SDK compatibility version when creating external tables, to read data compatible with specific versions.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    stored AS 
    INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    location '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
  • Delta Lake data files are mapped to the OSS external table.

    Note
    • Delta Lake external tables only support reading data from all files mapped by the external table. They do not support automatically hiding system columns, incremental reads, snapshot reads, or write operations. We recommend that you use features such as MaxCompute Delta tables or Paimon external tables to achieve ACID-compliant read and write operations.

    • The default version of the Delta Lake SDK integrated into MaxCompute is io.delta:delta-core_2.11:0.2.0.5, which does not guarantee forward or backward compatibility of the Delta Lake SDK. The compatibility is ensured by the open-source community.

    • MaxCompute provides a method for users to upload jars and specify the SDK compatibility version when creating external tables, to read data compatible with specific versions.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> (
       <col_name> <data_type>,
       ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS
    INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
    OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
    LOCATION '<oss_location>';
    SELECT ... FROM <mc_oss_extable_name> ...;
  • Apache Paimon data files are mapped to the OSS external table.

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [partitioned BY (<col_name> <data_type>, ...)]
    stored BY 'org.apache.paimon.hive.PaimonStorageHandler'
    WITH serdeproperties (
        'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
    )
    location '<oss_location>'
    USING 'paimon_maxcompute_connector.jar';

    You must upload the JAR file paimon_maxcompute_connector.jar to the MaxCompute project in advance. For more information, see Apache Paimon external tables.

Example: Create an OSS external table by using a custom extractor

Create a mapping between the OSS external table and the SampleData/ directory in Example: Prepare data.

Perform the following steps:

  1. Use Java programs TextExtractor.java, TextOutputer.java, SplitReader.java, and TextStorageHandler.java that are developed by using MaxCompute Studio.

    For more information about how to develop a Java program, see Develop a UDF.

  2. MaxCompute Studio allows you to package the TextStorageHandler.java program into a JAR file and upload the file to your MaxCompute project as a resource.

    In this example, the JAR file is named javatest-1.0-SNAPSHOT.jar. For more information about how to package resources into files and upload the files to your MaxCompute project, see Package a Java program, upload the package, and create a MaxCompute UDF.

    Note

    If multiple dependencies are required, separately package them and upload them as MaxCompute resources.

  3. Execute the following statement to create an OSS external table:

    CREATE EXTERNAL TABLE [IF NOT EXISTS] ambulance_data_txt_external
    (
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    stored BY 'com.aliyun.odps.udf.example.text.TextStorageHandler' 
      WITH serdeproperties (
    'delimiter'='|',  
    'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
    )
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SampleData/'
    USING 'javatest-1.0-SNAPSHOT.jar'; 

    delimiter specifies the name of the delimiter of OSS data.

    You can execute the DESC EXTENDED ambulance_data_txt_external; statement to view the schema of the OSS external table that you created.

    Note

    If you create an OSS external table by using a custom extractor, data in the external table is not sharded. This prevents data accuracy issues when the extractor is used to read data from OSS. If data in shards can be processed, you must run the following command to enable data sharding. When data sharding is enabled, multiple mappers are started to process data.

    set odps.sql.unstructured.data.single.file.split.enabled=true;

Example: Create an OSS external table that is used to read or write non-text data by using a custom extractor

Create a mapping between the OSS external table and the SpeechSentence/ directory in Example: Prepare data.

Perform the following steps:

  1. Use Java programs SpeechSentenceSnrExtractor.java and SpeechStorageHandler.java that are developed by using MaxCompute Studio.

    For more information about how to develop a Java program, see Develop a UDF.

  2. MaxCompute Studio allows you to package the SpeechStorageHandler.java program into a JAR file and upload the file to your MaxCompute project as a resource.

    In this example, the JAR file is named speechtest-1.0-SNAPSHOT.jar. For more information about how to package resources into files and upload the files to your MaxCompute project, see Package a Java program, upload the package, and create a MaxCompute UDF.

  3. Execute the following statement to create an OSS external table:

    CREATE EXTERNAL TABLE [IF NOT EXISTS] speech_sentence_snr_external
    (
    sentence_snr double,
    id string
    )
    stored BY 'com.aliyun.odps.udf.example.speech.SpeechStorageHandler'
    WITH serdeproperties (
        'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole',   
        'mlfFileName'='sm_random_5_utterance.text.label',
        'speechSampleRateInKHz' = '16'
    )
    location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SpeechSentence/'
    USING 'speechtest-1.0-SNAPSHOT.jar,sm_random_5_utterance.text.label';

    You can execute the DESC EXTENDED speech_sentence_snr_external; statement to view the schema of the OSS external table that you created.

Example: Create an OSS external table and specify the first row as the table header

-- Create an OSS external table.
CREATE EXTERNAL TABLE mf_oss_wtt
(
  id bigint,
  name string,
  tran_amt double
)
STORED BY 'com.aliyun.odps.CsvStorageHandler'
WITH serdeproperties (
  'odps.text.option.header.lines.count' = '1',
  'odps.sql.text.option.flush.header' = 'true'
)
location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo11/';

-- Insert data into the external table.
INSERT overwrite TABLE mf_oss_wtt VALUES (1, 'val1', 1.1),(2, 'value2', 1.3);
-- Query data from the external table.
-- When you create the table, you can set the data types of all fields to STRING. If you do not set the data types of all fields to STRING, an error is reported when the table header is read. 
-- When you create the table, you can also configure the following settings to skip the table header: 'odps.text.option.header.lines.count' = '1'.
SELECT * FROM mf_oss_wtt;
+------------+------+------------+
| id         | name | tran_amt   |
+------------+------+------------+
| 1          | val1 | 1.1        |
| 2          | value2 | 1.3        |
+------------+------+------------+

The following figure shows the data in the OSS file that corresponds to the external table.oss外部表文件

Example: Create an OSS external table in which the number of columns is different from the number of columns in the OSS data file

  1. Create a data file in the CSV format and upload the data file to the OSS directory doc-test-01/demo.

    1,kyle1,this is desc1
    2,kyle2,this is desc2,this is two
    3,kyle3,this is desc3,this is three, I have 4 columns
  2. Create an OSS external table.

    • Perform the truncate operation to process the rows in the external table in which the number of columns is different from the number of columns in the data file.

      -- Delete the table.
      DROP TABLE test_mismatch;
      -- Create an OSS external table.
      CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch
      (
        id string,
        name string,
        dect string,
        col4 string
      )
      stored BY 'com.aliyun.odps.CsvStorageHandler'
      WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'truncate')
      location 'oss://oss-cn-shanghai-internal.aliyuncs.com/doc-test-01/demo/';
    • Perform the ignore operation to process the rows in the table in which the number of columns is different from the number of columns in the data file.

      -- Delete the table.
      DROP TABLE test_mismatch01;
      -- Create an OSS external table.
      CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch01
      (
        id string,
        name string,
        dect string,
        col4 string
      )
      stored BY 'com.aliyun.odps.CsvStorageHandler'
      WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'ignore')
      location 'oss://oss-cn-shanghai-internal.aliyuncs.com/doc-test-01/demo/';
  3. Query data in the table.

    • SELECT * FROM test_mismatch;
      
      -- The following result is returned:
      +----+------+------+------+
      | id | name | dect | col4 |
      +----+------+------+------+
      | 1  | kyle1 | this is desc1 | NULL |
      | 2  | kyle2 | this is desc2 | this is two |
      | 3  | kyle3 | this is desc3 | this is three |
      +----+------+------+------+
    • SELECT * FROM test_mismatch01;
      
      -- The following result is returned:
      +----+------+------+------+
      | id | name | dect | col4 |
      +----+------+------+------+
      | 2  | kyle2 | this is desc2 | this is two |
      +----+------+------+------+

Reference: Parameters in the syntax

This section describes the parameters in the syntax.

Parameter

Required

Description

mc_oss_extable_name

Yes

The name of the OSS external table that you want to create.

Table names are not case-sensitive. When you query external tables, the table names are not case-sensitive, and forced uppercase and lowercase conversions are not supported.

col_name

Yes

The name of a column in the OSS external table.

Before you read OSS data, make sure that the schema of the OSS external table is the same as the schema of the OSS data files. Otherwise, OSS data cannot be read.

data_type

Yes

The data type of a column in the OSS external table.

Before you read OSS data, make sure that the data types of columns in the OSS external table are the same as the data types of columns in the OSS data files. Otherwise, OSS data cannot be read.

partitioned by (<col_name> <data_type>, ...)

Required if specific conditions are met

If data files in OSS are stored in partitioned directories, this parameter is required to create a partitioned external table.

  • col_name: The name of a partition key column.

  • data_type: The data type of a partition key column.

StorageHandler

Required if specific conditions are met

If you use a built-in text extractor or a custom extractor to create an OSS external table, this parameter is required. MaxCompute extractors are categorized into the following types:

  • Built-in text extractors:

    • com.aliyun.odps.CsvStorageHandler: defines how to read data from and write data to CSV files or CSV.GZ files.

    • com.aliyun.odps.TsvStorageHandler: defines how to read data from and write data to TSV files or TSV.GZ files.

  • Custom extractors: You can write MaxCompute user defined functions (UDFs) to create a custom extractor. For more information about how to write MaxCompute UDFs, see Develop a UDF.

'<property_name>'='<property_value>'

Yes

The extended property of the external table. For more information about the properties, see Reference: WITH SERDEPROPERTIES properties.

oss_location

Yes

The OSS directory where data files are stored. The OSS directory is in the oss://<oss_endpoint>/<Bucket name>/<OSS directory name>/ format. MaxCompute automatically reads data from all files in the OSS directory that you specified.

  • oss_endpoint: the OSS endpoint. You must use the classic network endpoint of OSS to prevent extra fees that are incurred by OSS traffic, such as oss://oss-cn-beijing-internal.aliyuncs.com/xxx. For more information about the classic network endpoints of OSS, see OSS regions and endpoints.

    Note

    We recommend that OSS for storing data files is deployed in the same region as your MaxCompute project. MaxCompute can be deployed only in some regions. Therefore, cross-region data connectivity issues may occur.

  • Bucket name: the name of the OSS bucket. For more information about how to view bucket names, see List buckets.

  • Directory name: the name of the OSS directory. You do not need to include file names in directory names.

    -- Example of a valid OSS directory: 
    oss://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/
    -- Examples of invalid OSS directories: 
    http://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/                -- HTTP connections are not supported. 
    https://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/               -- HTTPS connections are not supported. 
    oss://oss-cn-shanghai-internal.aliyuncs.com/Demo1                              -- The endpoint is invalid. 
    oss://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/vehicle.csv     -- The file name does not need to be included in the directory name.

jar_name

Required if specific conditions are met

If you use a custom extractor to create an OSS external table, this parameter is required. This parameter specifies the JAR package that corresponds to the custom extractor code. The JAR package must be added to your MaxCompute project as a resource.

For more information about how to add resources, see Add resources.

serde_class

No

This parameter specifies the built-in open source data extractor of MaxCompute. If the data file is in the TEXTFILE format, this parameter is required. You do not need to configure this parameter in other scenarios.

Mappings between open source data formats that are supported by MaxCompute and SerDe classes:

  • PARQUET: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.

  • TEXTFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, org.apache.hive.hcatalog.data.JsonSerDe, org.apache.hadoop.hive.serde2.OpenCSVSerde.

  • ORC: org.apache.hadoop.hive.ql.io.orc.OrcSerde.

  • AVRO: org.apache.hadoop.hive.serde2.avro.AvroSerDe.

  • SEQUENCEFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.

  • RCFILE: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe.

  • ORCFILE: org.apache.hadoop.hive.ql.io.orc.OrcSerde.

file_format

Required if specific conditions are met

If OSS data files are in an open source format, this parameter is required. This parameter specifies the format of the OSS data files.

Note

The size of a single file cannot exceed 3 GB. If the file size exceeds 3 GB, we recommend that you split the file into multiple files.

resource_name

No

If you use a custom SerDe class, you must use this parameter to specify the resource that contains the custom SerDe class.

JAR packages that are related to the SerDe class must be added to your MaxCompute project as resources.

For more information about how to add resources, see Add resources.

'<tbproperty_name>'='<tbproperty_value>'

No

The extended property of the external table. For more information about the properties, see Reference: TBLPROPERTIES properties.

Reference: WITH SERDEPROPERTIES properties

property_name

Scenario

Description

property_value

Default value

odps.properties.rolearn

If you use Security Token Service (STS) authorization, add this property.

Specifies the Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS.

You can obtain the ARN from the role details page in the RAM console.

None

odps.text.option.gzip.input.enabled

If you need to read data from a CSV.GZ or TSV.GZ data file, add this property.

CSV or TSV compression properties. If you set this parameter to True, MaxCompute normally reads data from CSV.GZ or TSV.GZ data files. Otherwise, the data read operation fails.

  • True

  • False

False

odps.text.option.gzip.output.enabled

If you need to write data to OSS by using the GZIP compression algorithm, add this property.

CSV or TSV compression properties. If you set this parameter to True, MaxCompute normally writes data to OSS by using the GZIP compression algorithm. Otherwise, the data is not compressed.

  • True

  • False

False

odps.text.option.header.lines.count

If the OSS data file is in the CSV or TSV format and you need to ignore the first N rows in the OSS data file, add this property.

Specifies the number of rows that are ignored when you use MaxCompute to read OSS data files.

Non-negative integer

0

odps.text.option.null.indicator

If the OSS data file is in the CSV or TSV format and you need to define parsing rules of NULL in the OSS data file, add this property.

Specifies the string that is parsed as NULL in an SQL statement. For example, if you use \N to represent NULL, a,\N,b is parsed as a, NULL, b.

A string

Empty string

odps.text.option.ignore.empty.lines

If the OSS data file is in the CSV or TSV format and you need to define processing rules of empty rows in the OSS data file, add this property.

If you set this parameter to True, MaxCompute reads all rows in the data file. Otherwise, MaxCompute reads the empty rows.

  • True

  • False

True

odps.text.option.encoding

If the OSS data file is in the CSV or TSV format and the encoding format of the OSS data file is not the default encoding format, add this property.

Make sure that the encoding format that is configured in this scenario is the same as the encoding format of the OSS data file. Otherwise, MaxCompute cannot read data successfully.

  • UTF-8

  • UTF-16

  • US-ASCII

  • GBK

UTF-8

odps.text.option.delimiter

If you need to specify column delimiters for a CSV or TSV data file, add this property.

Make sure that the column delimiters specified in this scenario can be used to normally read each column of the OSS data file. Otherwise, the data read by MaxCompute is misaligned.

A single character

Comma (,)

odps.text.option.use.quote

If fields in a CSV or TSV data file contain carriage return, line feed (CRLF) pairs, double quotation marks ("), or commas (,), add this property.

Specifies whether to recognize column delimiters in a CSV file if it uses double quotation marks (") as column delimiters. If fields in the CSV file contain specified symbols that are used to separate multiple values, these fields must be enclosed in double quotation marks ("). The symbols include CRLF pairs, double quotation marks ("), and commas (,). If a field contains a double quotation mark ("), replace the double quotation mark (") with two double quotation marks ("") for escaping.

  • True

  • False

False

mcfed.parquet.compression

If you need to write data in a PARQUET data file to OSS by using a compression algorithm, add this property.

PARQUET compression property. By default, data in the PARQUET data file is not compressed.

  • ZSTD

  • SNAPPY

  • GZIP

None

mcfed.parquet.compression.codec.zstd.level

When 'mcfed.parquet.compression'='zstd' is specified, this property can be added. If this property is not specified, the default compression level of 3 is used.

A higher level value results in a higher compression ratio. However, practical tests show that when the level value is high, the reduction in written data volume is very limited, while the consumption of time and resources increases rapidly, significantly reducing cost-effectiveness. Therefore, for scenarios involving large-scale read and write operations with compressed parquet files, using a lower zstd compression level (level 3 to level 5) yields the best results. For example: 'mcfed.parquet.compression.codec.zstd.level' = '5'.

Valid values: 1 to 22

3

parquet.file.cache.size

If you need to improve the performance of reading data in a PARQUET data file, add this property.

Specifies the maximum amount of data that can be cached when you read OSS data files. Unit: KB.

1024

None

parquet.io.buffer.size

If you need to improve the performance of reading data in a PARQUET data file, add this property.

Specifies the maximum amount of data that can be cached when the size of the OSS data file exceeds 1,024 KB. Unit: KB.

4096

None

separatorChar

If you need to specify column delimiters for a CSV data file in the TEXTFILE format, add this property.

Specifies the column delimiters for a CSV data file.

A single string

Comma (,)

quoteChar

If fields in a CSV data file in the TEXTFILE format contain CRLF pairs, double quotation marks ("), or commas (,), add this property.

Specifies the quote for a CSV data file.

A single string

None

escapeChar

If you need to specify the rules that are used to escape a CSV data file in the TEXTFILE format, add this property.

Specifies escape characters for a CSV data file.

A single string

None

mcfed.orc.schema.resolution

If the schemas of data in the same OSS external table are different, add this property.

Specifies the parsing mode of ORC files. If this property is set to name, ORC files are parsed based on column names.

name

By default, ORC files are parsed based on column IDs. If ORC files are parsed based on column IDs, the setting is equivalent to the setting of 'mcfed.orc.schema.resolution'='position'.

odps.sql.text.option.flush.header

This parameter specifies whether to use the first row of the file block as the table header when you write data to OSS.

This parameter must be specified for CSV files.

  • True

  • False

False

odps.sql.text.schema.mismatch.mode

If the number of columns in the OSS data file is different from the number of columns of the schema in the external table, add this property.

Specifies the method to process rows in the table in which the number of columns is different from the number of columns in the data file.

Note

If the odps.text.option.use.quote parameter is set to True, this feature does not take effect.

  • error: An error is reported.

  • truncate: If the number of columns in the data file exceeds the number of columns in the external table, the data is truncated. If the number of columns in the data file is less than the number of columns in the external table, null values are added.

  • ignore: The rows in the data file in which the number of columns is different from the number of columns in the external table are discarded.

error

Reference: TBLPROPERTIES properties

property_name

Scenario

Description

property_value

Default value

skip.header.line.count

If you need to ignore the first N rows in a CSV data file in the TEXTFILE format, add this property.

If you use MaxCompute to read OSS data, the data in the specified number of rows that start from the first row is ignored.

Non-negative integer

N/A

skip.footer.line.count

If you need to ignore the last N rows in a CSV data file in the TEXTFILE format, add this property.

If you use MaxCompute to read OSS data, the data in the specified number of rows that start from the last row is ignored.

Non-negative integer

N/A

mcfed.orc.compress

If you need to write data in an ORC data file to OSS by using a compression algorithm, add this property.

ORC compression property: specifies the compression algorithm of ORC data files.

  • SNAPPY

  • ZLIB

N/A

mcfed.mapreduce.output.fileoutputformat.compress

If you need to write data in a TEXTFILE data file to OSS by using a compression algorithm, add this property.

TEXTFILE compression property. If you set this parameter to True, MaxCompute writes data in the TEXTFILE data files to OSS by using a compression algorithm. Otherwise, data is not compressed.

  • True

  • False

False

mcfed.mapreduce.output.fileoutputformat.compress.codec

If you need to write data in a TEXTFILE data file to OSS by using a compression algorithm, add this property.

TEXTFILE compression property: Specifies the compression algorithm of the TEXTFILE data files.

Note

Only the compression methods that are described in the property_value column are supported.

  • com.hadoop.compression.lzo.LzoCodec

  • com.hadoop.compression.lzo.LzopCodec

  • org.apache.hadoop.io.compress.SnappyCodec

  • com.aliyun.odps.io.compress.SnappyRawCodec

N/A

io.compression.codecs

If the OSS data file is in the raw Snappy format, add this property.

If you set this parameter to True, MaxCompute normally reads data from compressed files. Otherwise, the data read operation fails.

com.aliyun.odps.io.compress.SnappyRawCodec

N/A

References

After you create an OSS external table, you can use the external table to read the data that is stored in an OSS directory. For more information, see Read OSS data.