This topic describes how to create, read, and write ORC-formatted OSS foreign tables.
Applicability
OSS external tables do not support the cluster property.
The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.
MaxCompute and OSS must be in the same region.
Create Foreign Tables
Syntax
If the schema of an ORC file is inconsistent with the schema of the foreign table:
Inconsistent number of columns: If an ORC file has fewer columns than the foreign table, the system fills the missing columns with NULL values when reading the data. If the file has more columns, the system discards the extra columns.
Inconsistent column types: MaxCompute supports using the STRING type to read INT data from ORC files, but this is not recommended. When using the INT type to read STRING data, the system converts string values to NULL and accepts numeric values.
Simplified Syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED AS orc
LOCATION '<oss_location>';Full Syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH serdeproperties(
'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
STORED AS orc
LOCATION '<oss_location>'
tblproperties (
'<xxx>'='<yyy>'
);Common parameters
For more information about common parameters, see Basic syntax parameters.
Unique Parameters
with serdeproperties Properties
Property Name | Scenario | Description | Property Value | Default Value |
mcfed.orc.schema.resolution | Add this property if the schemas of data in the same OSS foreign table are inconsistent. | Sets the ORC file parsing method. | name | Parses by column number by default. Equivalent to: |
tblproperties Properties
Property Name | Scenario | Description | Property Value | Default Value |
mcfed.orc.compress | Add this property to write ORC data to OSS in a compressed format. | ORC compression property. Specify the compression method for ORC data. |
| None |
io.compression.codecs | Add this property if OSS data files are in Raw-Snappy format. | If you set this parameter to True, MaxCompute can read compressed data normally. Otherwise, MaxCompute cannot read the data. | com.aliyun.odps.io.compress.SnappyRawCodec | None |
odps.external.data.output.prefix (Compatible with odps.external.data.prefix) | Add this property to add a custom prefix to output files. |
| A valid combination of characters, such as 'mc_'. | None |
odps.external.data.enable.extension | Add this property to display the extension of output files. | True displays the extension of output files. False hides it. |
| False |
odps.external.data.output.suffix | Add this property to add a custom suffix to output files. | Contains only numbers, letters, and underscores (a-z, A-Z, 0-9, _). | A valid combination of characters, such as '_hangzhou'. | None |
odps.external.data.output.explicit.extension | Add this property to add a custom extension to output files. |
| A valid combination of characters, such as "jsonl". | None |
mcfed.orc.batch.size | Controls the number of records processed each time, affecting memory usage and processing efficiency. | ORC tuning property. Defines the default ORC batch size in rows. | Non-negative integer | 1000 |
Write data
For more information about the MaxCompute write syntax, see Write syntax.
Query analysis
For more information about the SELECT syntax, see Query syntax.
For more information about query plan optimization, see Query optimization.
You can optimize queries on ORC foreign tables by enabling Predicate Pushdown (PPD). To do this, add the following parameters before your SQL statement:
The PPD parameters must be used in Native mode. This means the Native switch must be set to true.
-- Enable the ORC native reader. SET odps.ext.oss.orc.native-true; -- Enable ORC PPD. SET odps.storage.orc.use.predicate.pushdown=true;
Scenario example
This example shows how to create a foreign ORC table that uses SNAPPY compression, and then read data from and write data to the table.
Prerequisites
You have created a MaxCompute project.
You have prepared an OSS bucket and OSS directory. For more information, see Create a bucket and Manage folders.
Because MaxCompute is available only in specific regions, cross-region connectivity issues may occur. We recommend using an OSS bucket in the same region as your MaxCompute project.
Authorization
You have permissions to access OSS. An Alibaba Cloud account (primary account), Resource Access Management (RAM) user, or RAM role can access OSS external tables. For authorization details, see STS-mode authorization for OSS.
You have CreateTable permissions in the MaxCompute project. For details about table operation permissions, see MaxCompute permissions.
Prepare a SNAPPY-formatted data file.
Using the provided sample data, create the folder path
orc_snappy/dt=20250526in theoss-mc-testbucket. Upload the SNAPPY file to thedt=20250526partition folder.Create a foreign ORC table with SNAPPY compression.
CREATE EXTERNAL TABLE orc_data_type_snappy ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) PARTITIONED BY (dt STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS ORC LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/orc_snappy/' tblproperties ( 'mcfed.orc.compress'='SNAPPY');If your OSS foreign table is partitioned, run the following command to add existing partitions. For more information, see Syntax for adding partitions to an OSS external table.
-- Add partitions to the external table MSCK REPAIR TABLE orc_data_type_snappy ADD PARTITIONS;Read data from the ORC foreign table.
SELECT * FROM orc_data_type_snappy WHERE dt=20250526 LIMIT 10;The query returns the following result:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongitude | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 12 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20250526 | | 1 | 1 | 51 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | 20250526 | | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:01 | NE | 20250526 | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:02 | NE | 20250526 | | 1 | 4 | 30 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:03 | W | 20250526 | | 1 | 5 | 47 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:04 | S | 20250526 | | 1 | 6 | 9 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:05 | S | 20250526 | | 1 | 7 | 53 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:06 | N | 20250526 | | 1 | 8 | 63 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:07 | SW | 20250526 | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:08 | NE | 20250526 | | 1 | 10 | 31 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:09 | N | 20250526 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+Write data to the ORC foreign table and query the data.
INSERT INTO orc_data_type_snappy PARTITION (dt ='20250526') VALUES (1,16,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW'); -- Query the newly inserted data SELECT * FROM orc_data_type_snappy WHERE dt = '20250526' AND recordid=16;The query returns the following result:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongitude | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 16 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20250526 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
Supported data types
For more information about MaxCompute data types, see Data type version 1.0 and Data type version 2.0.
JNI mode (
set odps.ext.oss.orc.native=false;): This mode does not use the native ORC reader when reading tables. It supports both read and write operations.Native mode (
set odps.ext.oss.orc.native=true;): This mode uses the native ORC reader when reading tables. It supports only read operations.
Mode | Java mode (read and write) | Native mode (read-only) |
TINYINT | ||
SMALLINT | ||
INT | ||
BIGINT | ||
BINARY | ||
FLOAT | ||
DOUBLE | ||
DECIMAL(precision,scale) | ||
VARCHAR(n) | ||
CHAR(n) | ||
STRING | ||
DATE | ||
DATETIME | ||
TIMESTAMP | ||
TIMESTAMP_NTZ | ||
BOOLEAN | ||
ARRAY | ||
MAP | ||
STRUCT | ||
JSON |
Supported Compression Formats
To read or write compressed OSS files, you must add the
with serdepropertiesconfiguration to the table creation statement. For more information, see with serdeproperties property parameters.MaxCompute supports reading and writing ORC files that are compressed with SNAPPY or ZLIB.
Support Schema Evolution
ORC foreign tables support two methods for mapping the table schema to the columns in the data file: position-based mapping and name-based mapping.
Position-based mapping: To use this method, set the
'mcfed.orc.schema.resolution'='position'parameter or omit it to use the default setting. The system maps columns based on their position. Therefore, the column order in the table must exactly match the field order in the file.Name-based mapping: To use this method, set the
'mcfed.orc.schema.resolution'='name'parameter when you create the foreign table. The system then maps columns by name instead of by position.
In the following table, Data compatibility issues describes whether a foreign table can correctly read data after a Schema Evolution operation. This includes reading new data that conforms to the modified schema and historical data that uses the old schema.
Operation type | Mapping method | Supported | Description | Data compatibility issues |
Add column | Position-based mapping |
|
| |
Name-based mapping | ||||
Delete column | Position-based mapping | Not recommended. ORC foreign tables map column values by position. The column order of the table must match the field order in the file. After a column deletion operation, if the file and table fields do not match, an error occurs when reading the table. |
| |
Name-based mapping | With name-based mapping, the system automatically matches based on column names, no longer relying on order. | Compatible | ||
Modify column order | Position-based mapping | Not recommended. ORC foreign tables map column values by position. The column order of the table must match the field order in the file. After a column deletion operation, if the file and table fields do not match, an error occurs when reading the table. |
| |
Name-based mapping | With name-based mapping, the system automatically matches based on column names, no longer relying on order. | Compatible | ||
Change column data type | Position-based mapping | For more information about the conversion table for data types, see Change column data type. | Compatible | |
Name-based mapping | ||||
Modify column name | Position-based mapping | Compatible | ||
Name-based mapping | Not recommended. With name-based mapping, the system automatically matches based on column names. After modifying a column name, the original column name that could be matched might not be found in the file. |
| ||
Modify column comment | Position-based mapping | The comment content must be a valid string with a length not exceeding 1024 bytes; otherwise, an error is reported. | Compatible | |
Name-based mapping | ||||
Modify column's nullability property | Position-based mapping | This operation is not supported. It is Nullable by default. | Not applicable | |
Name-based mapping |