Paimon external table - MaxCompute - Alibaba Cloud Documentation Center

MaxCompute supports creating Paimon external tables that map to Paimon table directories stored in OSS buckets and access their data. This topic describes how to create a Paimon external table and query it from MaxCompute.

Function introduction

Apache Paimon is a unified lake storage format for batch and streaming workloads, offering high-throughput writes and low-latency queries. Common compute engines such as Spark, Hive, and Trino—used with Realtime Compute for Apache Flink and E-MapReduce—integrate seamlessly with Paimon. With Apache Paimon, you can quickly build a data lake on OSS and connect it to MaxCompute for analytics. Metadata filtering further optimizes query performance by skipping unnecessary OSS directory files during read operations.

Applicability

Paimon external tables do not automatically update their schema when the underlying Paimon file schema changes.
You cannot set cluster attributes or primary keys on Paimon external tables.
Paimon external tables do not support querying historical versions of data.
Do not write data directly into Paimon external tables. Instead, use methods such as UNLOAD to export data to OSS.
You can use INSERT INTO or INSERT OVERWRITE statements to write data into Paimon external tables. However, writing to Dynamic Bucket tables and Cross Partition tables is not supported.
UPDATE and DELETE operations are not supported on Paimon external tables.
MaxCompute and OSS must be in the same region.
Supported data types.

Create a Paimon external table

Syntax structure

For details about the syntax of external tables in various formats, see OSS external tables.

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
LOCATION '<oss_location>';

Common parameters

For more information about common parameters, see Basic syntax parameters.

Write data

For more information about the MaxCompute write syntax, see Write syntax.

Query analysis

Paimon table data splitting (Split) logic differs from that of native MaxCompute tables. Paimon uses its own internal file organization and sharding mechanism, which does not fully align with MaxCompute parameters.
For details about SELECT syntax, see Query syntax description.
For details about optimizing query plans, see Query optimization.
For details about BadRowSkipping, see BadRowSkipping.

Usage example

Step 1: Prerequisites

You have created a MaxCompute project.
You have prepared an OSS bucket and OSS directory. For more information, see Create a bucket and Manage folders.
Because MaxCompute is available only in specific regions, cross-region connectivity issues may occur. We recommend using an OSS bucket in the same region as your MaxCompute project.
Authorization
1. You have permissions to access OSS. An Alibaba Cloud account (primary account), Resource Access Management (RAM) user, or RAM role can access OSS external tables. For authorization details, see STS-mode authorization for OSS.
2. You have CreateTable permissions in the MaxCompute project. For details about table operation permissions, see MaxCompute permissions.

Step 2: Prepare data in Flink

Create a Paimon catalog and a Paimon table, then insert data into the table as shown in the following example.

Note

If Paimon table data already exists in OSS, skip this step.

Create a Paimon Filesystem Catalog

Log on to the Flink console. In the upper-left corner of the page, select a region.
Click the name of the target workspace. In the navigation pane on the left, select Catalogs.

On the Catalog List page, click Create Catalog . In the Create Catalog dialog box, select Apache Paimon, click Next, and then configure the following parameters:

Parameter	Required	Description
metastore	Required.	The metastore type. In this example, select `filesystem`.
catalog name	Required	A custom catalog name, such as `paimon-catalog`.
warehouse	Required	The data warehouse directory in OSS. In this example, use `oss://paimon-fs/paimon-test/`.
fs.oss.endpoint	Required	The OSS endpoint. For example, the endpoint for the China (Hangzhou) region is `oss-cn-hangzhou-internal.aliyuncs.com`.
fs.oss.accessKeyId	Required	The AccessKey ID required to access OSS.
fs.oss.accessKeySecret	Required	The AccessKey Secret required to access OSS.

Create a Paimon table
1. Log on to the Flink console. In the upper-left corner of the page, select a region.
2. Click the name of the target workspace. In the navigation pane on the left, select Development > Scripts.
3. On the New Script tab, click to create a new query script.
  Enter the following commands and click Run.
```
CREATE TABLE `paimon_catalog`.`default`.test_tbl (
    id BIGINT,
    data STRING,
    dt STRING,
    PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt);

INSERT INTO `paimon-catalog`.`default`.test_tbl VALUES (1,'CCC','2024-07-18'), (2,'DDD','2024-07-18');
```
If your SQL job is rate-limited—for example, when running an INSERT INTO ... VALUES ... statement—perform the following steps:
1. Click the name of the target workspace. In the navigation pane on the left, select O&M > Deployments.
2. On the Deployments page, click the name of the target job to open the Configuration page.
3. In the Runtime Parameter Settings area, click Edit. Set the execution.checkpointing.checkpoints-after-tasks-finish.enabled: true configuration in the Other Configuration section and save your changes.
  For details about configuring job runtime parameters, see Configure job deployment information.

Step 3: Create a Paimon external table in MaxCompute

Run the following SQL code in MaxCompute to create a Paimon external table.

CREATE EXTERNAL TABLE oss_extable_paimon_pt
(
    id BIGINT,
    data STRING
)
PARTITIONED BY (dt STRING )
STORED BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
    'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
LOCATION 'oss://oss-cn-<your region>-internal.aliyuncs.com/<table_path>'
;

In the preceding code, table_path is the path of the Paimon table created in Flink, such as paimon-fs/paimon-test/default.db/test_tbl. To obtain this path:

Log on to the Flink console. In the upper-left corner of the page, select a region.
Click the name of the target workspace. In the navigation pane on the left, select Catalogs.
On the Metadata page, click default under the target catalog. On the default page, click View in the Actions column for the target table.
On the Table Schema tab, in the Properties area, obtain the value of the path parameter. For table_path, enter only the path that follows oss://.

Step 4: Load partition data

If the OSS external table you created is partitioned, you must load partition data separately. For more information, see Syntax for loading partition data for OSS external tables.

MSCK REPAIR TABLE oss_extable_paimon_pt ADD PARTITIONS;

Step 5: Read the Paimon external table from MaxCompute

Run the following commands in MaxCompute to query the Paimon external table oss_extable_paimon_pt.

SET odps.sql.common.table.planner.ext.hive.bridge = true;
SET odps.sql.hive.compatible = true;
SELECT * FROM oss_extable_paimon_pt WHERE dt='2024-07-18';

The result is as follows:

+------------+------------+------------+
| id         | data       | dt         | 
+------------+------------+------------+
| 1          | CCC        | 2024-07-18 | 
| 2          | DDD        | 2024-07-18 | 
+------------+------------+------------+

Note

If the schema in the Paimon file differs from the external table schema:

Column count mismatch: If the Paimon file has fewer columns than defined in the external table DDL, missing column values are filled with NULL during reads. If it has more columns, extra columns are discarded.
Data type mismatch: MaxCompute does not support reading STRING data from Paimon files into INT columns. It supports reading INT data into STRING columns, although this is not recommended.

Supported data types

For MaxCompute data types, see Data type edition 1.0 and Data type edition 2.0.

Open source Paimon data type	MaxCompute 2.0 data type	Read/write support	Description
TINYINT	TINYINT	Supported	8-bit signed integer.
SMALLINT	SMALLINT	Supported	16-bit signed integer.
INT	INT	Supported	32-bit signed integer.
BIGINT	BIGINT	Supported	64-bit signed integer.
BINARY(MAX_LENGTH)	BINARY	Supported	Binary data type. Maximum length is 8 MB.
FLOAT	FLOAT	Supported	32-bit binary floating-point number.
DOUBLE	DOUBLE	Supported	64-bit binary floating-point number.
DECIMAL(precision,scale)	DECIMAL(precision,scale)	Supported	Exact decimal number. Default is `decimal(38,18)`. You can customize precision and scale. precision: The maximum number of digits. Valid range: `1 <= precision <= 38`. scale: The number of digits after the decimal point. Valid range: `0 <= scale <= 18`.
VARCHAR(n)	VARCHAR(n)	Supported	Variable-length character string. n is the length, ranging from 1 to 65535.
CHAR(n)	CHAR(n)	Supported	Fixed-length character string. n is the length, ranging from 1 to 255.
VARCHAR(MAX_LENGTH)	STRING	Supported	String type. Maximum length is 8 MB.
DATE	DATE	Supported	Date format: `yyyy-mm-dd`.
TIME, TIME(p)	Not supported	Not supported	Paimon TIME type represents time without time zone, composed of hours, minutes, and seconds, with nanosecond precision. TIME(p) specifies fractional second precision from 0 to 9 (default is 0). MaxCompute has no corresponding data type.
TIMESTAMP, TIMESTAMP(p)	TIMESTAMP_NTZ	Supported	Timestamp without time zone, precise to nanoseconds. To read this type, enable the native mode switch: `SET odps.sql.common.table.jni.disable.native=true;`
TIMESTAMP WITH LOCAL TIME_ZONE(9)	TIMESTAMP	Supported	Timestamp type precise to nanoseconds, formatted as `yyyy-mm-dd hh:mm:ss.xxxxxxxxx`. When writing low-precision TIMESTAMP values from Paimon source tables, truncation occurs: precision 0–3 is truncated to 3 digits, 4–6 to 6 digits, and 7–9 to 9 digits.
TIMESTAMP WITH LOCAL TIME_ZONE(9)	DATETIME	Not supported	Timestamp type precise to nanoseconds. Format: `yyyy-mm-dd hh:mm:ss.xxxxxxxxx`
BOOLEAN	BOOLEAN	Supported	Boolean type.
ARRAY	ARRAY	Supported	Complex type.
MAP	MAP	Supported	Complex type.
ROW	STRUCT	Supported	Complex type.
MULTISET<t>	Not supported	Not supported	MaxCompute has no corresponding data type.
VARBINARY, VARBINARY(n), BYTES	BINARY	Supported	Variable-length binary string.

FAQ

Error kSIGABRT when reading a Paimon external table

Error message:

ODPS-0123144: Fuxi job failed - kSIGABRT(errCode:6) at Odps/*****_SQL_0_1_0_job_0/M1@f01b17437.cloud.eo166#3. 
  Detail error msg: CRASH_CORE, maybe caused by jvm crash, please check your java udf/udaf/udtf. 
  | fatalInstance: Odps/*****_SQL_0_1_0_job_0/M1#0_0

Cause:
This error occurs when reading TIMESTAMP_NTZ data in JNI mode.
Solution:
Before you read from a table, disable the Native feature by running the command SET odps.sql.common.table.jni.disable.native=true;.

References

You can create a MaxCompute Paimon external table in Flink as a custom catalog, write data to it, and then query and consume the Paimon data from MaxCompute. For more information, see Create a MaxCompute Paimon external table based on Flink.