Create Apache Paimon external tables in MaxCompute - MaxCompute

MaxCompute allows you to create an Apache Paimon external table and establish a mapping between the external table and the directory of the Apache Paimon table that is stored in Object Storage Service (OSS). You can then use the Apache Paimon external table to access data in the Apache Paimon table that is stored in OSS. This topic describes how to create an Apache Paimon external table and access the Apache Paimon external table by using MaxCompute.

Background information

Apache Paimon is a lake storage format for integrated streaming and batch processing. Apache Paimon supports high-throughput writes and low-latency queries. Common compute engines such as Spark, Hive, and Trino of Alibaba Cloud Realtime Compute for Apache Flink and E-MapReduce are seamlessly integrated with Apache Paimon. Apache Paimon helps you quickly build your own data lake storage service on OSS and connect the service to MaxCompute to implement data lake analytics. For more information about Apache Paimon, see Apache Paimon.

Prerequisites

The Alibaba Cloud account that you use to perform operations has the CreateTable permission to create MaxCompute tables. For more information about table permissions, see MaxCompute permissions.
A MaxCompute project is created. For more information, see Create a MaxCompute project.
An OSS bucket and the related directory are created. For more information, see Create a bucket.
Note
MaxCompute is deployed only in specific regions. To prevent a cross-region data connectivity issue, we recommend that you use a bucket in the same region as your MaxCompute project.
Fully managed Flink is activated. For more information, see Activate Realtime Compute for Apache Flink.

Usage notes

MaxCompute can only read data from Apache Paimon external tables but cannot write data to Apache Paimon external tables or automatically synchronize the schema changes of Apache Paimon external tables.
Apache Paimon does not support MaxCompute projects for which the schema feature is enabled.
Apache Paimon external tables do not support the clustering attribute.
Apache Paimon external tables do not support features such as querying and backtracking data of historical versions.

Syntax

The following sample code shows the syntax for creating an Apache Paimon external table in MaxCompute.

CREATE EXTERNAL TABLE  [if NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[partitioned BY (<col_name> <data_type>, ...)]
stored BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
    'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
)
location '<oss_location>'
USING 'paimon_maxcompute_connector.jar';

If the Paimon external table you create is a partitioned table, you must perform additional operations to import partition data. For more information, see Create an OSS external table.

Method 1 (Recommended): Automatically parse the OSS directory structure to identify partitions and add partition information to the OSS external table.
```
msck repair TABLE <mc_oss_extable_name> ADD partitions;
```
Method 2: Manually execute the following command to add partition information to the OSS external table.
```
ALTER TABLE <mc_oss_extable_name> ADD PARTITION (<col_name>= <col_value>);
```

The following table describes the parameters in the preceding code.

Parameter	Required	Description
mc_oss_extable_name	Yes	The name of the Apache Paimon external table that you want to create. Table names are not case-sensitive. When you query external tables, the table names are not case-sensitive, and forced uppercase and lowercase conversions are not supported.
col_name	Yes	The name of the column in the Apache Paimon external table. To read data from an Apache Paimon table, you must make sure that the schema of the created Apache Paimon external table is the same as the schema of the Apache Paimon table. Otherwise, data reads fail.
data_type	Yes	The column data type of the Apache Paimon external table. To read data from an Apache Paimon table, you must make sure that the column data type of the created Apache Paimon external table is the same as the column data type of the Apache Paimon table. Otherwise, data reads fail.
odps.properties.rolearn	Yes	The Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS. You can obtain the ARN from the role details page in the Resource Access Management (RAM) console.
oss_location	Yes	The OSS directory where data files are stored. The OSS directory is in the `oss://<oss_endpoint>/<Bucket name>/<OSS directory name>/` format. MaxCompute automatically reads data from all files in the OSS directory that you specified. oss_endpoint: the OSS endpoint. You must use an internal endpoint of OSS to prevent extra fees that are incurred by Internet traffic. Example: `oss://oss-cn-beijing-internal.aliyuncs.com/xxx`. For more information about the internal endpoints of OSS, see Regions and endpoints. Note We recommend that OSS for storing data files is deployed in the same region as your MaxCompute project. MaxCompute can be deployed only in specific regions. Therefore, cross-region data connectivity issues may occur. Bucket name: the name of the OSS bucket. For more information about how to view bucket names, see List buckets. Directory name: the name of the OSS directory. You do not need to include file names in directory names. Example: `oss://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/`

Procedure

Step 1: Prepare data in Realtime Compute for Apache Flink

Create an Apache Paimon catalog and an Apache Paimon table, and insert data into the table. This section describes the procedure. If an Apache Paimon table and data are available in Realtime Compute for Apache Flink, skip this step.

Log on to the Realtime Compute for Apache Flink console and create an Apache Paimon catalog. For more information, see Create an Apache Paimon catalog.
Create an Apache Paimon table. For more information, see Manage Apache Paimon catalogs.
1. In the left-side navigation pane of the console of fully managed Flink, click Catalogs. In the Catalogs pane, find the Apache Paimon catalog that you create and click the default database. On the page that appears, click Create Table, as shown in the following figure.
2. On the Built-in tab of the Create Table dialog box, click Apache Paimon and click Next. In the Create Table step, enter the following table creation statement and click Confirm. In this example, a table named test_tbl is created.
```
CREATE TABLE `catalogname`.`default`.test_tbl (
    dt STRING,
    id BIGINT,
    data STRING,
    PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt);
```
3. On the SQL Editor page, create an SQL draft that contains the following statement, deploy the draft, and run the deployment for the draft. For more information about how to develop and run an SQL draft, see Develop an SQL draft.
```
INSERT INTO `catalogname`.`default`.test_tbl VALUES ('2023-04-21', 1, 'AAA'), ('2023-04-21', 2, 'BBB'), ('2023-04-22', 1, 'CCC'), ('2023-04-22', 2, 'DDD');
```
  Note
  - Make sure that the engine version of the SQL draft is vvr-8.0.1-flink-1.17 or later.
  - If the SQL deployment is a bounded streaming deployment, you must go to the Deployments page, click Edit in the upper-right corner of the Parameters section of the Configuration tab, and then add execution.checkpointing.checkpoints-after-tasks-finish.enabled: true to the Other Configuration field. For example, if the INSERT INTO ... VALUES ... statement is included in an SQL deployment, the deployment is a bounded streaming deployment. For more information about how to configure the runtime parameters of a deployment, see Configure a deployment.

Step 2: Upload the Apache Paimon plug-in to your MaxCompute project

You can use one of the following methods to upload the Apache Paimon plug-in to the MaxCompute project.

Use the MaxCompute client

Access the MaxCompute project on the MaxCompute client and run the following code to upload the paimon_maxcompute_connector.jar package to the MaxCompute project:

ADD JAR <path_to_paimon_maxcompute_connector.jar>;

Use the DataWorks console

Log on to the DataWorks console. In the left-side navigation pane, click Workspace. On the Workspaces page, find the desired workspace and choose Shortcuts > Data Development in the Actions column.
On the DataStudio page, click Create and choose Create Resource > JAR.
In the Create Resource dialog box, configure the parameters, upload the paimon_maxcompute_connector.jar package, and then click Create. For more information about how to create a resource, see Step 1: Create a resource or upload an existing resource.
After the resource is created, click the icon on the toolbar on the configuration tab of the resource to commit the resource to the development environment.

Step 3: Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements

Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements. In this example, an external table named oss_extable_paimon_1pt is created.

CREATE EXTERNAL TABLE oss_extable_paimon_1pt
(
    id BIGINT,
    data STRING
)
PARTITIONED BY (dt STRING )
stored BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
    'odps.properties.rolearn'='acs:ram::124*********:role/aliyunodpsdefaultrole'
)
location 'oss://oss-cn-beijing-internal.aliyuncs.com/paimon_flink/test_db_y.db/test_tbl/'
USING 'paimon_maxcompute_connector.jar'
;

Step 4: Read data from the Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements

Run the following commands on the MaxCompute client (odpscmd) or other tools that can run MaxCompute SQL:
```
SET odps.sql.common.table.planner.ext.hive.bridge = true;
SET odps.sql.hive.compatible = true;
```

Run the following command to query data from the oss_extable_paimon_1pt table:

SELECT * FROM oss_extable_paimon_1pt;

The following result is returned:

+------------+------------+------------+
| id         | data       | dt         |
+------------+------------+------------+
| 1          | AAA        | 2023-04-21 |
| 2          | BBB        | 2023-04-21 |
+------------+------------+------------+

Note

If the partition information is not displayed in the command output, you can run the following command to add the partition information:

msck repair table oss_extable_paimon_1pt add partitions;

References

You can also create a MaxCompute Apache Paimon external table by using a custom catalog in Flink. After data is written to the table, you can use MaxCompute to query and consume data in the table. For more information, see Create an Apache Paimon external table by using Realtime Compute for Apache Flink.