MaxCompute allows you to create an Apache Paimon external table and establish a mapping between the external table and the directory of the Apache Paimon table that is stored in Object Storage Service (OSS). You can then use the Apache Paimon external table to access data in the Apache Paimon table that is stored in OSS. This topic describes how to create an Apache Paimon external table and access the Apache Paimon external table by using MaxCompute.
Background information
Apache Paimon is a lake storage format for integrated streaming and batch processing. Apache Paimon supports high-throughput writes and low-latency queries. Common compute engines such as Spark, Hive, and Trino of Alibaba Cloud Realtime Compute for Apache Flink and E-MapReduce are seamlessly integrated with Apache Paimon. Apache Paimon helps you quickly build your own data lake storage service on OSS and connect the service to MaxCompute to implement data lake analytics. For more information about Apache Paimon, see Apache Paimon.
Prerequisites
The Alibaba Cloud account that you use to perform operations has the CreateTable permission to create MaxCompute tables. For more information about table permissions, see MaxCompute permissions.
A MaxCompute project is created. For more information, see Create a MaxCompute project.
An OSS bucket and the related directory are created. For more information, see Create a bucket.
NoteMaxCompute is deployed only in specific regions. To prevent a cross-region data connectivity issue, we recommend that you use a bucket in the same region as your MaxCompute project.
Fully managed Flink is activated. For more information, see Activate Realtime Compute for Apache Flink.
Usage notes
MaxCompute can only read data from Apache Paimon external tables but cannot write data to Apache Paimon external tables or automatically synchronize the schema changes of Apache Paimon external tables.
Apache Paimon does not support MaxCompute projects for which the schema feature is enabled.
Apache Paimon external tables do not support the clustering attribute.
Apache Paimon external tables do not support features such as querying and backtracking data of historical versions.
Syntax
The following sample code shows the syntax for creating an Apache Paimon external table in MaxCompute.
CREATE EXTERNAL TABLE [if NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[partitioned BY (<col_name> <data_type>, ...)]
stored BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
)
location '<oss_location>'
USING 'paimon_maxcompute_connector.jar';
If the Paimon external table you create is a partitioned table, you must perform additional operations to import partition data. For more information, see Create an OSS external table.
Method 1 (Recommended): Automatically parse the OSS directory structure to identify partitions and add partition information to the OSS external table.
msck repair TABLE <mc_oss_extable_name> ADD partitions;
Method 2: Manually execute the following command to add partition information to the OSS external table.
ALTER TABLE <mc_oss_extable_name> ADD PARTITION (<col_name>= <col_value>);
The following table describes the parameters in the preceding code.
Parameter | Required | Description |
mc_oss_extable_name | Yes | The name of the Apache Paimon external table that you want to create. Table names are not case-sensitive. When you query external tables, the table names are not case-sensitive, and forced uppercase and lowercase conversions are not supported. |
col_name | Yes | The name of the column in the Apache Paimon external table. To read data from an Apache Paimon table, you must make sure that the schema of the created Apache Paimon external table is the same as the schema of the Apache Paimon table. Otherwise, data reads fail. |
data_type | Yes | The column data type of the Apache Paimon external table. To read data from an Apache Paimon table, you must make sure that the column data type of the created Apache Paimon external table is the same as the column data type of the Apache Paimon table. Otherwise, data reads fail. |
odps.properties.rolearn | Yes | The Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS. You can obtain the ARN from the role details page in the Resource Access Management (RAM) console. |
oss_location | Yes | The OSS directory where data files are stored. The OSS directory is in the
|
Procedure
Step 1: Prepare data in Realtime Compute for Apache Flink
Create an Apache Paimon catalog and an Apache Paimon table, and insert data into the table. This section describes the procedure. If an Apache Paimon table and data are available in Realtime Compute for Apache Flink, skip this step.
Log on to the Realtime Compute for Apache Flink console and create an Apache Paimon catalog. For more information, see Create an Apache Paimon catalog.
Create an Apache Paimon table. For more information, see Manage Apache Paimon catalogs.
In the left-side navigation pane of the console of fully managed Flink, click Catalogs. In the Catalogs pane, find the Apache Paimon catalog that you create and click the default database. On the page that appears, click Create Table, as shown in the following figure.
On the Built-in tab of the Create Table dialog box, click Apache Paimon and click Next. In the Create Table step, enter the following table creation statement and click Confirm. In this example, a table named test_tbl is created.
CREATE TABLE `catalogname`.`default`.test_tbl ( dt STRING, id BIGINT, data STRING, PRIMARY KEY (dt, id) NOT ENFORCED ) PARTITIONED BY (dt);
On the SQL Editor page, create an SQL draft that contains the following statement, deploy the draft, and run the deployment for the draft. For more information about how to develop and run an SQL draft, see Develop an SQL draft.
INSERT INTO `catalogname`.`default`.test_tbl VALUES ('2023-04-21', 1, 'AAA'), ('2023-04-21', 2, 'BBB'), ('2023-04-22', 1, 'CCC'), ('2023-04-22', 2, 'DDD');
NoteMake sure that the engine version of the SQL draft is vvr-8.0.1-flink-1.17 or later.
If the SQL deployment is a bounded streaming deployment, you must go to the Deployments page, click Edit in the upper-right corner of the Parameters section of the Configuration tab, and then add
execution.checkpointing.checkpoints-after-tasks-finish.enabled: true
to the Other Configuration field. For example, if theINSERT INTO ... VALUES ...
statement is included in an SQL deployment, the deployment is a bounded streaming deployment. For more information about how to configure the runtime parameters of a deployment, see Configure a deployment.
Step 2: Upload the Apache Paimon plug-in to your MaxCompute project
You can use one of the following methods to upload the Apache Paimon plug-in to the MaxCompute project.
Use the MaxCompute client
Access the MaxCompute project on the MaxCompute client and run the following code to upload the paimon_maxcompute_connector.jar
package to the MaxCompute project:
ADD JAR <path_to_paimon_maxcompute_connector.jar>;
Use the DataWorks console
Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the desired workspace and choose Shortcuts > Data Development in the Actions column.
On the DataStudio page, click Create and choose Create Resource > JAR.
In the Create Resource dialog box, configure the parameters, upload the
paimon_maxcompute_connector.jar
package, and then click Create. For more information about how to create a resource, see Step 1: Create a resource or upload an existing resource.After the resource is created, click the icon on the toolbar on the configuration tab of the resource to commit the resource to the development environment.
Step 3: Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements
Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements. In this example, an external table named oss_extable_paimon_1pt
is created.
CREATE EXTERNAL TABLE oss_extable_paimon_1pt
(
id BIGINT,
data STRING
)
PARTITIONED BY (dt STRING )
stored BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
'odps.properties.rolearn'='acs:ram::124*********:role/aliyunodpsdefaultrole'
)
location 'oss://oss-cn-beijing-internal.aliyuncs.com/paimon_flink/test_db_y.db/test_tbl/'
USING 'paimon_maxcompute_connector.jar'
;
Step 4: Read data from the Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements
Run the following commands on the MaxCompute client (odpscmd) or other tools that can run MaxCompute SQL:
SET odps.sql.common.table.planner.ext.hive.bridge = true; SET odps.sql.hive.compatible = true;
Run the following command to query data from the
oss_extable_paimon_1pt
table:SELECT * FROM oss_extable_paimon_1pt;
The following result is returned:
+------------+------------+------------+ | id | data | dt | +------------+------------+------------+ | 1 | AAA | 2023-04-21 | | 2 | BBB | 2023-04-21 | +------------+------------+------------+
NoteIf the partition information is not displayed in the command output, you can run the following command to add the partition information:
msck repair table oss_extable_paimon_1pt add partitions;
References
You can also create a MaxCompute Apache Paimon external table by using a custom catalog in Flink. After data is written to the table, you can use MaxCompute to query and consume data in the table. For more information, see Create an Apache Paimon external table by using Realtime Compute for Apache Flink.