Use a Paimon data source in ApsaraDB for SelectDB - ApsaraDB for SelectDB

This topic describes how to integrate a Paimon data source with ApsaraDB for SelectDB by using a catalog to perform federated analysis on the Paimon data source.

Prerequisites

All nodes in your Paimon cluster are connected to the SelectDB instance.
- All nodes in the Paimon cluster reside in the same virtual private cloud (VPC) as the SelectDB instance. If the nodes in the data source cluster reside in different VPCs, you must connect the nodes to the SelectDB instance. For more information, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
- The IP addresses of all nodes in the Paimon cluster are added to the IP address whitelist of the SelectDB instance. For more information, see Configure an IP address whitelist.
- The IP addresses in the VPC in which the SelectDB instance resides are added to an IP address whitelist of the Paimon cluster if the whitelist mechanism is supported for the Paimon cluster.
  - To obtain the IP address of the SelectDB instance in the VPC to which the SelectDB instance belongs, you can perform the operations provided in How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
  - To obtain the public IP address of the SelectDB instance, you can run the ping command to ping the public IP address of the SelectDB instance.
You have basic knowledge of catalogs and understand the operations that you can perform on catalogs. For more information, see Data lakehouse.

Usage notes

The supported Paimon versions vary with the versions of SelectDB.
- SelectDB V3.0 supports Paimon 0.7 and earlier.
- SelectDB V4.0 supports Paimon 0.8 and earlier.
ApsaraDB for SelectDB can only read data in external catalogs.

Procedure

In this example, an external catalog named paimon_catalog is created. You can change the name based on your business requirements.

Step 1: Connect to an ApsaraDB for SelectDB instance

Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.

Step 2: Create a Paimon catalog

SelectDB integrates external data sources by using external catalogs. You can create a Paimon catalog based on the following types of metastores:

File system metastore (default): stores both Paimon metadata and Paimon data in a file system.
Hive metastore: stores Paimon metadata as tables in a Hive metastore. You can directly query these tables from Hive.

Create a Paimon catalog based on the metastore type you select.

Create a Paimon catalog based on a file system metastore

Create a Paimon catalog based on HDFS

CREATE CATALOG `paimon_catalog` PROPERTIES (
    "type" = "paimon",
    "warehouse" = "hdfs://HDFS8000871/user/paimon",
    "dfs.nameservices" = "HDFS8000871",
    "dfs.ha.namenodes.HDFS8000871" = "nn1,nn2",
    "dfs.namenode.rpc-address.HDFS8000871.nn1" = "172.21.0.1:4007",
    "dfs.namenode.rpc-address.HDFS8000871.nn2" = "172.21.0.2:4007",
    "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "hadoop.username" = "hadoop"
);

Parameter description

Important

The parameters whose names start with dfs have a one-to-one mapping with the parameters in the Paimon configuration file hdfs-site.xml, with the same parameter names and descriptions. Therefore, the values of these parameters must be the same as those in the hdfs-site.xml file.

Parameter	Required	Default value	Description
type	Yes	No default value	The type of the catalog. Set the value to paimon.
warehouse	Yes	No default value	The Hadoop Distributed File System (HDFS) path of the warehouse.
hadoop.username	No	No default value	The username used to log on to the HDFS.
dfs.nameservices	No	No default value	The name of the nameservice.
dfs.ha.namenodes.[nameservice ID]	No	No default value	The IDs of the NameNodes.
dfs.namenode.rpc-address.[nameservice ID].[name node ID]	No	No default value	The Remote Procedure Call (RPC) URLs of the NameNodes. The number of URLs is the same as the number of NameNodes.
dfs.client.failover.proxy.provider.[nameservice ID]	No	No default value	The Java class that implements active NameNode connection for the HDFS client. Default value: `org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`.

Create a Paimon catalog based on OSS

Important

ApsaraDB for SelectDB V3.0.6 and later can integrate Paimon whose data is stored in Object Storage Service (OSS).

CREATE CATALOG `paimon_oss_catalog` PROPERTIES (
    "type" = "paimon",
    "warehouse" = "oss://paimon-zd/paimonoss",
    "oss.endpoint" = "oss-cn-beijing.aliyuncs.com",
    "oss.access_key" = "ak",
    "oss.secret_key" = "sk"
);

Parameter description

Parameter	Required	Default value	Description
type	Yes	No default value	The type of the catalog. Set the value to paimon.
warehouse	Yes	No default value	The OSS path of the warehouse.
oss.endpoint	Yes	No default value	The `endpoint` that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints.
oss.access_key	Yes	No default value	The `AccessKey ID` that is used to access OSS data.
oss.secret_key	Yes	No default value	The `AccessKey secret` that is used to access OSS data.

Create a Paimon catalog based on a Hive metastore

If your Paimon metadata is stored in a Hive metastore, you can create a Hive metastore catalog to integrate Paimon with ApsaraDB for SelectDB. The following sample code is for reference only. For more information, see Hive data source.

CREATE CATALOG `paimon_hms` PROPERTIES (
    "type" = "paimon",
    "paimon.catalog.type" = "hms",
    "warehouse" = "hdfs://HDFS8000871/user/zhangdong/paimon2",
    "hive.metastore.uris" = "thrift://172.21.0.44:7004",
    "dfs.nameservices" = "HDFS8000871",
    "dfs.ha.namenodes.HDFS8000871" = "nn1,nn2",
    "dfs.namenode.rpc-address.HDFS8000871.nn1" = "172.21.0.1:4007",
    "dfs.namenode.rpc-address.HDFS8000871.nn2" = "172.21.0.2:4007",
    "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "hadoop.username" = "hadoop"
);

Step 3: View the catalog

You can execute the following statement to check whether the catalog is created.

SHOW CATALOGS; --Check whether the catalog is created.

+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId    | CatalogName  | Type     | IsCurrent | CreateTime              | LastUpdateTime      | Comment                |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | paimon_catalog | paimon      |           | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 |                        |
|            0 | internal     | internal | yes       | UNRECORDED              | NULL                | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+

Step 4: View data in the Paimon databases and tables

View the Paimon databases and tables in the external catalog that you created.
Note
After you connect to an ApsaraDB for SelectDB instance, the internal catalog is used by default.
1. Switch to the external catalog that you want to manage.
```
SWITCH paimon_catalog;
```
2. View the data.
  After you switch to the external catalog, you can access the data in the external catalog in the same way as you use the internal catalog. Sample commands:
  - View the database list: SHOW DATABASES;
  - Switch between databases: USE test_db;
  - View the table list: SHOW TABLES;
  - View the data in the specified table: SELECT * FROM test_t;

View the Paimon databases and tables in the internal catalog.

--View the data in the test_t table of the test_db database in the paimon_catalog catalog. 
SELECT * FROM iceberg_catalog.test_db.test_t;

Column data type mappings

Note

The following data types with complex structures can be nested:

Paimon data types: MapType and ArrayType
ApsaraDB for SelectDB data types: Map and Array

Paimon data type	ApsaraDB for SelectDB data type
BooleanType	Boolean
TinyIntType	TinyInt
SmallIntType	SmallInt
IntType	Int
FloatType	Float
BigIntType	BigInt
DoubleType	Double
VarCharType	VarChar
CharType	Char
DecimalType(precision, scale)	Decimal(precision, scale)
TimestampType,LocalZonedTimestampType	DateTime
DateType	Date
MapType	Map
ArrayType	Array
VarBinaryType, BinaryType	Binary