Use HMS to access data in OSS data lakes (beta) - Hologres

Hologres V2.2 and later allow you to use Hive Metastore Service (HMS) to access data in Object Storage Service (OSS) data lakes. If you use an E-MapReduce (EMR) cluster to build an OSS-based data lake, you can improve the speed of reading data from or writing data to OSS and OSS-HDFS by using simple configurations in Hologres.

Prerequisites

OSS is activated. For more information, see Get started by using the OSS console.
A DataLake cluster is created and test data is prepared. For more information, see Create a cluster. Make sure that the DataLake cluster meets the following requirements:
- The Hive version of the DataLake cluster is 3.1.3 or later.
- Kerberos authentication is not enabled.
- The Metadata parameter is set to Self-managed RDS or Built-in MySQL.
A Hologres instance is purchased, data lake acceleration is enabled for the instance, and a database is created on the instance. For more information, see Purchase a Hologres instance and Create a database.
Note
To enable data lake acceleration, log on to the Hologres console, find the desired instance on the Instances page, and click Data Lake Acceleration in the Actions column. In the message that appears, click Enable.

Network connections are established.

You must click here to apply for a request to establish network connections. After receiving your request, Alibaba Cloud Hologres engineers can help you create network connections. Procedure:

Log on to the virtual private cloud (VPC) console and create reverse endpoints. For more information, see Create and manage endpoints. When you create reverse endpoints, set the Endpoint Service parameter to Other Endpoint Services and enter the name of the endpoint service in the region where the EMR instance resides. The following table provides names of endpoint services in different regions.

Region	Endpoint service name
China (Beijing)	com.aliyuncs.privatelink.cn-beijing.epsrv-2zeokrydzjd6kx3cbwmb
China (Shanghai)	com.aliyuncs.privatelink.cn-shanghai.epsrv-uf61fvlfwta7f7dv9n3x
China (Zhangjiakou)	com.aliyuncs.privatelink.cn-zhangjiakou.epsrv-8vbno4k4wwvys0eg2swp

Note

If your region is not included in the preceding table, Hologres engineers will create an endpoint service in your region and provide you with the service name after you submit a request.
A VPC is an isolated network environment built on Alibaba Cloud. VPCs as well as a VPC and the classic network are logically isolated from each other and cannot access each other by default. The Hologres service was launched before VPC and is deployed in the classic network. Therefore, you must configure a reverse endpoint to implement network connectivity.
Network connections are established based on IP addresses. If the IP address of the DataLake cluster changes, you must reconfigure network connections.

Limits

The data lake acceleration feature is not supported by Hologres read-only secondary instances.
You cannot perform specific operations, such as UPDATE, DELETE, and TRUNCATE, on foreign tables.
You cannot use the Auto Load feature to create HMS foreign tables in Hologres.
Hive clusters for which Kerberos authentication is enabled are not supported.

Procedure

Execute the following SQL statement to install an extension.
You must be granted superuser permissions before you can install an extension. An extension is installed at the database level. For each database, you need to install an extension only once.
```
CREATE EXTENSION IF NOT EXISTS hive_fdw;
```

Create a foreign server based on hive_fdw and configure the endpoint information.

CREATE SERVER IF NOT EXISTS <server_name> FOREIGN DATA WRAPPER hive_fdw 
OPTIONS (
   hive_metastore_uris 'thrift://<IP address of HMS>:<Port number>',
   oss_endpoint 'oss-<nation>-<region>-internal.aliyuncs.com | <bucket>.oss-<nation>-<region>.oss-dls.aliyuncs.com' 
);

Parameter	Required	Description	Example
server_name	Yes	The name of the foreign server.	hive_server
hive_metastore_uris	Yes	The Uniform Resource Identifier (URI) of HMS. Specify the value in the following format: `thrift://<IP address of HMS>:<Port number>`. The default port number is 9083. Note You can log on to the EMR console, find the desired cluster, and click Nodes in the Actions column. On the Nodes tab, you can obtain the internal IP address of the master node, which is the IP address of HMS.	thrift://172.16.0.250:9083
oss_endpoint	Yes	The endpoint of OSS. Select an OSS endpoint based on your business requirements. Native OSS: To optimize access performance, we recommend that you use the OSS internal endpoint. OSS-HDFS: Only the internal endpoint can be used for access. Note You can log on to the OSS console and click the desired bucket. On the Overview page of the desired bucket, you can obtain the OSS endpoint in the Port section.	OSS `oss-cn-shanghai-internal.aliyuncs.com` OSS-HDFS `<bucket_name>.cn-beijing.oss-dls.aliyuncs.com`

Optional. Create user mappings.

Hologres allows you to execute the CREATE USER MAPPING statement to allow specific users to access the foreign server. For example, you can execute the CREATE USER MAPPING statement as the owner of the foreign server to allow the RAM user whose user ID (UID) is 123xxx to access OSS data. For more information about the CREATE USER MAPPING statement, see CREATE USER MAPPING.

CREATE USER mapping FOR <User account> server <server_name> options
(
    dlf_access_id 'accessid', 
    dlf_access_key 'accesskey',
    oss_access_id 'accessid', 
    oss_access_key 'accesskey'
);

Sample code:

-- Create a user mapping for your Alibaba Cloud account.
CREATE USER mapping FOR current_user server <server_name> options
(
    dlf_access_id 'LTAI5txxx', 
    dlf_access_key 'y8LUUyyy',
    oss_access_id 'LTAI5txxx', 
    oss_access_key 'y8LUUyyy'
);

-- Create a user mapping for the RAM user whose UID is 123xxx.
CREATE USER mapping FOR "p4_123xxx" server <server_name> options
(
    dlf_access_id 'LIlY5txxx', 
    dlf_access_key 'KsjkXKyyy',
    oss_access_id 'LIlY5txxx', 
    oss_access_key 'KsjkXKyyy'
);

-- Drop the user mappings.
Drop USER MAPPING FOR CURRENT_USER server <server_name>;
Drop USER MAPPING FOR "p4_123xxx" server <server_name>;

Create foreign tables.
Hologres provides the following methods to create foreign tables:
- CREATE FOREIGN TABLE: You can create only one foreign table at a time. You can specify the columns that you want to map in the foreign table. This method is suitable for scenarios in which you need to create a small number of foreign tables and only specific fields require mapping.
- IMPORT FOREIGN SCHEMA: You can create multiple foreign tables at the same time. This method is suitable for scenarios in which you need to create multiple foreign tables or all columns in source tables require mapping.
Note
- Hologres allows you to read data from partitioned tables in OSS and configure columns of the TEXT, VARCHAR, and INT types as partition key columns. If you use the CREATE FOREIGN TABLE statement, you must configure field mappings. In this method, no data is stored, and partition key columns are configured in the same manner as common fields. If you use the IMPORT FOREIGN SCHEMA statement, the system automatically creates field mappings.
- If you use the IMPORT FOREIGN SCHEMA statement and the name of an OSS foreign table that you want to create is the same as the name of an internal table in Hologres, the system does not create the foreign table but continues to create other foreign tables. In this case, we recommend that you use the CREATE FOREIGN TABLE statement to create the foreign table with a unique name.
```
-- Use the CREATE FOREIGN TABLE statement.
CREATE FOREIGN TABLE <holo_schema_name>.<table_name>
(
  { column_name data_type }
  [, ... ]
  ] )
)
SERVER <hive_server_name>
OPTIONS
(
  schema_name '<ext_db_name>',
  table_name '<ext_table_name>'
);


-- Use the IMPORT FOREIGN SCHEMA statement.
IMPORT FOREIGN SCHEMA <ext_db_name> 
[
  { limit TO | EXCEPT } 
  ( table_name [, ...] ) 
]
FROM server <hive_server_name>
INTO <holo_schema_name> 
options(
  if_table_exist 'update',
  if_unsupported_type 'error'
        );
```
Query data by using foreign tables.
After you create foreign tables, you can query data in OSS by using the foreign tables.
- Non-partitioned table
```
SELECT * FROM <holo_schema>.<hive_table>;
```
- Partitioned table
```
SELECT * FROM <holo_schema>.<hive_partition_table> WHERE <partition_key> = '<partition_value>';
```