The EMR+DLF data lake solution provides unified metadata management and permission management of data lakes for enterprises, and supports multi-source data ingestion and end-to-end data exploration. This solution supports the migration of existing E-MapReduce (EMR) cluster metadata in self-managed ApsaraDB RDS or built-in MySQL databases to DLF.
This topic describes how to migrate metadata stored by the Hive metastore service in MySQL databases or ApsaraDB RDS databases to DLF. This topic also describes how to configure and use DLF for unified storage of metadata of an EMR cluster.
Scenarios
You want to migrate metadata from third-party big data clusters to Alibaba Cloud EMR.
You want to migrate the data and metadata from an EMR cluster that uses a MySQL database for metadata storage to an EMR cluster that uses DLF for metadata storage.
You want to change the metadata storage for an EMR cluster from a MySQL database to DLF.
Note: The EMR major version must be EMR V3.33 or an EMR V3.X version later than EMR V3.33, EMR V4.6 or an EMR V4.X version later than EMR V4.6, or EMR V5.1 or an EMR V5.X version later than EMR V5.1. If you want to migrate metadata from EMR clusters of earlier versions to DLF, join the DingTalk group 33719678.
Migrate metadata
Preparations
Before you migrate metadata, you must check the remote access permissions on the metadatabase.
Log on to your ApsaraDB RDS metadatabase or MySQL metadatabase and execute the following statement to grant remote access permissions (the root account and a database named hivemeta are used in this example):
GRANT ALL PRIVILEGES ON hivemeta.* TO 'root'@'%' IDENTIFIED BY 'xxxx' WITH GRANT OPTION;
FLUSH PRIVILEGES;
For an ApsaraDB RDS metadatabase, you can also view and modify the access permissions in the ApsaraDB RDS console.
Start migration
DLF provides a visualized metadata migration feature to quickly migrate the metadata in a Hive metastore to DLF.
Create a migration task
Log on to the DLF console, switch to the region of the EMR cluster, choose Metadata > Migrate Metadata in the left-side navigation tree, and click Create Migration Task. See the following figure.
Configure the source database
Database Type: Select MySQL.
MySQL Type: Select an option based on the Hive metadata type.
For a built-in MySQL database of the cluster, select Other MySQL Databases. In this case, JDBC URL, Username, and Password are required. We recommend that you enter an internal IP address for JDBC URL and select Alibaba Cloud VPC for Network Type. If you want to select Internet for Network Type, enter a public IP address for JDBC URL.
For an independent ApsaraDB RDS metadatabase, select Alibaba Cloud RDS. In this case, RDS Instance, Database Name, Username, and Password are required. ApsaraDB RDS metadatabases can be accessed only by using Alibaba Cloud VPCs.
Network Type: You can select Alibaba Cloud VPC or Internet. Select an option based on the setting of MySQL Type.
Alibaba Cloud VPC: specifies the VPC of the EMR cluster or ApsaraDB RDS instance.
Internet: If you select Internet, you must add a rule in the EMR console to enable the default port 3306 of the EMR cluster for the elastic IP addresses (EIPs) of DLF. The EIP 121.41.166.235 in the China (Hangzhou) region is used in the following example.
The following table lists the DLF EIPs in different regions.
Region | EIP |
China (Hangzhou) | 121.41.166.235 |
China (Shanghai) | 47.103.63.0 |
China (Beijing) | 47.94.234.203 |
China (Shenzhen) | 39.108.114.206 |
Singapore | 161.117.233.48 |
Germany (Frankfurt) | 8.211.38.47 |
China (Zhangjiakou) | 8.142.121.7 |
China (Hong Kong) | 8.218.148.213 |
Configure the migration task
Task Name: Enter a name for the metadata migration task.
Task Description: This parameter is optional. You can enter a description for the task.
Conflict Resolution Strategy:
Update Original Metadata: updates the original metadata in DLF. This option is recommended.
Delete Original Metadata and Create Metadata: deletes the original metadata from DLF and then creates new metadata.
Log Storage Path: The system records each metadata object, migration status, and error logs (if any) in the specified OSS path.
Object to Synchronize: The objects to be synchronized include databases, functions, tables, and partitions. In most cases, select Select All.
Location Replacement: You must set this parameter if you want to replace the location of a table or database during migration.
Run the migration task
The migration task that you created is displayed on the Migration Task tab. Click Run in the Actions column to run the task. See the following figure.
View running records and logs
Click the Execution History tab to view task running details.
Click View Logs in the Actions column of the task to view running logs. See the following figure.
Verify the metadata migration
In the left-side navigation tree of the DLF console, choose Metadata > Metadata. The metadatabase that you migrated is displayed on the Metadata page. See the following figure.
Use DLF for unified metadata storage for an EMR cluster
This section describes how to configure and use DLF for unified metadata storage for an EMR cluster.
Change metadata storage for compute engines
Hive
In the EMR console, add the following configurations to the hive-site.xml file, enable Save and Deliver Configuration, click Save, and then restart the Hive service.
<!-- Configure the URL of the DLF metadata service. Replace regionId with the region ID of the desired cluster, such as cn-hangzhou. -->
dlf.catalog.endpoint=dlf-vpc.{regionId}.aliyuncs.com
<!-- Note: Check the configuration after you perform copy and paste operations. No spaces are allowed for the configuration. -->
hive.imetastoreclient.factory.class=com.aliyun.datalake.metastore.hive2.DlfMetaStoreClientFactory
dlf.catalog.akMode=EMR_AUTO
dlf.catalog.proxyMode=DLF_ONLY
<!-- Configuration for Hive 3 -->
hive.notification.event.poll.interval=0s
<!-- Configuration for versions earlier than EMR V3.33 and versions earlier than EMR V4.6.0 -->
dlf.catalog.sts.isNewMode=false
Presto
In the EMR console, add the following configurations to the hive.properties file, enable Save and Deliver Configuration, click Save, and then restart the Presto service.
hive.metastore=dlf
<!-- Configure the URL of the DLF metadata service. Replace regionId with the region ID of the desired cluster, such as cn-hangzhou. -->
dlf.catalog.endpoint=dlf-vpc.{regionId}.aliyuncs.com
dlf.catalog.akMode=EMR_AUTO
dlf.catalog.proxyMode=DLF_ONLY
<!-- See the value of hive.metastore.warehouse.dir configured in the hive-site.xml file of Hive. -->
dlf.catalog.default-warehouse-dir= <!-- Set it to the same value as hive.metastore.warehouse.dir. -->
<!-- Configuration for versions earlier than EMR V3.33 and versions earlier than EMR V4.6.0 -->
dlf.catalog.sts.isNewMode=false
Spark
Click Deploy Client Configuration and restart Spark.
Impala
In the EMR console, click Deploy Client Configuration and restart Impala.
Verify the change of metadata storage for the compute engines
Hive is used in the following example. You can use the similar method for other engines.
1. Log on to the cluster and run the hive command.
2. Create a database by executing the following statement: create database dlf_test_db;
3. Log on to the DLF console and check whether the database exists.
4. Delete the database by executing the following statement: drop database dlf_test_db;
FAQ
Why should I run a metadata migration task multiple times?
Metadata migration tasks are executed based on metadata in ApsaraDB RDS or MySQL databases. You can run a metadata migration task for multiple times to ensure the eventual consistency between the metadata in the source database and the metadata in DLF.
References
For more details about the best practices, see Best practices for migrating EMR metadata to DLF.