一次性把EMR Hive整个数据库数据离线同步至MaxCompute - DataWorks

This topic describes how to create and configure a batch synchronization task to synchronize data in an E-MapReduce (EMR) Hive database to MaxCompute.

Background information

Hive is a Hadoop-based data warehouse tool that is used to extract, transform, and load data. You can use Hive to store, query, and analyze large-scale data that is stored in Hadoop. Hive maps structured data files to a database table, supports SQL queries, and converts SQL statements into Map or Reduce tasks. Data Integration supports the synchronization of data between Hive and another type of data source.

Prepare data sources

Add a MaxCompute data source

Add a MaxCompute data source. For more information, see Add a MaxCompute data source.

Add a Hive data source

On the Data Sources page in the DataWorks console, click Add Data Source. In the Add Data Source dialog box, add a Hive data source as prompted. You can add a Hive data source in one of the following modes: Alibaba Cloud Instance Mode, Connection String Mode, and CDH Cluster Mode. If the Hive data source belongs to Alibaba Cloud EMR, we recommend that you add the data source in Alibaba Cloud instance mode. If the Hive data source does not belong to Alibaba Cloud EMR, we recommend that you add the data source in connection string mode. If your workspace is associated with a CDH compute engine, we recommend that you add the Hive data source in CDH cluster mode. For more information, see Add a Hive data source.

Create an exclusive resource group for Data Integration and establish a network connection between the resource group and the EMR Hive data source

Before you synchronize data, you must establish a network connection between your exclusive resource group for Data Integration and the EMR Hive data source. For more information, see Configure network connectivity.

If your exclusive resource group for Data Integration and the EMR Hive data source reside in the same region, you can use a virtual private cloud (VPC) that resides in the region to establish a network connection between the resource group and the data source. To establish a network connection, perform the operations in Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group.
If your exclusive resource group for Data Integration and the EMR Hive data source reside in different regions, you can establish a network connection between the resource group and the data source over the Internet. In this case, you must configure a security group rule to allow access from the exclusive resource group for Data Integration to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled. For more information, see Step 2: Configure a security group rule to allow access to the EMR Hive data source.

Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group

Note

If you establish a network connection between the exclusive resource group for Data Integration and a data source over the Internet, you can skip this step.

Associate the exclusive resource group for Data Integration with a VPC.
1. Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
2. On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, configure the parameters to associate the resource group with a VPC.
  - VPC: Select the VPC in which the EMR Hive data source resides.
  - Zone and VSwitch: We recommend that you select the zone and vSwitch in which the EMR Hive data source resides. If the zone in which the data source resides is not displayed in the drop-down list, you can select a random zone and a vSwitch. Make sure that the vSwitch that you select provides a connection to the data source.
  - Security Groups: Select a security group to which your EMR cluster belongs. The selected security group must meet the following requirements:
    - The related ports of the EMR cluster are enabled in an inbound rule of the security group. In most cases, the ports that need to be enabled include ports 10000, 9093, and 8020. You can view the information about the security group on the Security Groups page in the ECS console.
    - The CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated is specified as an authorization object in the security group rule.
Add a custom route for the exclusive resource group for Data Integration.
Note
If you select the zone and vSwitch in which the data source resides in the preceding substep, you can skip this substep. If you select a different zone or vSwitch, you must perform the operations in this substep to add a custom route for the exclusive resource group for Data Integration.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. Go to the Resource Groups page, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
2. On the VPC Binding tab of the page that appears, find the VPC association record and click Custom Route in the Actions column.
3. In the Custom Route panel, click Add Route. In the Add Route dialog box, configure the parameters to add a custom route for the exclusive resource group for Data Integration.
  - Destination VPC: Select the region and VPC in which the EMR Hive data source resides.
  - Destination VSwitch: Select the vSwitch in which the EMR Hive data source resides.
  Click Generate Route.

Step 2: Configure a security group rule to allow access to the EMR Hive data source

If you establish a network connection between the exclusive resource group for Data Integration and the EMR Hive data source over the Internet, you must configure an inbound security group rule to allow access from the elastic IP address (EIP) of the resource group to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled.

Obtain required IP addresses or CIDR blocks.
For information about how to obtain the required IP address or CIDR block, see Configure an IP address whitelist.
Log on to the EMR console. On the Overview page, find the desired cluster and click Nodes in the Actions column to go to the Nodes tab.
On the Nodes tab, click the Plus sign to the left of the desired node group in the Node Group Name / ID column, find the desired node, and then click the icon. On the page that appears, add the obtained EIP to the security group.

Create a batch synchronization task

Create a workflow. For more information, see Create a workflow.
Create a batch synchronization task.
You can use one of the following methods to create a batch synchronization task:
- Method 1: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and select Data Development from the Shortcuts drop-down list in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click its name. Right-click Data Integration and choose Create Node > Offline synchronization.
- Method 2: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and select Data Development from the Shortcuts drop-down list in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click its name. On the configuration tab of the workflow, click Create Node. In the Data Integration section, drag Offline synchronization to the canvas on the right.
In the Create Node dialog box, configure the parameters to create a batch synchronization task.

Establish network connections

Select the source, destination, and exclusive resource group for Data Integration, and establish network connections between the resource group and the data sources.

Note

If a network connection cannot be established, you can configure the network connectivity as prompted or by referring to the related topic. For more information, see Network connectivity solutions.

Configure the source and destination

Configure the source

Configure parameters related to the source. In this example, a batch synchronization task is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters. hive

Parameter	Description
Data Source	The Hive data source that you added.
Table	Select the Hive table whose data you want to synchronize. The table of the data source in the production environment and that in the development environment must have the same schema. Note The tables and table schemas of the Hive data source in the development environment are displayed. If the table that you select has a different schema from the table in the production environment, an error indicating that the table or a column does not exist occurs after you submit the task to the production environment.
Method to Read Data of Hive Data Source	Read Data based on HDFS Files: Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file. Read Data by Using Hive JDBC: Hive Reader connects to HiveServer2 by using Hive Java Database Connectivity (JDBC) and reads data. This method allows you to specify a WHERE clause to filter data and execute SQL statements to read data. Note If you read data based on HDFS files, the data synchronization efficiency is higher. If you read data by using Hive JDBC, MapReduce programs are generated. As a result, the data synchronization efficiency is lower. If you read data based on HDFS files, you cannot specify filter conditions or read views. You can select a synchronization method based on your business requirements.
parquet schema	If you store the Hive table in the Parquet format, you must configure the parquet schema parameter.

Retain the default values for other parameters.

Configure the destination

Configure parameters related to the destination. In this example, a batch synchronization task is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters.

Parameter	Description
Data Source	The MaxCompute data source that you added.
Table	The name of the MaxCompute table to which you want to write data.
Partition Key Column	The partition information. If the table is a partitioned table, you can set this parameter to the value of the partition key column. The value can be a constant such as `ds=20220101` or a scheduling system parameter such as `ds=${bizdate}`. If you specify a scheduling parameter as a value for this parameter, the system replaces the scheduling parameter with an actual value when you run the batch synchronization task. Note Whether the Partition Key Column parameter needs to be configured and the number of times the system displays the Partition Key Column parameter vary based on the type of the destination MaxCompute table that you select. If you select a non-partitioned destination MaxCompute table, you do not need to configure this parameter. If you select a partitioned destination MaxCompute table, the system displays the Partition Key Column parameter based on the number of partition key columns in the table and the names of the partition key columns. The number of times that the system displays the Partition Key Column parameter is the same as the number of partition key columns in the table.

Retain the default values for other parameters.

Configure field mappings

After you configure the source and destination, you must configure the mappings between the fields in the source and the fields in the destination. You can click Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout to perform the related operation.

Configure channel control policies

You can configure settings, such as the maximum number of parallel threads and the maximum number of dirty data records that are allowed.

Configure scheduling properties

Click Next. In the Configure Scheduling Settings step, configure the scheduling properties that are required to run the synchronization task. For more information about scheduling parameters, see Description for using scheduling parameters in data synchronization.

Configure the rerun property.
You can configure a rerun policy based on your business requirements. Task rerunning can prevent task failures caused by occasional issues such as network jitters.
Configure scheduling dependencies.
You can configure scheduling dependencies for the task based on your business requirements. You can configure the instance that is generated for the batch synchronization task in the current cycle to depend on the instance that is generated for the same task in the previous cycle. This way, you can ensure that instances generated for the task in different scheduling cycles can finish running in sequence and prevent multiple instances from running at the same time.

Note

DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and uses the resource groups for Data Integration to run the tasks. You are charged for using the resource groups for scheduling to schedule batch synchronization tasks. For information about the task issuing mechanism, see Mechanism for issuing tasks that are run on old-version resource groups.

Test the batch synchronization task and commit and deploy the task for running

Test the batch synchronization task

In the top toolbar of the configuration tab of the batch synchronization task, you can click Run or Run with Parameters to test the batch synchronization task and check whether the task can run as expected. You can click Run with Parameters to test whether the scheduling parameters that you configured for the task are replaced as expected.

Commit and deploy the batch synchronization task

If the batch synchronization task runs as expected, you can click the Save icon to save the task configurations and the Commit icon to commit the task to the production environment. If the workspace is in standard mode, you must click the Deploy icon to deploy the task to the production environment. The synchronization task writes data from Hive to a MaxCompute table by day, hour, or minute. For more information, see Deploy tasks.

After you deploy the batch synchronization task, you can view the running results of the task in Operation Center and perform operations on the task, such as data backfilling. For more information, see Perform basic O&M operations on auto triggered tasks.