This topic describes how to create and configure a batch synchronization task to synchronize data in an E-MapReduce (EMR) Hive database to MaxCompute.
Background information
Hive is a Hadoop-based data warehouse tool that is used to extract, transform, and load data. You can use Hive to store, query, and analyze large-scale data that is stored in Hadoop. Hive maps structured data files to a database table, supports SQL queries, and converts SQL statements into Map or Reduce tasks. Data Integration supports the synchronization of data between Hive and another type of data source.
Prepare data sources
Add a MaxCompute data source
Add a MaxCompute data source. For more information, see Add a MaxCompute data source.
Add a Hive data source
On the Data Source page in the DataWorks console, click Add Data Source. In the Add Data Source dialog box, add a Hive data source as prompted. You can add a Hive data source in one of the following modes: Alibaba Cloud Instance Mode, Connection String Mode, and Built-in Mode of CDH Cluster. If the Hive data source belongs to Alibaba Cloud EMR, we recommend that you add the data source in Alibaba Cloud instance mode. If the Hive data source does not belong to Alibaba Cloud EMR, we recommend that you add the data source in connection string mode. If your workspace is associated with a CDH compute engine, we recommend that you add the Hive data source in built-in mode of CDH. For more information, see Add a Hive data source.
Create an exclusive resource group for Data Integration and establish a network connection between the resource group and the EMR Hive data source
Before you synchronize data, you must establish a network connection between your exclusive resource group for Data Integration and the EMR Hive data source. For more information, see Configure network connectivity.
If your exclusive resource group for Data Integration and the EMR Hive data source reside in the same region, you can use a virtual private cloud (VPC) that resides in the region to establish a network connection between the resource group and the data source. To establish a network connection, perform the operations in Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group.
If your exclusive resource group for Data Integration and the EMR Hive data source reside in different regions, you can establish a network connection between the resource group and the data source over the Internet. In this case, you must configure a security group rule to allow access from the exclusive resource group for Data Integration to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled. For more information, see Step 2: Configure a security group rule to allow access to the EMR Hive data source.
Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group
If you establish a network connection between the exclusive resource group for Data Integration and a data source over the Internet, you can skip this step.
Associate the exclusive resource group for Data Integration with a VPC.
Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, configure the parameters to associate the resource group with a VPC.
VPC: Select the VPC in which the EMR Hive data source resides.
Zone and VSwitch: We recommend that you select the zone and vSwitch in which the EMR Hive data source resides. If the zone in which the data source resides is not displayed in the drop-down list, you can select a random zone and a vSwitch. Make sure that the vSwitch that you select provides a connection to the data source.
Security Groups: Select a security group to which your EMR cluster belongs. The selected security group must meet the following requirements:
The related ports of the EMR cluster are enabled in an inbound rule of the security group. In most cases, the ports that need to be enabled include ports 10000, 9093, and 8020. You can view the information about the security group on the Security Groups page in the ECS console.
The CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated is specified as an authorization object in the security group rule.
Add a custom route for the exclusive resource group for Data Integration.
NoteIf you select the zone and vSwitch in which the data source resides in the preceding substep, you can skip this substep. If you select a different zone or vSwitch, you must perform the operations in this substep to add a custom route for the exclusive resource group for Data Integration.
Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
On the VPC Binding tab of the page that appears, find the VPC association record and click Custom Route in the Actions column.
In the Custom Route panel, click Add Route. In the Add Route dialog box, configure the parameters to add a custom route for the exclusive resource group for Data Integration.
Destination VPC: Select the region and VPC in which the EMR Hive data source resides.
Destination VSwitch: Select the vSwitch in which the EMR Hive data source resides.
Step 2: Configure a security group rule to allow access to the EMR Hive data source
If you establish a network connection between the exclusive resource group for Data Integration and the EMR Hive data source over the Internet, you must configure an inbound security group rule to allow access from the elastic IP address (EIP) of the resource group to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled.
Find the resource group on the Exclusive Resource Groups tab of the Resource Groups page and click View Information in the Actions column to view the EIP of the exclusive resource group for Data Integration.
Add the obtained EIP to the security group in the EMR console.
On the Nodes tab in the EMR console, click the Plus sign to the left of the desired node group in the Node Group Name / ID column, find the desired node, and then click the icon. On the page that appears, add the obtained EIP to the security group.
Create a batch synchronization task
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization task.
You can use one of the following methods to create a batch synchronization task:
Method 1: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and select Data Development from the Shortcuts drop-down list in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click its name. Right-click Data Integration and choose
.Method 2: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and select Data Development from the Shortcuts drop-down list in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click its name. On the configuration tab of the workflow, click Create Node. In the Data Integration section, drag Offline synchronization to the canvas on the right.
In the Create Node dialog box, configure the parameters to create a batch synchronization task.
Establish network connections
Select the source, destination, and exclusive resource group for Data Integration, and establish network connections between the resource group and the data sources.
If a network connection cannot be established, you can configure the network connectivity as prompted or by referring to the related topic. For more information, see Establish a network connection between a resource group and a data source.
Configure the source and destination
Configure the source
Configure parameters related to the source. In this example, a batch synchronization task is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters.
Parameter | Description |
Data Source | The Hive data source that you added. |
Table | Select the Hive table whose data you want to synchronize. The table of the data source in the production environment and that in the development environment must have the same schema. Note The tables and table schemas of the Hive data source in the development environment are displayed. If the table that you select has a different schema from the table in the production environment, an error indicating that the table or a column does not exist occurs after you submit the task to the production environment. |
Method to Read Data of Hive Data Source |
Note If you read data based on HDFS files, the data synchronization efficiency is higher. If you read data by using Hive JDBC, MapReduce programs are generated. As a result, the data synchronization efficiency is lower. If you read data based on HDFS files, you cannot specify filter conditions or read views. You can select a synchronization method based on your business requirements. |
parquet schema | If you store the Hive table in the Parquet format, you must configure the parquet schema parameter. |
Retain the default values for other parameters.
Configure the destination
Configure parameters related to the destination. In this example, a batch synchronization task is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters.
Parameter | Description |
Data Source | The MaxCompute data source that you added. |
Table | The name of the MaxCompute table to which you want to write data. |
Partition Key Column | The partition information. If the table is a partitioned table, you can set this parameter to the value of the partition key column. The value can be a constant such as Note Whether the Partition Key Column parameter needs to be configured and the number of times the system displays the Partition Key Column parameter vary based on the type of the destination MaxCompute table that you select. If you select a non-partitioned destination MaxCompute table, you do not need to configure this parameter. If you select a partitioned destination MaxCompute table, the system displays the Partition Key Column parameter based on the number of partition key columns in the table and the names of the partition key columns. The number of times that the system displays the Partition Key Column parameter is the same as the number of partition key columns in the table. |
Retain the default values for other parameters.
Configure field mappings
After you configure the source and destination, you must configure the mappings between the fields in the source and the fields in the destination. You can click Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout to perform the related operation.
Configure channel control policies
You can configure settings, such as the maximum number of parallel threads and the maximum number of dirty data records that are allowed.
Configure scheduling properties
Click Next. In the Configure Scheduling Settings step, configure the scheduling properties that are required to run the synchronization task. For more information about scheduling parameters, see Description for using scheduling parameters in data synchronization.
Configure the rerun property.
You can configure a rerun policy based on your business requirements. Task rerunning can prevent task failures that are caused by occasional issues such as network jitters.
Configure scheduling dependencies.
You can configure scheduling dependencies for the task based on your business requirements. You can configure the instance that is generated for the batch synchronization task in the current cycle to depend on the instance that is generated for the same task in the previous cycle. This way, you can ensure that instances generated for the task in different scheduling cycles can finish running in sequence and prevent multiple instances from running at the same time.
DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and uses the resource groups for Data Integration to run the tasks. You are charged for using the resource groups for scheduling to schedule batch synchronization tasks. For information about the task issuing mechanism, see Mechanism for issuing tasks.
Test the batch synchronization task and commit and deploy the task for running
Test the batch synchronization task
In the top toolbar of the configuration tab of the batch synchronization task, you can click Run or Run with Parameters to test the batch synchronization task and check whether the task can run as expected. You can click Run with Parameters to test whether the scheduling parameters that you configured for the task are replaced as expected.
Commit and deploy the batch synchronization task
If the batch synchronization task runs as expected, you can click the Save icon to save the task configurations and the Commit icon to commit the task to the production environment. If the workspace is in standard mode, you must click the Deploy icon to deploy the task to the production environment. The synchronization task writes data from Hive to a MaxCompute table by day, hour, or minute. For more information, see Deploy tasks.
After you deploy the batch synchronization task, you can view the running results of the task in Operation Center and perform operations on the task, such as data backfilling. For more information, see Perform basic O&M operations on auto triggered tasks.