This topic describes how to synchronize a single E-MapReduce (EMR) Hive table to MaxCompute in offline mode. It provides best practices for data source configuration, network connectivity, and sync task configuration.
Background information
Hive is a data warehouse tool built on Hadoop. It extracts, transforms, and loads data. Hive can store, query, and analyze large datasets in Hadoop. It maps structured data files to database tables and provides SQL query capabilities by converting SQL statements into MapReduce tasks. DataWorks Data Integration lets you synchronize data from Hive to other destinations and from other sources into Hive.
Preparations
Purchase a Serverless resource group.
Create a Hive data source and a MaxCompute data source. For more information, see Data Source Configuration.
Establish a network connection between the resource group and the data sources. For more information, see Overview of network connection solutions.
NoteIf you use a public network connection for the exclusive resource group and EMR, you must add rules to the security group of the EMR cluster. The rules must allow access from the Elastic IP (EIP) address of the exclusive resource group. The inbound rules must allow traffic on the EMR cluster access ports, such as 10000, 9093, and 8020.
Limits
Syncing source data to MaxCompute foreign tables is not supported.
Procedure
1. Create a node and configure the task
This topic does not describe the general steps for creating a node and configuring the task in the codeless UI. For more information, see Configure a sync node in the codeless UI.
2. Configure the data source and destination
Configure the data source (Hive)
This document describes how to synchronize a single Hive table to MaxCompute in offline mode. The data source is a Hive table. The key configuration parameters are described below.
Configuration item | Configuration notes |
Hive read methods |
Note The HDFS method is more efficient. The JDBC method generates a MapReduce program, which results in lower synchronization efficiency. Note that the HDFS method does not support data filtering or reading data from views. Select a synchronization method as needed. |
Table | Select the Hive table to synchronize. Ensure the table has the same schema in the development and production environments of the Hive data source. Note The list of tables and table schemas from the development environment of the Hive data source is displayed here. If the table definitions in your development and production environments are different, the task may be configured correctly in the development environment. However, it may fail after you submit it to the production environment. Errors such as 'table not found' or 'column not found' might occur. |
Parquet Schema | If the Hive table is stored in the Parquet format, configure the corresponding Parquet schema. |
Configure the data destination (MaxCompute)
This document describes how to synchronize a single Hive table to MaxCompute in offline mode. The data destination is a table. The key configuration notes are described below.
Keep the default settings for parameters that are not described in the following table.
Configuration item | Configuration notes |
Tunnel Resource Group | This is the Tunnel Quota for MaxCompute data transmission. By default, Public transport resources is selected, which is the free quota for MaxCompute. If your exclusive Tunnel Quota is unavailable due to overdue payments or expiration, the task automatically switches to the public transport resources during runtime. |
Table | Select the MaxCompute table to synchronize. If you use a standard DataWorks workspace, ensure that a MaxCompute table with the same name and schema exists in both the development and production environments. You can also click Generate Target Table Schema With One Click for the system to automatically create a table to receive the data. You can then manually adjust the table creation statement. Note If:
|
Partition Information | If the table is a partitioned table, enter a value for the partition key column.
|
Write Mode | Choose whether to clear or keep existing data when writing to the destination table. |
3. Configure and verify the task
Field mapping: In most cases, mapping by Map Fields with Same Name or Map Fields in Same Line is sufficient. If the order or names of the fields in the source and destination are different, you can manually adjust the mapping.
Channel control: Set the Policy for Dirty Data Records to reject dirty data to ensure data quality. You can keep the default settings for other parameters.
4. Configure and run a test
On the right side of the offline sync node editor page, click Run Configuration. Set the Resource Group and Script Parameters for the test run. Then, click Run in the top toolbar to test whether the synchronization task runs successfully.
You can click
in the navigation pane on the left, and then click the
icon to the right of Personal Directory to create a file with the .sqlextension. You can then execute the following SQL statement to query the data in the destination table and check whether it is as expected.NoteThis query method requires you to attach the destination MaxCompute instance as a computing resource for DataWorks.
On the right side of the
.sqlfile editor page, click Run Configuration. Specify the data source Type, Computing Resource, and Resource Group. Then, click Run in the top toolbar.
SELECT * FROM <MaxCompute_destination_table_name> WHERE pt=<specified_partition> LIMIT 20;
5. Configure scheduling and publish the task
On the right side of the offline sync node, click Scheduling. Set the scheduling configuration parameters for periodic runs. Then, click Publish in the top toolbar. In the panel that appears, follow the on-screen instructions to publish the task.