All Products
Search
Document Center

DataWorks:Batch synchronization from a single EMR Hive table to MaxCompute

Last Updated:Feb 26, 2026

This topic describes how to synchronize a single E-MapReduce (EMR) Hive table to MaxCompute in offline mode. It provides best practices for data source configuration, network connectivity, and sync task configuration.

Background information

Hive is a data warehouse tool built on Hadoop. It extracts, transforms, and loads data. Hive can store, query, and analyze large datasets in Hadoop. It maps structured data files to database tables and provides SQL query capabilities by converting SQL statements into MapReduce tasks. DataWorks Data Integration lets you synchronize data from Hive to other destinations and from other sources into Hive.

Preparations

  • Purchase a Serverless resource group.

  • Create a Hive data source and a MaxCompute data source. For more information, see Data Source Configuration.

  • Establish a network connection between the resource group and the data sources. For more information, see Overview of network connection solutions.

    Note

    If you use a public network connection for the exclusive resource group and EMR, you must add rules to the security group of the EMR cluster. The rules must allow access from the Elastic IP (EIP) address of the exclusive resource group. The inbound rules must allow traffic on the EMR cluster access ports, such as 10000, 9093, and 8020.

Limits

Syncing source data to MaxCompute foreign tables is not supported.

Procedure

1. Create a node and configure the task

This topic does not describe the general steps for creating a node and configuring the task in the codeless UI. For more information, see Configure a sync node in the codeless UI.

2. Configure the data source and destination

Configure the data source (Hive)

This document describes how to synchronize a single Hive table to MaxCompute in offline mode. The data source is a Hive table. The key configuration parameters are described below.

Configuration item

Configuration notes

Hive read methods

  • Read data from HDFS files: The Hive Reader plugin accesses the Hive Metastore service to parse information. This information includes the Hadoop Distributed File System (HDFS) storage path, file format, and separator of the configured data table. The plugin then reads the table data from the HDFS files.

  • Read data using Hive JDBC (supports conditional filtering): The Hive Reader plugin connects to the HiveServer2 service through a Java Database Connectivity (JDBC) client to read data. This method supports data filtering with a `WHERE` clause and reads data directly using SQL.

Note

The HDFS method is more efficient. The JDBC method generates a MapReduce program, which results in lower synchronization efficiency. Note that the HDFS method does not support data filtering or reading data from views. Select a synchronization method as needed.

Table

Select the Hive table to synchronize. Ensure the table has the same schema in the development and production environments of the Hive data source.

Note

The list of tables and table schemas from the development environment of the Hive data source is displayed here. If the table definitions in your development and production environments are different, the task may be configured correctly in the development environment. However, it may fail after you submit it to the production environment. Errors such as 'table not found' or 'column not found' might occur.

Parquet Schema

If the Hive table is stored in the Parquet format, configure the corresponding Parquet schema.

Configure the data destination (MaxCompute)

This document describes how to synchronize a single Hive table to MaxCompute in offline mode. The data destination is a table. The key configuration notes are described below.

Note

Keep the default settings for parameters that are not described in the following table.

Configuration item

Configuration notes

Tunnel Resource Group

This is the Tunnel Quota for MaxCompute data transmission. By default, Public transport resources is selected, which is the free quota for MaxCompute. If your exclusive Tunnel Quota is unavailable due to overdue payments or expiration, the task automatically switches to the public transport resources during runtime.

Table

Select the MaxCompute table to synchronize. If you use a standard DataWorks workspace, ensure that a MaxCompute table with the same name and schema exists in both the development and production environments.

You can also click Generate Target Table Schema With One Click for the system to automatically create a table to receive the data. You can then manually adjust the table creation statement.

Note

If:

  • If the destination MaxCompute table does not exist in the development environment, you cannot find it in the destination table drop-down list when you configure the offline sync node.

  • If the destination MaxCompute table does not exist in the production environment, the sync task fails during its scheduled runtime after you submit and publish it. This is because the destination table cannot be found.

  • If the table schemas in the development and production environments are inconsistent, the actual column mapping during the scheduled runtime may differ from the mapping configured for the offline sync node. This can lead to incorrect data being written.

Partition Information

If the table is a partitioned table, enter a value for the partition key column.

  • The value can be a static value, such as ds=20220101.

  • The value can be a scheduling system parameter, such as ds=${bizdate}. The system parameter is automatically replaced with its actual value when the task runs.

Write Mode

Choose whether to clear or keep existing data when writing to the destination table.

3. Configure and verify the task

  • Field mapping: In most cases, mapping by Map Fields with Same Name or Map Fields in Same Line is sufficient. If the order or names of the fields in the source and destination are different, you can manually adjust the mapping.

  • Channel control: Set the Policy for Dirty Data Records to reject dirty data to ensure data quality. You can keep the default settings for other parameters.

4. Configure and run a test

  1. On the right side of the offline sync node editor page, click Run Configuration. Set the Resource Group and Script Parameters for the test run. Then, click Run in the top toolbar to test whether the synchronization task runs successfully.

  2. You can click image in the navigation pane on the left, and then click the image icon to the right of Personal Directory to create a file with the .sql extension. You can then execute the following SQL statement to query the data in the destination table and check whether it is as expected.

    Note
    SELECT * FROM <MaxCompute_destination_table_name> WHERE pt=<specified_partition> LIMIT 20;

5. Configure scheduling and publish the task

On the right side of the offline sync node, click Scheduling. Set the scheduling configuration parameters for periodic runs. Then, click Publish in the top toolbar. In the panel that appears, follow the on-screen instructions to publish the task.