Capabilities of synchronizing data from or to Hive data sources in DataWorks - DataWorks

DataWorks provides Hive Reader and Hive Writer for you to read data from and write data to Hive data sources. This topic describes the capabilities of synchronizing data from or to Hive data sources.

Background information

Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table. Essentially, Hive is an engine that is used to parse SQL statements. It uses MapReduce to analyze data at the underlying layer and uses Hadoop Distributed File System (HDFS) to store the processed data. Hive converts Hibernate Query Language (HQL) statements to MapReduce programs and runs the programs on YARN.

Hive Reader connects to a Hive metastore to obtain the metadata of your Hive table. Hive Reader can read data based on HDFS files or by using Hive Java Database Connectivity (JDBC):

Read data based on HDFS files
Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.
Read data by using Hive JDBC
Hive Reader connects to HiveServer2 by using Hive JDBC and reads data. Hive Reader allows you to specify WHERE clauses to filter data and execute SQL statements to read data.

Hive Writer connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file to which you want to write data. Then, Hive Writer writes data to the HDFS file and executes the LOAD DATA statement in the Hive client to load the data in the HDFS file to the destination Hive table by using JDBC.

The underlying logic of Hive Writer is the same as that of HDFS Writer. You can configure parameters for HDFS Writer in the parameters of Hive Writer. Data Integration transparently transmits the configured parameters to HDFS Writer.

Supported Hive versions

Hive versions supported by Hive Reader and Hive Writer

0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3

Limits

Hive data sources support only exclusive resource groups for Data Integration.
Hive Reader can read files only in the text, optical character recognition (ORC), and Parquet formats.
When you run a batch synchronization task to synchronize data to a Hive data source, temporary files are generated on the server. After the batch synchronization task finishes running, the temporary files are automatically deleted. DataWorks cannot ensure that the number of generated temporary files does not exceed the upper limit for the number of files that are allowed in an HDFS directory. You must pay attention to the upper limit allowed in the HDFS directory and make sure that the number of generated temporary files does not exceed the upper limit. This can prevent the HDFS service from being unavailable due to excessive temporary files.
Note
You can change the value of the dfs.namenode.fs-limits.max-directory-items parameter on the server to define the maximum number of non-recursive directories or files that can be contained in a single HDFS directory. The value range of this parameter is from 1 to 6400000. The default value of this parameter is 1048576. To prevent the number of temporary files that are generated from exceeding the upper limit for the number of files that are allowed in an HDFS directory, you can specify a larger value for the dfs.namenode.fs-limits.max-directory-items parameter or delete the files that are no longer required from the directory.
Hive data sources support only Kerberos authentication. If you do not need to perform identity authentication for a Hive data source, you can set the Special Authentication Method parameter to None when you add the data source.
If Kerberos authentication is enabled for both HiveServer2 and a metastore for a Hive data source that is accessed by using a Kerberos-authenticated identity in DataWorks, and the principals that are used for the authentication are different, you must add the following configuration to the extended parameters:
```
 {
"hive.metastore.kerberos.principal": "your metastore principal"
}
```

Data type mappings

Batch data read

Category	Hive data type
String	CHAR, VARCHAR, and STRING
Integer	TINYINT, SMALLINT, INT, INTEGER, and BIGINT
Floating point	FLOAT, DOUBLE, and DECIMAL
Date and time	TIMESTAMP and DATE
Boolean	BOOLEAN

Preparations before data synchronization

The preparations vary based on the data source configuration mode.

Alibaba Cloud Instance Mode

If you select this mode and want to synchronize Object Storage Service (OSS ) tables in the selected instance, you must select an access identity, which can be an Alibaba Cloud account, a RAM user, or a RAM role. Make sure that the selected access identity has required OSS permissions. Otherwise, data synchronization will fail due to insufficient read and write permissions.

Important

A connectivity test does not verify the data read and write permissions.

Connection String Mode

Configuration for the use of DLF to manage Hive metadata

If you use Data Lake Formation (DLF) to manage metadata of Hive data sources, you need to add the following configuration in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source:

{"dlf.catalog.id" : "my_catalog_xxxx"}

my_catalog_xxxx is the value of the dlf.catalog.id parameter that is configured for Hive in EMR.

High availability configuration

If High Service Availability is turned on for the E-MapReduce (EMR) Hive cluster, you must configure items related to the high availability feature in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example. You can obtain the values of the configuration items on the Services tab of the EMR Hive cluster in the EMR console.

{
// Advanced parameters related to high availability. 
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters. You can also use another service as the underlying storage. 
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}

Configuration of an OSS external table

If you use OSS as the underlying storage, you must take note of the following items:

The value of the defaultFS parameter must start with oss://. For example, the value can be in the oss://bucketName format.
If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS external table, you also need to configure items related to OSS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
```
{
    "fs.oss.accessKeyId":"<yourAccessKeyId>",
    "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
    "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
```
If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS-HDFS external table, you also need to configure items related to OSS-HDFS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
```
{
    "fs.oss.accessKeyId":"<yourAccessKeyId>",
    "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
    "fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com"
}
```

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Configure synchronization settings to implement batch synchronization of all data in a database

For more information about the configuration procedure, see Configure a synchronization task in Data Integration.

Appendix: Code and parameters

Appendix: Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for Hive Reader

Hive Reader can read data based on HDFS files or by using Hive JDBC.

Read data based on HDFS files

{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
                "partition": "pt1=a,pt2=b,pt3=c", // The partitions.
                "datasource": "hive_not_ha_****", // The name of the data source.
                "column": [ // The names of the columns.
                    "id",
                    "pt2",
                    "pt1"
                ],
                "readMode": "hdfs", // The mode in which Hive Reader reads data.
                "table": "part_table_1",
                "hdfsUsername" : "hdfs",
                "hivePartitionColumn": [
                    {
                      "type": "string",
                      "value": "The name of the first partition."
                    },
                    {
                      "type": "string",
                      "value": "The name of the second partition."
                     }
               ]
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": "" // The maximum number of dirty data records allowed.
        },
        "speed": {
            "concurrent": 2, // The maximum number of parallel threads.
            "throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "mbps":"12"// The maximum transmission rate.
        }
    }
}

Read data by using Hive JDBC

{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
                "querySql": "select id,name,age from part_table_1 where pt2='B'",
                "datasource": "hive_not_ha_****",  // The name of the data source.
                "column": [ // The names of the columns.
                    "id",
                    "name",
                    "age"
                ],
                "where": "",
                "table": "part_table_1",
                "readMode": "jdbc" // The mode in which Hive Reader reads data.
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 2,  // The maximum number of parallel threads.
            "throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "mbps":"12"// The maximum transmission rate.            
            
        }
    }
}

Parameters in code for Hive Reader

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source.	Yes	No default value
table	The name of the table from which you want to read data. Note The name is case-sensitive.	Yes	No default value
readMode	The mode in which Hive Reader reads data. If Hive Reader reads data based on HDFS files, set this parameter to `hdfs`. If Hive Reader reads data by using Hive JDBC, set this parameter to `jdbc`. Note If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause to filter data. In this case, MapReduce tasks may be generated at the underlying layer of the Hive database, which results in low data reading efficiency. If Hive Reader reads data based on HDFS files, you cannot specify a WHERE clause to filter data. In this case, Hive Reader directly reads data from the HDFS file that corresponds to your Hive table, which improves data reading efficiency. If Hive Reader reads data based on HDFS files, data of views cannot be read.	No	No default value
partition	The partitions from which you want to read data. This parameter is not required if Hive Reader reads data by using Hive JDBC. If the Hive table from which you want to read data is a partitioned table, this parameter is required. After you configure this parameter, Hive Reader reads data from the partitions that are specified by this parameter. For single-level partitions, you can use asterisks (*) as wildcards to configure this parameter. For multi-level partitions, wildcards are not supported to configure this parameter. If the Hive table from which you want to read data is a non-partitioned table, this parameter is not required.	No	No default value
column	The names of the columns from which you want to read data. Separate the names with commas (,), such as `"column": ["id", "name"]`. You can read data from some of the columns. The order of the columns can be changed. You can configure Hive Reader to synchronize data from specific columns in an order different from the order specified in the schema of the table. You can specify partition key columns. You can configure constants. The column parameter must explicitly specify all the columns that you want to synchronize. This parameter cannot be left empty.	Yes	No default value
querySql	If Hive Reader reads data by using Hive JDBC, you can configure the querySql parameter for Hive Reader to read data.	No	No default value
where	If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause for Hive Reader to filter data.	No	No default value
hdfsUsername	The account that is used when Hive Reader reads data from the Hive table. By default, if Hive Reader reads data based on HDFS files, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used to read data from the HDFS file. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Reader reads data from the HDFS file. If Hive Reader does not have permissions to read data during data synchronization, you must configure the hdfsUsername parameter for the related synchronization task in the code editor.	No	No default value
hivePartitionColumn	If you want to synchronize the values of the partition fields to the destination, you can configure the hivePartitionColumn parameter for the related synchronization task in the code editor.	No	No default value

Code for Hive Writer

{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
                "partition": "year=a,month=b,day=c", // The partition.
                "datasource": "hive_ha_shanghai", // The name of the data source.
                "table": "partitiontable2", // The name of the table.
                "column": [ // The names of the columns.
                    "id",
                    "name",
                    "age"
                ],
                "writeMode": "append" ,// The write mode.
                "hdfsUsername" : "hdfs"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":2, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    }
}

Parameters in code for Hive Writer

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source.	Yes	No default value
column	The names of the columns to which you want to write data. Separate the names with commas (,), such as `"column": ["id", "name"]`. You can write data to some of the columns. The column parameter must explicitly specify all the columns to which you want to write data. The parameter cannot be left empty. The order of the columns cannot be changed.	Yes	No default value
table	The name of the Hive table to which you want to write data. Note The name is case-sensitive.	Yes	No default value
partition	The partition to which you want to write data. If you want to write data to a partitioned Hive table, you must configure this parameter. After you configure this parameter, Hive Writer writes data to the partition that is specified by this parameter. If you want to write data to a non-partitioned table, you do not need to configure this parameter.	No	No default value
writeMode	The mode in which Hive Writer loads data to the Hive table. After data is written to an HDFS file, Hive Writer executes the `LOAD DATA INPATH (overwrite) INTO TABLE` statement to load data to the Hive table. This parameter specifies the mode in which Hive Writer loads data from the HDFS file to the Hive table. Valid values: truncate: Hive Writer deletes existing data before it loads data to the Hive table. append: Hive Writer retains the existing data and appends data to the Hive table. Other: Hive Writer writes data to the HDFS file but does not load the data to the Hive table. Note Before you configure this parameter, make sure that the value that will be specified for the parameter does not affect your business. Pay attention to the destination directory and the value of this parameter to prevent data from being unexpectedly deleted. This parameter and the hiveConfig parameter must be used in pairs.	Yes	No default value
hiveConfig	The extended parameters for Hive. You can configure the following parameters as extended parameters: hiveCommand, jdbcUrl, username, and password. hiveCommand: the full path of the Hive client. After you run the `hive -e` command, the `LOAD DATA INPATH` statement is executed to load data in the mode that is specified by the writeMode parameter. The client that is specified by the hiveCommand parameter provides access information about Hive. jdbcUrl, username, and password: the information that is required to connect to Hive by using JDBC. After Hive Writer connects to Hive by using JDBC, the `LOAD DATA INPATH` statement is executed to load data in the mode that is specified by the writeMode parameter. `"hiveConfig": { "hiveCommand": "", "jdbcUrl": "", "username": "", "password": "" }` Hive Writer uses an HDFS client to write data to HDFS files. You can use this parameter to configure advanced settings for the HDFS client.	Yes	No default value
hdfsUsername	The account that is used when Hive Writer writes data to the Hive table. By default, when Hive Writer writes data to the Hive table, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Writer writes data to the Hive table. If Hive Writer does not have permissions to write data during data synchronization, you must configure the hdfsUsername parameter for the related data synchronization task in the code editor.	No	No default value