DataWorks provides Hive Reader and Hive Writer for you to read data from and write data to Hive data sources. This topic describes the capabilities of synchronizing data from or to Hive data sources.
Background information
Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table. Essentially, Hive is an engine that is used to parse SQL statements. It uses MapReduce to analyze data at the underlying layer and uses Hadoop Distributed File System (HDFS) to store the processed data. Hive converts Hibernate Query Language (HQL) statements to MapReduce programs and runs the programs on YARN.
Hive Reader connects to a Hive metastore to obtain the metadata of your Hive table. Hive Reader can read data based on HDFS files or by using Hive Java Database Connectivity (JDBC):
Read data based on HDFS files
Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.
Read data by using Hive JDBC
Hive Reader connects to HiveServer2 by using Hive JDBC and reads data. Hive Reader allows you to specify WHERE clauses to filter data and execute SQL statements to read data.
Hive Writer connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file to which you want to write data. Then, Hive Writer writes data to the HDFS file and executes the LOAD DATA
statement in the Hive client to load the data in the HDFS file to the destination Hive table by using JDBC.
The underlying logic of Hive Writer is the same as that of HDFS Writer. You can configure parameters for HDFS Writer in the parameters of Hive Writer. Data Integration transparently transmits the configured parameters to HDFS Writer.
Supported Hive versions
Limits
Hive data sources support only exclusive resource groups for Data Integration.
Hive Reader can read files only in the text, optical character recognition (ORC), and Parquet formats.
When you run a batch synchronization task to synchronize data to a Hive data source, temporary files are generated on the server. After the batch synchronization task finishes running, the temporary files are automatically deleted. DataWorks cannot ensure that the number of generated temporary files does not exceed the upper limit for the number of files that are allowed in an HDFS directory. You must pay attention to the upper limit allowed in the HDFS directory and make sure that the number of generated temporary files does not exceed the upper limit. This can prevent the HDFS service from being unavailable due to excessive temporary files.
NoteYou can change the value of the dfs.namenode.fs-limits.max-directory-items parameter on the server to define the maximum number of non-recursive directories or files that can be contained in a single HDFS directory. The value range of this parameter is from 1 to 6400000. The default value of this parameter is 1048576. To prevent the number of temporary files that are generated from exceeding the upper limit for the number of files that are allowed in an HDFS directory, you can specify a larger value for the dfs.namenode.fs-limits.max-directory-items parameter or delete the files that are no longer required from the directory.
Hive data sources support only Kerberos authentication. If you do not need to perform identity authentication for a Hive data source, you can set the Special Authentication Method parameter to None when you add the data source.
If Kerberos authentication is enabled for both HiveServer2 and a metastore for a Hive data source that is accessed by using a Kerberos-authenticated identity in DataWorks, and the principals that are used for the authentication are different, you must add the following configuration to the extended parameters:
{ "hive.metastore.kerberos.principal": "your metastore principal" }
Data type mappings
Batch data read
Category | Hive data type |
String | CHAR, VARCHAR, and STRING |
Integer | TINYINT, SMALLINT, INT, INTEGER, and BIGINT |
Floating point | FLOAT, DOUBLE, and DECIMAL |
Date and time | TIMESTAMP and DATE |
Boolean | BOOLEAN |
Preparations before data synchronization
The preparations vary based on the data source configuration mode.
Alibaba Cloud Instance Mode
If you select this mode and want to synchronize Object Storage Service (OSS ) tables in the selected instance, you must select an access identity, which can be an Alibaba Cloud account, a RAM user, or a RAM role. Make sure that the selected access identity has required OSS permissions. Otherwise, data synchronization will fail due to insufficient read and write permissions.
A connectivity test does not verify the data read and write permissions.
Connection String Mode
Configuration for the use of DLF to manage Hive metadata
If you use Data Lake Formation (DLF) to manage metadata of Hive data sources, you need to add the following configuration in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source:
{"dlf.catalog.id" : "my_catalog_xxxx"}
my_catalog_xxxx
is the value of the dlf.catalog.id
parameter that is configured for Hive in EMR.
High availability configuration
If High Service Availability is turned on for the E-MapReduce (EMR) Hive cluster, you must configure items related to the high availability feature in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example. You can obtain the values of the configuration items on the Services tab of the EMR Hive cluster in the EMR console.
{
// Advanced parameters related to high availability.
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters. You can also use another service as the underlying storage.
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
Configuration of an OSS external table
If you use OSS as the underlying storage, you must take note of the following items:
The value of the defaultFS parameter must start with oss://. For example, the value can be in the
oss://bucketName
format.If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS external table, you also need to configure items related to OSS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
{ "fs.oss.accessKeyId":"<yourAccessKeyId>", "fs.oss.accessKeySecret":"<yourAccessKeySecret>", "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com" }
If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS-HDFS external table, you also need to configure items related to OSS-HDFS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
{ "fs.oss.accessKeyId":"<yourAccessKeyId>", "fs.oss.accessKeySecret":"<yourAccessKeySecret>", "fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com" }
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Configure synchronization settings to implement batch synchronization of all data in a database
For more information about the configuration procedure, see Configure a synchronization task in Data Integration.
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for Hive Reader
Hive Reader can read data based on HDFS files or by using Hive JDBC.
Read data based on HDFS files
{ "type": "job", "steps": [ { "stepType": "hive", "parameter": { "partition": "pt1=a,pt2=b,pt3=c", // The partitions. "datasource": "hive_not_ha_****", // The name of the data source. "column": [ // The names of the columns. "id", "pt2", "pt1" ], "readMode": "hdfs", // The mode in which Hive Reader reads data. "table": "part_table_1", "hdfsUsername" : "hdfs", "hivePartitionColumn": [ { "type": "string", "value": "The name of the first partition." }, { "type": "string", "value": "The name of the second partition." } ] }, "name": "Reader", "category": "reader" }, { "stepType": "hive", "parameter": { }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" // The maximum number of dirty data records allowed. }, "speed": { "concurrent": 2, // The maximum number of parallel threads. "throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. "mbps":"12"// The maximum transmission rate. } } }
Read data by using Hive JDBC
{ "type": "job", "steps": [ { "stepType": "hive", "parameter": { "querySql": "select id,name,age from part_table_1 where pt2='B'", "datasource": "hive_not_ha_****", // The name of the data source. "column": [ // The names of the columns. "id", "name", "age" ], "where": "", "table": "part_table_1", "readMode": "jdbc" // The mode in which Hive Reader reads data. }, "name": "Reader", "category": "reader" }, { "stepType": "hive", "parameter": { }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" }, "speed": { "concurrent": 2, // The maximum number of parallel threads. "throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. "mbps":"12"// The maximum transmission rate. } } }
Parameters in code for Hive Reader
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | No default value |
table | The name of the table from which you want to read data. Note The name is case-sensitive. | Yes | No default value |
readMode | The mode in which Hive Reader reads data.
Note
| No | No default value |
partition | The partitions from which you want to read data.
| No | No default value |
column | The names of the columns from which you want to read data. Separate the names with commas (,), such as
| Yes | No default value |
querySql | If Hive Reader reads data by using Hive JDBC, you can configure the querySql parameter for Hive Reader to read data. | No | No default value |
where | If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause for Hive Reader to filter data. | No | No default value |
hdfsUsername | The account that is used when Hive Reader reads data from the Hive table. By default, if Hive Reader reads data based on HDFS files, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used to read data from the HDFS file. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Reader reads data from the HDFS file. If Hive Reader does not have permissions to read data during data synchronization, you must configure the hdfsUsername parameter for the related synchronization task in the code editor. | No | No default value |
hivePartitionColumn | If you want to synchronize the values of the partition fields to the destination, you can configure the hivePartitionColumn parameter for the related synchronization task in the code editor. | No | No default value |
Code for Hive Writer
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
"partition": "year=a,month=b,day=c", // The partition.
"datasource": "hive_ha_shanghai", // The name of the data source.
"table": "partitiontable2", // The name of the table.
"column": [ // The names of the columns.
"id",
"name",
"age"
],
"writeMode": "append" ,// The write mode.
"hdfsUsername" : "hdfs"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":2, // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
}
}
Parameters in code for Hive Writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | No default value |
column | The names of the columns to which you want to write data. Separate the names with commas (,), such as
| Yes | No default value |
table | The name of the Hive table to which you want to write data. Note The name is case-sensitive. | Yes | No default value |
partition | The partition to which you want to write data.
| No | No default value |
writeMode | The mode in which Hive Writer loads data to the Hive table. After data is written to an HDFS file, Hive Writer executes the This parameter specifies the mode in which Hive Writer loads data from the HDFS file to the Hive table. Valid values:
Note Before you configure this parameter, make sure that the value that will be specified for the parameter does not affect your business. Pay attention to the destination directory and the value of this parameter to prevent data from being unexpectedly deleted. This parameter and the hiveConfig parameter must be used in pairs. | Yes | No default value |
hiveConfig | The extended parameters for Hive. You can configure the following parameters as extended parameters: hiveCommand, jdbcUrl, username, and password.
| Yes | No default value |
hdfsUsername | The account that is used when Hive Writer writes data to the Hive table. By default, when Hive Writer writes data to the Hive table, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Writer writes data to the Hive table. If Hive Writer does not have permissions to write data during data synchronization, you must configure the hdfsUsername parameter for the related data synchronization task in the code editor. | No | No default value |