The HBase data source supports both reading from and writing to HBase. This topic describes the synchronization capabilities of the DataWorks HBase data source.
Supported versions
The HBase plugin comes in two types: the standard HBase plugin and the HBase{xx}xsql plugin. The HBase{xx}xsql plugin requires both HBase and Phoenix.
HBase plugin:
This plugin supports
HBase0.94.x,HBase1.1.x, andHBase2.x. It supports both wizard mode and script mode. Use thehbaseVersionparameter to specify the version.If you are using
HBase0.94.x, sethbaseVersionto094xfor both the Reader and Writer plugins."reader": { "hbaseVersion": "094x" }"writer": { "hbaseVersion": "094x" }If you are using
HBase1.1.xorHBase2.x, sethbaseVersionto11xfor both the Reader and Writer plugins."reader": { "hbaseVersion": "11x" }"writer": { "hbaseVersion": "11x" }The HBase 1.1.x plugin is compatible with HBase 2.0.
HBase{xx}xsql plugin:
HBase20xsql plugin: Supports
HBase2.xandPhoenix5.x. Supports script mode only.HBase11xsql plugin: Supports
HBase1.1.xandPhoenix5.x. Supports script mode only.The HBase{xx}xsql Writer plugin lets you import data in bulk into an SQL table (Phoenix) in HBase. Phoenix applies data encoding to rowkeys. Writing data directly with the HBase API requires manual data conversion, a complex and error-prone process. The HBase{xx}xsql Writer plugin simplifies this process, offering a straightforward way to import data into an SQL table.
NoteThe plugin uses the Phoenix JDBC driver to execute UPSERT statements and write data to the table in batches. Because it operates through this high-level interface, it also synchronously updates the corresponding index table.
Limitations
HBase reader | HBase20xsql reader | HBase11xsql writer |
|
|
|
Supported features
HBase Reader
HBase Reader supports normal mode and multiVersionFixedColumn mode.
normalmode: Treats an HBase table as a standard two-dimensional table and retrieves the latest version of data.hbase(main):017:0> scan 'users' ROW COLUMN+CELL lisi column=address:city, timestamp=1457101972764, value=beijing lisi column=address:contry, timestamp=1457102773908, value=china lisi column=address:province, timestamp=1457101972736, value=beijing lisi column=info:age, timestamp=1457101972548, value=27 lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 lisi column=info:company, timestamp=1457101972653, value=baidu xiaoming column=address:city, timestamp=1457082196082, value=hangzhou xiaoming column=address:contry, timestamp=1457082195729, value=china xiaoming column=address:province, timestamp=1457082195773, value=zhejiang xiaoming column=info:age, timestamp=1457082218735, value=29 xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 xiaoming column=info:company, timestamp=1457082189826, value=alibaba 2 row(s) in 0.0580 secondsThe following table displays the output.
rowKey
address:city
address:contry
address:province
info:age
info:birthday
info:company
lisi
beijing
china
beijing
27
1987-06-17
baidu
xiaoming
hangzhou
china
zhejiang
29
1987-06-17
alibaba
multiVersionFixedColumnmode: Treats the HBase table as a vertical table. Each record consists of four columns:rowKey,family:qualifier,timestamp, andvalue. You must specify the columns to read. This mode treats each cell value as a separate record. If a cell has multiple versions, the mode generates a separate record for each version.hbase(main):018:0> scan 'users',{VERSIONS=>5} ROW COLUMN+CELL lisi column=address:city, timestamp=1457101972764, value=beijing lisi column=address:contry, timestamp=1457102773908, value=china lisi column=address:province, timestamp=1457101972736, value=beijing lisi column=info:age, timestamp=1457101972548, value=27 lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 lisi column=info:company, timestamp=1457101972653, value=baidu xiaoming column=address:city, timestamp=1457082196082, value=hangzhou xiaoming column=address:contry, timestamp=1457082195729, value=china xiaoming column=address:province, timestamp=1457082195773, value=zhejiang xiaoming column=info:age, timestamp=1457082218735, value=29 xiaoming column=info:age, timestamp=1457082178630, value=24 xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 xiaoming column=info:company, timestamp=1457082189826, value=alibaba 2 row(s) in 0.0260 secondsThe following table displays the output.
rowKey
column:qualifier
timestamp
Value
lisi
address:city
1457101972764
beijing
lisi
address:contry
1457102773908
china
lisi
address:province
1457101972736
beijing
lisi
info:age
1457101972548
27
lisi
info:birthday
1457101972604
1987-06-17
lisi
info:company
1457101972653
baidu
xiaoming
address:city
1457082196082
hangzhou
xiaoming
address:contry
1457082195729
china
xiaoming
address:province
1457082195773
zhejiang
xiaoming
info:age
1457082218735
29
xiaoming
info:age
1457082178630
24
xiaoming
info:birthday
1457082186830
1987-06-17
xiaoming
info:company
1457082189826
alibaba
HBase Writer
The HBase Writer can generate a
rowKeyby concatenating multiple columns from the source.The HBase Writer can set the
version(timestamp) of the data in the following ways:Using the current time.
Using a value from a source column.
Using a user-specified time.
Supported field types
Batch read
The following table lists the data type mappings for HBase Reader.
Type
Data Integration column type
Database data type
Integer
long
short, int, and long
Floating-point
double
float and double
String
string
binary_string and string
Date and Time
date
date
Byte
bytes
bytes
Boolean
boolean
boolean
HBase20xsql Reader supports most, but not all, Phoenix data types. Verify that your data types are supported before use.
The following table lists the type mappings used by HBase20xsql Reader for Phoenix data types.
DataX internal type
Phoenix data type
long
INTEGER, TINYINT, SMALLINT, and BIGINT
double
FLOAT, DECIMAL, and DOUBLE
string
CHAR and VARCHAR
date
DATE, TIME, and TIMESTAMP
bytes
BINARY and VARBINARY
boolean
BOOLEAN
Batch write
The following table lists the data type mappings for HBase Writer.
Ensure that the
columnconfiguration matches the corresponding column types in the HBase table.Only the data types listed in the following table are supported.
Type | Database data type |
Integer | INT, LONG, and SHORT |
Floating-point | FLOAT and DOUBLE |
Boolean | BOOLEAN |
String | STRING |
Considerations
If you receive the error message "tried to access method com.google.common.base.Stopwatch" when testing connectivity, add the hbaseVersion property to the Data Source configuration and specify the HBase version.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Data synchronization tasks
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Single-table offline synchronization task
For instructions, see Configure a task in the codeless UI and Configure a task in the code editor.
By default, Wizard Mode does not display the Field Mapping section because HBase is a schemaless data source. You must configure the field mapping manually:
When HBase is the data source, configure Source Field in the following format:
data_type|column_family:column_name.When HBase is the data destination, configure both Destination Field and rowkey. For Destination Field, use the format
source_field_index|data_type|column_family:column_name. For rowkey, use the formatsource_primary_key_index|data_type.
NoteEach field must be on a separate line.
For a complete list of parameters and script examples in Script Mode, see Appendix: Script demos and parameters.
FAQ
Q: What is the recommended concurrency setting? Does increasing it help if the import is slow?
A: The default JVM Heap Size for the Data Import process is 2 GB. Concurrency (the number of channels) is implemented using multi-threading. However, creating excessive threads does not always improve import speed and can degrade performance due to frequent Garbage Collection (GC). As a best practice, we recommend setting Concurrency (the number of channels) to 5 to 10.
Q: What is the recommended
batchSizesetting?A: The default value is 256, but you should calculate the optimal
batchSizebased on your average Row Size. As a best practice, aim for a total data size of 2 MB to 4 MB per batch. Divide this target size by your average Row Size to determine the appropriatebatchSize.
Appendix: Script demos and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
HBase Reader script demo
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"hbase",// The plugin name.
"parameter":{
"mode":"normal",// The mode for reading data from HBase. Valid values: normal and multiVersionFixedColumn.
"scanCacheSize":"256",// The number of rows the client reads from the server per RPC.
"scanBatchSize":"100",// The number of columns the client reads from the server per RPC.
"hbaseVersion":"094x/11x",// The HBase version.
"column":[// The columns to read.
{
"name":"rowkey",// The column name.
"type":"string"// The data type.
},
{
"name":"columnFamilyName1:columnName1",
"type":"string"
},
{
"name":"columnFamilyName2:columnName2",
"format":"yyyy-MM-dd",
"type":"date"
},
{
"name":"columnFamilyName3:columnName3",
"type":"long"
}
],
"range":{// The rowkey range for the HBase Reader.
"endRowkey":"",// The end rowkey.
"isBinaryRowkey":true,// Specifies how to convert startRowkey and endRowkey to byte arrays. The default value is false.
"startRowkey":""// The start rowkey.
},
"maxVersion":"",// The number of versions to read in multi-version mode.
"encoding":"UTF-8",// The encoding format.
"table":"",// The table name.
"hbaseConfig":{// Connection configuration for the HBase cluster, in JSON format.
"hbase.zookeeper.quorum":"hostname",
"hbase.rootdir":"hdfs://ip:port/database",
"hbase.cluster.distributed":"true"
}
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of error records allowed.
},
"speed":{
"throttle":true,// Enables throttling. If true, throttling is enabled based on the mbps value. If false, the mbps parameter is ignored.
"concurrent":1,// The number of concurrent tasks for the job.
"mbps":"12"// The throttling rate. In this example, 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}HBase Reader parameters
Parameter | Description | Required | Default |
haveKerberos | Specifies whether the HBase cluster requires Kerberos authentication. If set to Note
| No | false |
hbaseConfig | The connection configuration for the HBase cluster, in JSON format. The hbase.zookeeper.quorum property, which specifies the ZooKeeper (ZK) address of the HBase cluster, is required. You can add other HBase client configurations, such as scan cache and batch settings, to optimize server interaction. Note If you are connecting to an ApsaraDB for HBase database, use its private address. | Yes | None |
mode | The mode for reading data from HBase. Valid values: normal and multiVersionFixedColumn. | Yes | None |
table | The name of the HBase table to read from. This parameter is case-sensitive. | Yes | None |
encoding | The encoding format used to convert the binary HBase byte[] array to a string. Valid values: UTF-8 and GBK. | No | UTF-8 |
column | The columns to read from HBase. This parameter is required in both normal and multiVersionFixedColumn modes.
| Yes | None |
maxVersion | The number of versions the HBase Reader reads in multi-version mode. Set to -1 to read all versions, or to an integer greater than 1. | Required in | None |
range | Specifies the rowkey range for the HBase Reader.
| No | None |
scanCacheSize | The number of rows the HBase Reader fetches from the server per RPC. | No | 256 |
scanBatchSize | The number of columns the HBase Reader fetches from the server per RPC. A value of -1 indicates that all columns are returned. Note To avoid potential data quality issues, set scanBatchSize to a value greater than the actual number of columns. | No | 100 |
HBase Writer script demo
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"hbase",// The plugin name.
"parameter":{
"mode":"normal",// The mode for writing data to HBase.
"walFlag":"false",// Specifies whether to write to the Write-Ahead Log (WAL). A value of false disables it.
"hbaseVersion":"094x",// The HBase version.
"rowkeyColumn":[// The columns to use for the rowkey.
{
"index":"0",// The serial number.
"type":"string"// The data type.
},
{
"index":"-1",
"type":"string",
"value":"_"
}
],
"nullMode":"skip",// Specifies how to handle null values.
"column":[// The HBase columns to write to.
{
"name":"columnFamilyName1:columnName1",// The column name.
"index":"0",// The index number.
"type":"string"// The data type.
},
{
"name":"columnFamilyName2:columnName2",
"index":"1",
"type":"string"
},
{
"name":"columnFamilyName3:columnName3",
"index":"2",
"type":"string"
}
],
"encoding":"utf-8",// The encoding format.
"table":"",// The table name.
"hbaseConfig":{// Connection configuration for the HBase cluster, in JSON format.
"hbase.zookeeper.quorum":"hostname",
"hbase.rootdir":"hdfs://ip:port/database",
"hbase.cluster.distributed":"true"
}
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of error records allowed.
},
"speed":{
"throttle":true,// Enables throttling. If true, throttling is enabled based on the mbps value. If false, the mbps parameter is ignored.
"concurrent":1, // The number of concurrent tasks for the job.
"mbps":"12"// The throttling rate.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}HBase Writer parameters
Parameter | Description | Required | Default |
haveKerberos | Specifies whether the HBase cluster requires Kerberos authentication. If set to Note
| No | false |
hbaseConfig | The connection configuration for the HBase cluster, in JSON format. The hbase.zookeeper.quorum property, which specifies the ZooKeeper (ZK) address of the HBase cluster, is required. You can add other HBase client configurations, such as scan cache and batch settings, to optimize server interaction. Note If you are connecting to an ApsaraDB for HBase database, use its private address. | Yes | None |
mode | The mode for writing data to HBase. Currently, only normal mode is supported. | Yes | None |
table | The name of the target HBase table. This parameter is case-sensitive. | Yes | None |
encoding | The encoding format used to convert a STRING to an HBase byte[] array. Valid values: UTF-8 and GBK. | No | UTF-8 |
column | The HBase columns to write to:
| Yes | None |
rowkeyColumn | Specifies the columns used to construct the rowkey for writing data to HBase:
The configuration is as follows: | Yes | None |
versionColumn | Specifies the Timestamp for the data written to HBase. You can use the current system time, a value from a source column, or a fixed value. If this parameter is not configured, the current time is used by default.
The configuration is as follows:
| No | None |
nullMode | Specifies how to handle null values from the source data:
| No | skip |
walFlag | When a client sends a Put or Delete operation, the data is first written to the Write-Ahead Log (WAL) before being stored in the MemStore. This process ensures data durability. Setting this parameter to | No | false |
writeBufferSize | The size of the HBase client's write buffer in bytes. This buffer works with the client's
| No | 8 MB |
fileSystemUsername | If a data synchronization task fails due to Ranger permission issues, switch to Script Mode and set this parameter to a username with the required HBase access permissions. DataWorks then uses this username for the connection. | No | None |
HBase20xsql Reader script demo
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"hbase20xsql",// The plugin name.
"parameter":{
"queryServerAddress": "http://127.0.0.1:8765", // The Phoenix QueryServer address.
"serialization": "PROTOBUF", // The QueryServer serialization format.
"table": "TEST", // The table to read.
"column": ["ID", "NAME"], // The columns to read.
"splitKey": "ID" // The split column, which must be the primary key of the table.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of error records allowed.
},
"speed":{
"throttle":true,// Enables throttling. If true, throttling is enabled based on the mbps value. If false, the mbps parameter is ignored.
"concurrent":1,// The number of concurrent tasks for the job.
"mbps":"12"// The throttling rate. In this example, 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}HBase20xsql Reader parameters
Parameter | Description | Required | Default |
queryServerAddress | The HBase20xsql Reader uses a Phoenix lightweight client to connect to the Phoenix QueryServer. Specify the QueryServer address here. For ApsaraDB for HBase enhanced edition (Lindorm) users, you can pass | Yes | None |
serialization | The serialization protocol used by the QueryServer. | No | PROTOBUF |
table | The name of the table to read. This parameter is case-sensitive. | Yes | None |
schema | The schema that contains the table. | No | None |
column | A JSON array that contains the names of the columns to synchronize. If left empty, all columns are read. | No | All columns |
splitKey | Specifies a column to use for data sharding. Providing a splitKey enables parallel data synchronization, which improves performance. Two splitting methods are available. If splitPoint is empty, the data is automatically split based on Method 1:
| Yes | None |
splitPoints | Splitting a column based on its minimum and maximum values can create data hotspots. Therefore, we recommend setting split points based on the startkey and endkey of the regions. This approach aligns each query with a single region, preventing hotspots. | No | None |
where | The filter condition. You can add a filter to the table query. The HBase20xsql Reader constructs an SQL query based on the specified column, table, and where conditions, and then extracts data based on that query. | No | None |
querySql | In some use cases, the where parameter may not be sufficient to describe the desired filter conditions. You can use this parameter to define a custom filter SQL query. When | No | None |
HBase11xsql Writer script demo
{
"type": "job",
"version": "1.0",
"configuration": {
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle":true,// Enables throttling. If true, throttling is enabled based on the mbps value. If false, the mbps parameter is ignored.
"concurrent":1, // The number of concurrent tasks for the job.
"mbps":"1"// The throttling rate. In this example, 1 mbps is equal to 1 MB/s.
}
},
"reader": {
"plugin": "odps",
"parameter": {
"datasource": "",
"table": "",
"column": [],
"partition": ""
}
},
"plugin": "hbase11xsql",
"parameter": {
"table": "The target HBase table name, which is case-sensitive.",
"hbaseConfig": {
"hbase.zookeeper.quorum": "The ZooKeeper server address of the target HBase cluster.",
"zookeeper.znode.parent": "The znode of the target HBase cluster."
},
"column": [
"columnName"
],
"batchSize": 256,
"nullMode": "skip"
}
}
}HBase11xsql Writer parameters
Parameter | Description | Required | Default |
plugin | The name of the plugin. Must be | Yes | None |
table | The name of the target Phoenix table for the data import. This parameter is case-sensitive. Phoenix table names are typically in uppercase. | Yes | None |
column | The column names. This parameter is case-sensitive. Phoenix column names are typically in uppercase. Note
| Yes | None |
hbaseConfig | The address of the HBase cluster. The ZooKeeper quorum is required. Format: Note
| Yes | None |
batchSize | The maximum number of rows in a batch write operation. | No | 256 |
nullMode | Specifies how to handle null values from the source data:
| No | skip |