This topic describes how to archive incremental data in HBase clusters to MaxCompute.
Usage notes
This feature is no longer available for LTS instances that are purchased after June 16, 2023. If your LTS instance is purchased before June 16, 2023, you can still use this feature.
Prerequisites
Lindorm Tunnel Service (LTS) is activated.
An HBase data source is added.
A MaxCompute data source is added.
Supported versions
Self-managed HBase V1.x and HBase V2.x
E-MapReduce HBase
ApsaraDB for HBase Standard Edition, ApsaraDB for HBase Performance-enhanced Edition that runs in cluster mode, and Lindorm
Limits
Real-time data is archived based on HBase logs. Therefore, data that is imported by using bulk loading cannot be exported.
Lifecycle of log data
If log data is not consumed after you enable the archiving feature, the log data is retained for 48 hours by default. After the period expires, the subscription is automatically canceled and the retained data is automatically deleted.
Log data may fail to be consumed if your LTS cluster is released while the task is still running or the synchronization task is suspended.
Submit an archiving task
Log on to the LTS web UI. In the left-side navigation pane, choose Data Export > Incremental Archive to MaxCompute.
Click create new job. On the page that appears, select a source HBase cluster and a destination MaxCompute resource package, and specify the HBase tables that you want to export. The preceding figure provides an example on how to archive data from the wal-test HBase table to MaxCompute in real time.
The columns to be archived are cf1:a, cf1:b, cf1:c, and cf1:d.
The mergeInterval parameter specifies the archiving interval in milliseconds. The default value is 86400000.
Specify the mergeStartAt parameter in the format of yyyyMMddHHmmss. The value in this example specifies 00:00, September 30, 2019 as the start time. You can specify a past point in time.
View the archiving progress of the tables. The Real-time Synchronization Channel section shows the latency and start offset of the log synchronization task. The Table Merge section shows table merging tasks. After the tables are merged, you can query the new partitioned tables in MaxCompute.
Query data in MaxCompute.
Parameters
The following code provides an example of the format of exported tables:
hbaseTable/odpsTable {"cols": ["cf1:a|string", "cf1:b|int", "cf1:c|long", "cf1:d|short","cf1:e|decimal", "cf1:f|double","cf1:g|float","cf1:h|boolean","cf1:i"], "mergeInterval": 86400000, "mergeStartAt": "20191008100547"}
hbaseTable/odpsTable {"cols": ["cf1:a", "cf1:b", "cf1:c"], "mergeStartAt": "20191008000000"}
hbaseTable {"mergeEnabled": false} // No merge operation is performed on the tables.
The expression for an exported table consists of three parts: {{hbaseTable}}, {{odpsTable}}, and {{tbConf}}. {{hbaseTable}}: the source HBase table. {{odpsTable}}: the name of the destination MaxCompute table. This part is optional. By default, the MaxCompute table has the same name as the HBase table. MaxCompute table names do not support characters, such as hyphens (-), and these characters are converted into underscores (_). {{tbConf}}: the archiving configurations of the table. The following table describes the supported parameters in the {{tbConf}} part.
Parameter | Feature | Example |
cols | Specifies the columns that you want to export and the data types of the columns. By default, data is converted into the HexString format. | "cols": ["cf1:a", "cf1:b", "cf1:c"] |
mergeEnabled | Specifies whether to convert key-value (KV) tables into wide tables. Default value: true. | "mergeEnabled": false |
mergeStartAt | The start time for table merging. Specify the value of this parameter in the yyyyMMddHHmmss format. You can specify a past point in time. | "mergeStartAt": "20191008000000" |
mergeInterval | The interval at which table merging tasks are performed. Unit: milliseconds. The default value is one day. If the default value is used, data is archived on a daily basis. | "mergeInterval": 86400000 |