Lindorm provides a compute engine service named Lindorm Distributed Processing System (LDPS). After LDPS is activated for a Lindorm instance, a Lindorm Change Data Capture (CDC) data source is assigned to the Lindorm instance. Changes in data stored in other engine services that are activated for the Lindorm instance are synchronized to the CDC data source. You can use Spark SQL to query these data changes from the CDC data source.
Prerequisites
- Lindorm Tunnel Service (LTS) is activated for your Lindorm instance. For more information, see Activate and log on to LTS.
- A subscription channel is created for LindormTable. For more information, see Create a Pull channel for data subscription. Note When you create a subscription channel, take note of the following points:
- Do not select Ignore family prefix for column name in message.
- Select json for the Serialize Type parameter.
- One topic name corresponds to only one Lindorm table name.
- Configure the LINDORM_HBASE_CATALOG attribute for your HBase table. For more information, see Access data in wide tables. Note The LINDORM_HBASE_CATALOG attribute specifies the mapping between a Spark SQL schema and the schema of the HBase table. The Lindorm CDC data source extracts the schema of the HBase table based on the value of this attribute.
Limits
- Only HBase tables are supported. HBase tables are tables whose data is written to LindormTable by using HBase clients.
- The real-time change tracking feature allows you to consume only files in the JSON format.
How to submit a job
You can use one of the following methods to write and submit a Spark job for a Lindorm CDC data source:
Note For information about the syntax that is used to read data from and write data to a Lindorm CDC data source, see Configure a Lindorm CDC data source.
Configure a Lindorm CDC data source
Table schemas and database schemas of the Lindorm CDC data source
- The name of the Lindorm CDC data source provided by LDPS is lindorm_cdc.
- You cannot manage namespaces in the Lindorm CDC data source. You can manage only tables in the Lindorm CDC data source. The tables in the Lindorm CDC data source use the same names as the topics that you specified when you created data subscription channels.
Schemas of the Lindorm CDC data source
The Lindorm CDC data source extracts the schemas of HBase tables based on the LINDORM_HBASE_CATALOG attribute and uses the extracted schemas as the schemas of the Lindorm CDC data source. The Lindorm CDC data source reads data from Kafka. Each operation record is saved. The following table describes the meta fields that are supported in the schemas of the Lindorm CDC data source.
Field | Category | Description | Configuration |
---|---|---|---|
_cdc_timestamp_kafka | long | The timestamp when the operation record was written to Kafka. Unit: milliseconds. | No configuration is required. The default configuration value that is contained in the schema is used. |
_cdc_operation_type | string | The change type of the operation record.
| No configuration is required. The default configuration value that is contained in the schema is used. |
_cdc_timestamp_lindorm | long | The timestamp when the operation record was processed by a Lindorm engine service other than LDPS. Unit: milliseconds. | spark.sql.catalog.lindorm_cdc.lindormTsEnabled |
_cdc_timestamp_lts | long | The timestamp when the operation record was processed by LTS. Unit: milliseconds. | spark.sql.catalog.lindorm_cdc.ltsTsEnabled |
Configuration items of the Lindorm CDC data source
The following table describes the configuration items of the Lindorm CDC data source.
Configuration item | Required | Description | Example |
---|---|---|---|
spark.sql.catalog.lindorm_cdc.username |
| The username that is used to connect to LindormTable. | root (default username) |
spark.sql.catalog.lindorm_cdc.password |
| The password that is used to connect to LindormTable. | root (default password) |
spark.sql.catalog.lindorm_cdc.lindormTsEnabled | No | Specifies whether to include the timestamp when Lindorm processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lindorm field is added to the schema of the Lindorm CDC data source. | true |
spark.sql.catalog.lindorm_cdc.ltsTsEnabled | No | Specifies whether to include the timestamp when LTS processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lts field is added to the schema of the Lindorm CDC data source. | true |
Statements that are supported for the Lindorm CDC data source
The following table describes the statements that can be executed on the Lindorm CDC data source.
Statement | Description | Example |
---|---|---|
USE table_name | Uses a specified table. | USE test |
SHOW TABLES | Views all tables. | SHOW TABLES |
DESCRIBE table_name | Views the details of a specified table. | DESC test or DESCRIBE test |
SELECT | For more information about the SELECT statement, see Spark SQL. Note When you execute the SELECT statement, take note of the following items:
|