Tablestore is a NoSQL database service that is built on top of the Apsara distributed operating system. DataWorks provides Tablestore Reader and Tablestore Writer for you to read data from and write data to Tablestore data sources. This topic describes the capabilities of synchronizing data from or to Tablestore data sources.
Limits
You can use Tablestore Reader to read data from and Tablestore Writer to write data to Tablestore data sources. Data can be read or written in row or column mode. You can read data from and write data to wide tables and time series tables in one of the two modes.
Column mode: If you set the mode parameter to multiVersion, data is read or written in column mode, and table data is organized in a three-level structure: row, column, and version. One row can have multiple columns, and the names of the columns are not fixed. Each column can have multiple versions, and each version has a specific timestamp, which is the version number. In column mode, data that you want to read is in the four-tuple format that includes the primary key value, column name, timestamp, and column value. Data that you want to write is also in the four-tuple format that includes the primary key value, column name, timestamp, and column value.
Row mode: If you set the mode parameter to normal, data is read or written in row mode. Tablestore Reader reads operation records as rows. Data in the rows is in the format that includes the primary key value and column value.
In row mode, each row of data is equivalent to a data record in a Tablestore table. Data that you want to write to a destination consists of two parts: values of primary key columns and values of common columns.
A table in Tablestore consists of primary key columns and common columns. For a data synchronization task that synchronizes data to Tablestore, the order of columns in the source table must be the same as the order of columns in the destination table in Tablestore. Otherwise, a column mapping error occurs.
Tablestore Reader evenly splits data that you want to read from a Tablestore table based on the data read range and the number of parallel threads that you specified for a data synchronization task. Each Tablestore Reader thread runs a thread of the synchronization task.
Data type mappings
Tablestore Reader and Tablestore Writer support all data types of Tablestore tables. The following table lists the data type mappings based on which Tablestore Reader or Tablestore Writer converts data types.
Category | Tablestore data type |
Integer | INTEGER |
Floating point | DOUBLE |
String | STRING |
Boolean | BOOLEAN |
Binary | BINARY |
Tablestore does not support the DATE data type. The application layer uses the LONG-type UNIX timestamp to indicate time.
To write data of the INTEGER type, set the data type to INT in the code editor. During data synchronization, Tablestore Writer converts the data from the INT data type to the INTEGER data type. If you set the data type to INTEGER, an error is reported in the log, and the synchronization task fails.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Appendix: Code and parameters
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for Tablestore Reader
Tablestore Reader configurations for reading data from wide tables in row mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"newVersion":"true",// Specifies whether to use the latest version of Tablestore Reader.
"mode": "normal",// The mode in which you want to read data. Set this parameter to normal.
"isTimeseriesTable":"false",// Specifies whether to configure the Tablestore table as a wide table.
"column":[// The names of the columns.
{
"name":"column1"// The name of the column.
},
{
"name":"column2"
},
{
"name":"column3"
},
{
"name":"column4"
},
{
"name":"column5"
}
],
"range":{
"split":[
{
"type":"STRING",
"value":"beginValue"
},
{
"type":"STRING",
"value":"splitPoint1"
},
{
"type":"STRING",
"value":"splitPoint2"
},
{
"type":"STRING",
"value":"splitPoint3"
},
{
"type":"STRING",
"value":"endValue"
}
],
"end":[
{
"type":"STRING",
"value":"endValue"
},
{
"type":"INT",
"value":"100"
},
{
"type":"INF_MAX"
},
{
"type":"INF_MAX"
}
],
"begin":[
{
"type":"STRING",
"value":"beginValue"
},
{
"type":"INT",
"value":"0"
},
{
"type":"INF_MIN"
},
{
"type":"INF_MIN"
}
]
},
"table":""// The name of the table.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Tablestore Reader configurations for reading data from time series tables in row mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"table": "",// The name of the table.
// If you read data from a time series table, you must set the mode parameter to normal.
"mode": "normal",
// If you read data from a time series table, you must set the newVersion parameter to true.
"newVersion": "true",
// Set this parameter to true.
"isTimeseriesTable":"true",
// measurementName: The name of a physical quantity or metric for data in a time series table. This parameter is optional. If this parameter is left empty, all data in the table is read.
"measurementName":"measurement_1",
"column": [
{
"name": "_m_name"
},
{
"name": "tagA",
"is_timeseries_tag":"true"
},
{
"name": "double_0",
"type":"DOUBLE"
},
{
"name": "string_0",
"type":"STRING"
},
{
"name": "long_0",
"type":"INT"
},
{
"name": "binary_0",
"type":"BINARY"
},
{
"name": "bool_0",
"type":"BOOL"
},
{
"type":"STRING",
"value":"testString"
}
]
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Tablestore Reader configurations for reading data from wide tables in column mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"table":"",// The name of the table.
"newVersion":"true",// Specifies whether to use the latest version of Tablestore Reader.
"mode": "multiversion",// The mode in which you want to read data. Set this parameter to multiVersion.
"column":[// The name of the column from which you want to read data. The column cannot be a primary key column.
{"name":"mobile"},
{"name":"name"},
{"name":"age"},
{"name":"salary"},
{"name":"marry"}
],
"range":{// The range of data that you want to read.
"begin":[
{"type":"INF_MIN"},
{"type":"INF_MAX"}
],
"end":[
{"type":"INF_MAX"},
{"type":"INF_MIN"}
],
"split":[
]
},
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Common parameters in code for Tablestore Reader
Parameter | Description | Required | Default value |
endpoint | The endpoint of the Tablestore server. For more information, see Endpoints. | Yes | No default value |
accessId | The AccessKey ID of the account that you use to connect to the Tablestore server. | Yes | No default value |
accessKey | The AccessKey secret of the account that you use to connect to the Tablestore server. | Yes | No default value |
instanceName | The name of the Tablestore instance. The instance is an entity for you to use and manage Tablestore. After you activate Tablestore, you must create an instance in the Tablestore console before you can create and manage tables. Tablestore instances are the basic units that you can use to manage your Tablestore resources. Access control and resource measurement for applications are performed at the instance level. | Yes | No default value |
table | The name of the table from which you want to read data. You can specify only one table name. Multi-table synchronization is not required for Tablestore. | Yes | No default value |
newVersion | Specifies whether to use the latest version of Tablestore Reader.
The latest version of Tablestore Reader provides new features and consumes less system resources. Therefore, we recommend that you use the latest version of Tablestore Reader. The settings of the latest version of Tablestore Reader are compatible with the settings of an earlier version of Tablestore Reader. If you add the newVersion=true setting to the configuration of a task in which an earlier version of Tablestore Reader is used, the task can be run as expected. | No | false |
mode | The mode in which you want to read data. Valid values:
This setting takes effect only for Tablestore Reader for which the newVersion parameter is set to true. An earlier version of Tablestore Reader ignores the setting of the mode parameter, and you can read data only in row mode. | No | normal |
isTimeseriesTable | Specifies whether the table from which you want to read data is a time series table.
This setting takes effect only for Tablestore Reader for which the newVersion parameter is set to true and the mode parameter is set to normal. An earlier version of Tablestore Reader does not support time series tables, and you cannot read data from time series tables in column mode. | No | false |
Additional parameters in code for Tablestore Reader
You can use Tablestore Reader to read data from wide tables in row mode, time series tables in row mode, and wide tables in column mode. The following table describes the additional parameters for data synchronization in different modes.
Parameters for reading data from wide tables in row mode
Parameter | Description | Required | Default value |
column | The names of the columns that you want to synchronize. Specify the names in a JSON array. Tablestore is a NoSQL database service. You must specify column names for Tablestore Reader to read data.
| Yes | No default value |
begin and end | The parameters specify the range of the data that you want to read. The begin and end parameters specify the data read range for primary key columns in the Tablestore table. If you do not need to limit a range, specify the begin and end parameters as Note
For example, you read data from a Tablestore table with three primary key columns
| No | (INF_MIN,INF_MAX) |
split | The custom rule for data sharding. This parameter is an advanced configuration item. We recommend that you do not configure this parameter in common scenarios. You can configure this parameter to customize the data range of shards. If data is unevenly distributed in a Tablestore table, you can customize a sharding rule. Sample configuration:
Data is split into six segments when a data synchronization task is run and the data synchronization task uses Tablestore Reader to read the data by using parallel threads. We recommend that the number of data segments is greater than the number of threads that can be run in parallel for the data synchronization task.
| No | If you do not configure the split parameter, the automatic splitting logic is used. The automatic splitting logic identifies the maximum value and minimum value of a partition key column, and evenly splits data into segments. The INTEGER and STRING data types are supported for data in a partition key column. Exact division is used to split data of the INTEGER data type in a partition key column into segments. The Unicode of the first character is used to split data of the STRING data type in a partition key column into segments. |
Parameters for reading data from time series tables in row mode
Parameter | Description | Required | Default value |
column | The columns are organized in an array structure. Each element in an array represents a column from which you want to read data. Constant columns and common columns can be configured. For constant columns, you must configure the following fields:
For common columns, you must configure the following fields:
Sample script for reading data from four columns:
| Yes | No default value |
measurementName | The name of a physical quantity or metric for data in a time series table. If this parameter is left empty, all data in the table is read. | No | No default value |
timeRange | The time range of the data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. The value of the begin parameter must be less than that of the end parameter. The timestamp is measured in milliseconds. Format:
| No | All versions |
Parameters for reading data from wide tables in column mode
Parameter | Description | Required | Default value |
column | The name of the column from which you want to read data. You can read data only from common columns in column mode. Format:
Note
| Yes | All columns |
range | The range of data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. Take note of the following items about the range:
Data types specified by the type parameter:
Format:
| No | All data |
timeRange | The time range of the data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. The value of the begin parameter must be less than that of the end parameter. The timestamp is measured in milliseconds. Format:
| No | All versions |
maxVersion | The maximum number of data versions that you can read. Valid values: 1 to INT32_MAX. | No | All versions |
Appendix: Code and parameters
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for Tablestore Writer
Tablestore Writer configurations for writing data to wide tables in row mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"table":"",// The name of the table.
"newVersion":"true",// Specifies whether to use the latest version of Tablestore Writer.
"mode": "normal",// The mode in which you want to write data. Set this parameter to normal.
"isTimeseriesTable":"false",// Specifies whether to configure the Tablestore table as a wide table.
"primaryKey" : [// The primary keys of the destination table in Tablestore.
{"name":"gid", "type":"INT"},
{"name":"uid", "type":"STRING"}
],
"column" : [// The names of the columns.
{"name":"col1", "type":"INT"},
{"name":"col2", "type":"DOUBLE"},
{"name":"col3", "type":"STRING"},
{"name":"col4", "type":"STRING"},
{"name":"col5", "type":"BOOL"}
],
"writeMode" : "PutRow" // The write mode.
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1, // The maximum number of parallel threads.
"mbps":"12",// The maximum transmission rate. Unit: MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Tablestore Writer configurations for writing data to time series tables in row mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"table": "testTimeseriesTableName01",
"mode": "normal",
"newVersion": "true",
"isTimeseriesTable":"true",
"timeunit":"microseconds",
"column": [
{
"name": "_m_name"
},
{
"name": "_data_source",
},
{
"name": "_tags",
},
{
"name": "_time",
},
{
"name": "string_1",
"type":"string"
},
{
"name":"tag3",
"is_timeseries_tag":"true",
}
]
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1, // The maximum number of parallel threads.
"mbps":"12",// The maximum transmission rate. Unit: MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Tablestore Writer configurations for writing data to wide tables in column mode
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"ots",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"table":"",
"newVersion":"true",
"mode":"multiVersion",
"primaryKey" : [
"gid",
"uid"
]
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},x`
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1, // The maximum number of parallel threads.
"mbps":"12",// The maximum transmission rate. Unit: MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Common parameters in code for Tablestore Writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
endPoint | The endpoint of the Tablestore server. For more information, see Endpoints. | Yes | No default value |
accessId | The AccessKey ID of the account that you use to connect to the Tablestore server. | Yes | No default value |
accessKey | The AccessKey secret of the account that you use to connect to the Tablestore server. | Yes | No default value |
instanceName | The name of the Tablestore instance. The instance is an entity for you to use and manage Tablestore. After you activate Tablestore, you must create an instance in the Tablestore console before you can create and manage tables. Tablestore instances are the basic units that you can use to manage your Tablestore resources. Access control and resource measurement for applications are performed at the instance level. | Yes | No default value |
table | The name of the table to which you want to write data. You can specify only one table name. Multi-table synchronization is not required for Tablestore. | Yes | No default value |
newVersion | Specifies whether to use the latest version of Tablestore Writer.
The latest version of Tablestore Writer provides new features and consumes less system resources. Therefore, we recommend that you use the latest version of Tablestore Writer. The settings of the latest version of Tablestore Writer are compatible with the settings of an earlier version of Tablestore Writer. If you add the newVersion=true setting to a task in which an earlier version of Tablestore Writer is used, the task can be run as expected. | Yes | false |
mode | The mode in which you want to write data. Valid values:
This setting takes effect only for Tablestore Writer for which the newVersion parameter is set to true. An earlier version of Tablestore Writer ignores the setting of the mode parameter, and you can write data only in row mode. | No | normal |
isTimeseriesTable | Specifies whether to write data to a time series table.
This setting takes effect only for Tablestore Writer for which the newVersion parameter is set to true and the mode parameter is set to normal. In column mode, time series tables are not compatible. | No | false |
Additional parameters in code for Tablestore Writer
You can use Tablestore Writer to write data to wide tables in row mode, write data to time series tables in row mode, and write data to wide tables in column mode. The following table describes the additional parameters for data synchronization in different modes.
Parameters for writing data to wide tables in row mode
Parameter | Description | Required | Default value |
primaryKey | The primary keys of the destination table in Tablestore. Specify the primary keys in a JSON array. Tablestore is a NoSQL database service. You must specify column names for Tablestore Writer to write data. Data Integration supports data type conversion. Tablestore Writer can convert data from a data type other than STRING or INT to the STRING or INT data type. Sample configuration:
Note The primary keys in Tablestore must be of the STRING or INT type. Therefore, you must set the data type of a primary key to STRING or INT in the code editor for Tablestore Writer. | Yes | No default value |
column | The names of the columns that you want to synchronize. Specify the names in a JSON array. Sample configuration:
The name parameter specifies the name of the column to which you want to write data. The type parameter specifies the data type of values in the column. Tablestore supports the following data types: STRING, INT, DOUBLE, BOOLEAN, and BINARY. Note Constants, functions, or custom statements are not supported during the data write process. | Yes | No default value |
writeMode | The write mode. Valid values:
| Yes | No default value |
enableAutoIncrement | Specifies whether you can write data to a Tablestore table that contains auto-increment primary key columns.
| No | false |
requestTotalSizeLimitation | The maximum size of data that can be written to a single row in Tablestore. The parameter value must be of a numeric data type. | No | 1MB |
attributeColumnSizeLimitation | The maximum size of data that can be written to a single attribute column in Tablestore. The parameter value must be of a numeric data type. | No | 2MB |
primaryKeyColumnSizeLimitation | The maximum size of data that can be written to a single primary key column in Tablestore. The parameter value must be of a numeric data type. | No | 1KB |
attributeColumnMaxCount | The maximum number of attribute columns to which data can be written in Tablestore. The parameter value must be of a numeric data type. | No | 1,024 |
Parameters for writing data to time series tables in row mode
Parameter | Description | Required | Default value |
column | Each element in the column parameter corresponds to a field in timer series data. You can configure the following parameters for each element:
The name of a physical quantity or metric and the timestamp for time series data cannot be left empty. You must configure the Example: The following sample code shows that a single data record that you want to write contains six fields:
Configuration details:
After the data record is written to Tablestore, check the result in the Tablestore console. | Yes | No default value |
timeunit | The unit of the timestamp specified by the _time field. Valid values: NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, and MINUTES. | No | MICROSECONDS |
Parameters for writing data to wide tables in column mode
Parameter | Description | Required | Default value |
primaryKey | The primary key columns of the table. To save configuration costs, you do not need to specify the position of the primaryKey column for rows in the destination table. However, the format of rows must be fixed. The primaryKey column must be placed at the beginning for rows in the table and the primaryKey column is followed by the columnName column. Format of rows: For example, the following data records need to be written:
Sample configuration:
Result of data write to wide tables:
| Yes | No default value |
columnNamePrefixFilter | The column name prefix filter. The names of the columns for data that is read from HBase consist of cf and qulifier. However, Tablestore does not support cf. If you want to write data that is read from HBase to Tablestore, cf must be filtered out. Sample configuration: Note
| No | No default value |
FAQ
Question 1
Q: How do I configure Tablestore Writer to write data to a destination table that contains auto-increment primary key columns?
Take note of the following items when you configure Tablestore Writer:
"newVersion": "true", "enableAutoIncrement": "true",
When you configure Tablestore Writer, you do not need to specify the names of auto-increment primary key columns.
When you configure Tablestore Writer, the sum of the number of primary key columns and the number of column columns must be equal to the number of columns in the table from which you read data by using Tablestore Reader.
Question 2
Q: What do the _tags
and is_timeseries_tag
fields mean in a time series model?
Example: A data record has three tags: phone=xiaomi,RAM=8G,camera=LEICA.
Example for using Tablestore Reader to read data
If you want to combine the preceding tags and read the tags as one column, configure the following setting:
"column": [ { "name": "_tags", } ],
DataWorks reads the tags as one column. Example:
["phone=xiaomi","camera=LEICA","RAM=8G"]
If you want to read the
phone
tag and thecamera
tag as two separate columns, configure the following setting:"column": [ { "name": "phone", "is_timeseries_tag":"true", }, { "name": "camera", "is_timeseries_tag":"true", } ],
The following two columns of data are obtained:
xiaomi, LEICA
Example for using Tablestore Writer to write data
For example, the following two columns of data exist in the source:
["phone=xiaomi","camera=LEICA","RAM=8G"]
6499
You want the two columns of data to be written to the tag field in the following format.Configure the following setting:
"column": [ { "name": "_tags", }, { "name": "price", "is_timeseries_tag":"true", }, ],
The configuration for the first column enables
["phone=xiaomi","camera=LEICA","RAM=8G"]
to be written to the tag field.The configuration for the second column enables
price=6499
to be written to the tag field.