All Products
Search
Document Center

DataWorks:Tablestore data source

最終更新日:Apr 11, 2024

Tablestore is a NoSQL database service that is built on top of the Apsara distributed operating system. DataWorks provides Tablestore Reader and Tablestore Writer for you to read data from and write data to Tablestore data sources. This topic describes the capabilities of synchronizing data from or to Tablestore data sources.

Limits

  • You can use Tablestore Reader to read data from and Tablestore Writer to write data to Tablestore data sources. Data can be read or written in row or column mode. You can read data from and write data to wide tables and time series tables in one of the two modes.

    • Column mode: If you set the mode parameter to multiVersion, data is read or written in column mode, and table data is organized in a three-level structure: row, column, and version. One row can have multiple columns, and the names of the columns are not fixed. Each column can have multiple versions, and each version has a specific timestamp, which is the version number. In column mode, data that you want to read is in the four-tuple format that includes the primary key value, column name, timestamp, and column value. Data that you want to write is also in the four-tuple format that includes the primary key value, column name, timestamp, and column value.

    • Row mode: If you set the mode parameter to normal, data is read or written in row mode. Tablestore Reader reads operation records as rows. Data in the rows is in the format that includes the primary key value and column value.

      In row mode, each row of data is equivalent to a data record in a Tablestore table. Data that you want to write to a destination consists of two parts: values of primary key columns and values of common columns.

  • A table in Tablestore consists of primary key columns and common columns. For a data synchronization task that synchronizes data to Tablestore, the order of columns in the source table must be the same as the order of columns in the destination table in Tablestore. Otherwise, a column mapping error occurs.

  • Tablestore Reader evenly splits data that you want to read from a Tablestore table based on the data read range and the number of parallel threads that you specified for a data synchronization task. Each Tablestore Reader thread runs a thread of the synchronization task.

Data type mappings

Tablestore Reader and Tablestore Writer support all data types of Tablestore tables. The following table lists the data type mappings based on which Tablestore Reader or Tablestore Writer converts data types.

Category

Tablestore data type

Integer

INTEGER

Floating point

DOUBLE

String

STRING

Boolean

BOOLEAN

Binary

BINARY

Note
  • Tablestore does not support the DATE data type. The application layer uses the LONG-type UNIX timestamp to indicate time.

  • To write data of the INTEGER type, set the data type to INT in the code editor. During data synchronization, Tablestore Writer converts the data from the INT data type to the INTEGER data type. If you set the data type to INTEGER, an error is reported in the log, and the synchronization task fails.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for Tablestore Reader

Tablestore Reader configurations for reading data from wide tables in row mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "newVersion":"true",// Specifies whether to use the latest version of Tablestore Reader.
                "mode": "normal",// The mode in which you want to read data. Set this parameter to normal.
                "isTimeseriesTable":"false",// Specifies whether to configure the Tablestore table as a wide table.
                "column":[// The names of the columns. 
                    {
                        "name":"column1"// The name of the column. 
                    },
                    {
                        "name":"column2"
                    },
                    {
                        "name":"column3"
                    },
                    {
                        "name":"column4"
                    },
                    {
                        "name":"column5"
                    }
                ],
                "range":{
                    "split":[
                        {
                            "type":"STRING",
                            "value":"beginValue"
                        },
                        {
                            "type":"STRING",
                            "value":"splitPoint1"
                        },
                        {
                            "type":"STRING",
                            "value":"splitPoint2"
                        },
                        {
                            "type":"STRING",
                            "value":"splitPoint3"
                        },
                        {
                            "type":"STRING",
                            "value":"endValue"
                        }
                    ],
                    "end":[
                        {
                            "type":"STRING",
                            "value":"endValue"
                        },
                        {
                            "type":"INT",
                            "value":"100"
                        },
                        {
                            "type":"INF_MAX"
                        },
                        {
                            "type":"INF_MAX"
                        }
                    ],
                    "begin":[
                        {
                            "type":"STRING",
                            "value":"beginValue"
                        },
                        {
                            "type":"INT",
                            "value":"0"
                        },
                        {
                            "type":"INF_MIN"
                        },
                        {
                            "type":"INF_MIN"
                        }
                    ]
                },
                "table":""// The name of the table. 
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Tablestore Reader configurations for reading data from time series tables in row mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "table": "",// The name of the table.
                // If you read data from a time series table, you must set the mode parameter to normal.
                "mode": "normal",
                // If you read data from a time series table, you must set the newVersion parameter to true.
                "newVersion": "true",
                // Set this parameter to true.
                "isTimeseriesTable":"true",
                // measurementName: The name of a physical quantity or metric for data in a time series table. This parameter is optional. If this parameter is left empty, all data in the table is read.
                "measurementName":"measurement_1",
                "column": [
                  {
                    "name": "_m_name"
                  },
                  {
                    "name": "tagA",
                    "is_timeseries_tag":"true"
                  },
                  {
                    "name": "double_0",
                    "type":"DOUBLE"
                  },
                  {
                    "name": "string_0",
                    "type":"STRING"
                  },
                  {
                    "name": "long_0",
                    "type":"INT"
                  },
                  {
                    "name": "binary_0",
                    "type":"BINARY"
                  },
                  {
                    "name": "bool_0",
                    "type":"BOOL"
                  },
                  {
                    "type":"STRING",
                    "value":"testString"
                  }
                ]
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Tablestore Reader configurations for reading data from wide tables in column mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "table":"",// The name of the table.
                "newVersion":"true",// Specifies whether to use the latest version of Tablestore Reader.
                "mode": "multiversion",// The mode in which you want to read data. Set this parameter to multiVersion.
                "column":[// The name of the column from which you want to read data. The column cannot be a primary key column.
                    {"name":"mobile"},
                    {"name":"name"},
                    {"name":"age"},
                    {"name":"salary"},
                    {"name":"marry"}
                ],
                "range":{// The range of data that you want to read.
                    "begin":[
                        {"type":"INF_MIN"},
                        {"type":"INF_MAX"}
                    ],
                    "end":[
                        {"type":"INF_MAX"},
                        {"type":"INF_MIN"}
                    ],
                    "split":[
                    ]
                },

            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Common parameters in code for Tablestore Reader

Parameter

Description

Required

Default value

endpoint

The endpoint of the Tablestore server. For more information, see Endpoints.

Yes

No default value

accessId

The AccessKey ID of the account that you use to connect to the Tablestore server.

Yes

No default value

accessKey

The AccessKey secret of the account that you use to connect to the Tablestore server.

Yes

No default value

instanceName

The name of the Tablestore instance. The instance is an entity for you to use and manage Tablestore.

After you activate Tablestore, you must create an instance in the Tablestore console before you can create and manage tables.

Tablestore instances are the basic units that you can use to manage your Tablestore resources. Access control and resource measurement for applications are performed at the instance level.

Yes

No default value

table

The name of the table from which you want to read data. You can specify only one table name. Multi-table synchronization is not required for Tablestore.

Yes

No default value

newVersion

Specifies whether to use the latest version of Tablestore Reader.

  • false: An earlier version of Tablestore Reader is used. In this case, if you want to read data from wide tables, you must set the mode parameter to normal.

  • true: The latest version of Tablestore Reader is used. In this case, you can read data from time series tables and wide tables in row or column mode.

The latest version of Tablestore Reader provides new features and consumes less system resources. Therefore, we recommend that you use the latest version of Tablestore Reader.

The settings of the latest version of Tablestore Reader are compatible with the settings of an earlier version of Tablestore Reader. If you add the newVersion=true setting to the configuration of a task in which an earlier version of Tablestore Reader is used, the task can be run as expected.

No

false

mode

The mode in which you want to read data. Valid values:

  • normal: You read data in row mode. The data format is {Primary key column value,Common column value}.

  • multiVersion: You read data in column mode. The data format is {Primary key value,Column name,Timestamp,Column value}.

This setting takes effect only for Tablestore Reader for which the newVersion parameter is set to true.

An earlier version of Tablestore Reader ignores the setting of the mode parameter, and you can read data only in row mode.

No

normal

isTimeseriesTable

Specifies whether the table from which you want to read data is a time series table.

  • false: The table is a wide table.

  • true: The table is a time series table.

This setting takes effect only for Tablestore Reader for which the newVersion parameter is set to true and the mode parameter is set to normal.

An earlier version of Tablestore Reader does not support time series tables, and you cannot read data from time series tables in column mode.

No

false

Additional parameters in code for Tablestore Reader

You can use Tablestore Reader to read data from wide tables in row mode, time series tables in row mode, and wide tables in column mode. The following table describes the additional parameters for data synchronization in different modes.

Parameters for reading data from wide tables in row mode

Parameter

Description

Required

Default value

column

The names of the columns that you want to synchronize. Specify the names in a JSON array. Tablestore is a NoSQL database service. You must specify column names for Tablestore Reader to read data.

  • You can specify common columns. For example, you can specify {"name":"col1"} for Tablestore Reader to read data from column 1.

  • You can specify specific columns. Tablestore Reader reads data only from the specified columns.

  • You can specify constant columns. For example, you can specify {"type":"STRING", "value":"DataX"} for Tablestore Reader to read data from the column in which data is of the STRING type and the value is DataX. The type parameter specifies the data type of the constant. The supported data types are STRING, INT, DOUBLE, BOOLEAN, BINARY, INF_MIN, and INF_MAX. If the data type of the constant is BINARY, the constant value must be Base64-encoded. INF_MIN indicates the minimum value specified by Tablestore, and INF_MAX indicates the maximum value specified by Tablestore. If you set the type to INF_MIN or INF_MAX, do not configure the value parameter. If you configure the value parameter, errors may occur.

  • You cannot specify a function or a custom expression. This is because Tablestore does not provide functions or expressions that are similar to those of SQL. Tablestore Reader cannot read data from columns that contain functions or expressions.

Yes

No default value

begin and end

The parameters specify the range of the data that you want to read.

The begin and end parameters specify the data read range for primary key columns in the Tablestore table. If you do not need to limit a range, specify the begin and end parameters as {"type":"INF_MIN"} and {"type":"INF_MAX"}. The type parameter specifies the data type of the data that you want to read.

Note
  • The default values for the begin and end parameters are [INF_MIN,INF_MAX). The setting indicates that all data in the Tablestore table is read.

  • If the number of ranges specified by the begin and end parameters is less than the number of primary key columns, the leftmost primary key column matching principle is used. For data in the primary key columns for which the begin and end parameters are not configured, the values [INF_MIN,INF_MAX) are used by default.

  • If only the begin parameter or the end parameter is configured for the Tablestore table from which you want to read data, the data read range is [begin,INF_MAX) or [INF_MIN,end).

  • If the Tablestore table from which you want to read data contains only one primary key column, the data read range specified by the begin and end parameters is a left-closed, right-open interval.

  • If the Tablestore table from which you want to read data contains multiple primary key columns, the data read range specified by the begin and end parameters for the last primary key column is a left-closed, right-open interval, and the data read range specified by the begin and end parameters for the rest of primary key columns is a left-closed, right-closed interval.

For example, you read data from a Tablestore table with three primary key columns [Hundreds,Tens,Ones]. The values of the primary key columns are (0,0,0)(0,0,1)(0,0,2)(0,0,3)......(9,9,8)(9,9,9), and the table consists of 1000 columns. In this case, you can specify the begin and end parameters in the following scenarios:

  • Example 1: You read data from the Tablestore table based on the data read range [3,5] specified for the Hundreds primary key column and the data read range [4,6] specified for the Tens primary key column. The values of the primary key columns for the data you read are (3,4,0)(3,4,1)...(4,4,0),(4,0,1)...(5,6,8)(5,6,9). Configuration details:

    "range": {
          "begin": [
            {"type":"INT", "value":"3"},  // The minimum value of the Hundreds column. 
            {"type":"INT", "value":"4"}  // The minimum value of the Tens column. 
          ],
          "end": [
            {"type":"INT", "value":"5"}, // The maximum value of the Hundreds column. 
            {"type":"INT", "value":"6"} // The maximum value of the Tens column. 
          ]
        }
  • Example 2: You read data from the Tablestore table based on the data read range [3,5] specified for the Hundreds primary key column, the data read range [4,6] specified for the Tens primary key column, and the data read range [5,7) specified for the Ones primary key column. The values of the primary key columns for the data you read are (3,4,5)(3,4,6)...(4,4,5),(4,4,6)...(5,6,5)(5,6,6). Configuration details:

    "range": {
          "begin": [
            {"type":"INT", "value":"3"},  // The minimum value of the Hundreds column. 
            {"type":"INT", "value":"4"},  // The minimum value of the Tens column. 
            {"type":"INT", "value":"5"}  // The minimum value of the Ones column. 
          ],
          "end": [
            {"type":"INT", "value":"5"}, // The maximum value of the Hundreds column. 
            {"type":"INT", "value":"6"}, // The maximum value of the Tens column. 
            {"type":"INT", "value":"7"}  // The maximum value of the Ones column. 
          ]
        }

No

(INF_MIN,INF_MAX)

split

The custom rule for data sharding. This parameter is an advanced configuration item. We recommend that you do not configure this parameter in common scenarios.

You can configure this parameter to customize the data range of shards. If data is unevenly distributed in a Tablestore table, you can customize a sharding rule. Sample configuration:

{
  "range": {
    "begin": [{"type": "INF_MIN"}],
    "end":   [{"type": "INF_MAX"}],
    "split": [
      {"type": "STRING","value": "1"},
      {"type": "STRING","value": "2"},
      {"type": "STRING","value": "3"},
      {"type": "STRING","value": "4"},
      {"type": "STRING","value": "5"}
    ]
  }

Data is split into six segments when a data synchronization task is run and the data synchronization task uses Tablestore Reader to read the data by using parallel threads. We recommend that the number of data segments is greater than the number of threads that can be run in parallel for the data synchronization task.

// Segment 1
[Minimum value,1)
// Segment 2
[1, 2)
// Segment 3
[2, 3)
// Segment 4
[3, 4)
// Segment 5
[4, 5)
// Segment 6
[5,Maximum value)

No

If you do not configure the split parameter, the automatic splitting logic is used.

The automatic splitting logic identifies the maximum value and minimum value of a partition key column, and evenly splits data into segments.

The INTEGER and STRING data types are supported for data in a partition key column. Exact division is used to split data of the INTEGER data type in a partition key column into segments. The Unicode of the first character is used to split data of the STRING data type in a partition key column into segments.

Parameters for reading data from time series tables in row mode

Parameter

Description

Required

Default value

column

The columns are organized in an array structure. Each element in an array represents a column from which you want to read data. Constant columns and common columns can be configured.

For constant columns, you must configure the following fields:

  • type: the data type of data in the constant column. This field is required. Supported data types are STRING, INT, DOUBLE, BOOLEAN, and BINARY.

  • value: the value of data in the constant column. This field is required.

For common columns, you must configure the following fields:

  • name: the name of the column. This field is required. The following are the predefined fields.

    • Use _m_name to identify the names of physical quantities or metrics for data in a time series table. The data type of data is STRING.

    • Use _data_source to identify data sources in a time series. The data type of data is STRING.

    • Use _tags to identify tags in a time series. The data type of data is STRING.

    • Use _time to identify timestamps in a time series. The data type of data is LONG.

  • is_timeseries_tag: Specifies whether the value of a key specified by the _tags field is read as a column. The default value of the is_timeseries_tag field is false. This field is optional.

  • type: the data type of data in the column. The default value of the field is string. This field is optional. Supported data types are STRING, INT, DOUBLE, BOOLEAN, and BINARY.

Sample script for reading data from four columns:

"column": [
  {
    "name": "_m_name"               // The column for the names of physical quantities or metrics for data in a time series table.
  },
  {
    "name": "tag_key",                // The value of the tag_key tag specified by the _tags field in a time series.
    "is_timeseries_tag":"true"
  },
  {
    "name": "string_column",        // The column whose name is string_column.
    "type":"string"                    // The data type of data in the column is STRING.
  },
  {
    "value": "constant_value",        // A constant column. The value of data in the constant column is constant_value.
    "type":"string"
  }
],

Yes

No default value

measurementName

The name of a physical quantity or metric for data in a time series table. If this parameter is left empty, all data in the table is read.

No

No default value

timeRange

The time range of the data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. The value of the begin parameter must be less than that of the end parameter. The timestamp is measured in milliseconds. Format:

"timeRange":{
    // begin: This parameter is optional. The default value is 0 and the value range is from 0 to LONG_MAX.
    "begin":1400000000000,
    // end: This parameter is optional. The default value is Long Max(9223372036854775807L) and the value range is from 0 to LONG_MAX.
    "end"  :1600000000000
},

No

All versions

Parameters for reading data from wide tables in column mode

Parameter

Description

Required

Default value

column

The name of the column from which you want to read data. You can read data only from common columns in column mode.

Format:

"column": [
    {"name1":"{your column name1}"},
    {"name2":"{your column name2}"}
],
Note
  • Constant columns are not supported in column mode.

  • You cannot configure the primaryKey parameter. The four-tuple data that you read contains all primary key columns by default.

  • You cannot specify a column repeatedly.

Yes

All columns

range

The range of data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. Take note of the following items about the range:

  • If the value of the begin parameter is less than that of the end parameter, data is read in normal order.

  • If the value of the begin parameter is greater than that of the end parameter, data is read in reverse order.

  • The value of the begin parameter cannot be the same as that of the end parameter.

Data types specified by the type parameter:

  • string

  • int

  • binary: The data must be read as Base64-encoded strings.

  • INF_MIN: indicates an infinitely small value.

  • INF_MAX: indicates an infinitely large value.

Format:

"range":{
    // This parameter is optional. The default value indicates that you start to read data from an infinitely small value.
    // You can enter an empty array, a primary key prefix, or a complete primary key for the range parameter. If data is read in normal order, the default primary key suffix is INF_MIN. If data is read in reverse order, the default primary key suffix is INF_MAX.
    // Example:
    If the table from which you want to read data has two primary key columns and the data types of the columns are STRING and INT respectively, you can configure the range parameter in any of the following formats:
    //1. Read data from the beginning to the end of the table.
    //"begin":[],"end":[],
    //2. For the first primary key column, Tablestore Reader reads data from the value a to the value b. For the second primary key column, Tablestore Reader reads data from the minimum value to the maximum value.
    //"begin":[{"type":"string", "value":"a"}],"end":[{"type":"string", "value":"b"}],
    //3. For the first primary key column, Tablestore Reader reads data from the value a. For the second primary key column, Tablestore Reader reads data from the minimum value. Tablestore Reader reads data to the end of all primary key columns.
    //"begin":[{"type":"string", "value":"a"},{"type":"INF_MIN"}],"end":[],    
    //
    // The JSON format does not support binary data. If the data type of values in a primary key column is BINARY, you must
    // first use the Java method Base64.encodeBase64String to convert binary data to a string, and then enter the string as the value of the parameter.
    // Example (Java):
    //   byte[] bytes = "hello".getBytes();  # constructs binary data, which is the byte value of the string hello.
    //   String inputValue = Base64.encodeBase64String(bytes) # calls the Base64.encodeBase64String method to convert the binary data to a string.
    //   After you run the preceding code, the string "aGVsbG8=" is returned for the inputValue parameter.
    //   Finally, set this parameter to {"type":"binary","value" : "aGVsbG8="}.

    "begin":[{"type":"string", "value":"a"},{"type":"INF_MIN"}],

    // The default value indicates that data read ends with an infinitely large value.
    // You can enter an empty array, a primary key prefix, or a complete primary key for the range parameter. If data is read in normal order, the default primary key suffix is INF_MAX. If data is read in reverse order, the default primary key suffix is INF_MIN.
    // This parameter is optional.
    "end":[{"type":"string", "value":"g"},{"type":"INF_MAX"}],

    // If you want to read large amounts of data, you can configure the split parameter to split the data that you want to read into shards based on splitting points and use parallel threads of a task to read data from the shards.
    // This parameter is optional.
    //   1. The values of the split parameter must correspond to the first column of the primary key, which is the shard key, and the value type must be the same as that of the partition key.
    //   2. The specified values must fall within the value range of the begin and end parameters.
    //   3. The values of the split parameter must be sorted in descending or ascending order based on the data read order that is determined by the values of the begin and end parameters.
    "split":[{"type":"string", "value":"b"}, {"type":"string", "value":"c"}]
},

No

All data

timeRange

The time range of the data that you want to read, which is indicated by [begin,end). It is a left-closed, right-open interval. The value of the begin parameter must be less than that of the end parameter. The timestamp is measured in milliseconds.

Format:

"timeRange":{
    // begin: This parameter is optional. The default value is 0 and the value range is from 0 to LONG_MAX.
    "begin":1400000000000,
    // end: This parameter is optional. The default value is Long Max(9223372036854775807L) and the value range is from 0 to LONG_MAX.
    "end"  :1600000000000
},

No

All versions

maxVersion

The maximum number of data versions that you can read. Valid values: 1 to INT32_MAX.

No

All versions

Appendix: Code and parameters

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for Tablestore Writer

Tablestore Writer configurations for writing data to wide tables in row mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "table":"",// The name of the table. 
                "newVersion":"true",// Specifies whether to use the latest version of Tablestore Writer.
                "mode": "normal",// The mode in which you want to write data. Set this parameter to normal.
                "isTimeseriesTable":"false",// Specifies whether to configure the Tablestore table as a wide table.
                "primaryKey" : [// The primary keys of the destination table in Tablestore. 
                    {"name":"gid", "type":"INT"},
                    {"name":"uid", "type":"STRING"}
                 ],
                "column" : [// The names of the columns. 
                      {"name":"col1", "type":"INT"},
                      {"name":"col2", "type":"DOUBLE"},
                      {"name":"col3", "type":"STRING"},
                      {"name":"col4", "type":"STRING"},
                      {"name":"col5", "type":"BOOL"}
                  ],
                "writeMode" : "PutRow"    // The write mode. 
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Tablestore Writer configurations for writing data to time series tables in row mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "table": "testTimeseriesTableName01",
                "mode": "normal",
                "newVersion": "true",
                "isTimeseriesTable":"true",
                "timeunit":"microseconds",
                "column": [
                      {
                        "name": "_m_name"
                      },
                      {
                        "name": "_data_source",
                      },
                      {
                        "name": "_tags",
                      },
                      {
                        "name": "_time",
                      },
                      {
                        "name": "string_1",
                        "type":"string"
                      },
                      {
                        "name":"tag3",
                        "is_timeseries_tag":"true",
                      }
                    ]
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Tablestore Writer configurations for writing data to wide tables in column mode

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ots",// The plug-in name. 
            "parameter":{
                "datasource":"",// The name of the data source. 
                "table":"",
                "newVersion":"true",
                "mode":"multiVersion",
                "primaryKey" : [
                    "gid",
                    "uid"
                    ]
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },x`
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Common parameters in code for Tablestore Writer

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

endPoint

The endpoint of the Tablestore server. For more information, see Endpoints.

Yes

No default value

accessId

The AccessKey ID of the account that you use to connect to the Tablestore server.

Yes

No default value

accessKey

The AccessKey secret of the account that you use to connect to the Tablestore server.

Yes

No default value

instanceName

The name of the Tablestore instance. The instance is an entity for you to use and manage Tablestore.

After you activate Tablestore, you must create an instance in the Tablestore console before you can create and manage tables. Tablestore instances are the basic units that you can use to manage your Tablestore resources. Access control and resource measurement for applications are performed at the instance level.

Yes

No default value

table

The name of the table to which you want to write data. You can specify only one table name. Multi-table synchronization is not required for Tablestore.

Yes

No default value

newVersion

Specifies whether to use the latest version of Tablestore Writer.

  • false: Uses an earlier version of Tablestore Writer. In this case, if you want to write data to wide tables, you must set the mode parameter to normal.

  • true: Uses the latest version of Tablestore Writer. In this case, you can write data to time series tables and wide tables in row or column mode, and the auto-increment primary key column feature is supported.

The latest version of Tablestore Writer provides new features and consumes less system resources. Therefore, we recommend that you use the latest version of Tablestore Writer.

The settings of the latest version of Tablestore Writer are compatible with the settings of an earlier version of Tablestore Writer. If you add the newVersion=true setting to a task in which an earlier version of Tablestore Writer is used, the task can be run as expected.

Yes

false

mode

The mode in which you want to write data. Valid values:

  • normal: You write data in row mode.

  • multiVersion: You write data in column mode.

This setting takes effect only for Tablestore Writer for which the newVersion parameter is set to true.

An earlier version of Tablestore Writer ignores the setting of the mode parameter, and you can write data only in row mode.

No

normal

isTimeseriesTable

Specifies whether to write data to a time series table.

  • false: The table is a wide table.

  • true: The table is a time series table.

This setting takes effect only for Tablestore Writer for which the newVersion parameter is set to true and the mode parameter is set to normal. In column mode, time series tables are not compatible.

No

false

Additional parameters in code for Tablestore Writer

You can use Tablestore Writer to write data to wide tables in row mode, write data to time series tables in row mode, and write data to wide tables in column mode. The following table describes the additional parameters for data synchronization in different modes.

Parameters for writing data to wide tables in row mode

Parameter

Description

Required

Default value

primaryKey

The primary keys of the destination table in Tablestore. Specify the primary keys in a JSON array. Tablestore is a NoSQL database service. You must specify column names for Tablestore Writer to write data.

Data Integration supports data type conversion. Tablestore Writer can convert data from a data type other than STRING or INT to the STRING or INT data type. Sample configuration:

"primaryKey" : [
    {"name":"gid", "type":"INT"},
    {"name":"uid", "type":"STRING"}
                 ],
Note

The primary keys in Tablestore must be of the STRING or INT type. Therefore, you must set the data type of a primary key to STRING or INT in the code editor for Tablestore Writer.

Yes

No default value

column

The names of the columns that you want to synchronize. Specify the names in a JSON array.

Sample configuration:

"column" : [
     {"name":"col1", "type":"INT"},
     {"name":"col2", "type":"DOUBLE"},
     {"name":"col3", "type":"STRING"},
     {"name":"col4", "type":"BINARY"},
     {"name":"col5", "type":"BOOL"}
              ],

The name parameter specifies the name of the column to which you want to write data. The type parameter specifies the data type of values in the column. Tablestore supports the following data types: STRING, INT, DOUBLE, BOOLEAN, and BINARY.

Note

Constants, functions, or custom statements are not supported during the data write process.

Yes

No default value

writeMode

The write mode. Valid values:

  • PutRow: the PutRow operation of Tablestore, which is used to insert data into a specified row. If a specified row does not exist, a new row is added. If a specified row exists, the row is overwritten.

  • UpdateRow: the UpdateRow operation of Tablestore, which is used to update the data in a specified row. If a specified row does not exist, a new row is added. If a specified row exists, the values of the specified columns in the row are added, modified, or removed based on the content in the request.

Yes

No default value

enableAutoIncrement

Specifies whether you can write data to a Tablestore table that contains auto-increment primary key columns.

  • true: Tablestore Writer scans for information about auto-increment primary key columns in the destination table and adds the auto-increment primary key columns during the data write process. Users do not need to specify the names of the auto-increment primary key columns.

  • false: The error is returned if you write data to a Tablestore table that contains auto-increment primary key columns.

No

false

requestTotalSizeLimitation

The maximum size of data that can be written to a single row in Tablestore. The parameter value must be of a numeric data type.

No

1MB

attributeColumnSizeLimitation

The maximum size of data that can be written to a single attribute column in Tablestore. The parameter value must be of a numeric data type.

No

2MB

primaryKeyColumnSizeLimitation

The maximum size of data that can be written to a single primary key column in Tablestore. The parameter value must be of a numeric data type.

No

1KB

attributeColumnMaxCount

The maximum number of attribute columns to which data can be written in Tablestore. The parameter value must be of a numeric data type.

No

1,024

Parameters for writing data to time series tables in row mode

Parameter

Description

Required

Default value

column

Each element in the column parameter corresponds to a field in timer series data. You can configure the following parameters for each element:

  • name: the name of the column. This field is required. The following are the predefined fields.

    • Use _m_name to identify the names of physical quantities or metrics for data in a time series table. The data type of data is STRING.

    • Use _data_source to identify data sources in a time series. The data type of data is STRING.

    • Use _tags to identify tags in a time series. The data type of data is STRING and the string format is ["tagKey1=value1","tagKey2=value2"].

    • Use _time to identify timestamps in a time series. The data type of data is LONG and the default unit is microseconds.

  • is_timeseries_tag: Specifies whether the value of a key specified by the _tags field is read as a column. The default value of the is_timeseries_tag field is false. This field is optional.

  • type: the data type of data in the column. The default value of the field is string. This field is optional. Supported data types are STRING, INT, DOUBLE, BOOLEAN, and BINARY.

The name of a physical quantity or metric and the timestamp for time series data cannot be left empty. You must configure the _m_name and _time fields.

Example: The following sample code shows that a single data record that you want to write contains six fields:

mName1    source1    ["tag1=A","tag2=B"]    1677763080000000    field_value     C

Configuration details:

"column": [
      {
        "name": "_m_name"
      },
      {
        "name": "_data_source",
      },
      {
        "name": "_tags",
      },
      {
        "name": "_time",
      },
      {
        "name": "string_1",
        "type":"string"
      },
      {
        "name":"tag3",
        "is_timeseries_tag":"true",
      }
    ],

After the data record is written to Tablestore, check the result in the Tablestore console.

Yes

No default value

timeunit

The unit of the timestamp specified by the _time field. Valid values: NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, and MINUTES.

No

MICROSECONDS

Parameters for writing data to wide tables in column mode

Parameter

Description

Required

Default value

primaryKey

The primary key columns of the table.

To save configuration costs, you do not need to specify the position of the primaryKey column for rows in the destination table. However, the format of rows must be fixed. The primaryKey column must be placed at the beginning for rows in the table and the primaryKey column is followed by the columnName column. Format of rows: {pk0,pk1...}, {columnName}, {timestamp}, {value}.

For example, the following data records need to be written:

1,pk1,row1,1677699863871,value_0_0
1,pk1,row2,1677699863871,value_0_1
1,pk1,row3,1677699863871,value_0_2
2,pk2,row1,1677699863871,value_1_0
2,pk2,row2,1677699863871,value_1_1
2,pk2,row3,1677699863871,value_1_2
3,pk3,row1,1677699863871,value_2_0
3,pk3,row2,1677699863871,value_2_1
3,pk3,row3,1677699863871,value_2_2

Sample configuration:

"primaryKey" : [
    "gid",
    "uid"
    ],

Result of data write to wide tables:

gid     uid     row1        row2        row3
1        pk1        value_0_0    value_0_1    value_0_2
2        pk2        value_1_0    value_1_1    value_1_2
3        pk3        value_2_0    value_2_1    value_2_2

Yes

No default value

columnNamePrefixFilter

The column name prefix filter.

The names of the columns for data that is read from HBase consist of cf and qulifier. However, Tablestore does not support cf. If you want to write data that is read from HBase to Tablestore, cf must be filtered out.

Sample configuration: "columnNamePrefixFilter":"cf:"

Note
  • This parameter is optional. If you do not configure this parameter or the parameter value is an empty string, the filter operation is not performed on column names.

  • If the names of the columns for data that is added by Data Integration do not have prefixes, the data is stored in the dirty data collector.

No

No default value

FAQ

Question 1

Q: How do I configure Tablestore Writer to write data to a destination table that contains auto-increment primary key columns?

  1. Take note of the following items when you configure Tablestore Writer:

    "newVersion": "true",
    "enableAutoIncrement": "true",
  2. When you configure Tablestore Writer, you do not need to specify the names of auto-increment primary key columns.

  3. When you configure Tablestore Writer, the sum of the number of primary key columns and the number of column columns must be equal to the number of columns in the table from which you read data by using Tablestore Reader.

Question 2

Q: What do the _tags and is_timeseries_tag fields mean in a time series model?

Example: A data record has three tags: phone=xiaomi,RAM=8G,camera=LEICA.数据

  • Example for using Tablestore Reader to read data

    • If you want to combine the preceding tags and read the tags as one column, configure the following setting:

      "column": [
            {
              "name": "_tags",
            }
          ],

      DataWorks reads the tags as one column. Example:

      ["phone=xiaomi","camera=LEICA","RAM=8G"]
    • If you want to read the phone tag and the camera tag as two separate columns, configure the following setting:

      "column": [
            {
              "name": "phone",
              "is_timeseries_tag":"true",
            },
            {
              "name": "camera",
              "is_timeseries_tag":"true",
            }
          ],

      The following two columns of data are obtained:

      xiaomi, LEICA
  • Example for using Tablestore Writer to write data

    For example, the following two columns of data exist in the source:

    • ["phone=xiaomi","camera=LEICA","RAM=8G"]

    • 6499

    You want the two columns of data to be written to the tag field in the following format.格式Configure the following setting:

    "column": [
          {
            "name": "_tags",
          },
          {
            "name": "price",
            "is_timeseries_tag":"true",
          },
        ],
    • The configuration for the first column enables ["phone=xiaomi","camera=LEICA","RAM=8G"] to be written to the tag field.

    • The configuration for the second column enables price=6499 to be written to the tag field.