Export full data from Tablestore to MaxCompute - Tablestore

If you want to use MaxCompute to back up Tablestore data or migrate data from Tablestore to MaxCompute, you can create and configure batch synchronization tasks in the DataWorks console to export full data. After full data is exported to MaxCompute, you can use the DataAnalysis service of DataWorks to view and analyze the exported data.

Usage notes

The field names in Tablestore are case-sensitive. Make sure that the field names in MaxCompute are the same as the field names in Tablestore.

Step 1: Add a Tablestore data source

To add a Tablestore database as a data source, perform the following steps:

Go to the Data Integration page.
Log on to the DataWorks console, select a region in the upper-left corner, choose Data Development and Governance > Data Integration, select a workspace from the drop-down list, and then click Go to Data Integration.
In the left-side navigation pane, click Data Source.
On the Data Source page, click Add Data Source.
In the Add Data Source dialog box, click the Tablestore block.

In the Add OTS data source dialog box, configure the parameters that are described in the following table.

Parameter	Description
Data Source Name	The name of the data source. The name can contain letters, digits, and underscores (_), and must start with a letter.
Data Source Description	The description of the data source. The description cannot exceed 80 characters in length.
Endpoint	The endpoint of the Tablestore instance. For more information, see Endpoints. If the Tablestore instance and the resources of the destination data source are in the same region, enter a virtual private cloud (VPC) endpoint. Otherwise, enter a public endpoint.
Table Store instance name	The name of the Tablestore instance. For more information, see Instance.
AccessKey ID	The AccessKey ID and AccessKey secret of your Alibaba Cloud account or RAM user. For more information about how to create an AccessKey pair, see Create an AccessKey pair.
AccessKey Secret

Test the network connectivity between the data source and the resource group that you select.
To ensure that your synchronization nodes run as expected, you need to test the connectivity between the data source and all types of resource groups on which your synchronization nodes will run.
Important
A synchronization task can use only one type of resource group. By default, only shared resource groups for Data Integration are displayed in the resource group list. To ensure the stability and performance of data synchronization, we recommend that you use an exclusive resource group for Data Integration.
1. Click Purchase to create a new resource group or click Associate Purchased Resource Group to associate an existing resource group. For more information, see Create and use an exclusive resource group for Data Integration.
2. After the resource group is started, click Test Network Connectivity in the Connection Status (Production Environment) column of the resource group.
  If Connected is displayed, the connectivity test is passed.
If the data source passes the network connectivity test, click Complete.
The newly created data source is displayed in the data source list.

Step 2: Add a MaxCompute data source

Perform the similar operations in Step 1 to add MaxCompute as a data source.

On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, find and click the MaxCompute block.
In the Add MaxCompute Data Source dialog box, configure the parameters, test the resource group connectivity, and then click Complete. For more information, see Add a MaxCompute data source.
Important
- A synchronization task can use only one type of resource group. By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
- If no resource group is available, click Purchase to create an exclusive resource group for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration.

Step 3: Create a batch synchronization node

Go to the DataStudio console.
Log on to the DataWorks console, select a region in the upper-left corner, choose Data Development and Governance > Data Development, select a workspace from the drop-down list, and then click Go to DataStudio.
On the Scheduled Workflow page of the DataStudio console, click Business Flow and select a business flow.
For information about how to create a workflow, see Create a workflow.
Right-click the Data Integration node and choose Create Node > Offline synchronization.
In the Create Node dialog box, select a path and enter a node name.
Click Confirm.
The newly created offline synchronization node will be displayed under the Data Integration node.

Step 4: Configure and start a batch synchronization task

To create and configure a task to synchronize data from Tablestore to MaxCompute, perform the following steps:

Double-click the created batch synchronization node in the Data Integration folder.
Establish network connections between the resource group and the data source.
Select the source and destination for the batch synchronization task and the resource group that is used to run the batch synchronization task. Establish network connections between the resource group and data sources and then test the connectivity.
Important
Data synchronization tasks are run by using resource groups. Select a resource group and make sure that network connections between the resource group and data sources are established.
1. In the Configure Network Connections and Resource Group step, set the Source parameter to Tablestore and the Data Source Name parameter to the name of the Tablestore data source that you added in Step 1: Add a Tablestore data source.
2. Select a resource group.
  After you select a resource group, the system displays the region and specifications of the resource group and automatically tests the connectivity between the resource group and the source data source.
  Important
  Make sure that the resource group is the same as that you selected when you added the data source.
3. Set the Destination parameter to MaxCompute(ODPS) and the Data Source Name parameter to the name of the destination data source that you added in Step 2: Add a MaxCompute data source.
  The system automatically tests the connectivity between the resource group and the destination data source.
4. Click Next.

Configure the task.

You can use the codeless user interface (UI) or the code editor to configure the task.

(Recommended) Use the codeless UI

In the Configure Source and Destination section of the Configure tasks step, configure the data source and destination based on your business requirements.

Configure the data source

Parameter	Description
Table	The name of the Tablestore data table.
Range of Primary Key(begin)	The start primary key and the end primary key that are used to specify the range of the data that you want to read. The values must be a JSON array. The start primary key and the end primary key must be valid primary keys or virtual points that consist of values of the INF_MIN and INF_MAX types. The number of columns in the virtual points must be the same as the number of primary key columns. The INF_MIN type specifies an infinitely small value. All values of other types are greater than the value of the INF_MIN type. The INF_MAX type specifies an infinitely large value. All values of other types are less than the value of the INF_MAX type. The rows in the data table are sorted in ascending order by primary key. The range of the data that you want to read is a left-closed, right-open interval. All rows whose primary key is greater than or equal to the start primary key and less than the end primary key are returned. For example, a table contains the pk1 and pk2 primary key columns. The pk1 column is of the String type and the pk2 column is of the Integer type. To export full data from the table, specify the following parameters: Sample Range of Primary Key(begin) parameter `[ { "type": "INF_MIN" }, { "type": "INF_MIN" } ]` Sample Range of Primary Key(end) parameter `[ { "type": "INF_MAX" }, { "type": "INF_MAX" } ]` To export the rows in which the value of the pk1 column is tablestore, specify the following parameters: Sample Range of Primary Key(begin) parameter `[ { "type": "STRING", "value": "tablestore" }, { "type": "INF_MIN" } ]` Sample Range of Primary Key(end) parameter `[ { "type": "STRING", "value": "tablestore" }, { "type": "INF_MAX" } ]`
Range of Primary Key(end)
Split configuration information	The custom rule used to split data. We recommend that you do not configure this parameter in common scenarios. If data is unevenly distributed in a Tablestore table and the automatic splitting feature of Tablestore Reader fails, you can specify a custom rule to split data. You can configure a split key within the range between the start primary key and the end primary key. You do not need to specify all primary keys. The value is a JSON array.

Configure the data destination

Parameter	Description
Tunnel Resource Group	The Tunnel quota. The default value is Common transmission resources, which specifies the free Tunnel quota of MaxCompute. Note If the exclusive Tunnel quota is unavailable due to overdue payments or expiration, the running task automatically switches from the exclusive Tunnel quota to the free Tunnel quota.
Table	The name of the table in MaxCompute.
Partition information	If you want daily incremental data to be stored in the partition of the corresponding date, configure the pt parameter to allow the synchronization of daily incremental data. For example, you can set the pt parameter to ${bizdate}.
Write Mode	The mode in which data is written to MaxCompute. Valid values: Keep existing data before writing (Insert Into): The data is directly imported to the table or the static partition. Clean up existing data before writing (Insert Overwrite): Existing data in the destination is cleared before the data is imported to the table or the static partition.
Write by Converting Empty Strings into Null	Specifies whether to convert empty strings in the source to Null in the destination.
Visible After Synchronization	This parameter is displayed only after you click Advanced configuration. Specifies whether data that is synchronized to MaxCompute can be queried after the synchronization is complete.

In the Field Mapping section, click the icon next to Source field and Target field to modify the source and destination fields.
Important
Make sure that the number and types of the source fields match the number and types of the destination fields.
You must specify a source field in the JSON format. Example: {"name":"id","type":"STRING"}. You must specify the name of each destination field in Target field One row represents one field. A row in Source field is mapped to the same row in Target field.
In the Channel Control section, configure the parameters for task execution, such as the Task Expected Maximum Concurrency, Synchronization rate, Policy for Dirty Data Records, and Distributed Execution parameters. For more information about the parameters, see Configure channel control policies.
Click the icon to save the configurations.
Note
If you do not save the configurations, a message prompting you to save the configurations appears when you perform subsequent operations. Click OK to save the configurations.

Use the code editor

To synchronize full data from Tablestore to MaxCompute, use the Tablestore Reader and MaxCompute Writer plug-ins. For information about the script configuration rules, see Tablestore data source and MaxCompute data source.

Important

You cannot switch between the codeless UI and the code editor. Proceed with caution.

In the Configure tasks step, click the icon. In the message that appears, click OK.

In the code editor, specify the parameters based on the following sample code.

Important

In most cases, a task that exports full data is executed only once. You do not need to configure scheduling parameters for the task.
If the script configurations contain variables, such as ${date}, set each variable to a specific value when you run the task to synchronize data.
Comments are provided in the sample code to help you understand the configurations. Delete all comments when you use the sample code.

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "ots", # The type of the reader. You cannot modify this parameter. 
            "parameter": {
                "datasource": "",  # The name of the Tablestore data source. Configure this parameter based on your business requirements. 
                "column": [ # The names of the columns that you want to export from the Tablestore source. 
                    {
                        "name": "column1"
                    },
                    {
                        "name": "column2"
                    },
                    {
                        "name": "column3"
                    },
                    {
                        "name": "column4"
                    },
                    {
                        "name": "column5"
                    }
                ],
                "range": {
                    "split": [ # The partition information about the table in Tablestore, which can accelerate the export. In most cases, you do not need to specify this parameter. 
                        {
                            "type": "INF_MIN"
                        },
                        {
                            "type": "STRING",
                            "value": "splitPoint1"
                        },
                        {
                            "type": "STRING",
                            "value": "splitPoint2"
                        },
                        {
                            "type": "STRING",
                            "value": "splitPoint3"
                        },
                        {
                            "type": "INF_MAX"
                        }
                    ],
                    "end": [ 
                        {
                            "type": "INF_MAX" # The end position of the first primary key column in Tablestore. To export full data, set this parameter to INF_MAX. To export only part of the data, configure this parameter based on your business requirements. If the table contains multiple primary key columns, configure the end parameter for each primary key column. 
                        },
                        {
                            "type": "INF_MAX"
                        },
                        {
                            "type": "STRING",
                            "value": "end1"
                        },
                        {
                            "type": "INT",
                            "value": "100"
                        }
                    ],
                    "begin": [ 
                        {
                            "type": "INF_MIN" # The start position of the first primary key column in Tablestore. To export full data, set this parameter to INF_MIN. To export only part of the data, configure this parameter based on your business requirements. If the table contains multiple primary key columns, configure the begin parameter for each primary key column. 
                        },
                        {
                            "type": "INF_MIN"
                        },
                        {
                            "type": "STRING",
                            "value": "begin1"
                        },
                        {
                            "type": "INT",
                            "value": "0"
                        }
                    ]
                },
                "table": "" # The name of the table in Tablestore. 
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps", # The type of the writer. You cannot modify this parameter. 
            "parameter": {
                "partition": "",  # If the MaxCompute table is partitioned, this parameter is required. If the MaxCompute table is not partitioned, do not specify this parameter. The names of the partitions to which you want to write data. The last-level partition must be specified. 
                "truncate": true, # Specifies whether to delete all historical data. 
                "datasource": "", # The name of the MaxCompute data source. Configure this parameter based on your business requirements. 
                "column": [ # The names of the columns in MaxCompute. Specify the column names in the same order as the column names in Tablestore. 
                    "*"
                ],
                "table": "" # The name of the table in MaxCompute. The table must be created in advance. Otherwise, the task may fail. 
            },
            "name": "Writer",
            "category": "writer"
        },
        {
            "name": "Processor",
            "stepType": null,
            "category": "processor",
            "parameter": {}
        }
    ],
    "setting": {
        "executeMode": null,
        "errorLimit": {
            "record": "0" # If the number of errors exceeds the number of records, the task fails. 
        },
        "speed": {
            "throttle":true, # Specifies whether to enable throttling. A value of false specifies that throttling is disabled, and a value of true specifies that throttling is enabled. The mbps parameter takes effect only if you set the throttle parameter to true. 
            "concurrent":1 # The maximum number of concurrent tasks. 
            "mbps":"12" # The rate to which traffic is limited. 
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

You can configure the begin and end parameters to specify the range of data that you want to export. For example, a table contains the pk1 and pk2 primary key columns. The pk1 column is of the String type and the pk2 column is of the Integer type.

To export full data from the table, specify the following parameters:

"begin": [ # The start position of the data that you want to export. 
  {
    "type": "INF_MIN"
  },
  {
    "type": "INF_MIN"
  }
],
"end": [  # The end position of the data that you want to export. 
  {
    "type": "INF_MAX"
  },
  {
    "type": "INF_MAX"
  }
],

To export the rows in which the value of the pk1 column is tablestore, specify the following parameters:

"begin": [ # The start position of the data that you want to export. 
  {
    "type": "STRING",
    "value": "tablestore"
  },
  {
    "type": "INF_MIN"
  }
],
"end": [  # The end position of the data that you want to export. 
  {
    "type": "STRING",
    "value": "tablestore"
  },
  {
    "type": "INF_MAX"
  }
],

Click the icon to save the configurations.
Note
If you do not save the script, a message prompting you to save the script appears when you perform subsequent operations. Click OK to save the script.

Run the synchronization task
Important
In most cases, you need to synchronize full data only once and do not need to configure scheduling properties.
1. Click the icon.
2. In the Parameters dialog box, select the name of the resource group from the drop-down list.
3. Click Run.
  After the synchronization task is complete, click the URL of the run log on the Runtime Log tab to go to the details page of the run log. On the details page of the run log, check the value of the Current task status parameter.
  If the value of the Current task status parameter is FINISH, the task is complete.

Step 5: View the data imported to MaxCompute

You can develop a MaxCompute SQL task or create an ad hoc query to query data in a MaxCompute table by executing SQL statements. For more information, see Develop a MaxCompute SQL task, Create an ad hoc query, and Table operations.

Tablestore:Export full data from Tablestore to MaxCompute