This topic describes how to synchronize data from one table to another table in Tablestore with Tunnel Service, DataWorks, or DataX.
Prerequisites
A destination table is created. The destination table must contain the columns that you want to synchronize from the source table. For more information, see the Step 3: Create a data table section of the "Use the Wide Column model in the Tablestore console" topic.
If you want to migrate data across accounts and regions, use DataX to connect to a virtual private cloud (VPC) over the Internet or Cloud Enterprise Network (CEN). For information about how to use CEN, see Overview.
Use Tunnel Service to synchronize data
After the tunnel of the source table is created, you can use a Tablestore SDK to synchronize data from the source table to the destination table. You can specify custom logic to process data for your business during synchronization.
Prerequisites
The endpoint that you want to use is obtained. For more information, see Initialize an OTSClient instance.
An AccessKey pair is configured. For more information, see Initialize an OTSClient instance.
The AccessKey pair is configured in environment variables. For more information, see Initialize an OTSClient instance.
The OTS_AK_ENV environment variable indicates the AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user. The OTS_SK_ENV environment variable indicates the AccessKey secret of an Alibaba Cloud account or a RAM user. Specify the AccessKey pair based on your business requirements.
Procedure
Create a tunnel for the source table in the Tablestore console or by using a Tablestore SDK, and record the tunnel ID. For more information, see Quick start or Use Tunnel Service by using Tablestore SDKs.
Synchronize data by using a Tablestore SDK.
Sample code:
public class TunnelTest { public static void main(String[] args){ String accessKeyId = System.getenv("OTS_AK_ENV"); String accessKeySecret = System.getenv("OTS_SK_ENV"); TunnelClient tunnelClient = new TunnelClient("endpoint", accessKeyId,accessKeySecret,"instanceName"); TunnelWorkerConfig config = new TunnelWorkerConfig(new SimpleProcessor()); // You can view the tunnel ID on the Tunnels tab of the Tablestore console or call the describeTunnelRequest operation to query the tunnel ID. TunnelWorker worker = new TunnelWorker("tunnelId", tunnelClient, config); try { worker.connectAndWorking(); } catch (Exception e) { e.printStackTrace(); worker.shutdown(); tunnelClient.shutdown(); } } public static class SimpleProcessor implements IChannelProcessor{ // Connect the tunnel to the destination table. TunnelClient tunnelClient = new TunnelClient("endpoint", "accessKeyId","accessKeySecret","instanceName"); @Override public void process(ProcessRecordsInput processRecordsInput) { // Incremental data or full data is returned in ProcessRecordsInput. List<StreamRecord> list = processRecordsInput.getRecords(); for(StreamRecord streamRecord : list){ switch (streamRecord.getRecordType()){ case PUT: // Specify the custom logic that you want to use to process data for your business. //putRow break; case UPDATE: //updateRow break; case DELETE: //deleteRow break; } System.out.println(streamRecord.toString()); } } @Override public void shutdown() { } } }
Use DataWorks or DataX to synchronize data
You can use DataWorks or DataX to synchronize data from the source table to the destination table. This section describes how to synchronize data by using DataWorks.
Step 1: Add a Tablestore data source
Add the Tablestore instances of the source table and the destination table as data sources.
Go to the Data Integration page.
Log on to the DataWorks console as the project administrator.
In the left-side navigation pane, click Workspaces. In the top navigation bar, select a region.
On the Workspaces page, find the workspace that you want to manage and choose Shortcuts > Data Integration in the Actions column.
In the left-side navigation pane, click Data Source.
On the Data Source page, click Add Data Source.
In the Add Data Source dialog box, click the Tablestore block.
In the Add OTS data source dialog box, configure the parameters that are described in the following table.
Parameter
Description
Data Source Name
The name of the data source. The name can contain letters, digits, and underscores (_), and must start with a letter.
Data Source Description
The description of the data source. The description cannot exceed 80 characters in length.
Endpoint
The endpoint of the Tablestore instance. For more information, see Endpoints.
If the Tablestore instance and the resources of the destination data source are in the same region, enter a virtual private cloud (VPC) endpoint. Otherwise, enter a public endpoint.
Table Store instance name
The name of the Tablestore instance. For more information, see Instance.
AccessKey ID
The AccessKey ID and AccessKey secret of your Alibaba Cloud account or RAM user. For more information about how to create an AccessKey pair, see Create an AccessKey pair.
AccessKey Secret
Test the network connectivity between the data source and the resource group that you select.
To ensure that your synchronization nodes run as expected, you need to test the connectivity between the data source and all types of resource groups on which your synchronization nodes will run.
ImportantA synchronization task can use only one type of resource group. By default, only shared resource groups for Data Integration are displayed in the resource group list. To ensure the stability and performance of data synchronization, we recommend that you use an exclusive resource group for Data Integration.
Click Purchase to create a new resource group or click Associate Purchased Resource Group to associate an existing resource group. For more information, see Create and use an exclusive resource group for Data Integration.
Find the resource group that you want to manage and click Test Network Connectivity in the Connection Status column.
If Connected is displayed in the Connection Status column, the connectivity test is passed.
If the data source passes the network connectivity test, click Complete.
The newly created data source is displayed in the data source list.
Step 2: Create a synchronization node
Go to the DataStudio console.
Log on to the DataWorks console as the project administrator.
In the top navigation bar, select a region. In the left-side navigation pane, click Workspaces.
On the Workspaces page, find the workspace that you want to manage and choose Shortcuts > Data Development in the Actions column.
On the Scheduled Workflow page of the DataStudio console, click Business Flow and select a business flow.
For information about how to create a workflow, see Create a workflow.
Right-click the Data Integration node and choose Create Node > Offline synchronization.
In the Create Node dialog box, select a path and enter a node name.
Click Confirm.
The newly created offline synchronization node will be displayed under the Data Integration node.
Step 3: Configure and run an offline synchronization task
Double-click the new synchronization node under Data Integration.
Establish network connections between the resource group and data sources.
Select the source and destination data sources for the data synchronization task and the resource group that is used to run the data synchronization task. Establish network connections between the resource group and data sources and test the connectivity.
ImportantData synchronization tasks are run by using resource groups. Select a resource group and make sure that network connections between the resource group and data sources are established.
In the Configure Network Connections and Resource Group step, select Tablestore from the Source drop-down list and set the Data Source Name parameter to the source data source that you created.
Select a resource group from the Resource Group drop-down list.
After you select a resource group, the system displays the region and specifications of the resource group. The system automatically tests the connectivity between the resource group and the source data source.
ImportantMake sure that the resource group is the same as that you selected when you created the data source.
Select Tablestore from the Destination drop-down list and set the Data Source Name parameter to the new destination data source.
The system automatically tests the connectivity between the resource group and the destination data source.
Click Next.
In the message that appears, click Use Script Mode.
ImportantTablestore supports only the script mode. If a data source cannot be configured by using the wizard mode, the script mode is used to configure the batch synchronization task.
After a task is switched to the script mode, you cannot switch back to the wizard mode.
Configure and save the task.
To synchronize full data, you need to use Tablestore Reader and Tablestore Writer. For more information about how to configure the script, see Tablestore data source.
Modify the script in the Configure tasks step.
Configure Tablestore Reader
Tablestore Reader reads data from Tablestore. You can specify a data range to extract incremental data from Tablestore. For more information, see Appendix: Code and parameters for Tablestore Reader.
Configure Tablestore Writer
By using Tablestore SDK for Java, Tablestore Writer connects to the Tablestore server and writes data to the Tablestore server. Tablestore Writer provides features to allow users to optimize the write process, such as retries upon write timeouts, retries upon write exceptions, and batch submission. For more information, see Appendix: Code and parameters for Tablestore Writer.
Press Ctrl+S to save the script.
NoteIf you do not save the script, a message appears when you perform subsequent operations. In this case, click OK in the message to save.
Run the synchronization task.
NoteIn most cases, you need to synchronize full data only once and do not need to configure scheduling properties.
Click the icon.
In the Parameters dialog box, select the name of the resource group from the drop-down list.
Click Run.
After the script is run, click the link next to Detail log url on the Runtime Log tab. On the Detailed Runtime Logs page, check the value of
Current task status
.If the value of
Current task status
is FINISH, the task is complete.