To periodically synchronize incremental data from Tablestore to Object Storage Service (OSS) for backup or use, you can create and configure batch synchronization tasks in the DataWorks console. This topic describes how to synchronize incremental data from Tablestore to OSS by using batch synchronization tasks in the DataWorks console.
Usage notes
This feature is applicable to the Wide Column model and TimeSeries model of Tablestore.
Wide Column model: You can use the codeless user interface (UI) or code editor to export data from a data table in Tablestore to OSS.
TimeSeries model: You can use only the code editor to export data from a time series table in Tablestore to OSS.
When you use Tablestore Stream to synchronize incremental data, make sure that a whole row of data is written to Tablestore each time. The whole-row data write mode is applied to time series data such as IoT data. Therefore, data does not need to be modified after it is written.
Incremental data is synchronized every 5 minutes and Tablestore plug-ins may introduce a latency of 5 minutes. Therefore, the total latency for incremental synchronization ranges from 5 to 10 minutes.
Prerequisites
OSS is activated, and an OSS bucket is created. For more information, see Activate OSS and Create buckets.
The information about the instances, data tables, or time series tables whose data you want to synchronize from Tablestore to OSS is confirmed and recorded.
DataWorks is activated, and a workspace is created. For more information, see Activate DataWorks and Create a workspace.
A Resource Access Management (RAM) user is created, and the OSS and Tablestore policies are attached to the RAM user. For more information, see Create a RAM user and Grant permissions to a RAM user.
ImportantTo prevent security risks caused by the leakage of the AccessKey pair of your Alibaba Cloud account, we recommend that you use the AccessKey pair of a RAM user.
An AccessKey pair is created for the RAM user. For more information, see Create an AccessKey pair.
A Tablestore data source and an OSS data source are added. For more information, see the Step 1: Add a Tablestore data source and Step 2: Add an OSS data source sections of the "Export full data to OSS" topic.
Step 1: Create a batch synchronization node
Go to the DataStudio console.
Log on to the DataWorks console as the project administrator.
In the top navigation bar, select a region. In the left-side navigation pane, click Workspaces.
On the Workspaces page, find the workspace that you want to manage and choose Shortcuts > Data Development in the Actions column.
On the Scheduled Workflow page of the DataStudio console, click Business Flow and select a business flow.
For information about how to create a workflow, see Create a workflow.
Right-click the Data Integration node and choose Create Node > Offline synchronization.
In the Create Node dialog box, select a path and enter a node name.
Click Confirm.
The newly created offline synchronization node will be displayed under the Data Integration node.
Step 2: Configure and start a batch synchronization task
To configure a task to synchronize incremental data from Tablestore to OSS, select an appropriate configuration method based on the data storage model.
If you select the Wide Column model to use a data table to store data, you need to synchronize data from the data table. For more information, see the Configure a task to synchronize data from a data table section of this topic.
If you select the TimeSeries model to use a time series table to store data, you need to synchronize data from the time series table. For more information, see the Configure a task to synchronize data from a time series table section of this topic.
Configure a task to synchronize data from a data table
Configure a task to synchronize data from a time series table
Step 3: Configure scheduling properties
You can configure the scheduling properties of the batch synchronization task in the Properties panel, such as the time to run the task, rerun properties, and scheduling dependencies.
Click Properties in the right-side navigation pane of the task configuration tab.
In the Scheduling Parameter section of the Properties panel, click Add Parameter to add parameters. The following table describes the parameters. For more information, see Supported formats of scheduling parameters.
Parameter
Value
startTime
$[yyyymmddhh24-2/24]$[miss-10/24/60]
endTime
$[yyyymmddhh24-1/24]$[miss-10/24/60]
The following figure shows how to configure the parameters.
For example, if you want to run the task at 19:00:00 on April 23, 2023, you can set the startTime parameter to 20230423175000 and the endTime parameter to 20230423185000. In this case, the task synchronizes the data that is generated from 17:50 to 18:50.
In the Schedule section, configure the scheduling properties. For more information, see Configure time properties.
The following figure shows how to configure a task that is scheduled to run by the hour.
In the Dependencies section, select Add Root Node. The system automatically generates the information about the ancestor node of the current node.
If you select Add Root Node, the task that runs on the current node does not depend on the ancestor node of the current node.
After the configuration is complete, close the Properties panel.
Step 4: Debug the script and commit the task
Optional. Debug the script.
Debug the script to ensure that the synchronization task can synchronize incremental data from Tablestore to OSS.
ImportantWhen you debug the script, data generated within the specified time range may be imported to OSS multiple times. If the same data rows are written to OSS multiple times, the relevant data rows in OSS are overwritten.
Click the icon.
In the Parameters dialog box, select a resource group and configure the custom parameters.
Specify the custom parameter values in the
yyyyMMddHHmmss
format. Example: 20230423175000.Click Run.
Commit the synchronization task.
After a synchronization task is committed, the synchronization task is run based on the scheduling properties that you configured.
Click the icon.
In the Submit dialog box, specify the change description based on your business requirements.
Click Confirm.
Step 5: View the result of the synchronization task
To view the status of the task, perform the following steps in the DataWorks console:
Click Operation Center in the upper-right corner of the task configuration tab.
In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Instance. On the Instance Perspective tab of the Cycle Instance page, view the status of the instance.
To view the result of the task, perform the following steps in the OSS console:
Log on to the OSS console.
Click Buckets in the left-side navigation pane. On the Buckets page, find the bucket to which data is synchronized and click the name of the bucket.
On the Objects page, select an object and download the object to check whether the data is synchronized as expected.
FAQ
References
If you want to download the OSS objects that contain the exported Tablestore data to your local device, you can use the OSS console or ossutil. For more information, see Simple download.
To prevent important data from being unavailable due to accidental deletion or malicious tampering, you can use Cloud Backup to back up data in the wide tables of Tablestore instances on a regular basis and restore lost or damaged data at your earliest opportunity. For more information, see Overview.
If you want to implement tiered storage for the hot and cold data of Tablestore, full backup of Tablestore data, and large-scale real-time data analysis, you can use the data delivery feature of Tablestore. For more information, see Overview.