You can create a delivery task in the Tablestore console to deliver data from Tablestore to an Object Storage Service (OSS) bucket.
Prerequisites
OSS is activated. A bucket is created in the region where a Tablestore instance resides. For more information, see Activate OSS.
Data delivery allows you to deliver data from a Tablestore instance to an OSS bucket within the same region. To deliver data to another warehouse such as MaxCompute, submit a ticket.
Usage notes
Data delivery is available in the China (Hangzhou), China (Shanghai), China (Beijing), and China (Zhangjiakou) regions.
The delete operation on Tablestore data is ignored when the data is delivered. Tablestore data delivered to OSS is not deleted when you perform a delete operation on the data.
It takes at most one minute for initialization when you create a delivery task.
Latencies are within 3 minutes when data is written at a steady rate. The P99 latency is within 10 minutes when data is synchronized.
NoteThe P99 latency indicates the average latency of the slowest 1% of requests over the previous 10 seconds.
Create a data delivery task
Go to the Instance Management page.
Log on to the Tablestore console.
In the upper part of the page, select a resource group and a region. Find a Tablestore instance that you want to manage in the instance list. Then, click the instance name or Manage Instance in the Actions column.
On the Instance Management page, click the Deliver Data to OSS tab.
(Optional) Create the service-linked role AliyunServiceRoleForOTSDataDelivery.
When you configure data delivery for the first time, you must create the AliyunServiceRoleForOTSDataDelivery role that is used to authorize Tablestore to write data to an OSS bucket. For more information, see AliyunServiceRoleForOTSDataDelivery role.
NoteFor more information about service-linked roles, see Service-linked roles.
On the Deliver Data to OSS tab, click Role for Delivery Service.
In the Role Details message, view related information. Click OK.
Create a delivery task.
On the Deliver Data to OSS tab, click Create Task.
In the Create Task dialog box, configure the following parameters.
Parameter
Description
Task Name
The name of the delivery task.
The name must be 3 to 16 characters in length and can contain only lowercase letters, digits, and hyphens (-). It must start and end with a lowercase letter or digit.
Destination Region
The region where the Tablestore instance and OSS bucket are located.
Source Table
The name of the Tablestore table.
Destination Bucket
The name of the OSS bucket to which you want to deliver data.
ImportantMake sure that a bucket is created in the same region of the Tablestore instance.
Destination Prefix
The prefix of the folder in the bucket. Data is delivered from Tablestore to the folder. The path of the destination folder supports the following time variables: $yyyy, $MM, $dd, $HH, and $mm. For more information, see the Partition data by time section of this topic.
When the path uses time variables, OSS folders are dynamically generated based on the time at which data is written. This way, the data in OSS are organized, partitioned, and distributed based on time, which follows the hive partition naming style.
When the path does not use time variables, all files are delivered to an OSS folder whose name contains this prefix.
Synchronization Mode
The type of the delivery task. Valid values:
Incremental: Only incremental data is synchronized.
Full: All data in tables is scanned and synchronized.
Differential: Full data is synchronized before incremental data is synchronized.
When Tablestore synchronizes incremental data, you can view the time when data is last delivered and the status of the delivery task.
Destination Object Format
The delivered data is stored in the Parquet format. By default, data delivery uses PLAIN for encoding. PLAIN can be used to encode data of all types.
Schema Generation Type
Specify the columns to be delivered. The order in which fields are sorted in the Tablestore table can be different from the order of fields in the schema. Parquet data stored in OSS is distributed based on the order of fields in the schema.
Select a schema generation type.
If the Schema Generation Type parameter is set to Manual, you must configure the source fields, destination field names, and destination field types for delivery.
If the Schema Generation Type parameter is set to Auto Generate, the system identifies and matches the fields for delivery.
ImportantThe data types must be consistent between the source and destination fields. Otherwise, the fields are discarded as dirty data. For more information about field type mappings, see the Data type mapping section of this topic.
When you configure the schema, you can perform the following operations:
Click +Add Field to add fields for delivery.
Click the or icon in the Actions column corresponding to a field to adjust the order of the field.
Click the icon in the Actions column corresponding to a field to delete the field.
Schema Configurations
Click OK.
In the View Statement to Create Table message, you can view the statement that is used to create an external table for E-MapReduce (EMR). You can also copy the statement to create an external table for EMR to access data in OSS.
After the delivery task is created, you can perform the following operations:
View the details of the delivery task, such as the task name, table name, destination bucket, destination prefix, status, and the time when data is last synchronized.
View or copy the statement to create a table.
Click View Statement to Create Table in the Actions column. You can view or copy the statement to create an external table by using computing engines such as EMR. For more information, see Use EMR.
View the error message returned after the delivery.
If the configurations for the OSS bucket and delivery permissions are incorrect, data delivery cannot be complete. On the status page of the delivery task, you can view related error messages. For more information about error handling, see the Error handling section of this topic.
Delete the delivery task.
Click Delete in the Actions column corresponding to the delivery task. You can delete the delivery task. The system returns an error if the delivery task is in the initialization process. You can delete the task later.
View OSS data
After the delivery task is initialized and data is delivered, you can view the data delivered to OSS by using the OSS console, API, SDK, or computing engine EMR. For more information, see Overview.
Example of an OSS object URL:
oss://BucketName/TaskPrefix/TaskName_ConcurrentID_TaskPrefix__SequenceID
In the example, BucketName indicates the name of the bucket. The first TaskPrefix indicates the prefix of the destination folder. The second TaskPrefix indicates the prefix information of the task. TaskName indicates the name of the task. ConcurrentID indicates the number for concurrency determined by the system. The number starts from 0 and increases when the throughput increases. SequenceID indicates the sequence ID of the delivered file and increases from 1.
Partition data by time
Data delivery allows the system to record the time when data is written to Tablestore. The time consists of the following variables: $yyyy (four-digit year), $MM (two-digit month), $dd (two-digit day), $HH (two-digit hour), and $mm (two-digit minute). The time can be used as the prefix of the destination bucket after conversion.
We recommend that the size of an OSS object be at least 4 MB. When computing engines are used to load OSS data, the larger number of partitions results in longer time to load partitions. Therefore, in most real-time data writing scenarios, we recommend that the time based on which to partition data be at the granularity of day or hour, rather than minute.
The delivery task by which data was written to Tablestore at 16:03 on August 31, 2020 is used in the example. The following table describes the URLs of the first object generated in OSS based on different destination prefix configurations.
OSS bucket | Task name | Destination prefix | OSS object URL |
myBucket | testTask | myPrefix | oss://myBucket/myPrefix/testTask_0_myPrefix__1 |
myBucket | testTaskTimeParitioned | myPrefix/$yyyy/$MM/$dd/$HH/$mm | oss://myBucket/myPrefix/2020/08/31/16/03/testTaskTimeParitioned_0_myPrefix_2020_08_31_16_03__1 |
myBucket | testTaskTimeParitionedHiveNamingStyle | myPrefix/year=$yyyy/month=$MM/day=$dd | oss://myBucket/myPrefix/year=2020/month=08/day=31/testTaskTimeParitionedHiveNamingStyle_0_myPrefix_year=2020_month=08 |
myBucket | testTaskDs | ds=$yyyy$MM$dd | oss://myBucket/ds=20200831/testTaskDs_0_ds=20200831__0 |
Data type mapping
Parquet logical type | Data type in Tablestore |
Boolean | Boolean |
Int64 | Int64 |
Double | Double |
UTF8 | String |
Error handling
Error message | Description | Solution |
UnAuthorized | Tablestore is not authorized to access OSS. | Check whether the service-linked role AliyunServiceRoleForOTSDataDelivery exists. If the role does not exist, you must create a delivery task to trigger Tablestore to create the role. |
InvalidOssBucket | The specified OSS bucket does not exist. |
When the OSS bucket is created, all data is written to the OSS bucket again. The delivery progress is updated. |