This topic describes how to configure a batch synchronization node to synchronize data from Object Storage Service (OSS) to MaxCompute. In this example, the best practices of batch synchronization to MaxCompute in the aspects of data source configuration, network connectivity, and synchronization node configuration are introduced.
Background information
OSS is a secure, cost-effective, and highly reliable cloud storage service that allows you to store a large amount of data. OSS is designed to provide 99.9999999999% (twelve 9's) data durability and 99.995% data availability. OSS provides multiple storage classes and helps you manage and reduce storage costs. Data Integration supports the synchronization of data between OSS and another type of data source. This topic provides an example on how to synchronize data from OSS to MaxCompute in offline mode.
Obtain information about the OSS bucket
- A public endpoint is used for access over the Internet. You are charged no fees for inbound traffic that is generated when you write data to OSS over the Internet. You are charged fees for outbound traffic that is generated when you read data from OSS over the Internet. For information about the billable items of OSS, see Regions and endpoints.
- An internal endpoint is used for access over an internal network to support communications between Alibaba Cloud services in the same region. For example, you can use a resource group for Data Integration in a region to access an OSS data source that is deployed in the same region. You are charged no fees for inbound or outbound traffic that is generated when you access OSS over an internal network. If you want to use a resource group for Data Integration in a region to read data from or write data to an OSS bucket in the same region, you must configure an internal endpoint for the OSS bucket. Otherwise, you must configure a public endpoint.
- For information about mappings between regions and endpoints, see Regions and endpoints.
Prepare data sources
Prepare an OSS data source
- AccessKey mode
- Endpoint: the endpoint that is used to access the OSS data source. You can obtain the endpoint on the Overview page of the OSS bucket in the OSS console. For more information, see Obtain information about the OSS bucket.
- AccessKey ID: the AccessKey ID that is used to access the OSS data source. You can obtain the AccessKey ID on the User Management page.
- AccessKey Secret: the AccessKey secret that is used to access the OSS data source. The secret is equivalent to the password that is used to log on to the OSS console.
- RAM role authorization mode
For information about how to access an OSS data source in RAM role authorization mode, see Use the RAM role-based authorization mode to add a data source.
After the OSS data source is prepared, click Test connectivity. You must select the exclusive resource group for Data Integration that you configure to connect to the OSS data source, and make sure that the connectivity status is Connected. In this state, a network connection is established between the OSS data source and the exclusive resource group for Data Integration.
Prepare a MaxCompute data source
You can associate a MaxCompute compute engine with your DataWorks workspace to generate a MaxCompute data source. Alternatively, you can manually add a MaxCompute data source to your DataWorks workspace. For more information, see Associate a MaxCompute compute engine with a workspace and Add a MaxCompute data source.
Create a batch synchronization node
You can find the required workflow in the Scheduled Workflow pane of the DataStudio page in the DataWorks console and create a batch synchronization node in the workflow. When you create a batch synchronization node, you must configure the parameters such as Path and Name. For more information, see Configure a batch synchronization node by using the codeless UI.
Configure the source
Parameter | Description |
---|---|
Data source | The OSS data source that you prepared. |
Text type | The file format of the OSS object from which you want to read data. If you configure the batch synchronization node by using the codeless UI, the valid values of the parameter are csv and text.
|
File Path | The path of the OSS object from which you want to read data.
|
Column Delimiter | The column delimiter for the CSV or TXT file. |
Coding | The encoding format of the OSS object from which you want to read data. |
null value | The source strings that are written to the destination as NULL. For example, if nullFormat is set to null, Data Integration writes the source string null to the destination as NULL. |
Compression format | The compression format of the OSS object from which you want to read data. Valid values: Gzip, Bzip2, Zip, and None. The value None indicates that the object is not compressed. |
Skip Header | Specifies whether to skip the headers in a CSV-like object when the headers are used as titles. The default value of this parameter is No. Note The headers in an object that is compressed cannot be skipped. |
Configure the destination
Parameter | Description |
---|---|
Data source | The MaxCompute data source that you prepared. A sample name of the automatically added MaxCompute data source is odps_first. If you use a DataWorks workspace in standard mode, the name of the MaxCompute project in the development environment and the name of the MaxCompute project in the production environment are displayed. |
Table | The name of the MaxCompute table to which you want to write data. If you use a DataWorks workspace in standard mode, you must make sure that the MaxCompute data source used in the production environment of the workspace contains a table whose name and schema are the same as those of a table in the MaxCompute data source used in the development environment of the workspace. Note Before you configure the Table parameter, take note of the following items:
|
Partition Info | The partition information. If the table is a partitioned table, you can set the value of this parameter to a value of the partition key column.
|
Configure field mappings
After you configure the source and destination, you must configure the mappings between the fields in the source and the fields in the destination. You can click Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout to perform the related operation.
Configure channel control policies
You can configure settings such as the maximum number of parallel threads and the maximum number of dirty data records allowed.
Configure scheduling settings
- Configure the rerun attribute.
You can configure a rerun policy based on your business requirements. Node rerunning can prevent node failures caused by occasional issues such as network jitters.
- Configure scheduling dependencies.
You can configure scheduling dependencies for the node based on your business requirements. You can configure the instance generated for the batch synchronization node in the current cycle to depend on the instance generated for the same node in the previous cycle. This way, you can ensure that instances generated for the node in different scheduling cycles can finish running in sequence and prevent multiple instances from running at the same time.
Configure the exclusive resource group for Data Integration
In the right-side navigation pane of the configuration tab of the batch synchronization node, click Configure Resource Group for Data Integration to configure the exclusive resource group for Data Integration for the node. Select the exclusive resource group for Data Integration that is connected to the OSS data source and the MaxCompute data source.
Test the batch synchronization node and commit and deploy the node for running
Test the batch synchronization node
In the top toolbar of the configuration tab of the batch synchronization node, you can click Run or Run with Parameters to test the batch synchronization node and check whether the node can run as expected. You can click Run with Parameters to test whether the scheduling parameters that you configured for the node are replaced as expected.
Commit and deploy the batch synchronization node
If the batch synchronization node is run as expected, you can save the node configurations, and commit and deploy the node to Operation Center. The node can periodically read data from the OSS object and write data to the MaxCompute table at intervals of minutes, hours, or days. For information about how to deploy a node, see Deploy nodes.
After the batch synchronization node is deployed, you can view the running results of the node in Operation Center and perform operations, such as data backfilling, on the node. For more information, see Basic O&M operations for auto triggered nodes.