Synchronize data from OSS to MaxCompute in offline mode - DataWorks

This topic describes how to configure a batch synchronization node to synchronize data from Object Storage Service (OSS) to MaxCompute. In this example, the best practices of batch synchronization to MaxCompute in the aspects of data source configuration, network connectivity, and synchronization node configuration are introduced.

Background information

OSS is a secure, cost-effective, and highly reliable cloud storage service that allows you to store a large amount of data. OSS is designed to provide 99.9999999999% (twelve 9's) data durability and 99.995% data availability. OSS provides multiple storage classes and helps you manage and reduce storage costs. Data Integration supports the synchronization of data between OSS and another type of data source. This topic provides an example on how to synchronize data from OSS to MaxCompute in offline mode.

Obtain information about the OSS bucket

Log on to the OSS console. On the Buckets page, find the OSS bucket from which you want to read data, click the name of the OSS bucket, and then click Overview. On the page that appears, you can view the public and internal endpoints of the OSS bucket. You can select different endpoints in different scenarios.

A public endpoint is used for access over the Internet. You are charged no fees for inbound traffic that is generated when you write data to OSS over the Internet. You are charged fees for outbound traffic that is generated when you read data from OSS over the Internet. For information about the billable items of OSS, see Regions and endpoints.
An internal endpoint is used for access over an internal network to support communications between Alibaba Cloud services in the same region. For example, you can use a resource group for Data Integration in a region to access an OSS data source that is deployed in the same region. You are charged no fees for inbound or outbound traffic that is generated when you access OSS over an internal network. If you want to use a resource group for Data Integration in a region to read data from or write data to an OSS bucket in the same region, you must configure an internal endpoint for the OSS bucket. Otherwise, you must configure a public endpoint.
For information about mappings between regions and endpoints, see Regions and endpoints.

Prepare data sources

Prepare an OSS data source

On the Data Source page in the DataWorks console, click Add data source. In the Add data source dialog box, select OSS. You can access an OSS data source in RAM role authorization mode or AccessKey mode:

AccessKey mode
- Endpoint: the endpoint that is used to access the OSS data source. You can obtain the endpoint on the Overview page of the OSS bucket in the OSS console. For more information, see Obtain information about the OSS bucket.
- AccessKey ID: the AccessKey ID that is used to access the OSS data source. You can obtain the AccessKey ID on the User Management page.
- AccessKey Secret: the AccessKey secret that is used to access the OSS data source. The secret is equivalent to the password that is used to log on to the OSS console.
RAM role authorization mode
For information about how to access an OSS data source in RAM role authorization mode, see Use the RAM role-based authorization mode to add a data source.

After the OSS data source is prepared, click Test connectivity. You must select the exclusive resource group for Data Integration that you configure to connect to the OSS data source, and make sure that the connectivity status is Connected. In this state, a network connection is established between the OSS data source and the exclusive resource group for Data Integration.

Prepare a MaxCompute data source

You can associate a MaxCompute compute engine with your DataWorks workspace to generate a MaxCompute data source. Alternatively, you can manually add a MaxCompute data source to your DataWorks workspace. For more information, see Associate a MaxCompute compute engine with a workspace and Add a MaxCompute data source.

Create a batch synchronization node

You can find the required workflow in the Scheduled Workflow pane of the DataStudio page in the DataWorks console and create a batch synchronization node in the workflow. When you create a batch synchronization node, you must configure the parameters such as Path and Name. For more information, see Configure a batch synchronization node by using the codeless UI.

Configure the source

In the Connections step, you can configure parameters related to the source. In this example, a batch synchronization node is created to synchronize incremental data from OSS to MaxCompute.


Parameter	Description
Data source	The OSS data source that you prepared.
Text type	The file format of the OSS object from which you want to read data. If you configure the batch synchronization node by using the codeless UI, the valid values of the parameter are csv and text. text: a TXT file. No limits on text formats exist. A TXT file can store texts in any format. csv: a comma-separated values (CSV) file. A separating character can be a comma or another character. A CSV file can store table data such as numbers and texts in plain texts. A CSV file consists of any number of records that are separated by a type of line feed. Each record consists of fields, and field delimiters are other characters or strings. The most common field delimiters are commas or tab characters. In most cases, the fields of each record in a CSV file are in the same order.
File Path	The path of the OSS object from which you want to read data. If you specify a single OSS object name, OSS Reader uses only a single thread to read data. If you specify multiple OSS object names, OSS Reader uses parallel threads to read data. You can configure the number of parallel threads based on your business requirements. If you specify a name that contains a wildcard, OSS Reader reads data from all objects that match the name. For example, if you set an OSS object name to abc*[0-9], OSS Reader reads data from objects such as abc0, abc1, abc2, and abc3. If you set an OSS object name to abc?.txt, OSS Reader reads data from objects whose names start with abc, end with .txt, and contain an arbitrary character between abc and .txt.
Column Delimiter	The column delimiter for the CSV or TXT file.
Coding	The encoding format of the OSS object from which you want to read data.
null value	The source strings that are written to the destination as NULL. For example, if nullFormat is set to null, Data Integration writes the source string null to the destination as NULL.
Compression format	The compression format of the OSS object from which you want to read data. Valid values: Gzip, Bzip2, Zip, and None. The value None indicates that the object is not compressed.
Skip Header	Specifies whether to skip the headers in a CSV-like object when the headers are used as titles. The default value of this parameter is No. Note The headers in an object that is compressed cannot be skipped.

Configure the destination

In this example, a batch synchronization node is created to synchronize data from OSS to MaxCompute. The following figure shows the parameters that can be configured for the destination.


Parameter	Description
Data source	The MaxCompute data source that you prepared. A sample name of the automatically added MaxCompute data source is odps_first. If you use a DataWorks workspace in standard mode, the name of the MaxCompute project in the development environment and the name of the MaxCompute project in the production environment are displayed.
Table	The name of the MaxCompute table to which you want to write data. If you use a DataWorks workspace in standard mode, you must make sure that the MaxCompute data source used in the production environment of the workspace contains a table whose name and schema are the same as those of a table in the MaxCompute data source used in the development environment of the workspace. Note Before you configure the Table parameter, take note of the following items: If the MaxCompute data source used in the development environment does not contain the table, the table is not displayed in the drop-down list when you configure the Table parameter. If the MaxCompute data source used in the production environment does not contain the table, the batch synchronization node fails after the node is committed and deployed. This is because the node cannot find the table in the production environment. If the schema of the table used in the production environment is different from the schema of the table used in the development environment, field mappings that are established when the batch synchronization node is run may be different from the field mappings that are configured in the Mappings step. As a result, data may be incorrectly written to the destination.
Partition Info	The partition information. If the table is a partitioned table, you can set the value of this parameter to a value of the partition key column. You can specify a constant as a value for this parameter. Example: `ds=20220101`. You can specify a scheduling parameter as a value for this parameter. Example: `ds=${bizdate}`. If you specify a scheduling parameter as a value for this parameter, the system automatically replaces the scheduling parameter with an actual value when the batch synchronization node is run.

Retain the default values for other parameters.

Configure field mappings

After you configure the source and destination, you must configure the mappings between the fields in the source and the fields in the destination. You can click Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout to perform the related operation.

Configure channel control policies

You can configure settings such as the maximum number of parallel threads and the maximum number of dirty data records allowed.

Configure scheduling settings

In the right-side navigation pane of the configuration tab of the batch synchronization node, click Properties. You must configure the following important settings. For information about common scheduling settings and all scheduling settings that can be configured for a batch synchronization node, see the topics in the Schedule directory.

Configure the rerun attribute.
You can configure a rerun policy based on your business requirements. Node rerunning can prevent node failures caused by occasional issues such as network jitters.
Configure scheduling dependencies.
You can configure scheduling dependencies for the node based on your business requirements. You can configure the instance generated for the batch synchronization node in the current cycle to depend on the instance generated for the same node in the previous cycle. This way, you can ensure that instances generated for the node in different scheduling cycles can finish running in sequence and prevent multiple instances from running at the same time.

Configure the exclusive resource group for Data Integration

In the right-side navigation pane of the configuration tab of the batch synchronization node, click Configure Resource Group for Data Integration to configure the exclusive resource group for Data Integration for the node. Select the exclusive resource group for Data Integration that is connected to the OSS data source and the MaxCompute data source.

Test the batch synchronization node and commit and deploy the node for running

Test the batch synchronization node

In the top toolbar of the configuration tab of the batch synchronization node, you can click Run or Run with Parameters to test the batch synchronization node and check whether the node can run as expected. You can click Run with Parameters to test whether the scheduling parameters that you configured for the node are replaced as expected.

Commit and deploy the batch synchronization node

If the batch synchronization node is run as expected, you can save the node configurations, and commit and deploy the node to Operation Center. The node can periodically read data from the OSS object and write data to the MaxCompute table at intervals of minutes, hours, or days. For information about how to deploy a node, see Deploy nodes.

After the batch synchronization node is deployed, you can view the running results of the node in Operation Center and perform operations, such as data backfilling, on the node. For more information, see Basic O&M operations for auto triggered nodes.