This topic describes how to use DataWorks Data Integration to synchronize data between heterogeneous data sources. You can use this method to synchronize data to data warehouses. In the example in this topic, a batch synchronization task in Data Integration is used to synchronize basic user information stored in an ApsaraDB RDS for MySQL table named ods_user_info_d to a MaxCompute table named ods_user_info_d and synchronize website access logs stored in an Object Storage Service (OSS) object named user_log.txt to another MaxCompute table named ods_raw_log_d.
Prerequisites
The basic user information and website access logs of users are prepared and are stored in an ApsaraDB RDS for MySQL instance and an OSS bucket. You can directly register and use the prepared data in DataWorks. You do not need to activate ApsaraDB RDS for MySQL and OSS or prepare test data. You need to only make sure that the following requirements are met:
A DataWorks workspace is created.
In this example, a workspace in standard mode is used. The name of the workspace is
WorkShop2024_01
. You can specify a name based on your business requirements.A MaxCompute data source is added.
In this example, a MaxCompute data source named
odps_first
is added. The name of the MaxCompute project used in the production environment isworkshop2024_01
. The name of the MaxCompute project used in the development environment isworkshop2024_01_dev
.Optional. If you perform operations in this example as a RAM user, make sure that the RAM user is attached the AliyunBSSOrderAccess and AliyunDataWorksFullAccess policies. For information about authorization, see Grant permissions to a RAM user.
Quick start
In this experiment, tasks for data synchronization and data processing can be imported with one click through an extract, transform, and load (ETL) workflow template. After the template is imported, you can go to the desired workspace and complete subsequent data quality monitoring and data visualization operations.
Only users that are assigned the Workspace Administrator role can import an ETL workflow template to a desired workspace. For more information about how to assign a Workspace Administrator role, see Manage permissions on workspace-level services.
For information about quick access to an ETL workflow template, go to the Website User Behavior Analysis page.
Background information
Data Integration is a stable, efficient, and scalable data synchronization service. It can efficiently transmit and synchronize data between heterogeneous data sources in complex network environments. Data Integration provides different types of synchronization solutions such as batch synchronization, incremental data synchronization, and real-time full and incremental data synchronization.
In this example, a batch synchronization solution is used. DataWorks encapsulates the batch synchronization capabilities of Data Integration into batch synchronization nodes. Each batch synchronization node represents a synchronization task. You can configure a source and a destination for a node to define data transmission between the data sources and configure field mappings to define the read and write relationships between source fields and destination fields.
In this example, the test data and data sources that are required are prepared. To access the test data from your workspace, you need to only add the data source information to your workspace.
The data in this experiment can be used only for experimental operations in DataWorks, all the data is manual mock data, and the data can only be read in Data Integration.
Information about source data and destination tables
The Data Integration service is used to synchronize basic user information stored in ApsaraDB RDS for MySQL and website access logs stored in OSS to MaxCompute. The following table describes information about source data and destination tables.
Source | Destination (MaxCompute) | |
MySQL | Table: ods_user_info_d
| Table: ods_user_info_d
|
OSS | object: user_log.txt
| Table: ods_raw_log_d
|
Step 1: Purchase and configure a serverless resource group
In this example, synchronization tasks are used to synchronize log data stored in OSS and MySQL to MaxCompute. The synchronization tasks are run based on serverless resource groups. Therefore, you need to purchase and configure serverless resource groups.
Purchase a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of general-purpose resource groups.
NoteIn this example, a serverless resource group that is deployed in the China (Shanghai) region is used. Note that serverless resource groups do not support cross-region operations.
Configure the serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
Find the serverless resource group that you purchased, click Associate Workspace in the Actions column and then associate the resource group with the DataWorks workspace that you create as prompted.
Enable the serverless resource group to access the Internet.
Log on to the VPC console and go to the Internet NAT Gateway page. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway. Configure the parameters that are described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the virtual private cloud (VPC) and vSwitch with which the resource group is associated.
To obtain the VPC and vSwitch with which the resource group is associated, perform the following steps: Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Create Service-Linked Role
Click Create Service-Linked Role to create a service-linked role. If this is the first time you create an Internet NAT gateway, this step is required.
NoteRetain the default values for other parameters that are not described in the preceding table.
Click Buy Now. On the Confirm page, read the terms of service, select the Terms of Service check box, and then click Activate Now.
For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.
Step 2: Add data sources
In this example, you must add an HttpFile data source named user_behavior_analysis_httpfile
and an ApsaraDB RDS for MySQL data source named user_behavior_analysis_mysql
to your workspace for you to access the test data. The basic information about the data sources used for the test is provided.
Before you configure a Data Integration synchronization task, you can add and configure the source and destination databases or data warehouses on the Data Source page in the DataWorks console. This allows you to search for the data sources by name when you configure the synchronization task to determine the source and destination databases or data warehouses that you want to use.
The data in this experiment can be used only for experimental operations in DataWorks, all the data is manual mock data, and the data can only be read in Data Integration.
The test data for the HttpFile and ApsaraDB RDS for MySQL data sources that you want to add in this step is stored on the Internet. Make sure that an Internet NAT gateway is configured for your DataWorks resource group according to Step 1. Otherwise, the following errors are reported when you test the connectivity:
HttpFile:
ErrorMessage:[Connect to dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com:443 [dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com/106.14.XX.XX] failed: connect timed out]
MySQL:
ErrorMessage:[Exception:Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.<br><br>ExtraInfo:Resource Group IP:****,detail version info:mysql_all],Root Cause:[connect timed out]
Add the HttpFile data source named user_behavior_analysis_httpfile
Add the HttpFile data source to your workspace. Then, test whether a network connection is established between the data source and the resource group you want to use for data synchronization. The HttpFile data source is used to read the website access test data of users that is stored in OSS and can be accessed from DataWorks.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, choose
.
Add the HttpFile data source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, click HttpFile.
On the Add HttpFile Data Source page, configure the parameters. The following table describes the parameters.
Parameter
Description
Data Source Name
The name of the data source. It is the identifier of the data source in your workspace. In this example, the parameter is set to user_behavior_analysis_httpfile.
Data Source Description
The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
Environment
Select Development Environment and Production Environment.
NoteYou must add a data source in the development environment and a data source in the production environment. Otherwise, an error is reported when the related task is run to produce data.
URL Domain
The URL of the OSS bucket. Enter the
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
.Connection Configuration
In the Connection Configuration section, find the serverless resource group that you purchased and click Test Network Connectivity in the Connection Status column. You need to separately test the network connections between the resource group and the data sources in the development and production environments. After the system returns a message indicating that the test is successful, the connectivity status changes to Connected.
ImportantThe test data for the HttpFile data source that you want to add in this step is stored on the Internet. Make sure that an Internet NAT gateway is configured for your DataWorks resource group according to Step 1. Otherwise, the following error is reported when you test the connectivity:
ErrorMessage:[Connect to dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com:443 [dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com/106.14.XX.XX] failed: connect timed out]
.
Add the ApsaraDB RDS for MySQL data source named user_behavior_analysis_mysql
Add the ApsaraDB RDS for MySQL data source to your workspace. Then, test whether a network connection is established between the data source and the resource group that you want to use for data synchronization. The ApsaraDB RDS for MySQL data source is used to read the basic user information that is stored in ApsaraDB RDS for MySQL and can be accessed from DataWorks.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, choose
.
Add the ApsaraDB RDS for MySQL data source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, click MySQL.
On the Add MySQL Data Source page, configure the parameters. The following table describes the parameters.
Parameter
Description
Configuration Mode
Set this parameter to Connection String Mode.
Data Source Name
The name of the data source. Enter user_behavior_analysis_mysql.
Data Source Description
The description of the data source. The data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. The data source is only for data reading in data synchronization scenarios.
Environment
Select Development and Production.
NoteYou must add a data source in the development environment and a data source in the production environment. Otherwise, an error is reported when the related task is run to produce data.
Connection Address
Host IP Address
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
Port Number
3306
Database Name
workshop
Username
workshop
Password
workshop#2017
Authentication Method
Set this parameter to No Authentication.
Connection Configuration
In the Connection Configuration section, find the serverless resource group that you purchased and click Test Network Connectivity in the Connection Status column. You need to separately test the network connections between the resource group and the data sources in the development and production environments. After the system returns a message indicating that the test is successful, the connectivity status changes to Connected.
ImportantThe test data for the ApsaraDB RDS for MySQL data source that you want to add in this step is stored on the Internet. Make sure that an Internet NAT gateway is configured for your DataWorks resource group according to Step 1. Otherwise, the following error is reported when you test the connectivity:
ErrorMessage:[Exception:Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.<br><br>ExtraInfo:Resource Group IP:****,detail version info:mysql_all],Root Cause:[connect timed out]
.
Step 3: Create a workflow
Design a workflow based on requirement analysis. Create and use two batch synchronization nodes named ods_raw_log_d and ods_user_info_d to synchronize basic user information from ApsaraDB RDS for MySQL and website access logs from OSS. Then, create a zero load node named workshop_start to manage nodes in the workflow in a centralized manner. This topic describes only the data synchronization procedure. Specific task configurations are not included.
1. Create a workflow
By default, DataWorks provides a workflow named workflow. You can skip the workflow creation steps and directly use the provided workflow.
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create a workflow.
In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow. In the Create Workflow dialog box, configure the Workflow Name parameter based on your business requirements. In this example, WorkShop is used.
2. Design the workflow
Go to the configuration tab of the workflow.
Double-click the name of the workflow created in the previous step to go to the configuration tab of the workflow.
Create nodes.
You can design the workflow by performing drag-and-drop operations on components on the configuration tab of the workflow. Click Create Node. Find the node type based on which you want to create a node and drag it to the workflow canvas on the right.
In this example, you need to create a zero load node named workshop_start and two batch synchronization nodes named ods_raw_log_d and ods_user_info_d. The ods_raw_log_d node is used to synchronize website access logs from OSS and the ods_user_info_d node is used to synchronize basic user information from ApsaraDB RDS for MySQL.
Configure scheduling dependencies for nodes.
Configure the workshop_start node as the ancestor node of the two batch synchronization nodes. In this example, no data lineage exists between the zero load node and batch synchronization nodes. Therefore, you can draw lines to configure the scheduling dependencies between the zero load node workshop_start and batch synchronization nodes ods_raw_log_d and ods_user_info_d in the workflow. For more information about the methods for configuring scheduling dependencies, see Scheduling dependency configuration guide.
Step 4: Create MaxCompute tables
You must create MaxCompute tables that are used to store the data synchronized by using Data Integration in advance. In this example, the tables are created in a quick manner. For more information about MaxCompute table-related operations, see Create and manage MaxCompute tables.
Go to the entry point for creating tables.
Create a table named ods_raw_log_d.
In the Create Table dialog box, enter ods_raw_log_d in the Name field. In the upper part of the table configuration tab, click DDL, enter the following table creation statement, and then click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the original configurations.
CREATE TABLE IF NOT EXISTS ods_raw_log_d ( col STRING ) PARTITIONED BY ( dt STRING ) LIFECYCLE 7;
Create a table named ods_user_info_d.
In the Create Table dialog box, enter ods_user_info_d in the Name field. In the upper part of the table configuration tab, click DDL, enter the following table creation statement, and then click Generate Table Schema. In the Confirm dialog box, click Confirmation to overwrite the original configurations.
CREATE TABLE IF NOT EXISTS ods_user_info_d ( uid STRING COMMENT 'The user ID', gender STRING COMMENT 'The gender', age_range STRING COMMENT 'The age range', zodiac STRING COMMENT 'The zodiac sign' ) PARTITIONED BY ( dt STRING ) LIFECYCLE 7;
Commit and deploy the tables.
After you confirm that the table information is valid, click Commit to Development Environment and Commit to Production Environment in sequence on the configuration tab of the ods_user_info_d and ods_raw_log_d tables. In the MaxCompute projects that are associated with the workspace in the development and production environments, the system creates the related physical tables in the MaxCompute projects based on the node configurations.
NoteAfter you define the schema of a table, you can commit the table to the development and production environments. After the table is committed, you can view the table in the MaxCompute project in a specific environment.
If you commit the tables to the development environment of the workspace, the tables are created in the MaxCompute project that is associated with the workspace in the development environment.
If you commit the tables to the production environment of the workspace, the tables are created in the MaxCompute project that is associated with the workspace in the production environment.
Step 5: Configure a zero load node named workshop_start
In this example, the workshop_start
node is run at 00:15 every day to trigger the running of the user profile analysis workflow. The workshop_start
node is a management and control node. You do not need to write code for the node. The following content describes the scheduling configurations of the node:
Go to the configuration tab of the workshop_start node.
On the DataStudio page, double-click the name of the workflow created in the previous step to go to the configuration tab of the workflow. Double-click the zero load node
workshop_start
. On the configuration tab of the zero load node, click Properties in the right-side navigation pane.Configure the scheduling properties for the workshop_start node.
Configure the following settings for the workshop_start node to allow the root node of the workspace to trigger the running of the workshop_start node at 00:15 every day:
Configure the scheduling time: Set Scheduling Cycle to Day and Scheduled time to 00:15.
Configure the rerun properties: Set Rerun to Allow Regardless of Running Status.
Configure scheduling dependencies: Configure the root node of the workspace as the ancestor node of the workshop_start node.
In this example, the workshop_start node is a management and control node and does not depend on other nodes. Therefore, you can configure the workshop_start node to depend on the root node of the workspace. This way, the root node of the workspace can be used to trigger the running of the current user profile analysis workflow.
NoteBy default, the root node of a workspace is automatically generated after you create the workspace. In most cases, all nodes in a workspace depend on the root node of the workspace. By default, the root node of a workspace starts to trigger all its level-1 descendant nodes to run at 00:00. The configurations of the root node cannot be changed.
Step 6: Configure the ods_raw_log_d node
In this step, you can configure the ods_raw_log_d node to synchronize website access logs of users from the OSS object user_log.txt to the MaxCompute table ods_raw_log_d.
On the DataStudio page, double-click the ods_raw_log_d node in the workflow. On the configuration tab of the node, configure the node.
1. Establish network connections between the data sources and the resource group that you want to use
In this step, a serverless resource group is used. You need to test the network connectivity between the resource group and the source user_behavior_analysis_httpfile and the destination MaxCompute data source.
Set Source to
HttpFile
and Data Source Name touser_behavior_analysis_httpfile
, which is the data source that you added in Step 2: Add data sources.In the Resource Group step, select the
serverless resource group
that you purchased from the drop-down list.Set Destination to
MaxCompute
and select the data source that you add.
2. Configure a synchronization task
Configure basic settings of the synchronization task.
For the source, you need to configure the user_log.txt object as the object from which you want to read data.
For the destination, you need to configure the ods_raw_log_d table as the table to which you want to write data. You also need to configure the Partition information parameter. The bizdate variable is defined in the ${} format in the Partition information field. The variable value is assigned in the subsequent Step 3.
Configure field mappings and general settings.
DataWorks allows you to configure mappings between source fields and destination fields to read data from specified source fields and write data to the destination fields. In the Channel Control section, you can also use the features such as data read and write parallelism, the maximum transmission rate that can prevent data synchronization from affecting the performance of databases, and the policy for dirty data records and distributed execution. In this example, the default settings are used. For information about other configuration items for a synchronization task, see Configure a batch synchronization task by using the codeless UI.
3. Configure scheduling properties
If you configure scheduling properties for the synchronization task as those shown in the following figure, DataWorks Data Integration synchronizes data from the OSS object user_log.txt to the time-based partition in the MaxCompute table ods_raw_log_d on 00:15 every day.
In the Parameters section, enter bizdate for Parameter Name and $bizdate for Parameter Value, which is used to query the date of the previous day. The format of the parameter values is yyyymmdd.
In the Schedule section, set Scheduling Cycle to Day. You do not need to separately configure the Scheduled time parameter for the current node. The time when the current node is scheduled to run every day is determined by the scheduling time of the zero load node workshop_start of the workflow. The current node is scheduled to run after 00:15 every day.
Configure the parameters in the Dependencies section:
Determine the ancestor nodes of the current node: Determine whether to display the workshop_start node in Parent Nodes for the current node. The node that you specified as the ancestor node of the current node by drawing lines is displayed. If the workshop_start node is not displayed, check whether the workflow design in the business data synchronization phase has been completed by referring to 2. Design the workflow.
In this example, when the scheduling time of the workshop_start node arrives and the node finishes running, the current node is triggered to run.
Determine the output of the current node: Determine whether the output named in the format of Name of a MaxCompute project in the production environment.ods_raw_log_d for the current node exists. You can go to the Workspace page in SettingCenter to view the MaxCompute project name in the production environment.
NoteIn DataWorks, the output of a node is used to configure scheduling dependencies between the node and its descendant nodes. If an SQL node depends on a synchronization node, when the SQL node starts to process the output table of the synchronization node, DataWorks uses the automatic parsing feature to quickly configure the synchronization node as the ancestor node of the SQL node based on the table lineage. You need to confirm whether the node output that has the same name as the node output table ods_raw_log_d exists.
Step 7: Configure the ods_user_info_d node
In this step, you can configure the ods_user_info_d node to synchronize basic user information from the ApsaraDB RDS for MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d.
On the DataStudio page, double-click the ods_user_info_d node in the workflow. On the configuration tab of the node, configure the node.
1. Establish network connections between the data sources and the resource group that you want to use
In this step, the serverless resource group is used. You need to test the network connectivity between the resource group and the source user_behavior_analysis_mysql and the destination MaxCompute data source.
Set Source to
MySQL
and Data Source Name to user_behavior_analysis_mysql.In the Resource Group step, select the
serverless resource group
that you purchased from the drop-down list.Set Destination to
MaxCompute
and select the data source that you add.
2. Configure a synchronization task
Configure basic settings of the synchronization task.
For the source, you need to configure the ods_user_info_d table as the table from which you want to read data.
For the destination, you need to configure the ods_user_info_d table as the table to which you want to write data. You also need to configure the Partition information parameter. The bizdate variable is defined in the ${} format in the Partition information field. The variable value is assigned in the subsequent Step 3.
NoteIn this example, full data is read from the ApsaraDB RDS for MySQL table and written to the specified time-based partition of the MaxCompute table by default.
Configure field mappings and general settings.
DataWorks allows you to configure mappings between source fields and destination fields to read data from specified source fields and write data to the destination fields. In the Channel Control section, you can also use the features such as data read and write parallelism, the maximum transmission rate that can prevent data synchronization from affecting the performance of databases, and the policy for dirty data records and distributed execution. In this example, the default settings are used. For information about other configuration items for a synchronization task, see Configure a batch synchronization task by using the codeless UI.
3. Configure scheduling properties
If you configure scheduling properties for the synchronization task as those shown in the following figure, DataWorks Data Integration synchronizes data from the ApsaraDB RDS for MySQL table ods_user_info_d to the time-based partition in the MaxCompute table ods_user_info_d on 00:15 every day.
In the Parameters section, enter bizdate for Parameter Name and $bizdate for Parameter Value, which is used to query the date of the previous day. The format of the parameter values is yyyymmdd.
In the Schedule section, set Scheduling Cycle to Day. You do not need to separately configure the Scheduled time parameter for the current node. The time when the current node is scheduled to run every day is determined by the scheduling time of the zero load node workshop_start of the workflow. The current node is scheduled to run after 00:15 every day.
Configure the parameters in the Dependencies section:
Determine the ancestor nodes of the current node: Determine whether to display the workshop_start node in Parent Nodes for the current node. The node that you specified as the ancestor node of the current node by drawing lines is displayed. If the workshop_start node is not displayed, check whether the workflow design in the business data synchronization phase has been completed by referring to 2. Design the workflow.
In this example, when the scheduling time of the workshop_start node arrives and the node finishes running, the current node is triggered to run.
Determine the output of the current node: Determine whether the output named in the format of Name of a MaxCompute project in the production environment.ods_user_info_d for the current node exists. You can go to the Workspace page in SettingCenter to view the MaxCompute project name in the production environment.
NoteIn DataWorks, the output of a node is used to configure scheduling dependencies between the node and its descendant nodes. If an SQL node depends on a synchronization node, when the SQL node starts to process the output table of the synchronization node, DataWorks uses the automatic parsing feature to quickly add the synchronization node as the ancestor node of the SQL node based on the table lineage. You need to confirm whether the node output that has the same name as the node output table ods_user_info_d exists.
Step 8: Run the nodes in the WorkShop workflow and view the result
Run the nodes in the current workflow to write the basic user information in ApsaraDB RDS for MySQL and the website access logs of users in OSS to the related MaxCompute tables.
Run the workflow
On the DataStudio page, double-click the
WorkShop
workflow under Business Flow. On the configuration tab of the workflow, click the icon in the top toolbar to run the nodes in the workflow based on the scheduling dependencies between the nodes.Confirm the status.
View the node status: If a node is in the state, the synchronization process is normal.
View the node running logs: For example, right-click the
ods_user_info_d
orods_raw_log_d
node and select View Logs. If the information shown in the following figure appears in the logs, the node is run and data is synchronized.
Query synchronization results
If the nodes in the workflow are run as expected, all basic user information in the ApsaraDB RDS for MySQL table ods_user_info_d is synchronized to the partition of the previous day in the output table workshop2024_01_dev.ods_user_info_d, and all website access logs of users in the OSS object user_log.txt are synchronized to the partition of the previous day in the output table workshop2024_01_dev.ods_raw_log_d. You do not need to deploy query SQL statements to the production environment for execution. Therefore, you can query synchronization results by creating an ad hoc query.
Create an ad hoc query.
In the left-side navigation pane of the DataStudio page, click the icon. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose
.Query synchronization result tables.
Execute the following SQL statements to confirm the data write results. View the number of records that are imported into the ods_raw_log_d and ods_user_info_d tables.
// You must specify the data timestamp of the data on which you perform read and write operations as the filter condition for partitions. For example, if a node is scheduled to run on June 21, 2023, the data timestamp of the node is 20230620, which is one day earlier than the node running date. select count(*) from ods_user_info_d where dt=Data timestamp; select count(*) from ods_raw_log_d where dt=Data timestamp;
NoteIn this example, nodes are run in DataStudio, which is the development environment. Therefore, data is written to the specified tables in the MaxCompute project workshop2024_01_dev that is associated with the workspace in the development environment by default.
Step 9: Commit the workflow
After the node code is debugged, commit the user profile analysis workflow to the scheduling system for periodic scheduling. This indicates that the raw business data is periodically synchronized to the MaxCompute destination tables.
Go to the configuration tab of the workflow.
On the DataStudio page, double-click the workflow name
WorkShop
to go to the configuration tab of the workflow.Commit the workflow.
On the workflow canvas, click the icon in the top toolbar to commit the workflow.
Confirm the commit operation.
In the Commit dialog box, select all nodes in the current workflow and select Ignore I/O Inconsistency Alerts. Confirm the configurations and click Confirm to commit all nodes in the workflow. On the Deploy page, select and deploy the nodes to the production environment.
What to do next
After you understand how to synchronize data based on this tutorial, you can now proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, see Process data.