DataWorks is an end-to-end big data development and governance platform that provides data warehousing, data lake, and data lakehouse solutions based on big data compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), AnalyticDB, and Cloudera Data Platform (CDP). This topic describes how to use some core services and features of DataWorks to integrate data into DataWorks and perform business processing, periodic scheduling, and data visualization.
Introduction
You can quickly complete the following operations by referring to the instructions provided in this topic:
Data synchronization: You can use DataWorks Data Integration to create batch synchronization nodes to synchronize business data to a big data computing platform, such as MaxCompute.
Data cleansing: You can process, analyze, and mine business data in DataWorks DataStudio.
Data display: You can convert analysis results into graphs in DataWorks DataAnalysis for better understanding by business personnel.
Periodic scheduling: You can configure settings for periodic scheduling for the data synchronization and data cleansing procedures. This way, the procedures can be periodically scheduled.
Prerequisites
An Alibaba Cloud account or a RAM user to which the AliyunDataWorksFullAccess policy is attached is prepared. For information about how to prepare an Alibaba Cloud account or a RAM user, see Prepare an Alibaba Cloud account or the topics in the Prepare a RAM user directory.
DataWorks provides a comprehensive permission management mechanism, which allows you to manage permissions at the product level and the module level. For information about how to perform fine-grained permission management, see Overview of the DataWorks permission management system.
Preparations
Create a workspace and associate a resource group with the workspace.
This tutorial describes how to get started with DataWorks in the China (Shanghai) region. You must log on to the DataWorks console and switch to the China (Shanghai) region to check whether DataWorks is activated in this region.
NoteIn an actual business scenario, you can determine the region in which you want to activate DataWorks based on the location of your business data.
If your business data is stored in an Alibaba Cloud service other than DataWorks, you must select the region in which the Alibaba Cloud service is activated.
If your business data is stored on an on-premises machine, you must access the business data over the Internet. In this case, you must select a region that is close to your geographical location to reduce access latency.
If DataWorks is not activated in the China (Shanghai) region, you can click Purchase Product Portfolio for Free to enable the system to activate DataWorks, create a default workspace, and associate an automatically created resource group with the workspace at the same time.
If DataWorks is activated in the China (Shanghai) region, you must manually create a workspace and a resource group and associate the resource group with the workspace for this tutorial.
Configure an elastic IP address (EIP) for the VPC with which the resource group is associated.
In this tutorial, the public test business data of an e-commerce platform is used. You must obtain the business data over the Internet. By default, general-purpose resource groups cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the created resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the business data. This way, you can use the resource group to access the business data.
Procedure
This section guides you to quickly experience specific features of DataWorks in the following scenario:
The commodity information and order information of your e-commerce platform are stored in a MySQL database. You want to periodically analyze the order data and view the ranking of the most popular commodity categories every day in a visualized manner.
Step 1: Synchronize data
Add data sources.
DataWorks allows you to connect to sources and destinations by adding data sources. In this step, a MySQL data source and a MaxCompute data source need to be added.
The MySQL data source is used to connect to the MySQL database that stores business data. The MySQL data source provides raw business data for this tutorial.
NoteYou do not need to prepare raw business data for this tutorial. To facilitate testing and learning, DataWorks provides a test dataset. The required table data is stored in a MySQL database that can be accessed over the Internet. You need to only add a MySQL data source to DataWorks based on the MySQL database.
The MaxCompute data source is used to connect to the MaxCompute data warehouse. After you associate the MaxCompute data source with DataStudio, the MaxCompute data source can provide data storage and computing capabilities for this tutorial.
If a MaxCompute data source already exists in your workspace, you can skip the operations for adding a MaxCompute data source.
If no MaxCompute data source exists in your workspace, you can refer to the following steps to add a MaxCompute data source.
Associate the MaxCompute data source with DataStudio.
You can process data in the MaxCompute data source in DataStudio only after you associate the MaxCompute data source with DataStudio.
In the upper-left corner of the DataWorks console, click the icon and choose
.In the left-side navigation pane of the DataStudio page, click the (Data Source) icon. On the Data Source page, find the added MaxCompute data source and click Associate in the upper-right corner of the section that displays the information about the data source.
NoteIf a MaxCompute data source is associated with DataStudio, you can skip the operation for associating the MaxCompute data source with DataStudio.
Create a zero load node to manage the sales data analysis workflow for the e-commerce platform in a centralized manner. The zero load node is a dry-run node. You do not need to edit code for the zero load node.
In the left-side navigation pane of the DataStudio page, click Scheduled Workflow. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose
. In the Create Node dialog box, configure the Name parameter based on your business requirements. In this example, the Name parameter is set toWorkshop
.Create batch synchronization nodes.
The test data used in this tutorial is stored in the following tables: the commodity information source table
item_info
and the order information source tabletrade_order
. The two tables are stored in the MySQL database to which the MySQL data source corresponds. You must create two batch synchronization nodes to synchronize data in the tables from the MySQL database to the MaxCompute data warehouse to which the MaxCompute data source corresponds. Then, you can perform subsequent data development operations. In this tutorial, two batch synchronization nodes namedods_item_info
andods_trade_order
are created.
Step 2: Cleanse data
After data is synchronized from MySQL to MaxCompute, the commodity information table ods_item_info
and the order information table ods_trade_order
are obtained. You can cleanse, process, and analyze data in the tables in DataStudio to obtain the ranking of the most popular commodity categories every day.
When you run each ODPS SQL node, the system displays the estimated cost required to execute the SQL statements configured for the node. The SQL statements include the CREATE and INSERT statements. When the INSERT statement is executed, the related table has not been created. In this case, the system may report an error indicating that the cost fails to be estimated. You can ignore the error and directly run the ODPS SQL node.
DataWorks provides scheduling parameters whose values are dynamically replaced in the code of a node based on the configurations of the scheduling parameters. You can define variables in the node code in the
${Variable}
format and assign values to the variables in the Scheduling Parameter section of the Properties tab. For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters. In this example, the scheduling parameter$bizdate
is used, which indicates the data timestamp of the batch synchronization node. The data timestamp of a node is one day earlier than the scheduling time of the node.
Create the
dim_item_info
node.This node is used to process data of the commodity dimension based on the
ods_item_info
table to generate the dimension tabledim_item_info
for basic information of commodities.Create the
dwd_trade_order
node.This node is used to cleanse and transform detailed transaction data of orders, and perform processing based on business logic on the data based on the
ods_trade_order
table to generate the fact tabledwd_trade_order
for order placing.Create the
dws_daily_category_sales
node.This node is used to aggregate fact data that is cleansed and standardized at the data warehouse detail (DWD) layer based on the
dwd_trade_order
anddim_item_info
tables to generate the aggregate tabledws_daily_category_sales
for categories of daily sold commodities.Create the
ads_top_selling_categories
node.This node is used to generate
ads_top_selling_categories
based on thedws_daily_category_sales
table. The ads_top_selling_categories table contains the ranking of the most popular commodity categories every day.
Step 3: Display data
After the raw test data obtained from the MySQL database is processed in DataStudio and aggregated into the ads_top_selling_categories
table, you can query the table data and view data analysis results.
In the upper-left corner of the DataWorks console, click the icon and choose
.In the left-side navigation tree of the DataAnalysis page, move the pointer over or click the icon next to My Files and click Create File. In the Create File dialog box, enter a name in the File Name field based on your business requirements and click OK.
On the configuration tab that appears, enter the following SQL statement:
SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};
In the top toolbar of the configuration tab, click the (Run) icon. In the upper-right corner of the configuration tab, select a MaxCompute data source from the Data Source Name drop-down list that is displayed and click OK. In the Estimate Costs dialog box, click Run.
On the tab that displays the execution result, click the icon to view the visualization result. You can click the icon in the upper-right corner of the tab to customize the graph pattern.
In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the DataAnalysis page, click the (Cards) icon to view the saved card.
Step 4: Enable the system to periodically schedule the batch synchronization nodes and the ODPS SQL nodes
After you perform the preceding operations, you have obtained the sales data of commodities of different categories on the previous day. If you want to obtain the latest sales data every day, you can configure scheduling settings for all the nodes that are created in DataStudio. This way, the system can periodically schedule the nodes.
To simplify operations, scheduling settings are configured for the nodes involved in this tutorial in a visualized manner. DataWorks allows you to manually configure scheduling settings for nodes in a fine-grained manner. You can use the automatic parsing feature to enable the system to automatically parse scheduling dependencies for nodes based on node code. For more information about scheduling settings, see the topics in the Schedule directory.
In the upper-left corner of the DataWorks console, click the icon and choose
.In the Scheduled Workflow pane, click Business Flow and double-click Workflow. In the canvas that appears on the right, move the nodes and draw lines to connect the nodes based on the instructions shown in the following figure.
In the right-side navigation pane, click Workflow Parameters. On the Workflow Parameters tab, set the Parameter Name parameter to
bizdate
and the Value/Expression parameter to$bizdate
and click Save.Double-click the Workshop zero load node and click Properties in the right-side navigation pane. On the Properties tab, configure the parameters by referring to the instructions shown in the following figure. In the top toolbar of the configuration tab of the zero load node, click the (Save) icon.
NoteRetain the default values for other parameters.
Switch to the configuration tab of the Workflow workflow and click Run in the top toolbar. In the Enter parameters dialog box, set the value of bizdate to the date of the previous day and click OK to test whether all nodes in the workflow can be successfully run. For example, if the current day is 20240731, you can set the value of bizdate to 20240730.
If all nodes in the workflow are successfully run, click Submit in the top toolbar to commit all the nodes to Operation Center.
In the upper-left corner of the DataWorks console, click the icon and choose
.In the left-side navigation pane of the Operation Center page, choose
. On the page that appears, view the created auto triggered nodes.NoteIf you want to present all nodes that have scheduling dependencies as shown in the following figure, you can right-click the Workshop node in the directed acyclic graph (DAG) of the node and choose
.
What to do next
For information about details of operations performed in the modules involved in this tutorial and parameter descriptions, see Overview of Data Integration, Overview of DataStudio, Overview of DataAnalysis, and the topics in the Schedule directory.
In addition to the modules that are used in this tutorial, DataWorks also provides other modules, such as Data Modeling, Data Quality, Data Security Guard, and DataService Studio. These modules help you perform end-to-end data monitoring and O&M.
You can also experience more DataWorks tutorials. For more information, see Tutorials for different business scenarios.