Getting started with DataWorks - DataWorks - Alibaba Cloud Documentation Center

DataWorks is an end-to-end big data development and governance platform that provides data warehousing, data lake, and data lakehouse solutions based on big data compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), AnalyticDB, and Cloudera Data Platform (CDP). This topic describes how to use some core services and features of DataWorks to integrate data into DataWorks and perform business processing, periodic scheduling, and data visualization.

Introduction

You can quickly complete the following operations by referring to the instructions provided in this topic:

Data synchronization: You can use DataWorks Data Integration to create batch synchronization nodes to synchronize business data to a big data computing platform, such as MaxCompute.
Data cleansing: You can process, analyze, and mine business data in DataWorks DataStudio.
Data display: You can convert analysis results into graphs in DataWorks DataAnalysis for better understanding by business personnel.
Periodic scheduling: You can configure settings for periodic scheduling for the data synchronization and data cleansing procedures. This way, the procedures can be periodically scheduled.

Prerequisites

An Alibaba Cloud account or a RAM user to which the AliyunDataWorksFullAccess policy is attached is prepared. For information about how to prepare an Alibaba Cloud account or a RAM user, see Prepare an Alibaba Cloud account or the topics in the Prepare a RAM user directory.

Note

DataWorks provides a comprehensive permission management mechanism, which allows you to manage permissions at the product level and the module level. For information about how to perform fine-grained permission management, see Overview of the DataWorks permission management system.

Preparations

Create a workspace and associate a resource group with the workspace.

This tutorial describes how to get started with DataWorks in the China (Shanghai) region. You must log on to the DataWorks console and switch to the China (Shanghai) region to check whether DataWorks is activated in this region.

Note

In an actual business scenario, you can determine the region in which you want to activate DataWorks based on the location of your business data.

If your business data is stored in an Alibaba Cloud service other than DataWorks, you must select the region in which the Alibaba Cloud service is activated.
If your business data is stored on an on-premises machine, you must access the business data over the Internet. In this case, you must select a region that is close to your geographical location to reduce access latency.

If DataWorks is not activated in the China (Shanghai) region, you can click Purchase Product Portfolio for Free to enable the system to activate DataWorks, create a default workspace, and associate an automatically created resource group with the workspace at the same time.

Activate DataWorks

On the buy page, configure the parameters that are described in the following table.

Parameter	Description	Example value
Region	Select the region in which you want to activate DataWorks.	China (Shanghai)
DataWorks	Select the DataWorks edition that you want to purchase. Note In this tutorial, Basic Edition is selected. You can experience the features that are used in this tutorial in any DataWorks edition. For information about how to select an appropriate DataWorks edition based on your business requirements, see the Feature comparison section in the "Differences among DataWorks editions" topic.	Basic Edition

Read the terms of service, select the check box for Terms of Service, and then click Confirm Order and Pay.
Note
After you complete the payment, the system automatically creates a workspace, creates a resource group, and then associates the resource group with the workspace to run various tasks, such as data synchronization, data development, and data scheduling tasks.
Go to the Workspaces page and switch to the China (Shanghai) region to view the created workspace.

Important

If you cannot find the DataWorks edition that you purchased, you can perform the following operations:

Wait a moment and refresh the page. Data update may have a latency.
Check whether the current region is the same as the region in which you purchased the DataWorks edition. If the current region is different from the region in which you purchased the DataWorks edition, you cannot find the DataWorks edition.

If DataWorks is activated in the China (Shanghai) region, you must manually create a workspace and a resource group and associate the resource group with the workspace for this tutorial.

Manually create a workspace and a resource group and associate the resource group with the workspace

Create a workspace.
1. In the left-side navigation pane of the DataWorks console, click Workspace. On the Workspaces page, click Create Workspace.
2. On the Create Workspace page, enter a name in the Workspace Name field and click Create Workspace.

Create a resource group.

In the left-side navigation pane of the DataWorks console, click Resource Group. On the Resource Groups page, click Create Resource Group.

On the DataWorks Serverless Resource (PAYG) page, configure the parameters that are described in the following table.

Parameter	Description
Resource Group Name	Specify a name based on your business requirements.
VPC and vSwitch	Select an existing virtual private cloud (VPC) and an existing vSwitch. If no VPC or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one.
Service-linked Role	Click Create Service-linked Role to create a service-linked role.

Click Buy Now. Then, complete the payment.

Associate the resource group with the workspace.
1. In the left-side navigation pane of the DataWorks console, click Resource Group. On the Resource Groups page, find the created resource group, and click Associate Workspace in the Actions column.
2. In the Associate Workspace panel, find the created workspace and click Associate in the Actions column.

Configure an elastic IP address (EIP) for the VPC with which the resource group is associated.

In this tutorial, the public test business data of an e-commerce platform is used. You must obtain the business data over the Internet. By default, serverless resource groups cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the created resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the business data. This way, you can use the resource group to access the business data.

Procedure

Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

Click Create Internet NAT Gateway. On the Internet NAT Gateway page, configure the parameters that are described in the following table.

Parameter	Description
Region	Select China (Shanghai).
VPC	Select the VPC and vSwitch with which the resource group is associated. To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?
Associate vSwitch
Access Mode	Select SNAT for All VPC Resources.
EIP	Select Purchase EIP.
Service-linked Role	Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.

Note

You can retain the default values for the parameters that are not described in the preceding table.

Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.

Create a MaxCompute project. The MaxCompute project is used to receive data and perform big data analysis.

Log on to the MaxCompute console. In the top navigation bar, select the China (Shanghai) region.

In the left-side navigation pane, choose Workspace > Projects. On the Projects page, click Create Project.

Note

The following table describes the key parameters for creating a MaxCompute project. For information about the detailed parameter descriptions for creating a MaxCompute project, see Create a MaxCompute project.

Parameter	Description
Project Name (Globally Unique)	Specify a name based on your business requirements. The name must be globally unique.
Billing Method	Select Pay-as-you-go. Note If Pay-as-you-go cannot be selected, you must click Activate next to Pay-as-you-go to first activate the MaxCompute service that is charged based on the pay-as-you-go billing method.
Default Quota	Select the default quota.

Click OK.

Procedure

This section guides you to quickly experience specific features of DataWorks in the following scenario:

The commodity information and order information of your e-commerce platform are stored in a MySQL database. You want to periodically analyze the order data and view the ranking of the most popular commodity categories every day in a visualized manner.

Step 1: Synchronize data

Add data sources.

DataWorks allows you to connect to sources and destinations by adding data sources. In this step, a MySQL data source and a MaxCompute data source need to be added.

The MySQL data source is used to connect to the MySQL database that stores business data. The MySQL data source provides raw business data for this tutorial.

Note

You do not need to prepare raw business data for this tutorial. To facilitate testing and learning, DataWorks provides a test dataset. The required table data is stored in a MySQL database that can be accessed over the Internet. You need to only add a MySQL data source to DataWorks based on the MySQL database.

Add a MySQL data source

Go to the SettingCenter page.

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MySQL.

On the Add MySQL Data Source page, configure the parameters that are described in the following table.

Note

You can retain the default values for the parameters that are not described in the following table.
The first time you add a data source, you must perform cross-service authorization to allow DataWorks to assume the AliyunDIDefaultRole service-linked role.

Parameter	Description
Data Source Name	Enter a name for the data source. Example: MySQL_Source.
Configuration Mode	Select Connection String Mode.
Connection Address	Click Add Address. Host IP Address: Enter `rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com`. Port Number: Enter `3306`. Important All data provided in this tutorial is test data that is used only for data application operations in DataWorks. The data can be read only in Data Integration.
Database Name	Enter a database name. Example: `retail_e_commerce`.
Username	Enter a username. Example: `workshop`.
Password	Enter a password. Example: `workshop#2017`.

In the Connection Configuration section, find the resource group that is associated with the workspace and click Test Network Connectivity in the Connection Status column.
Note
If the network connectivity test on the MySQL data source fails, you must check whether an EIP is configured for the VPC with which the resource group is associated. You can use the resource group to access the MySQL data source only if the resource group can access the Internet.
Click Complete Create.

The MaxCompute data source is used to connect to the MaxCompute data warehouse. After you associate the MaxCompute data source with DataStudio as a computing resource, the MaxCompute computing resource can provide data storage and computing capabilities for this tutorial.

Add a MaxCompute data source

In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MaxCompute.

On the Add MaxCompute Data Source page, configure the parameters that are described in the following table.

Note

You can retain the default values for the parameters that are not described in the following table.

Parameter	Description
Data Source Name	Enter a name for the data source. Example: MaxCompute_Source.
Region	Select China (Shanghai).
MaxCompute Project Name	Select the name of the MaxCompute project that you create in the Preparations section.
Default Access Identity	Select Alibaba Cloud Account.

In the Connection Configuration section, find the resource group that is associated with the workspace and click Test Network Connectivity in the Connection Status column.
Note
The MaxCompute project name that you specify must be globally unique. If the network connectivity test on the MaxCompute data source fails and the ErrorMessage:[ODPS-0420095: Access Denied - Authorization Failed [4002], You don't exist in project xxx. error message appears, the MaxCompute project name already exists. In this case, you must change the MaxCompute project name and test the network connectivity of the MaxCompute data source again.
Click Complete Create.

Associate the MaxCompute data source with DataStudio as a computing resource.
You can process data in the MaxCompute data source in DataStudio only after you associate the MaxCompute data source with DataStudio.
In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Development And Task Operation > DataStudio.
In the left-side navigation pane of the DataStudio page, click the (Computing Resource) icon. On the Computing Resource page, find the added MaxCompute data source and click Associate in the upper-right corner of the section that displays information about the MaxCompute data source.
Create a zero load node to manage the sales data analysis workflow for the e-commerce platform in a centralized manner. The zero load node is a dry-run node. You do not need to edit code for the zero load node.
In the left-side navigation pane of the DataStudio page, click the icon to go to the Scheduled Workflow pane. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > General > Zero-Load Node. In the Create Node dialog box, configure the Name parameter based on your business requirements. In this example, the Name parameter is set to Workshop.

Create batch synchronization nodes.

The test data used in this tutorial is stored in the following tables: the commodity information source table item_info and the order information source table trade_order. The two tables are stored in the MySQL database to which the MySQL data source corresponds. You must create two batch synchronization nodes to synchronize data in the tables from the MySQL database to the MaxCompute data warehouse to which the MaxCompute data source corresponds. Then, you can perform subsequent data development operations. In this tutorial, two batch synchronization nodes named ods_item_info and ods_trade_order are created.

Create the ods_item_info batch synchronization node

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > Data Integration > Offline synchronization.
In the Create Node dialog box, set the Name parameter to ods_item_info.
On the configuration tab of the batch synchronization node, select the added MySQL data source as the source and the MaxCompute data source that is associated with DataStudio as the destination, and select the resource group that is associated with the workspace. Then, click Test Connectivity to test the network connectivity between the resource group and the data sources. If the result of the network connectivity test is Connected, click Next.

Configure the parameters related to the source and destination.

Note

You can retain the default values for the parameters that are not described in the following table.

Section	Parameter	Value for the ods_item_info batch synchronization node
Source	Table	Select `item_info`.
Destination	Table	Click Generate Destination Table Schema. In the Create Table dialog box, modify the table creation statement to change the name of the table to be created to `ods_item_info`, and click Create Table.
Channel Control	Policy for Dirty Data Records	Select Disallow Dirty Data Records.

In the top toolbar of the configuration tab, click the (Save) icon and click the (Run with Parameters) icon.
Important
In offline computing scenarios, bizdate indicates the date on which a business transaction is conducted. The date is also called the data timestamp. For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp. If you click Run with Parameters after you save the batch synchronization node, you can retain the default value of the custom parameter. The default value of the custom parameter is the date of the previous day.
If you change the value of the bizdate custom parameter when you run the batch synchronization node with the custom parameter configured, you must change the value of the bizdate custom parameter for all other nodes involved in this example when you run the nodes with the custom parameter configured.

Create the ods_trade_order batch synchronization node

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > Data Integration > Offline synchronization.
In the Create Node dialog box, set the Name parameter to ods_trade_order.
On the configuration tab of the batch synchronization node, select the added MySQL data source as the source and the MaxCompute data source that is associated with DataStudio as the destination, and select the resource group that is associated with the workspace. Then, click Test Connectivity to test the network connectivity between the resource group and the data sources. If the result of the network connectivity test is Connected, click Next.

Configure the parameters related to the source and destination.

Note

You can retain the default values for the parameters that are not described in the following table.

Section	Parameter	Value for the ods_trade_order batch synchronization node
Source	Table	Select `trade_order`.
Destination	Table	Click Generate Destination Table Schema. In the Create Table dialog box, modify the table creation statement to change the name of the table to be created to `ods_trade_order`, and click Create Table.
Channel Control	Policy for Dirty Data Records	Select Disallow Dirty Data Records.

In the top toolbar of the configuration tab, click the (Save) icon and click the (Run with Parameters) icon.

Step 2: Cleanse data

After data is synchronized from MySQL to MaxCompute, the commodity information table ods_item_info and the order information table ods_trade_order are obtained. You can cleanse, process, and analyze data in the tables in DataStudio to obtain the ranking of the most popular commodity categories every day.

Note

When you run each ODPS SQL node, the system displays the estimated cost required to execute the SQL statements configured for the node. The SQL statements include the CREATE and INSERT statements. When the INSERT statement is executed, the related table has not been created. In this case, the system may report an error indicating that the cost fails to be estimated. You can ignore the error and directly run the ODPS SQL node.
DataWorks provides scheduling parameters whose values are dynamically replaced in the code of a node based on the configurations of the scheduling parameters. You can define variables in the node code in the ${Variable} format and assign values to the variables in the Scheduling Parameter section of the Properties tab. For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters. In this example, the scheduling parameter $bizdate is used, which indicates the data timestamp of the batch synchronization node. The data timestamp of a node is one day earlier than the scheduling time of the node.

Create the dim_item_info node.

This node is used to process data of the commodity dimension based on the ods_item_info table to generate the dimension table dim_item_info for basic information of commodities.

Procedure

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dim_item_info and click Confirm. On the configuration tab of the ODPS SQL node, enter the following SQL statements:

CREATE TABLE IF NOT EXISTS dim_item_info (
    gmt_modified                   STRING COMMENT 'Last modification date of a commodity',
    gmt_create                     STRING COMMENT 'Creation time of a commodity',
    item_id                        BIGINT COMMENT 'Numeric ID of a commodity',
    title                          STRING COMMENT 'Commodity title',
    sub_title                      STRING COMMENT 'Commodity subtitle',
    pict_url                       STRING COMMENT 'Picture URL',
    desc_path                      STRING COMMENT 'Path of the commodity description',
    item_status                    BIGINT COMMENT 'Commodity status. Valid values: 1 and 0. 1: Confirmation passed. 0: Confirmation failed',
    last_online_time               DATETIME COMMENT 'Last time when a commodity is put on the shelf',
    last_offline_time              DATETIME COMMENT 'Last time when a commodity is taken off the shelf, which indicates the end of a sales cycle and takes effect only for auction commodities',
    duration                       BIGINT COMMENT 'Validity period, which indicates a sales cycle. Valid values: 7 and 14. Unit: days',
    reserve_price                  DOUBLE COMMENT 'Current price',
    secure_trade_ordinary_post_fee DOUBLE COMMENT 'Regular mail fee',
    secure_trade_fast_post_fee     DOUBLE COMMENT 'Fast mail fee',
    secure_trade_ems_post_fee      DOUBLE COMMENT 'EMS fee',
    last_online_quantity           BIGINT COMMENT 'Inventories when a commodity is last put on the shelf',
    features                       STRING COMMENT 'Commodity characteristics',
    cate_id                        BIGINT COMMENT 'Leaf category ID of a commodity',
    cate_name                      STRING COMMENT 'Leaf category name of a commodity',
    commodity_id                   BIGINT COMMENT 'Commodity ID',
    commodity_name                 STRING COMMENT 'Commodity name',
    is_virtual                     STRING COMMENT 'Whether the commodity is a virtual commodity',
    shop_id                        BIGINT COMMENT 'Seller ID',
    shop_nick                      STRING COMMENT 'Seller nickname',
    is_deleted                     BIGINT COMMENT 'Whether a category is deleted'
)
COMMENT 'Dimension table for basic information of commodities'
PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
LIFECYCLE 365;


-- Insert data into the dim_item_info table.
INSERT OVERWRITE TABLE dim_item_info PARTITION(pt='${bizdate}')
SELECT
    gmt_create,
    gmt_modified,
    item_id,
    title,
    sub_title,
    pict_url,
    desc_path,
    item_status,
    last_online_time,
    last_offline_time,
    duration,
    cast(reserve_price as DOUBLE),
    cast(secure_trade_ordinary_post_fee as DOUBLE),
    cast(secure_trade_fast_post_fee as DOUBLE),
    cast(secure_trade_ems_post_fee as DOUBLE),
    last_online_quantity,
    features,
    cate_id,
    cate_name,
    commodity_id,
    commodity_name,
    is_virtual,
    shop_id,
    shop_nick,
    is_deleted
FROM ods_item_info
WHERE pt = '${bizdate}';

In the top toolbar of the configuration tab of the ODPS SQL node, separately click the (Save) icon and the (Run with Parameters) icon.

Create the dwd_trade_order node.

This node is used to cleanse and transform detailed transaction data of orders, and perform processing based on business logic on the data based on the ods_trade_order table to generate the fact table dwd_trade_order for order placing.

Procedure

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dwd_trade_order and click Confirm. On the configuration tab that appears, enter the following SQL statements:

CREATE TABLE IF NOT EXISTS dwd_trade_order (
    id               BIGINT COMMENT 'Primary key, which is used as the latest ID after deduplication',
    gmt_create       DATETIME COMMENT 'Creation time',
    gmt_modified     DATETIME COMMENT 'Modification time',
    sub_order_id     BIGINT COMMENT 'Suborder ID',
    parent_order_id  BIGINT COMMENT 'Parent order ID',
    buyer_id         BIGINT COMMENT 'Numeric ID of a buyer',
    buyer_nick       STRING COMMENT 'Buyer nickname, which is used to filter out raw data that does not contain nicknames',
    item_id          BIGINT COMMENT 'Numeric ID of a commodity',
    item_price       DECIMAL(38,18) COMMENT 'Commodity price. Unit: cent',
    buy_amount             BIGINT COMMENT 'Quantity of purchased commodities',
    biz_type         BIGINT COMMENT 'Transaction type',
    memo             STRING COMMENT 'Comment, which is used to filter out raw data that does not contain comments',
    pay_status       BIGINT COMMENT 'Payment status',
    logistics_status BIGINT COMMENT 'Logistics status',
    status           BIGINT COMMENT 'Status',
    seller_memo      STRING COMMENT 'Transaction comment of a seller',
    buyer_memo       STRING COMMENT 'Transaction comment of a buyer',
    clean_ip         STRING COMMENT 'Buyer IP address obtained after cleansing with IP addresses in the invalid formats filtered out',
    end_time         DATETIME COMMENT 'Transaction end time',
    pay_time         DATETIME COMMENT 'Payment time',
    is_sub           BIGINT COMMENT 'Whether the order is a suborder. 1: yes',
    is_parent        BIGINT COMMENT 'Whether the order is a parent order. 1: yes',
    shop_id          BIGINT COMMENT 'Seller ID',
    total_fee        DECIMAL(38,18) COMMENT 'Suborder fee calculated after discounts are deducted and adjustments are made',
    is_large_order_flag BOOLEAN COMMENT 'Whether the order is a large order'
)
COMMENT 'Order-placing fact table obtained after cleansing and processing based on business logic are performed'
PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
LIFECYCLE 365; -- Set the lifecycle of data to 365 days.


INSERT OVERWRITE TABLE dwd_trade_order PARTITION(pt='${bizdate}')
SELECT
    MAX(id) AS id, -- Assume that the latest ID is used for deduplication.
    gmt_create,
    gmt_modified,
    sub_order_id,
    parent_order_id,
    buyer_id,
    COALESCE(buyer_nick, '') AS buyer_nick, -- Process the situation that raw data does not contain buyer nicknames.
    item_id,
    item_price,
    buy_amount,
    biz_type,
    COALESCE(memo, '') AS memo, -- Process the situation that raw data does not contain comments.
    pay_status,
    logistics_status,
    status,
    seller_memo,
    buyer_memo,
    CASE 
        WHEN ip LIKE '__.__.__.__' THEN NULL -- Filter out IP addresses in the invalid formats.
        ELSE ip 
    END AS clean_ip,
    end_time,
    pay_time,
    is_sub,
    is_parent,
    shop_id,
    total_fee,
    CASE 
        WHEN total_fee >= 10000 THEN TRUE -- Assume that orders whose transaction amount is greater than or equal to 10,000 cents are large orders.
        ELSE FALSE 
    END AS is_large_order_flag -- Add a business logic flag.
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY buyer_id, item_id, gmt_create ORDER BY id DESC) AS rn -- Specify the row number used for deduplication.
    FROM ods_trade_order
    WHERE pt = '${bizdate}'
) AS sub_query
WHERE rn = 1 -- Retain only the first record of each deduplication group.
GROUP BY 
    gmt_create,
    gmt_modified,
    sub_order_id,
    parent_order_id,
    buyer_id,
    buyer_nick,
    item_id,
    item_price,
    buy_amount,
    biz_type,
    memo,
    pay_status,
    logistics_status,
    status,
    seller_memo,
    buyer_memo,
    clean_ip,
    end_time,
    pay_time,
    is_sub,
    is_parent,
    shop_id,
    total_fee,
    is_large_order_flag;

In the top toolbar of the configuration tab of the ODPS SQL node, separately click the (Save) icon and the (Run with Parameters) icon.

Create the dws_daily_category_sales node.

This node is used to aggregate fact data that is cleansed and standardized at the data warehouse detail (DWD) layer based on the dwd_trade_order and dim_item_info tables to generate the aggregate table dws_daily_category_sales for categories of daily sold commodities.

Procedure

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dws_daily_category_sales and click Confirm. On the configuration tab that appears, enter the following SQL statements:

CREATE TABLE IF NOT EXISTS dws_daily_category_sales (
    cate_id             BIGINT COMMENT 'Leaf category ID of a commodity',
    cate_name           STRING COMMENT 'Leaf category name of a commodity',
    total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount of different commodity categories. Unit: cent',
    order_count         BIGINT COMMENT 'Number of orders'
)
COMMENT 'Aggregate table for categories of daily sold commodities'
PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
LIFECYCLE 365;


INSERT OVERWRITE TABLE dws_daily_category_sales PARTITION(pt='${bizdate}')
SELECT
    i.cate_id,
    i.cate_name,
    SUM(t.total_fee) AS total_sales_amount,
    COUNT(DISTINCT t.id) AS order_count
FROM dwd_trade_order t
JOIN dim_item_info i ON t.item_id = i.item_id AND t.pt = i.pt
WHERE t.pt = '${bizdate}'
GROUP BY t.pt, i.cate_id, i.cate_name;

In the top toolbar of the configuration tab of the ODPS SQL node, separately click the (Save) icon and the (Run with Parameters) icon.

Create the ads_top_selling_categories node.

This node is used to generate ads_top_selling_categories based on the dws_daily_category_sales table. The ads_top_selling_categories table contains the ranking of the most popular commodity categories every day.

Procedure

In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to ads_top_selling_categories and click Confirm. On the configuration tab that appears, enter the following SQL statements:

CREATE TABLE IF NOT EXISTS ads_top_selling_categories (
    rank                BIGINT COMMENT 'Sales volume ranking',
    cate_id             BIGINT COMMENT 'Leaf category ID of a commodity',
    cate_name           STRING COMMENT 'Leaf category name of a commodity',
    total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount of different commodity categories. Unit: cent',
    order_count         BIGINT COMMENT 'Number of orders'
)
COMMENT 'Table for the ranking of the most popular commodity categories every day'
PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd');


INSERT OVERWRITE TABLE ads_top_selling_categories PARTITION(pt='${bizdate}')
SELECT
    rank,
    cate_id,
    cate_name,
    total_sales_amount,
    order_count
FROM (
    SELECT
        DENSE_RANK() OVER(ORDER BY total_sales_amount DESC) AS rank,
        cate_id,
        cate_name,
        total_sales_amount,
        order_count
    FROM (
        SELECT
            cate_id,
            cate_name,
            SUM(total_sales_amount) AS total_sales_amount,
            SUM(order_count) AS order_count
        FROM dws_daily_category_sales
        WHERE pt = '${bizdate}'
        GROUP BY cate_id, cate_name
    ) agg_sub
) agg_outer
WHERE rank <= 10;

In the top toolbar of the configuration tab of the ODPS SQL node, separately click the (Save) icon and the (Run with Parameters) icon.

Step 3: Display data

After the raw test data obtained from the MySQL database is processed in DataStudio and aggregated into the ads_top_selling_categories table, you can query the table data and view data analysis results.

In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Analysis > SQL Query.
In the left-side navigation tree of the SQL Query page, move the pointer over or click the icon next to My Files and click Create File. In the Create File dialog box, enter a name in the File Name field based on your business requirements and click OK.
On the configuration tab that appears, enter the following SQL statement:
```
SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};
```
In the top toolbar of the configuration tab, click the (Run) icon. In the upper-right corner of the configuration tab, select a MaxCompute data source from the Data Source Name drop-down list that is displayed and click OK. In the Estimate Costs dialog box, click Run.
On the tab that displays the execution result, click the icon to view the visualization result. You can click the icon in the upper-right corner of the tab to customize the graph pattern.
In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the DataAnalysis page, click the (Cards) icon to view the saved card.

Step 4: Enable the system to periodically schedule the batch synchronization nodes and the ODPS SQL nodes

After you perform the preceding operations, you have obtained the sales data of commodities of different categories on the previous day. If you want to obtain the latest sales data every day, you can configure scheduling settings for all the nodes that are created in DataStudio. This way, the system can periodically schedule the nodes.

Note

To simplify operations, scheduling settings are configured for the nodes involved in this tutorial in a visualized manner. DataWorks allows you to manually configure scheduling settings for nodes in a fine-grained manner. You can use the automatic parsing feature to enable the system to automatically parse scheduling dependencies for nodes based on node code. For more information about scheduling settings, see the topics in the Node scheduling configuration directory.

In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Development And Task Operation > DataStudio.
In the Scheduled Workflow pane, click Business Flow, right-click Workflow, and then select Board. In the canvas that appears on the right, move the nodes and draw lines to connect the nodes based on the instructions shown in the following figure.
In the right-side navigation pane, click Workflow Parameters. On the Workflow Parameters tab, set the Parameter Name parameter to bizdate and the Value/Expression parameter to $bizdate and click Save.
Double-click the Workshop zero load node and click Properties in the right-side navigation pane. On the Properties tab, configure the parameters by referring to the instructions shown in the following figure. In the top toolbar of the configuration tab of the zero load node, click the (Save) icon.
Note
Retain the default values for other parameters.
Switch to the configuration tab of the Workflow workflow and click Run in the top toolbar. In the Run Workflow dialog box, click Confirmation. In the Enter parameters dialog box, set the value of bizdate to the date of the previous day and click OK to test whether all nodes in the workflow can be successfully run. For example, if the current day is 20240731, you can set the value of bizdate to 20240730.
If all nodes in the workflow are successfully run, click Submit in the top toolbar to commit all the nodes to Operation Center.
In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Development And Task Operation > Operation Center.
In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the page that appears, view the created auto triggered nodes.
Note
If you want to present all nodes that have scheduling dependencies as shown in the following figure, you can right-click the Workshop node in the directed acyclic graph (DAG) of the node and choose Show Descendant Nodes > 4 Levels.

What to do next

For information about details of operations performed in the modules involved in this tutorial and parameter descriptions, see Overview of Data Integration, Overview of DataStudio, Overview of DataAnalysis, and the topics in the Node scheduling configuration directory.
In addition to the modules that are used in this tutorial, DataWorks also provides other modules, such as Data Modeling, Data Quality, Data Security Guard, and DataService Studio. These modules help you perform end-to-end data monitoring and O&M.
You can also experience more DataWorks tutorials. For more information, see Tutorials for different business scenarios.

Appendix: Resource release and clearance

To release the resources that are created for this tutorial, perform the following operations:

Stop the auto triggered nodes.
1. Go to the Operation Center page.
2. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the page that appears, select all auto triggered nodes that are created for this tutorial, click Actions in the lower part of the page, and then select Undeploy.
Delete the batch synchronization nodes and ODPS SQL nodes and disassociate the MaxCompute computing resource from DataStudio.
1. Go to the DataStudio page.
2. In the Scheduled Workflow pane, click Business Flow and click Workflow. Then, separately find the batch synchronization nodes and the ODPS SQL nodes, right-click each node name, and then select Delete.
3. In the left-side navigation pane of the DataStudio page, click Computing Resource. On the Computing Resource page, find the MaxCompute computing resource and click Disassociate. In the dialog box that appears, read the risk descriptions, select the check box for the risk confirmation prompt, and then click Disassociate.
Delete the MySQL and MaxCompute data sources.
1. In the left-side navigation pane of the SettingCenter page, click Data Sources. On the Data Sources page, separately find the MySQL data source and the MaxCompute data source and click Delete in the Operation column of the data sources. In the Delete Data Source message, click Continue.
Delete the MaxCompute project.
Log on to the MaxCompute console. On the Projects page, find the created MaxCompute project and click Delete in the Actions column. In the Delete Project dialog box, select the check box for confirming the deletion operation and click OK.
Delete the Internet NAT gateway and release the EIP.
1. Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
2. On the Internet NAT Gateway page, find the created Internet NAT gateway, click the icon in the Actions column, and then select Delete. In the Delete Gateway dialog box, select Force Delete (Delete the NAT gateway and associated SNAT/DNAT entries) and click OK.
3. In the left-side navigation pane of the VPC console, choose Access to Internet > Elastic IP Addresses. On the Elastic IP Addresses page, find the created EIP, click the icon in the Actions column, and then select Release. In the dialog box that appears, click OK.