All Products
Search
Document Center

DataWorks:Getting started with DataWorks

Last Updated:Nov 15, 2024

DataWorks is an end-to-end big data development and governance platform that provides data warehousing, data lake, and data lakehouse solutions based on big data compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), AnalyticDB, and Cloudera Data Platform (CDP). This topic describes how to use some core services and features of DataWorks to integrate data into DataWorks and perform business processing, periodic scheduling, and data visualization.

Introduction

You can quickly complete the following operations by referring to the instructions provided in this topic:

  1. Data synchronization: You can use DataWorks Data Integration to create batch synchronization nodes to synchronize business data to a big data computing platform, such as MaxCompute.

  2. Data cleansing: You can process, analyze, and mine business data in DataWorks DataStudio.

  3. Data display: You can convert analysis results into graphs in DataWorks DataAnalysis for better understanding by business personnel.

  4. Periodic scheduling: You can configure settings for periodic scheduling for the data synchronization and data cleansing procedures. This way, the procedures can be periodically scheduled.

image

Prerequisites

An Alibaba Cloud account or a RAM user to which the AliyunDataWorksFullAccess policy is attached is prepared. For information about how to prepare an Alibaba Cloud account or a RAM user, see Prepare an Alibaba Cloud account or the topics in the Prepare a RAM user directory.

Note

DataWorks provides a comprehensive permission management mechanism, which allows you to manage permissions at the product level and the module level. For information about how to perform fine-grained permission management, see Overview of the DataWorks permission management system.

Preparations

  1. Create a workspace and associate a resource group with the workspace.

    This tutorial describes how to get started with DataWorks in the China (Shanghai) region. You must log on to the DataWorks console and switch to the China (Shanghai) region to check whether DataWorks is activated in this region.

    Note

    In an actual business scenario, you can determine the region in which you want to activate DataWorks based on the location of your business data.

    • If your business data is stored in an Alibaba Cloud service other than DataWorks, you must select the region in which the Alibaba Cloud service is activated.

    • If your business data is stored on an on-premises machine, you must access the business data over the Internet. In this case, you must select a region that is close to your geographical location to reduce access latency.

    • If DataWorks is not activated in the China (Shanghai) region, you can click Purchase Product Portfolio for Free to enable the system to activate DataWorks, create a default workspace, and associate an automatically created resource group with the workspace at the same time.

      Activate DataWorks

      1. On the buy page, configure the parameters that are described in the following table.

        Parameter

        Description

        Example value

        Region

        Select the region in which you want to activate DataWorks.

        China (Shanghai)

        DataWorks

        Select the DataWorks edition that you want to purchase.

        Note

        In this tutorial, Basic Edition is selected. You can experience the features that are used in this tutorial in any DataWorks edition. For information about how to select an appropriate DataWorks edition based on your business requirements, see the Feature comparison section in the "Differences among DataWorks editions" topic.

        Basic Edition

      2. Read the terms of service, select the check box for Terms of Service, and then click Confirm Order and Pay.

        Note

        After you complete the payment, the system automatically creates a workspace, creates a resource group, and then associates the resource group with the workspace to run various tasks, such as data synchronization, data development, and data scheduling tasks.

      3. Go to the Workspaces page and switch to the China (Shanghai) region to view the created workspace.

        image

    • If DataWorks is activated in the China (Shanghai) region, you must manually create a workspace and a resource group and associate the resource group with the workspace for this tutorial.

      Manually create a workspace and a resource group and associate the resource group with the workspace

      1. Create a workspace.

        1. In the left-side navigation pane of the DataWorks console, click Workspaces. On the Workspaces page, click Create Workspace.

        2. In the Create Workspace panel, enter a name in the Workspace Name field and click Commit.

      2. Create a resource group.

        1. In the left-side navigation pane of the DataWorks console, click Resource Groups. On the Resource Groups page, click Create Resource Group.

        2. On the DataWorks General Resource (PayAsYouGo) page, configure the parameters that are described in the following table.

          Parameter

          Description

          Resource Group Name

          Specify a name based on your business requirements.

          VPC and vSwitch

          Select an existing virtual private cloud (VPC) and an existing vSwitch. If no VPC or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one.

          Service-linked Role

          Click Create Service-linked Role to create a service-linked role.

        3. Click Buy Now. Then, complete the payment.

      3. Associate the resource group with the workspace.

        1. In the left-side navigation pane of the DataWorks console, click Resource Groups. On the Resource Groups page, find the created resource group, click the image icon in the Actions column, and then click Change Workspace.

        2. In the Change Workspace panel, find the created workspace and click Associate in the Actions column.

  2. Configure an elastic IP address (EIP) for the VPC with which the resource group is associated.

    In this tutorial, the public test business data of an e-commerce platform is used. You must obtain the business data over the Internet. By default, general-purpose resource groups cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the created resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the business data. This way, you can use the resource group to access the business data.

    Procedure

    1. Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

    2. Click Create NAT Gateway. On the Internet NAT Gateway page, configure the parameters that are described in the following table.

      Parameter

      Description

      Region

      Select China (Shanghai).

      VPC

      Select the VPC and vSwitch with which the resource group is associated.

      To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?

      Associate vSwitch

      Access Mode

      Select SNAT for All VPC Resources.

      EIP

      Select Purchase EIP.

      Service-linked Role

      Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.

      Note

      You can retain the default values for the parameters that are not described in the preceding table.

    3. Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Confirm.

      image

Procedure

This section guides you to quickly experience specific features of DataWorks in the following scenario:

The commodity information and order information of your e-commerce platform are stored in a MySQL database. You want to periodically analyze the order data and view the ranking of the most popular commodity categories every day in a visualized manner.

Step 1: Synchronize data

  1. Add data sources.

    DataWorks allows you to connect to sources and destinations by adding data sources. In this step, a MySQL data source and a MaxCompute data source need to be added.

    • The MySQL data source is used to connect to the MySQL database that stores business data. The MySQL data source provides raw business data for this tutorial.

      Note

      You do not need to prepare raw business data for this tutorial. To facilitate testing and learning, DataWorks provides a test dataset. The required table data is stored in a MySQL database that can be accessed over the Internet. You need to only add a MySQL data source to DataWorks based on the MySQL database.

      Add a MySQL data source

      1. Go to the Management Center page.

        Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

      2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MySQL.

      3. On the Add MySQL Data Source page, configure the parameters that are described in the following table.

        Note
        • You can retain the default values for the parameters that are not described in the following table.

        • The first time you add a data source, you must perform cross-service authorization to allow DataWorks to assume the AliyunDIDefaultRole service-linked role.

        Parameter

        Description

        Data Source Name

        Enter a name for the data source. Example: MySQL_Source.

        Configuration Mode

        Select Connection String Mode.

        Connection Address

        Click Add Address.

        • Host IP Address: Enter rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com.

        • Port Number: Enter 3306.

        Important

        All data provided in this tutorial is test data that is used only for data application operations in DataWorks. The data can be read only in Data Integration.

        Database Name

        Enter a database name. Example: retail_e_commerce.

        Username

        Enter a username. Example: workshop.

        Password

        Enter a password. Example: workshop#2017.

      4. In the Connection Configuration section, find the resource group that is associated with the workspace and click Test Network Connectivity in the Connection Status column.

        Note

        If the network connectivity test on the MySQL data source fails, you must check whether an EIP is configured for the VPC with which the resource group is associated. You can use the resource group to access the MySQL data source only if the resource group can access the Internet.

      5. Click Complete Creation.

    • The MaxCompute data source is used to connect to the MaxCompute data warehouse. After you associate the MaxCompute data source with DataStudio, the MaxCompute data source can provide data storage and computing capabilities for this tutorial.

      • If a MaxCompute data source already exists in your workspace, you can skip the operations for adding a MaxCompute data source.

      • If no MaxCompute data source exists in your workspace, you can refer to the following steps to add a MaxCompute data source.

        Add a MaxCompute data source

        1. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MaxCompute.

        2. On the Add MaxCompute Data Source page, configure the parameters that are described in the following table.

          Note

          You can retain the default values for the parameters that are not described in the following table.

          Parameter

          Description

          Data Source Name

          Enter a name for the data source. Example: MaxCompute_Source.

          Creation Method

          Select Create MaxCompute Project.

          Project Name

          Specify a name based on your business requirements.

          Note

          The MaxCompute project name that you specify must be globally unique. If the network connectivity test on the MaxCompute data source fails, the project name may already exist. In this case, you must change the MaxCompute project name and test the network connectivity of the MaxCompute data source again.

          Billing Method of Computing Resources

          Select Pay-as-you-go.

          Note

          If Pay-as-you-go cannot be selected, you must click Activate next to Pay-as-you-go to first activate the MaxCompute service that is charged based on the pay-as-you-go billing method.

          Default Quota

          Select the default quota.

          Default Access Identity

          Select Alibaba Cloud Account.

        3. In the Connection Configuration section, find the resource group that is associated with the workspace and click Test Network Connectivity in the Connection Status column.

          Note

          The MaxCompute project name that you specify must be globally unique. If the network connectivity test on the MaxCompute data source fails and the ErrorMessage:[ODPS-0420095: Access Denied - Authorization Failed [4002], You don't exist in project xxx. error message appears, the MaxCompute project name already exists. In this case, you must change the MaxCompute project name and test the network connectivity of the MaxCompute data source again.

        4. Click Complete Creation.

  2. Associate the MaxCompute data source with DataStudio.

    You can process data in the MaxCompute data source in DataStudio only after you associate the MaxCompute data source with DataStudio.

    1. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Development And Task Operation > DataStudio.

    2. In the left-side navigation pane of the DataStudio page, click the image (Data Source) icon. On the Data Source page, find the added MaxCompute data source and click Associate in the upper-right corner of the section that displays the information about the data source.

      Note

      If a MaxCompute data source is associated with DataStudio, you can skip the operation for associating the MaxCompute data source with DataStudio.

      image

  3. Create a zero load node to manage the sales data analysis workflow for the e-commerce platform in a centralized manner. The zero load node is a dry-run node. You do not need to edit code for the zero load node.

    In the left-side navigation pane of the DataStudio page, click Scheduled Workflow. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > General > Zero-Load Node. In the Create Node dialog box, configure the Name parameter based on your business requirements. In this example, the Name parameter is set to Workshop.image

  4. Create batch synchronization nodes.

    The test data used in this tutorial is stored in the following tables: the commodity information source table item_info and the order information source table trade_order. The two tables are stored in the MySQL database to which the MySQL data source corresponds. You must create two batch synchronization nodes to synchronize data in the tables from the MySQL database to the MaxCompute data warehouse to which the MaxCompute data source corresponds. Then, you can perform subsequent data development operations. In this tutorial, two batch synchronization nodes named ods_item_info and ods_trade_order are created.

    1. Create the ods_item_info batch synchronization node

      1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > Data Integration > Offline synchronization.

      2. In the Create Node dialog box, set the Name parameter to ods_item_info.

      3. On the configuration tab of the batch synchronization node, select the added MySQL data source as the source and the MaxCompute data source that is associated with DataStudio as the destination, and select the resource group that is associated with the workspace. Then, click Test Connectivity of All Data Sources to test the network connectivity between the resource group and the data sources. If the result of the network connectivity test is connected, click Next.

        image

      4. Configure the parameters related to the source and destination.

        Note

        You can retain the default values for the parameters that are not described in the following table.

        Section

        Parameter

        Value for the ods_item_info batch synchronization node

        Source

        Table

        Select item_info.

        Destination

        Table

        Click Generate Destination Table Schema. In the Create Table dialog box, modify the table creation statement to change the name of the table to be created to ods_item_info, and click Create Table.

        Channel Control

        Policy for Dirty Data Records

        Select Disallow Dirty Data Records.

      5. In the top toolbar of the configuration tab, click the image (Save) icon and click the image (Run with Parameters) icon.

        Important

        In offline computing scenarios, bizdate indicates the date on which a business transaction is conducted. The date is also called the data timestamp. For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp. If you click Run with Parameters after you save the batch synchronization node, you can retain the default value of the custom parameter. The default value of the custom parameter is the date of the previous day.

        If you change the value of the bizdate custom parameter when you run the batch synchronization node with the custom parameter configured, you must change the value of the bizdate custom parameter for all other nodes involved in this example when you run the nodes with the custom parameter configured.

    2. Create the ods_trade_order batch synchronization node

      1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > Data Integration > Offline synchronization.

      2. In the Create Node dialog box, set the Name parameter to ods_trade_order.

      3. On the configuration tab of the batch synchronization node, select the added MySQL data source as the source and the MaxCompute data source that is associated with DataStudio as the destination, and select the resource group that is associated with the workspace. Then, click Test Connectivity of All Data Sources to test the network connectivity between the resource group and the data sources. If the result of the network connectivity test is connected, click Next.

      4. Configure the parameters related to the source and destination.

        Note

        You can retain the default values for the parameters that are not described in the following table.

        Section

        Parameter

        Value for the ods_trade_order batch synchronization node

        Source

        Table

        Select trade_order.

        Destination

        Table

        Click Generate Destination Table Schema. In the Create Table dialog box, modify the table creation statement to change the name of the table to be created to ods_trade_order, and click Create Table.

        Channel Control

        Policy for Dirty Data Records

        Select Disallow Dirty Data Records.

      5. In the top toolbar of the configuration tab, click the image (Save) icon and click the image (Run with Parameters) icon.

Step 2: Cleanse data

After data is synchronized from MySQL to MaxCompute, the commodity information table ods_item_info and the order information table ods_trade_order are obtained. You can cleanse, process, and analyze data in the tables in DataStudio to obtain the ranking of the most popular commodity categories every day.

Note
  • When you run each ODPS SQL node, the system displays the estimated cost required to execute the SQL statements configured for the node. The SQL statements include the CREATE and INSERT statements. When the INSERT statement is executed, the related table has not been created. In this case, the system may report an error indicating that the cost fails to be estimated. You can ignore the error and directly run the ODPS SQL node.

  • DataWorks provides scheduling parameters whose values are dynamically replaced in the code of a node based on the configurations of the scheduling parameters. You can define variables in the node code in the ${Variable} format and assign values to the variables in the Scheduling Parameter section of the Properties tab. For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters. In this example, the scheduling parameter $bizdate is used, which indicates the data timestamp of the batch synchronization node. The data timestamp of a node is one day earlier than the scheduling time of the node.

  1. Create the dim_item_info node.

    This node is used to process data of the commodity dimension based on the ods_item_info table to generate the dimension table dim_item_info for basic information of commodities.

    Procedure

    1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dim_item_info and click Confirm. On the configuration tab of the ODPS SQL node, enter the following SQL statements:

      CREATE TABLE IF NOT EXISTS dim_item_info (
          gmt_modified                   STRING COMMENT 'Last modification date of a commodity',
          gmt_create                     STRING COMMENT 'Creation time of a commodity',
          item_id                        BIGINT COMMENT 'Numeric ID of a commodity',
          title                          STRING COMMENT 'Commodity title',
          sub_title                      STRING COMMENT 'Commodity subtitle',
          pict_url                       STRING COMMENT 'Picture URL',
          desc_path                      STRING COMMENT 'Path of the commodity description',
          item_status                    BIGINT COMMENT 'Commodity status. Valid values: 1 and 0. 1: Confirmation passed. 0: Confirmation failed',
          last_online_time               DATETIME COMMENT 'Last time when a commodity is put on the shelf',
          last_offline_time              DATETIME COMMENT 'Last time when a commodity is taken off the shelf, which indicates the end of a sales cycle and takes effect only for auction commodities',
          duration                       BIGINT COMMENT 'Validity period, which indicates a sales cycle. Valid values: 7 and 14. Unit: days',
          reserve_price                  DOUBLE COMMENT 'Current price',
          secure_trade_ordinary_post_fee DOUBLE COMMENT 'Regular mail fee',
          secure_trade_fast_post_fee     DOUBLE COMMENT 'Fast mail fee',
          secure_trade_ems_post_fee      DOUBLE COMMENT 'EMS fee',
          last_online_quantity           BIGINT COMMENT 'Inventories when a commodity is last put on the shelf',
          features                       STRING COMMENT 'Commodity characteristics',
          cate_id                        BIGINT COMMENT 'Leaf category ID of a commodity',
          cate_name                      STRING COMMENT 'Leaf category name of a commodity',
          commodity_id                   BIGINT COMMENT 'Commodity ID',
          commodity_name                 STRING COMMENT 'Commodity name',
          is_virtual                     STRING COMMENT 'Whether the commodity is a virtual commodity',
          shop_id                        BIGINT COMMENT 'Seller ID',
          shop_nick                      STRING COMMENT 'Seller nickname',
          is_deleted                     BIGINT COMMENT 'Whether a category is deleted'
      )
      COMMENT 'Dimension table for basic information of commodities'
      PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
      LIFECYCLE 365;
      
      
      -- Insert data into the dim_item_info table.
      INSERT OVERWRITE TABLE dim_item_info PARTITION(pt='${bizdate}')
      SELECT
          gmt_create,
          gmt_modified,
          item_id,
          title,
          sub_title,
          pict_url,
          desc_path,
          item_status,
          last_online_time,
          last_offline_time,
          duration,
          cast(reserve_price as DOUBLE),
          cast(secure_trade_ordinary_post_fee as DOUBLE),
          cast(secure_trade_fast_post_fee as DOUBLE),
          cast(secure_trade_ems_post_fee as DOUBLE),
          last_online_quantity,
          features,
          cate_id,
          cate_name,
          commodity_id,
          commodity_name,
          is_virtual,
          shop_id,
          shop_nick,
          is_deleted
      FROM ods_item_info
      WHERE pt = '${bizdate}';
    2. In the top toolbar of the configuration tab of the ODPS SQL node, separately click the image (Save) icon and the image (Run with Parameters) icon.

  2. Create the dwd_trade_order node.

    This node is used to cleanse and transform detailed transaction data of orders, and perform processing based on business logic on the data based on the ods_trade_order table to generate the fact table dwd_trade_order for order placing.

    Procedure

    1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dwd_trade_order and click Confirm. On the configuration tab that appears, enter the following SQL statements:

      CREATE TABLE IF NOT EXISTS dwd_trade_order (
          id               BIGINT COMMENT 'Primary key, which is used as the latest ID after deduplication',
          gmt_create       DATETIME COMMENT 'Creation time',
          gmt_modified     DATETIME COMMENT 'Modification time',
          sub_order_id     BIGINT COMMENT 'Suborder ID',
          parent_order_id  BIGINT COMMENT 'Parent order ID',
          buyer_id         BIGINT COMMENT 'Numeric ID of a buyer',
          buyer_nick       STRING COMMENT 'Buyer nickname, which is used to filter out raw data that does not contain nicknames',
          item_id          BIGINT COMMENT 'Numeric ID of a commodity',
          item_price       DECIMAL(38,18) COMMENT 'Commodity price. Unit: cent',
          buy_amount             BIGINT COMMENT 'Quantity of purchased commodities',
          biz_type         BIGINT COMMENT 'Transaction type',
          memo             STRING COMMENT 'Comment, which is used to filter out raw data that does not contain comments',
          pay_status       BIGINT COMMENT 'Payment status',
          logistics_status BIGINT COMMENT 'Logistics status',
          status           BIGINT COMMENT 'Status',
          seller_memo      STRING COMMENT 'Transaction comment of a seller',
          buyer_memo       STRING COMMENT 'Transaction comment of a buyer',
          clean_ip         STRING COMMENT 'Buyer IP address obtained after cleansing with IP addresses in the invalid formats filtered out',
          end_time         DATETIME COMMENT 'Transaction end time',
          pay_time         DATETIME COMMENT 'Payment time',
          is_sub           BIGINT COMMENT 'Whether the order is a suborder. 1: yes',
          is_parent        BIGINT COMMENT 'Whether the order is a parent order. 1: yes',
          shop_id          BIGINT COMMENT 'Seller ID',
          total_fee        DECIMAL(38,18) COMMENT 'Suborder fee calculated after discounts are deducted and adjustments are made',
          is_large_order_flag BOOLEAN COMMENT 'Whether the order is a large order'
      )
      COMMENT 'Order-placing fact table obtained after cleansing and processing based on business logic are performed'
      PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
      LIFECYCLE 365; -- Set the lifecycle of data to 365 days.
      
      
      INSERT OVERWRITE TABLE dwd_trade_order PARTITION(pt='${bizdate}')
      SELECT
          MAX(id) AS id, -- Assume that the latest ID is used for deduplication.
          gmt_create,
          gmt_modified,
          sub_order_id,
          parent_order_id,
          buyer_id,
          COALESCE(buyer_nick, '') AS buyer_nick, -- Process the situation that raw data does not contain buyer nicknames.
          item_id,
          item_price,
          buy_amount,
          biz_type,
          COALESCE(memo, '') AS memo, -- Process the situation that raw data does not contain comments.
          pay_status,
          logistics_status,
          status,
          seller_memo,
          buyer_memo,
          CASE 
              WHEN ip LIKE '__.__.__.__' THEN NULL -- Filter out IP addresses in the invalid formats.
              ELSE ip 
          END AS clean_ip,
          end_time,
          pay_time,
          is_sub,
          is_parent,
          shop_id,
          total_fee,
          CASE 
              WHEN total_fee >= 10000 THEN TRUE -- Assume that orders whose transaction amount is greater than or equal to 10,000 cents are large orders.
              ELSE FALSE 
          END AS is_large_order_flag -- Add a business logic flag.
      FROM (
          SELECT
              *,
              ROW_NUMBER() OVER(PARTITION BY buyer_id, item_id, gmt_create ORDER BY id DESC) AS rn -- Specify the row number used for deduplication.
          FROM ods_trade_order
          WHERE pt = '${bizdate}'
      ) AS sub_query
      WHERE rn = 1 -- Retain only the first record of each deduplication group.
      GROUP BY 
          gmt_create,
          gmt_modified,
          sub_order_id,
          parent_order_id,
          buyer_id,
          buyer_nick,
          item_id,
          item_price,
          buy_amount,
          biz_type,
          memo,
          pay_status,
          logistics_status,
          status,
          seller_memo,
          buyer_memo,
          clean_ip,
          end_time,
          pay_time,
          is_sub,
          is_parent,
          shop_id,
          total_fee,
          is_large_order_flag;
    2. In the top toolbar of the configuration tab of the ODPS SQL node, separately click the image (Save) icon and the image (Run with Parameters) icon.

  3. Create the dws_daily_category_sales node.

    This node is used to aggregate fact data that is cleansed and standardized at the data warehouse detail (DWD) layer based on the dwd_trade_order and dim_item_info tables to generate the aggregate table dws_daily_category_sales for categories of daily sold commodities.

    Procedure

    1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to dws_daily_category_sales and click Confirm. On the configuration tab that appears, enter the following SQL statements:

      CREATE TABLE IF NOT EXISTS dws_daily_category_sales (
          cate_id             BIGINT COMMENT 'Leaf category ID of a commodity',
          cate_name           STRING COMMENT 'Leaf category name of a commodity',
          total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount of different commodity categories. Unit: cent',
          order_count         BIGINT COMMENT 'Number of orders'
      )
      COMMENT 'Aggregate table for categories of daily sold commodities'
      PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
      LIFECYCLE 365;
      
      
      INSERT OVERWRITE TABLE dws_daily_category_sales PARTITION(pt='${bizdate}')
      SELECT
          i.cate_id,
          i.cate_name,
          SUM(t.total_fee) AS total_sales_amount,
          COUNT(DISTINCT t.id) AS order_count
      FROM dwd_trade_order t
      JOIN dim_item_info i ON t.item_id = i.item_id AND t.pt = i.pt
      WHERE t.pt = '${bizdate}'
      GROUP BY t.pt, i.cate_id, i.cate_name;
    2. In the top toolbar of the configuration tab of the ODPS SQL node, separately click the image (Save) icon and the image (Run with Parameters) icon.

  4. Create the ads_top_selling_categories node.

    This node is used to generate ads_top_selling_categories based on the dws_daily_category_sales table. The ads_top_selling_categories table contains the ranking of the most popular commodity categories every day.

    Procedure

    1. In the Scheduled Workflow pane, click Business Flow. Right-click Workflow and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, set the Name parameter to ads_top_selling_categories and click Confirm. On the configuration tab that appears, enter the following SQL statements:

      CREATE TABLE IF NOT EXISTS ads_top_selling_categories (
          rank                BIGINT COMMENT 'Sales volume ranking',
          cate_id             BIGINT COMMENT 'Leaf category ID of a commodity',
          cate_name           STRING COMMENT 'Leaf category name of a commodity',
          total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount of different commodity categories. Unit: cent',
          order_count         BIGINT COMMENT 'Number of orders'
      )
      COMMENT 'Table for the ranking of the most popular commodity categories every day'
      PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd');
      
      
      INSERT OVERWRITE TABLE ads_top_selling_categories PARTITION(pt='${bizdate}')
      SELECT
          rank,
          cate_id,
          cate_name,
          total_sales_amount,
          order_count
      FROM (
          SELECT
              DENSE_RANK() OVER(ORDER BY total_sales_amount DESC) AS rank,
              cate_id,
              cate_name,
              total_sales_amount,
              order_count
          FROM (
              SELECT
                  cate_id,
                  cate_name,
                  SUM(total_sales_amount) AS total_sales_amount,
                  SUM(order_count) AS order_count
              FROM dws_daily_category_sales
              WHERE pt = '${bizdate}'
              GROUP BY cate_id, cate_name
          ) agg_sub
      ) agg_outer
      WHERE rank <= 10;
    2. In the top toolbar of the configuration tab of the ODPS SQL node, separately click the image (Save) icon and the image (Run with Parameters) icon.

Step 3: Display data

After the raw test data obtained from the MySQL database is processed in DataStudio and aggregated into the ads_top_selling_categories table, you can query the table data and view data analysis results.

  1. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Analysis > SQL Query.

  2. In the left-side navigation tree of the DataAnalysis page, move the pointer over or click the image icon next to My Files and click Create File. In the Create File dialog box, enter a name in the File Name field based on your business requirements and click OK.

    image

  3. On the configuration tab that appears, enter the following SQL statement:

    SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};
  4. In the top toolbar of the configuration tab, click the image (Run) icon. In the upper-right corner of the configuration tab, select a MaxCompute data source from the Data Source Name drop-down list that is displayed and click OK. In the Estimate Costs dialog box, click Run.

  5. On the tab that displays the execution result, click the image icon to view the visualization result. You can click the image icon in the upper-right corner of the tab to customize the graph pattern.

    image

  6. In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the DataAnalysis page, click the image (Cards) icon to view the saved card.

    image

Step 4: Enable the system to periodically schedule the batch synchronization nodes and the ODPS SQL nodes

After you perform the preceding operations, you have obtained the sales data of commodities of different categories on the previous day. If you want to obtain the latest sales data every day, you can configure scheduling settings for all the nodes that are created in DataStudio. This way, the system can periodically schedule the nodes.

Note

To simplify operations, scheduling settings are configured for the nodes involved in this tutorial in a visualized manner. DataWorks allows you to manually configure scheduling settings for nodes in a fine-grained manner. You can use the automatic parsing feature to enable the system to automatically parse scheduling dependencies for nodes based on node code. For more information about scheduling settings, see the topics in the Schedule directory.

  1. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the Scheduled Workflow pane, click Business Flow and double-click Workflow. In the canvas that appears on the right, move the nodes and draw lines to connect the nodes based on the instructions shown in the following figure.

    image

  3. In the right-side navigation pane, click Workflow Parameters. On the Workflow Parameters tab, set the Parameter Name parameter to bizdate and the Value/Expression parameter to $bizdate and click Save.

    image

  4. Double-click the Workshop zero load node and click Properties in the right-side navigation pane. On the Properties tab, configure the parameters by referring to the instructions shown in the following figure. In the top toolbar of the configuration tab of the zero load node, click the image (Save) icon.

    Note

    Retain the default values for other parameters.

    image

  5. Switch to the configuration tab of the Workflow workflow and click Run in the top toolbar. In the Enter parameters dialog box, set the value of bizdate to the date of the previous day and click OK to test whether all nodes in the workflow can be successfully run. For example, if the current day is 20240731, you can set the value of bizdate to 20240730.

    image

  6. If all nodes in the workflow are successfully run, click Submit in the top toolbar to commit all the nodes to Operation Center.

    image

  7. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Development And Task Operation > Operation Center.

  8. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Tasks. On the page that appears, view the created auto triggered nodes.

    Note

    If you want to present all nodes that have scheduling dependencies as shown in the following figure, you can right-click the Workshop node in the directed acyclic graph (DAG) of the node and choose Show Descendant Nodes > 4 Levels.

    image

What to do next

Appendix: Resource release and clearance

To release the resources that are created for this tutorial, perform the following operations:

  1. Stop the auto triggered nodes.

    1. Go to the Operation Center page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

    2. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Tasks. On the page that appears, select all auto triggered nodes that are created for this tutorial, click Actions in the lower part of the page, and then click Undeploy.

  2. Delete the batch synchronization nodes and ODPS SQL nodes and disassociate the MaxCompute data source from DataStudio.

    1. Go to the DataStudio page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

    2. In the Scheduled Workflow pane, click Business Flow and click Workflow. Then, separately find the batch synchronization nodes and the ODPS SQL nodes, right-click each node name, and then select Delete.

    3. In the left-side navigation pane of the DataStudio page, click Data Source. On the Data Source page, find the MaxCompute data source and click Disassociate.

  3. Delete the MySQL and MaxCompute data sources.

    1. Go to the Management Center page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, separately find the MySQL data source and the MaxCompute data source and click Delete in the Actions column of the data sources.

  4. Delete the MaxCompute project.

    Log on to the MaxCompute console. On the Projects page, find the created MaxCompute project and click Delete in the Actions column. In the Delete Project dialog box, select the check box for confirming the deletion operation and click OK.

  5. Delete the Internet NAT gateway and release the EIP.

    1. Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

    2. On the Internet NAT Gateway page, find the created Internet NAT gateway, click the image icon in the Actions column, and then click Delete. In the Delete Gateway dialog box, select Force Delete (Delete the NAT gateway and associated SNAT/DNAT entries) and click OK.

    3. In the left-side navigation pane of the VPC console, choose Access to Internet > Elastic IP Addresses. On the Elastic IP Addresses page, find the created EIP, click the image icon in the Actions column, and then click Release. In the dialog box that appears, click OK.