All Products
Search
Document Center

DataWorks:Process data

Last Updated:Jan 15, 2025

This topic describes how to use MaxCompute nodes in DataWorks to process data in the ods_user_info_d_odps and ods_raw_log_d_odps tables in MaxCompute to obtain user profile data after basic user information and website access logs of users are synchronized to the tables. This topic also describes how to use the DataWorks and MaxCompute product combo to compute and analyze synchronized data to complete simple data processing in data warehousing scenarios.

/se

Prerequisites

The required data is synchronized. For more information, see Synchronize data.

  • The basic user information stored in the ApsaraDB RDS for MySQL table ods_user_info_d is synchronized to the MaxCompute table ods_user_info_d_odps by using Data Integration.

  • The website access logs of users stored in the Object Storage Service (OSS) object user_log.txt is synchronized to the MaxCompute table ods_raw_log_d_odps by using Data Integration.

Objective

The objective is to process the ods_user_info_d_odps and ods_raw_log_d_odps tables to obtain a basic user profile table.

  1. Split the log information field in the ods_raw_log_d_odps table into multiple fields and generate the dwd_log_info_di_odps fact table.

  2. Join the dwd_log_info_di_odps fact table with the ods_user_info_d_odps table based on the uid field to generate the dws_user_info_all_di_odps aggregate table.

  3. Process the dws_user_info_all_di_odps table to generate a table named ads_user_info_1d_odps. The dws_user_info_all_di_odps table contains a large number of fields and a large amount of data. In this case, data consumption may require a long period of time to complete. Therefore, further data processing is required.

Go to the DataStudio page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

Step 1: Design a workflow

In the data synchronization phase, the required data is synchronized to MaxCompute tables. The next objective is to further process the data to generate the basic user profile data.

  • Nodes at different levels and the work logic of the nodes

    In the upper part of the workflow canvas, click Create Node to create the nodes described in the following table for data processing.

    Node category

    Node type

    Node name

    (Named after the output table)

    Code logic

    MaxCompute

    ODPS SQL

    dwd_log_info_di_odps

    Use built-in functions and user-defined functions (UDFs) to split data in the ods_raw_log_d_odps table and write data to multiple fields in the dwd_log_info_di_odps table.

    MaxCompute

    ODPS SQL

    dws_user_info_all_di_odps

    Join the table that stores the basic user information and the dwd_log_info_di_odps table that stores the preliminarily processed log data to generate an aggregate table.

    MaxCompute

    ODPS SQL

    ads_user_info_1d_odps

    Further process the data to generate a basic user profile.

  • Directed acyclic graph (DAG) in the workflow

    Drag the nodes to the workflow canvas, and configure dependencies between the nodes by drawing lines to design the workflow for data processing.

    image

Step 2: Configure a workflow

Create MaxCompute tables

Create the dwd_log_info_di_odps, dws_user_info_all_di_odps, and ads_user_info_1d_odps tables that are used to store processed data at each layer. In this example, the tables are created in a quick manner. For more information about MaxCompute table-related operations, see Create and manage MaxCompute tables.

  1. Go to the entry point of creating tables.

    On the DataStudio page, open the WorkShop workflow that you created in the data synchronization phase. Right-click MaxCompute and select Create Table.

  2. Define the schemas of MaxCompute tables.

    In the Create Table dialog box, enter a table name and click Create. For example, you need to create three tables named dwd_log_info_di_odps, dws_user_info_all_di_odps, and ads_user_info_1d_odps. Then, go to the DDL tab and execute a CREATE TABLE statement to create a table. You can view the statements that are used to create the preceding tables in the following content.

  3. Commit the tables to the compute engine.

    After you define the schemas of the tables, click Commit to Development Environment and Commit to Production Environment in sequence on the configuration tab of each table. In the MaxCompute projects in the development and production environments, the system creates the related physical tables in the MaxCompute projects based on the configurations.

    • If you commit the tables to the development environment of the workspace, the tables are created in the MaxCompute project in the development environment.

    • If you commit the tables to the production environment of the workspace, the tables are created in the MaxCompute project in the production environment.

      Note

Create the dwd_log_info_di_odps table

Double-click the dwd_log_info_di_odps table. On the table configuration tab that appears, click DDL and enter the following CREATE TABLE statement:

CREATE TABLE IF NOT EXISTS dwd_log_info_di_odps (
 ip STRING COMMENT 'The IP address of the client that is used to send the request',
 uid STRING COMMENT 'The ID of the user',
 time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss',
 status STRING COMMENT 'The status code that is returned by the server',
 bytes STRING COMMENT 'The number of bytes that are returned to the client',
 region STRING COMMENT 'The region, which is obtained based on the IP address',
 method STRING COMMENT 'The type of the HTTP request',
 url STRING COMMENT 'url',
 protocol STRING COMMENT 'The version number of HTTP',
 referer STRING COMMENT 'The source URL',
 device STRING COMMENT 'The terminal type',
 identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
)
PARTITIONED BY (
 dt STRING
)
LIFECYCLE 14;

Create the dws_user_info_all_di_odps table

Double-click the dws_user_info_all_di_odps table. On the table configuration tab that appears, click DDL and enter the following CREATE TABLE statement:

CREATE TABLE IF NOT EXISTS dws_user_info_all_di_odps (
 uid STRING COMMENT 'The ID of the user',
 gender STRING COMMENT 'The gender',
 age_range STRING COMMENT 'The age range of the user',
 zodiac STRING COMMENT 'The zodiac sign',
 region STRING COMMENT 'The region, which is obtained based on the IP address',
 device STRING COMMENT 'The terminal type',
 identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
 method STRING COMMENT 'The type of the HTTP request',
 url STRING COMMENT 'url',
 referer STRING COMMENT 'The source URL',
 time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss'
)
PARTITIONED BY (
 dt STRING
)
LIFECYCLE 14;

Create the ads_user_info_1d_odps table

Double-click the ads_user_info_1d_odps table. On the table configuration tab that appears, click DDL and enter the following CREATE TABLE statement:

CREATE TABLE IF NOT EXISTS ads_user_info_1d_odps (
 uid STRING COMMENT 'The ID of the user',
 region STRING COMMENT 'The region, which is obtained based on the IP address',
 device STRING COMMENT 'The terminal type',
 pv BIGINT COMMENT 'pv',
 gender STRING COMMENT 'The gender',
 age_range STRING COMMENT 'The age range of the user',
 zodiac STRING COMMENT 'The zodiac sign'
)
PARTITIONED BY (
 dt STRING
)
LIFECYCLE 14;    

Create a function named getregion

You can use methods such as a function to convert the structure of log data for the experiment into data in tables. In this example, the required resources are provided for the function that is used to convert IP addresses into regions. You need to only download the resources to your on-premises machine and upload the resources to your workspace before you register the function in DataWorks.

Important

The IP address resources used by this function are only for the use of this tutorial. If you need to implement the mappings between IP addresses and geographical locations in formal business scenarios, you must seek out professional IP address conversion services from specialized IP address websites.

  1. Upload the resource file ip2region.jar

    1. Download the ip2region.jar file.

      Note

      The ip2region.jar file is only for the use of this tutorial.

    2. On the DataStudio page, open the WorkShop workflow. Right-click MaxCompute and choose Create Resource > JAR.

    3. Click Upload, select the ip2region.jar file that is downloaded to your on-premises machine, and then click Open.

      Note
      • Select Upload to MaxCompute.

      • The resource name can be different from the name of the uploaded file.

    4. Click the image.png icon in the top toolbar to commit the resource to the MaxCompute project in the development environment.

  2. Register the function getregion

    1. Go to the function registration page.

      On the DataStudio page, open the WorkShop workflow, right-click MaxCompute, and then select Create Function.

    2. Enter a function name.

      In the Create Function dialog box, set the Name parameter to getregion and click Create.

    3. In the Register Function section of the configuration tab that appears, configure the parameters that are described in the following table.

      image

      Parameter

      Description

      Function Type

      The type of the function.

      MaxCompute Engine Instance

      The MaxCompute compute engine. The value of this parameter cannot be changed.

      Function Name

      The name of the function.

      Owner

      The owner of the function.

      Class Name

      Set the parameter to org.alidata.odps.udf.Ip2Region.

      Resources

      Set the parameter to ip2region.jar.

      Description

      Set the parameter to Region conversion based on the IP address.

      Expression Syntax

      Set the parameter to getregion('ip').

      Parameter Description

      Set the parameter to IP address.

    4. Click the image.png icon in the top toolbar to commit the function to the compute engine in the development environment.

Configure MaxCompute nodes

In this example, you need to use ODPS SQL nodes to implement data processing logic at each layer. Strong data lineages exist between ODPS SQL nodes at different layers. In the data synchronization phase, the output tables of the synchronization nodes have been manually added to the Output section on the Properties tab for the synchronization nodes. Therefore, the scheduling dependencies of the ODPS SQL nodes used to process data in this example can be automatically configured based on data lineages by using the automatic parsing feature.

Configure the dwd_log_info_di_odps node

On the configuration tab of the workflow, double-click the dwd_log_info_di_odps node. On the configuration tab of the node, enter the SQL code that processes fields in the ods_raw_log_d_odps ancestor table by using the created function and writes the processed data to the dwd_log_info_di_odps table. For more information about how to process data, see Appendix: Data processing example.

  1. Edit node code.

    -- Scenario: The following SQL statement uses the getregion function to parse IP addresses in raw log data, use methods such as regular expressions to split the parsed data into analyzable fields, and then write the fields into the dwd_log_info_di_odps table. 
    --      In this example, the getregion function that is used to convert IP addresses into regions is prepared. 
    -- Note:
    --     1. Before you can use a function in a DataWorks node, you must upload the resources required for registering the function to DataWorks and then register the function by using the resources in a visualized manner. 
    --        In this example, the resource used to register the getregion function is ip2region.jar. 
    --     2. You can configure scheduling parameters for nodes in DataWorks to write incremental data to the related partition in the desired table every day in the scheduling scenario. 
    --        In actual development scenarios, you can define variables in the code of a node in the ${Variable name} format and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters. 
    INSERT OVERWRITE TABLE dwd_log_info_di_odps PARTITION (dt='${bizdate}')
    SELECT ip 
      , uid
      , time
      , status
      , bytes 
      , getregion(ip) AS region -- Obtain the region based on the IP address by using the UDF. 
      , regexp_substr(request, '(^[^ ]+ )') AS method -- Use the regular expression to extract three fields from the request. 
      , regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') AS url
      , regexp_substr(request, '([^ ]+$)') AS protocol 
      , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Use the regular expression to clarify the referrer to obtain a more accurate URL. 
      , CASE
        WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' -- Obtain the terminal information and access type based on the agent parameter. 
        WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
        WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
        WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
        WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
        WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
        ELSE 'unknown'
      END AS device
      , CASE
        WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
        WHEN TOLOWER(agent) RLIKE 'feed'
        OR regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') RLIKE 'feed' THEN 'feed'
        WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
        AND agent RLIKE '^[Mozilla|Opera]'
        AND regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
        ELSE 'unknown'
      END AS identity
      FROM (
        SELECT SPLIT(col, '##@@')[0] AS ip
        , SPLIT(col, '##@@')[1] AS uid
        , SPLIT(col, '##@@')[2] AS time
        , SPLIT(col, '##@@')[3] AS request
        , SPLIT(col, '##@@')[4] AS status
        , SPLIT(col, '##@@')[5] AS bytes
        , SPLIT(col, '##@@')[6] AS referer
        , SPLIT(col, '##@@')[7] AS agent
      FROM ods_raw_log_d_odps  
      WHERE dt ='${bizdate}'
    ) a;
  2. Configure scheduling properties.

    On the configuration tab of the node, click Properties in the right-side navigation pane. On the Properties tab, configure the scheduling properties and basic information for the node. For more information, see Scheduling properties of a node. The following table describes the parameters.

    Section

    Description

    Illustration

    Scheduling Parameter

    Configure the following parameters in the Scheduling Parameter section:

    • Parameter Name: Set the value to bizdate.

    • Parameter Value: Set the value to $[yyyymmdd-1].

    image

    Schedule

    • Scheduling Cycle: Set the value to Day.

    • Scheduled time: Set the value to 00:30.

    • Rerun: Set the value to Allow Regardless of Running Status.

    Use the default values for other parameters.

    Note

    The time when the current node is scheduled to run every day is determined by the scheduling time of the zero load node workshop_start of the workflow. The current node is scheduled to run after 00:30 every day.

    image

    Resource Group

    Select the serverless resource group that you purchased in the environment preparation phase.

    image

    Dependencies

    Set Automatic Parsing From Code Before Node Committing to Yes to allow the system to configure the ods_raw_log_d_odps node that generates the ods_raw_log_d_odps table as the ancestor node of the dwd_log_info_di_odps node. The dwd_log_info_di_odps table is used as the output of the dwd_log_info_di_odps node. This way, the dwd_log_info_di_odps node can be automatically configured as an ancestor node of other nodes when these nodes query the table data generated by the dwd_log_info_di_odps node.

    image

    Note

    Determine the ancestor nodes of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name. The output node name is used as the name of the output of an ancestor node.

    Determine the output of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name, which is output to the downstream node.

  3. In this example, other required configuration items can be configured based on your business requirements. After the configuration is complete, click the image.png icon in the top toolbar on the configuration tab of the node to save the node configurations.

Configure the dws_user_info_all_di_odps node

On the configuration tab of the workflow, double-click the dws_user_info_all_di_odps node. On the configuration tab of the node, enter the SQL code that merges the dwd_log_info_di_odps and ods_user_info_d_odps ancestor tables and writes the merged data to the dws_user_info_all_di_odps table.

  1. Edit node code.

    -- Scenario: Aggregate the processed log data in the dwd_log_info_di_odps table and the basic user information in the ods_user_info_d_odps table and write the aggregated data to the dws_user_info_all_di_odps table. 
    -- Note: You can configure scheduling parameters for nodes in DataWorks to write incremental data to the related partition in the desired table every day in the scheduling scenario. 
    --      In actual development scenarios, you can define variables in the code of a node in the format of ${Variable name} and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters. 
    INSERT OVERWRITE TABLE dws_user_info_all_di_odps  PARTITION (dt='${bizdate}')
    SELECT COALESCE(a.uid, b.uid) AS uid
      , b.gender
      , b.age_range
      , b.zodiac
      , a.region
      , a.device
      , a.identity
      , a.method
      , a.url
      , a.referer
      , a.time
    FROM (
      SELECT *
      FROM dwd_log_info_di_odps  
      WHERE dt = '${bizdate}'
    ) a
    LEFT OUTER JOIN (
      SELECT *
      FROM ods_user_info_d_odps 
      WHERE dt = '${bizdate}'
    ) b
    ON a.uid = b.uid;
  2. Configure scheduling properties.

    On the configuration tab of the node, click Properties in the right-side navigation pane. On the Properties tab, configure the scheduling properties and basic information for the node. For more information, see Scheduling properties of a node. The following table describes the parameters.

    Section

    Description

    Illustration

    Scheduling Parameter

    Configure the following parameters in the Scheduling Parameter section:

    • Parameter Name: Set the value to bizdate.

    • Parameter Value: Set the value to $[yyyymmdd-1].

    image

    Schedule

    • Scheduling Cycle: Set the value to Day.

    • Scheduled time: Set the value to 00:30.

    • Rerun: Set the value to Allow Regardless of Running Status.

    Use the default values for other parameters.

    Note

    The time when the current node is scheduled to run every day is determined by the scheduling time of the zero load node workshop_start of the workflow. The current node is scheduled to run after 00:30 every day.

    image

    Resource Group

    Select the serverless resource group that you purchased in the environment preparation phase.

    image

    Dependencies

    Set Automatic Parsing From Code Before Node Committing to Yes to allow the system to configure the dwd_log_info_di_odps and ods_user_info_d_odps nodes that generate the dwd_log_info_di_odps and ods_user_info_d_odps tables as the ancestor node of the dws_user_info_all_di_odps node. The dws_user_info_all_di_odps table is used as the output of the dws_user_info_all_di_odps node. This way, the dws_user_info_all_di_odps node can be automatically configured as an ancestor node of other nodes when these nodes query the table data generated by the dws_user_info_all_di_odps node.

    image

    Note

    Determine the ancestor nodes of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name. The output node name is used as the name of the output of an ancestor node.

    Determine the output of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name, which is output to the downstream node.

  3. In this example, other required configuration items can be configured based on your business requirements. After the configuration is complete, click the image.png icon in the top toolbar on the configuration tab of the node to save the node configurations.

Configure the ads_user_info_1d_odps node

On the configuration tab of the workflow, double-click the ads_user_info_1d_odps node. On the configuration tab of the node, enter the SQL code that processes the dws_user_info_all_di_odps ancestor table and writes the processed data to the ads_user_info_1d_odps table.

  1. Edit node code.

    -- Scenario: The following SQL statement is used to further process the dws_user_info_all_di_odps wide table that is used to store user access information into the ads_user_info_1d_odps table that is used to store basic user profile data. 
    -- Note: You can configure scheduling parameters for nodes in DataWorks to write incremental data to the related partition in the desired table every day in the scheduling scenario. 
    --      In actual development scenarios, you can define variables in the code of a node in the format of ${Variable name} and assign scheduling parameters to the variables on the Properties tab of the configuration tab of the node. This way, the values of scheduling parameters can be dynamically replaced in the node code based on the configurations of the scheduling parameters. 
    INSERT OVERWRITE TABLE ads_user_info_1d_odps  PARTITION (dt='${bizdate}')
    SELECT uid
      , MAX(region)
      , MAX(device)
      , COUNT(0) AS pv
      , MAX(gender)
      , MAX(age_range)
      , MAX(zodiac)
    FROM dws_user_info_all_di_odps 
    WHERE dt = '${bizdate}'
    GROUP BY uid; 
  2. Configure scheduling properties.

    On the configuration tab of the node, click Properties in the right-side navigation pane. On the Properties tab, configure the scheduling properties and basic information for the node. For more information, see Scheduling properties of a node. The following table describes the parameters.

    Section

    Description

    Illustration

    Scheduling Parameter

    Configure the following parameters in the Scheduling Parameter section:

    • Parameter Name: Set the value to bizdate.

    • Parameter Value: Set the value to $[yyyymmdd-1].

    image

    Schedule

    • Scheduling Cycle: Set the value to Day.

    • Scheduled time: Set the value to 00:30.

    • Rerun: Set the value to Allow Regardless of Running Status.

    Use the default values for other parameters.

    Note

    The time when the current node is scheduled to run every day is determined by the scheduling time of the zero load node workshop_start of the workflow. The current node is scheduled to run after 00:30 every day.

    image

    Resource Group

    Select the serverless resource group that you purchased in the environment preparation phase.

    image

    Dependencies

    Set Automatic Parsing From Code Before Node Committing to Yes to allow the system to configure the dws_user_info_all_1d_odps node that generates the dws_user_info_all_1d_odps table as the ancestor node of the ads_user_info_1d_odps node. The ads_user_info_1d table is used as the output of the ads_user_info_1d node. This way, the ads_user_info_1d node can be automatically configured as an ancestor node of other nodes when these nodes query the table data generated by the ads_user_info_1d node.

    image

    Note

    Determine the ancestor nodes of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name. The output node name is used as the name of the output of an ancestor node.

    Determine the output of the current node: Click Parse Input and Output from Code, the table name in the code is parsed to the output node named in the format of MaxCompute project name in the production environment.Table name, which is output to the downstream node.

  3. In this example, other required configuration items can be configured based on your business requirements. After the configuration is complete, click the image.png icon in the top toolbar on the configuration tab of the node to save the node configurations.

Step 3: Run a workflow

Run the workflow

  1. On the DataStudio page, double-click the User profile analysis_MaxCompute workflow under Business Flow. On the configuration tab of the workflow, click the image icon in the top toolbar to run the nodes in the workflow based on the scheduling dependencies between the nodes.

    image

  2. Confirm the status.

    • View the node status: If a node is in the image.png state, the synchronization process is normal.

    • View the node running logs: Right-click the node and select View Logs to view the logs of each node in the entire user profile analysis process in the development environment.

View the synchronization result

After all nodes in the workflow enter the image.png state, you can query the finally processed result table.

  1. In the left-side navigation pane of the DataStudio page, click image.png.

  2. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > ODPS SQL.

    Execute the following SQL statement in the ODPS SQL node to confirm the final result table in this example.

    // You must specify the data timestamp of the data on which you perform read and write operations as the filter condition for partitions. For example, if a node is scheduled to run on February 22, 2023, the data timestamp of the node is 20230221, which is one day earlier than the scheduling time of the node. 
    select count(*) from ads_user_info_1d_odps where dt='Data timestamp';

    image

    Note

    In this example, nodes are run in DataStudio, which is the development environment. Therefore, data is written to the specified tables in the MaxCompute project workshop2024_01_dev that is associated with the workspace in the development environment by default.

Step 4: Deploy the workflow

The nodes can be automatically scheduled and run only after the nodes are deployed to the production environment. For more information, see the following content.

Commit the workflow to the development environment

In the top toolbar of the configuration tab of the workflow, click the image icon to commit all nodes in the workflow. In the Commit dialog box, configure the parameters as shown in the following figure, and click Confirm.

image

Commit the workflow to the production environment

After you commit the workflow, the nodes in the workflow enter the development environment. You must deploy the configured nodes to the production environment because nodes in the development environment cannot be automatically scheduled.

  1. In the top toolbar of the configuration tab of the workflow, click the image.png icon. Alternatively, go to the configuration tab of one of the nodes in DataStudio and click the Deploy icon in the upper-right corner to go to the Create Deploy Task page.

  2. Deploy the desired nodes at the same time. The deployed content includes the resources and functions that are involved in the workflow.

Step 5: Perform O&M and scheduling operations on the workflow

In actual development scenarios, you can use the data backfill feature in the production environment to backfill data of a historical period of time or a period of time in the future. This topic describes how to perform O&M and scheduling operations on the data backfill instances. For more information about capabilities in Operation Center, see Operation Center.

  1. Go to the Operation Center page.

    After the nodes are deployed, click Operation Center in the upper-right corner of the configuration tab of the node.

    You can also click Operation Center in the upper part of the DataStudio page to go to the Operation Center page.

  2. Backfill data for auto triggered tasks.

    1. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the page that appears, click the root node of the workshop_start_odps workflow.

    2. Right-click the workshop_start_odps node and choose Run > Current and Descendant Nodes Retroactively.

    3. Select all descendant nodes of the workshop_start_odps node, enter data timestamps, and then click OK. The Patch Data page appears.

      image

  3. Click Refresh until all SQL nodes are successfully run.

Note

After the test is complete, you can configure the Validity Period parameter for the nodes or freeze the root node of the workflow to which the nodes belong to prevent the fees of long-term node scheduling. The root node is the zero load node named workshop_start_odps.

What to do next

After you deploy the nodes, the entire workflow is complete. You can view the details of the created tables, consume data in the related table, and configure data quality monitoring. For more information, see Manage data, Use an API to provide data services, Visualize data on a dashboard, and Configure rules to monitor data quality.

Appendix: Data processing example

  • Before data processing

    58.246.10.82##@@2d24d94f14784##@@2014-02-12 13:12:25##@@GET /wp-content/themes/inove/img/feeds.gif HTTP/1.1##@@200##@@2572##@@http://coolshell.cn/articles/10975.html##@@Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36
  • After data processing

    image.png