All Products
Search
Document Center

DataWorks:Data development: Developers

Last Updated:Nov 13, 2024

This topic describes how developers can create an auto triggered node in DataStudio. This topic provides an example on how to use a MaxCompute data source to run MaxCompute jobs in DataWorks. This helps you quickly understand the basic usage of the DataStudio module.

Prerequisites

The environments required for data development are prepared. For more information, see Data development: Developers.

Note
  • In this example, an ODPS SQL node needs to be created. Therefore, you must add a MaxCompute data source to your workspace.

  • You must prepare an account that has data development permissions. The account can be an Alibaba Cloud account or a RAM user that is assigned the Workspace Administrator or Develop role.

Background information

DataStudio provides a visualized development interface for nodes of various types of compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), and CDH. You can use the visualized development interface to configure settings to perform intelligent code development, data cleansing and processing, and standardized node development and deployment. This helps ensure efficient and stable data development. For more information about how to use DataStudio, see Overview.

The procedure that is used to write raw business data to DataWorks and obtain a final result table consists of the following steps:

  1. Create multiple tables in DataWorks. Example:

    • Source table: stores data that is synchronized from other data sources.

    • Result table: stores data that is cleansed and processed in DataWorks.

  2. Create a data synchronization node to synchronize business data to the preceding source table.

  3. Create a compute node to cleanse the data in the source table, process the data at each layer, and then write the results of each layer to the result table.

You can also upload data from your on-premises machine to the source table. Then, you can use a compute node to cleanse and process the data, and store the processed data in the result table. In this example, data is uploaded from an on-premises machine to a source table and a compute node is used to cleanse and process the data.

Go to the DataStudio page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

Procedure

  1. Step 1: Create a workflow

    Code is developed based on a workflow in DataStudio. Before you perform development operations, you must create a workflow.

  2. Step 2: Create tables

    DataWorks allows you to create tables in a visualized manner and displays tables in a directory structure. Before data development, you must create a table in the MaxCompute compute engine to store the data processing results.

  3. Step 3: Create a node

    Data development in DataWorks is based on nodes, and nodes of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a suitable node type to develop a compute engine node based on your business requirements.

  4. Step 4: Configure the node

    You can write code for the node on the node configuration tab based on the syntax that is supported by the related database.

  5. Step 5: Configure scheduling properties for the node

    You can configure scheduling properties for the node to enable the system to periodically schedule and run the node.

  6. Step 6: Debug the code of the node

    You can use the quick run feature for code snippets, or the Run feature or Run with Parameters feature to debug and check the logic of the code of the node.

  7. Step 7: Save and commit the node

    After the node is debugged, you must save and commit the node.

  8. Step 8: Perform smoke testing

    To ensure efficient running of a node in the production environment and prevent wastage of computing resources, you can commit the node to the development environment and perform smoke testing in the development environment before you deploy the node. This helps ensure the correctness of the code of the node.

  9. Step 9: Deploy the node

    DataWorks can automatically schedule only nodes that are deployed to the production environment. After the node passes the smoke testing, you must deploy the node to the production environment to enable DataWorks to periodically schedule the node.

Step 1: Create a workflow

DataWorks organizes data development processes by using workflows. DataWorks provides dashboards for different types of nodes in each workflow and allows you to use tools and optimize and manage nodes on the dashboards. This facilitates data development and management. You can place nodes of the same business type in one workflow based on your business requirements.

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create a workflow.

    You can use one of the following methods to create a workflow:

    • Method 1: Move the pointer over the 新建 icon and click Create Workflow.

    • Method 2: Right-click Business Flow in the Scheduled Workflow pane and select Create Workflow.

  3. In the Create Workflow dialog box, configure the Workflow Name and Description parameters for the workflow, and click Create.

    In this example, the Workflow Name parameter is set to Create the first auto triggered node. You can configure the Workflow Name parameter based on your business requirements in actual data development scenarios.

    Note

    For more information about how to use workflows, see Create a workflow.

Step 2: Create tables

A data development node of DataWorks cleanses and processes source data. Before data development, you must create a table in the required compute engine to store the data cleansing results and define the table schema.

  1. Create tables.

    1. Click Business Flow in the Scheduled Workflow pane. Find the workflow that is created in Step 1, click the workflow name, right-click MaxCompute, and then select Create Table.

    2. In the Create Table dialog box, configure the parameters such as Engine Instance and Name parameters.

    In this example, the following tables are created.

    Table name

    Description

    bank_data

    Used to store raw business data.

    result_table

    Used to store the data cleansing results.

    Note
    • For information about table creation statements, see Table creation statements.

    • For information about how to create tables in different compute engines in a visualized manner, such as creating a MaxCompute table or an EMR table, see Create tables.

  2. Generate table schemas.

    Go to the configuration tabs of the tables, switch to the DDL mode, and use DDL statements to generate schemas for the tables. After the table schemas are generated, configure the Display Name parameter in the General section, click Commit to Development Environment, and then click Commit to Production Environment in the top toolbar. After the tables are committed, you can view the tables in the MaxCompute data source in the related environment. For information about how to view the data sources that are added to workspaces in different environments, see Add a MaxCompute data source.

    Note
    • Operations such as table creation and table update can take effect in the related compute engines only after they are committed to the required environment.

    • You can also follow the on-screen instructions that are displayed in the DataWorks console to configure the table schemas in a visualized manner based on your business requirements. For more information about how to create a table in a visualized manner, see Create and manage MaxCompute tables.

    生成表结构。In this example, the following statement is used to generate the schema of the bank_data table:

    CREATE TABLE IF NOT EXISTS bank_data
    (
     age             BIGINT COMMENT 'Age',
     job             STRING COMMENT 'Job type',
     marital         STRING COMMENT 'Marital status',
     education       STRING COMMENT 'Education level',
     default         STRING COMMENT 'Credit card',
     housing         STRING COMMENT 'Mortgage',
     loan            STRING COMMENT 'Loan',
     contact         STRING COMMENT 'Contact information',
     month           STRING COMMENT 'Month',
     day_of_week     STRING COMMENT 'Day of the week',
     duration        STRING COMMENT 'Duration',
     campaign        BIGINT COMMENT 'Number of contacts during the campaign',
     pdays           DOUBLE COMMENT 'Interval from the last contact',
     previous        DOUBLE COMMENT 'Number of contacts with the customer',
     poutcome        STRING COMMENT 'Result of the previous marketing campaign',
     emp_var_rate    DOUBLE COMMENT 'Employment change rate',
     cons_price_idx  DOUBLE COMMENT 'Consumer price index',
     cons_conf_idx   DOUBLE COMMENT 'Consumer confidence index',
     euribor3m       DOUBLE COMMENT 'Euro deposit rate',
     nr_employed     DOUBLE COMMENT 'Number of employees',
     y               BIGINT COMMENT 'Time deposit available or not'
    );

    In this example, the following statement is used to generate the schema of the result_table table:

    CREATE TABLE IF NOT EXISTS result_table
    (
    education STRING COMMENT 'Education level',
    num BIGINT COMMENT 'Number of persons'
    )
    PARTITIONED BY
    (
    day STRING,
    hour STRING
    );
  3. Upload data.

    Upload raw business data to a table in DataWorks. In this example, data is uploaded to the bank_data table. In this example, a file named banking.txt is uploaded from an on-premises machine to DataWorks. The following figure shows the procedure. 上传数据For more information about how to upload data, see Upload a file from your on-premises machine to the bank_data table.

Step 3: Create a node

Select a suitable node type for node development based on your business requirements.

Note

Nodes in DataWorks can be classified into data synchronization nodes and compute nodes. In most data development scenarios, you need to use a batch synchronization node to synchronize data from a business database to a data warehouse, and then use a compute node to cleanse and process the data in the data warehouse.

  1. Create a node.

    You can use one of the following methods to create a node:

    • Method 1: Create a node in the Scheduled Workflow pane

      1. In the Scheduled Workflow pane of the DataStudio page, click Business Flow, find the workflow that you created, and then click the name of the workflow.

      2. Right-click the compute engine that you want to use, and select a suitable node type after you move the pointer over Create Node to create a node of the selected type.

    • Method 2: Create a node on the configuration tab of the workflow

      1. In the Scheduled Workflow pane of the DataStudio page, click Business Flow and find the workflow that you created.

      2. Double-click the name of the workflow to go to the configuration tab of the workflow.

      3. In the left-side section of the configuration tab, click the required node type or drag the required node type to the canvas on the right side.

  2. In the Create Node dialog box, configure the parameters such as Engine Instance and Name.

    In this example, an ODPS SQL node named result_table is created. The name of the node is the same as the name of the result table that is created in Step 2.

    Note

    When you use DataWorks for data development, you need to use a compute node to cleanse the data and then store the cleansing results in a result table. We recommend that you use the name of the result table as the name of the node to quickly locate the table data that is generated by the node.

    创建节点

Step 4: Configure the node

Find the node that you created in Step 3, and double-click the name of the node to go to the node configuration tab. On the node configuration tab, write the code of the node based on the syntax that is supported by the related database.

In this example, the result_table node is used to write the data in the specified partition in the bank_data table to the specified partition in the result_table table, and the partition to which the data is written is defined by the day and hour variables.

Note
  • If you want to use variables to dynamically replace parameters in scheduling scenarios during code development, you can define the variables in the code in the ${Custom variable name} format and assign values to the variables when you configure scheduling properties for the node in Step 5.

  • For more information about scheduling parameters, see Supported formats of scheduling parameters.

  • For more information about the code syntax for different types of nodes, see Create and use nodes.

编辑代码Sample code:

INSERT OVERWRITE TABLE result_table partition (day='${day}', hour='${hour}')
SELECT education
, COUNT(marital) AS num
FROM bank_data
GROUP BY education;

Step 5: Configure scheduling properties for the node

You can configure scheduling properties for a node to enable periodic scheduling for the node. In the right-side navigation pane of the node configuration tab, click the Properties tab. You can configure scheduling properties in different sections of the tab for the node based on your business requirements.

Tab

Description

General

In this section, the node name, node ID, node type, and owner of the node are automatically displayed. You do not need to configure additional settings.

Note
  • By default, the owner of the node is the current user. You can modify the owner of the node based on your business requirements. You can select only a member in the current workspace as the owner of the node.

  • An ID is automatically generated after the node is committed.

Scheduling Parameter

In this section, you can configure the scheduling parameters that are used to define how the node is scheduled.

DataWorks provides scheduling parameters that can be classified into custom parameters and built-in variables based on their value assignment methods. Scheduling parameters support dynamic parameter settings for node scheduling. If a variable is defined during the modification of the code of the node in Step 4, you can assign a value to the variable in the Parameters section.

In this example, the following variables are defined in Step 4, and values are assigned to the variables to write the data that is generated in the 24 hours of the previous day in the bank_data table to the partition in the result_table table.

  • Assign ${yyyymmdd} to day as the value.

  • Assign $[hh24] to hour as the value.

参数赋值

Schedule

In this section, you can configure time properties for the node, such as the instance generation mode, the scheduling cycle, the point in time when you want to schedule the node to start, the rerun settings, and the timeout period.

Note
  • You can commit the node only after you configure the rerun settings.

  • The scheduling time that you specify for a node takes effect only on the node. The point in time when the node starts to run is related to the scheduling time of the ancestor node of the node. The node can start to run only if the scheduling time of the ancestor node arrives and the ancestor node is successfully run, even if the scheduling time of the node is earlier than the scheduling time of the ancestor node.

In this example, the result_table node is scheduled to run at an interval of 1 hour from 00:00. The data that is generated each hour in the bank_data table in the 24 hours of the previous day is written to the related hourly partition in the result_table table each hour.时间周期

Resource Group

In this section, you can select the resource group for scheduling that you want to use to deploy the node to the production environment. When you activate DataWorks, a serverless resource group is provided. In this example, the serverless resource group is used. For more information about how to create and use a serverless resource group, see Create and use a serverless resource group.

Dependencies

In this section, you can configure scheduling dependencies for the node. We recommend that you configure scheduling dependencies for the node based on the lineage of the node. If the ancestor node of the current node is successfully run, the table data that the current node needs to use is generated. This way, the current node can obtain the table data.

Note
  • If a SELECT statement is specified in the code of the current node to query the table data that is not generated by an auto triggered node, you can disable the automatic parsing feature and use the root node of the workspace to schedule the current node.

  • If a SELECT statement is specified in the code of the current node to query the table data that is generated by other nodes, you can use the following methods to configure the ancestor node that is used by the current node as a dependency. The ancestor node generates the table data that is queried by the current node, and is used to schedule the current node.

    • If the ancestor node does not belong to the current workflow or workspace, enter the output name of the ancestor node in the Parent Nodes table.

    • If the ancestor node belongs to the current workflow, configure scheduling dependencies for the current node by drawing lines in the canvas of the workflow.

In this example, if the result_table node queries the data in the bank_data table that is not generated by a node in the current workflow, configure the root node of the workspace as the ancestor node of the result_table node, and use the root node to schedule the result_table node.

(Optional) Input and Output Parameters

In this section, you can configure input parameters and output parameters for the node. The configurations in this section are optional. A node can obtain the values of parameters that are configured for its ancestor node by using specific parameters.

Note

In most cases, this process requires assignment nodes or scheduling parameters.

Step 6: Debug the code of the node

You can use one of the following features to debug the code logic to ensure that the code you write is correct.

Feature

Description

Suggestion

Quick run (used to debug a code snippet)

You can quickly run the code snippet that you selected on the configuration tab of the node.

You can use this feature to quickly run a code snippet of a node.

Top toolbar: Run (运行)

You can assign constants to the variables that are defined in the code in specific test scenarios.

Note

The first time you click the Run icon to run a new node, you must manually assign constants to the variables that are defined in the code of the node in the dialog box that appears. The assignment operation will be recorded in the system. You do not need to repeat the operations for subsequent running of the node.

You can use this feature to frequently debug full code of a node.

Top toolbar: Run with Parameters (高级运行)

You must assign constants to the variables that are defined in the code in specific test scenarios each time you click this icon.

You can use this feature to modify the values assigned to the variables in the code.

In this example, the node is run at 2022.09.07 14:00 in a Run with Parameters test. The following figure shows the running result.高级运行

Step 7: Save and commit the node

After node configuration and testing are complete, save the node configuration, and then commit the node to the development environment.

Note

You can commit the node to the development environment only after you configure rerun settings and ancestor nodes for the node in Step 5.

  1. Click the 保存 icon in the top toolbar to save the node.

  2. Click the 提交 icon in the top toolbar to commit the node to the development environment.

Step 8: Perform smoke testing

To ensure that the node that you developed can be run in an efficient manner and fully utilize computing resources, we recommend that you perform smoke testing on the node before you commit and deploy the node. The smoke testing must be performed in the development environment. You must commit the node to the development environment before you perform smoke testing on the node.

  1. Click the 冒烟测试 icon in the top toolbar. In the smoke testing dialog box, specify the data timestamp of the node.

  2. After the smoke testing is complete, click the 查看冒烟测试结果 icon in the top toolbar to view the test results.

In this example, smoke testing is performed to check whether the scheduling parameters that are configured meet user requirements. The result_table node is scheduled to run at an interval of 1 hour from 00:00 to 23:59. When the smoke testing is performed on the node, two instances are generated. The scheduling time of the instances are 00:00 and 01:00.

Note
  • Auto triggered instances are snapshots that are generated for an auto triggered node when the node is scheduled to run based on the specified scheduling cycle.

  • The result_table node is scheduled by hour. You must specify the data timestamp of the node for the smoke testing. You must also select the start time and end time of the test.

  • For more information about how to perform smoke testing in the development environment, see Perform smoke testing.

冒烟测试

Step 9: Deploy the node

If the workspace is in basic mode, the node can be periodically scheduled after the node is committed. If the workspace is in standard mode, the node is in the pending state after the node is committed. You must refer to the operations that are described in this step to deploy the node. The node can be periodically scheduled only after the node is deployed.

Note
  • DataWorks can automatically schedule only the nodes that are deployed to the production environment. After smoke testing is complete, commit and deploy the node to the production environment to enable DataWorks to periodically schedule the node.

  • For more information about workspaces in basic mode and workspaces in standard mode, see Differences between workspaces in basic mode and workspaces in standard mode.

In a workspace in standard mode, the operations that are committed on the DataStudio page, including addition, update, and deletion of data development nodes, resources, and functions, are in the pending state on the Create Deploy Task page. You can click Deploy to go to the Create Deploy Task page, and deploy the related operations to the production environment. The operations take effect only after they are deployed to the production environment. For more information, see Deploy nodes.

The following table describes the items that are related to the deployment procedure.

Item

Description

Deployment control

Whether the deployment operation is successful varies based on the permissions of the role of the user that performs this operation and the specified deployment procedure that is used.

Note
  • After you deploy a node, you can view the deployment record and status of the node on the Deployment Packages page.

  • Developers can only create deployment packages. Deploying deployment packages requires O&M permissions.

Instance generation mode

If you create or update a node and deploy the node in the time range of 23:30 to 24:00, instances that are generated for the node take effect on the third day.

Note

This limit takes effect on nodes for which the Instance Generation Mode parameter is set to Next Day or Immediately After Deployment. For more information about the instance generation mode, see Configure immediate instance generation for a task.

What to do next

You can go to Operation Center and view the auto triggered node that is deployed to the production environment on the Auto Triggered Tasks page and perform the related O&M operations on the node. For more information, see Perform basic O&M operations on auto triggered nodes.