This topic describes how developers can create an auto triggered node in DataStudio. This topic provides an example on how to use a MaxCompute data source to run MaxCompute jobs in DataWorks. This helps you quickly understand the basic usage of the DataStudio module.
Prerequisites
The environments required for data development are prepared. For more information, see Data development: Developers.
In this example, an ODPS SQL node needs to be created. Therefore, you must add a MaxCompute data source to your workspace.
You must prepare an account that has data development permissions. The account can be an Alibaba Cloud account or a RAM user that is assigned the Workspace Administrator or Develop role.
Background information
DataStudio provides a visualized development interface for nodes of various types of compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), and CDH. You can use the visualized development interface to configure settings to perform intelligent code development, data cleansing and processing, and standardized node development and deployment. This helps ensure efficient and stable data development. For more information about how to use DataStudio, see Overview.
The procedure that is used to write raw business data to DataWorks and obtain a final result table consists of the following steps:
Create multiple tables in DataWorks. Example:
Source table: stores data that is synchronized from other data sources.
Result table: stores data that is cleansed and processed in DataWorks.
Create a data synchronization node to synchronize business data to the preceding source table.
Create a compute node to cleanse the data in the source table, process the data at each layer, and then write the results of each layer to the result table.
You can also upload data from your on-premises machine to the source table. Then, you can use a compute node to cleanse and process the data, and store the processed data in the result table. In this example, data is uploaded from an on-premises machine to a source table and a compute node is used to cleanse and process the data.
Go to the DataStudio page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Procedure
Code is developed based on a workflow in DataStudio. Before you perform development operations, you must create a workflow.
DataWorks allows you to create tables in a visualized manner and displays tables in a directory structure. Before data development, you must create a table in the MaxCompute compute engine to store the data processing results.
Data development in DataWorks is based on nodes, and nodes of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a suitable node type to develop a compute engine node based on your business requirements.
You can write code for the node on the node configuration tab based on the syntax that is supported by the related database.
Step 5: Configure scheduling properties for the node
You can configure scheduling properties for the node to enable the system to periodically schedule and run the node.
Step 6: Debug the code of the node
You can use the quick run feature for code snippets, or the Run feature or Run with Parameters feature to debug and check the logic of the code of the node.
Step 7: Save and commit the node
After the node is debugged, you must save and commit the node.
To ensure efficient running of a node in the production environment and prevent wastage of computing resources, you can commit the node to the development environment and perform smoke testing in the development environment before you deploy the node. This helps ensure the correctness of the code of the node.
DataWorks can automatically schedule only nodes that are deployed to the production environment. After the node passes the smoke testing, you must deploy the node to the production environment to enable DataWorks to periodically schedule the node.
Step 1: Create a workflow
DataWorks organizes data development processes by using workflows. DataWorks provides dashboards for different types of nodes in each workflow and allows you to use tools and optimize and manage nodes on the dashboards. This facilitates data development and management. You can place nodes of the same business type in one workflow based on your business requirements.
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create a workflow.
You can use one of the following methods to create a workflow:
Method 1: Move the pointer over the icon and click Create Workflow.
Method 2: Right-click Business Flow in the Scheduled Workflow pane and select Create Workflow.
In the Create Workflow dialog box, configure the Workflow Name and Description parameters for the workflow, and click Create.
In this example, the Workflow Name parameter is set to
Create the first auto triggered node
. You can configure the Workflow Name parameter based on your business requirements in actual data development scenarios.NoteFor more information about how to use workflows, see Create a workflow.
Step 2: Create tables
A data development node of DataWorks cleanses and processes source data. Before data development, you must create a table in the required compute engine to store the data cleansing results and define the table schema.
Create tables.
Click Business Flow in the Scheduled Workflow pane. Find the workflow that is created in Step 1, click the workflow name, right-click MaxCompute, and then select Create Table.
In the Create Table dialog box, configure the parameters such as Engine Instance and Name parameters.
In this example, the following tables are created.
Table name
Description
bank_data
Used to store raw business data.
result_table
Used to store the data cleansing results.
NoteFor information about table creation statements, see Table creation statements.
For information about how to create tables in different compute engines in a visualized manner, such as creating a MaxCompute table or an EMR table, see Create tables.
Generate table schemas.
Go to the configuration tabs of the tables, switch to the DDL mode, and use DDL statements to generate schemas for the tables. After the table schemas are generated, configure the Display Name parameter in the General section, click Commit to Development Environment, and then click Commit to Production Environment in the top toolbar. After the tables are committed, you can view the tables in the MaxCompute data source in the related environment. For information about how to view the data sources that are added to workspaces in different environments, see Add a MaxCompute data source.
NoteOperations such as table creation and table update can take effect in the related compute engines only after they are committed to the required environment.
You can also follow the on-screen instructions that are displayed in the DataWorks console to configure the table schemas in a visualized manner based on your business requirements. For more information about how to create a table in a visualized manner, see Create and manage MaxCompute tables.
In this example, the following statement is used to generate the schema of the bank_data table:
CREATE TABLE IF NOT EXISTS bank_data ( age BIGINT COMMENT 'Age', job STRING COMMENT 'Job type', marital STRING COMMENT 'Marital status', education STRING COMMENT 'Education level', default STRING COMMENT 'Credit card', housing STRING COMMENT 'Mortgage', loan STRING COMMENT 'Loan', contact STRING COMMENT 'Contact information', month STRING COMMENT 'Month', day_of_week STRING COMMENT 'Day of the week', duration STRING COMMENT 'Duration', campaign BIGINT COMMENT 'Number of contacts during the campaign', pdays DOUBLE COMMENT 'Interval from the last contact', previous DOUBLE COMMENT 'Number of contacts with the customer', poutcome STRING COMMENT 'Result of the previous marketing campaign', emp_var_rate DOUBLE COMMENT 'Employment change rate', cons_price_idx DOUBLE COMMENT 'Consumer price index', cons_conf_idx DOUBLE COMMENT 'Consumer confidence index', euribor3m DOUBLE COMMENT 'Euro deposit rate', nr_employed DOUBLE COMMENT 'Number of employees', y BIGINT COMMENT 'Time deposit available or not' );
In this example, the following statement is used to generate the schema of the result_table table:
CREATE TABLE IF NOT EXISTS result_table ( education STRING COMMENT 'Education level', num BIGINT COMMENT 'Number of persons' ) PARTITIONED BY ( day STRING, hour STRING );
Upload data.
Upload raw business data to a table in DataWorks. In this example, data is uploaded to the
bank_data
table. In this example, a file named banking.txt is uploaded from an on-premises machine to DataWorks. The following figure shows the procedure. For more information about how to upload data, see Upload a file from your on-premises machine to the bank_data table.
Step 3: Create a node
Select a suitable node type for node development based on your business requirements.
Nodes in DataWorks can be classified into data synchronization nodes and compute nodes. In most data development scenarios, you need to use a batch synchronization node to synchronize data from a business database to a data warehouse, and then use a compute node to cleanse and process the data in the data warehouse.
Create a node.
You can use one of the following methods to create a node:
Method 1: Create a node in the Scheduled Workflow pane
In the Scheduled Workflow pane of the DataStudio page, click Business Flow, find the workflow that you created, and then click the name of the workflow.
Right-click the compute engine that you want to use, and select a suitable node type after you move the pointer over Create Node to create a node of the selected type.
Method 2: Create a node on the configuration tab of the workflow
In the Scheduled Workflow pane of the DataStudio page, click Business Flow and find the workflow that you created.
Double-click the name of the workflow to go to the configuration tab of the workflow.
In the left-side section of the configuration tab, click the required node type or drag the required node type to the canvas on the right side.
In the Create Node dialog box, configure the parameters such as Engine Instance and Name.
In this example, an ODPS SQL node named
result_table
is created. The name of the node is the same as the name of the result table that is created in Step 2.NoteWhen you use DataWorks for data development, you need to use a compute node to cleanse the data and then store the cleansing results in a result table. We recommend that you use the name of the result table as the name of the node to quickly locate the table data that is generated by the node.
Step 4: Configure the node
Find the node that you created in Step 3, and double-click the name of the node to go to the node configuration tab. On the node configuration tab, write the code of the node based on the syntax that is supported by the related database.
In this example, the result_table
node is used to write the data in the specified partition in the bank_data
table to the specified partition in the result_table
table, and the partition to which the data is written is defined by the day
and hour
variables.
If you want to use variables to dynamically replace parameters in scheduling scenarios during code development, you can define the variables in the code in the
${Custom variable name}
format and assign values to the variables when you configure scheduling properties for the node in Step 5.For more information about scheduling parameters, see Supported formats of scheduling parameters.
For more information about the code syntax for different types of nodes, see Create and use nodes.
Sample code:
INSERT OVERWRITE TABLE result_table partition (day='${day}', hour='${hour}')
SELECT education
, COUNT(marital) AS num
FROM bank_data
GROUP BY education;
Step 5: Configure scheduling properties for the node
You can configure scheduling properties for a node to enable periodic scheduling for the node. In the right-side navigation pane of the node configuration tab, click the Properties tab. You can configure scheduling properties in different sections of the tab for the node based on your business requirements.
Tab | Description |
In this section, the node name, node ID, node type, and owner of the node are automatically displayed. You do not need to configure additional settings. Note
| |
In this section, you can configure the scheduling parameters that are used to define how the node is scheduled. DataWorks provides scheduling parameters that can be classified into custom parameters and built-in variables based on their value assignment methods. Scheduling parameters support dynamic parameter settings for node scheduling. If a variable is defined during the modification of the code of the node in Step 4, you can assign a value to the variable in the Parameters section. In this example, the following variables are defined in Step 4, and values are assigned to the variables to write the data that is generated in the 24 hours of the previous day in the
| |
In this section, you can configure time properties for the node, such as the instance generation mode, the scheduling cycle, the point in time when you want to schedule the node to start, the rerun settings, and the timeout period. Note
In this example, the | |
In this section, you can select the resource group for scheduling that you want to use to deploy the node to the production environment. When you activate DataWorks, a serverless resource group is provided. In this example, the serverless resource group is used. For more information about how to create and use a serverless resource group, see Create and use a serverless resource group. | |
In this section, you can configure scheduling dependencies for the node. We recommend that you configure scheduling dependencies for the node based on the lineage of the node. If the ancestor node of the current node is successfully run, the table data that the current node needs to use is generated. This way, the current node can obtain the table data. Note
In this example, if the | |
In this section, you can configure input parameters and output parameters for the node. The configurations in this section are optional. A node can obtain the values of parameters that are configured for its ancestor node by using specific parameters. Note In most cases, this process requires assignment nodes or scheduling parameters.
|
Step 6: Debug the code of the node
You can use one of the following features to debug the code logic to ensure that the code you write is correct.
Feature | Description | Suggestion |
You can quickly run the code snippet that you selected on the configuration tab of the node. | You can use this feature to quickly run a code snippet of a node. | |
Top toolbar: Run () | You can assign constants to the variables that are defined in the code in specific test scenarios. Note The first time you click the Run icon to run a new node, you must manually assign constants to the variables that are defined in the code of the node in the dialog box that appears. The assignment operation will be recorded in the system. You do not need to repeat the operations for subsequent running of the node. | You can use this feature to frequently debug full code of a node. |
Top toolbar: Run with Parameters () | You must assign constants to the variables that are defined in the code in specific test scenarios each time you click this icon. | You can use this feature to modify the values assigned to the variables in the code. |
In this example, the node is run at 2022.09.07 14:00
in a Run with Parameters test. The following figure shows the running result.
Step 7: Save and commit the node
After node configuration and testing are complete, save the node configuration, and then commit the node to the development environment.
You can commit the node to the development environment only after you configure rerun settings and ancestor nodes for the node in Step 5.
Click the icon in the top toolbar to save the node.
Click the icon in the top toolbar to commit the node to the development environment.
Step 8: Perform smoke testing
To ensure that the node that you developed can be run in an efficient manner and fully utilize computing resources, we recommend that you perform smoke testing on the node before you commit and deploy the node. The smoke testing must be performed in the development environment. You must commit the node to the development environment before you perform smoke testing on the node.
Click the icon in the top toolbar. In the smoke testing dialog box, specify the data timestamp of the node.
After the smoke testing is complete, click the icon in the top toolbar to view the test results.
In this example, smoke testing is performed to check whether the scheduling parameters that are configured meet user requirements. The result_table
node is scheduled to run at an interval of 1 hour from 00:00
to 23:59
. When the smoke testing is performed on the node, two instances are generated. The scheduling time of the instances are 00:00
and 01:00
.
Auto triggered instances are snapshots that are generated for an auto triggered node when the node is scheduled to run based on the specified scheduling cycle.
The
result_table
node is scheduled by hour. You must specify the data timestamp of the node for the smoke testing. You must also select the start time and end time of the test.For more information about how to perform smoke testing in the development environment, see Perform smoke testing.
Step 9: Deploy the node
If the workspace is in basic mode, the node can be periodically scheduled after the node is committed. If the workspace is in standard mode, the node is in the pending state after the node is committed. You must refer to the operations that are described in this step to deploy the node. The node can be periodically scheduled only after the node is deployed.
DataWorks can automatically schedule only the nodes that are deployed to the production environment. After smoke testing is complete, commit and deploy the node to the production environment to enable DataWorks to periodically schedule the node.
For more information about workspaces in basic mode and workspaces in standard mode, see Differences between workspaces in basic mode and workspaces in standard mode.
In a workspace in standard mode, the operations that are committed on the DataStudio page, including addition, update, and deletion of data development nodes, resources, and functions, are in the pending state on the Create Deploy Task page. You can click Deploy to go to the Create Deploy Task page, and deploy the related operations to the production environment. The operations take effect only after they are deployed to the production environment. For more information, see Deploy nodes.
The following table describes the items that are related to the deployment procedure.
Item | Description |
Deployment control | Whether the deployment operation is successful varies based on the permissions of the role of the user that performs this operation and the specified deployment procedure that is used. Note
|
Instance generation mode | If you create or update a node and deploy the node in the time range of Note This limit takes effect on nodes for which the Instance Generation Mode parameter is set to Next Day or Immediately After Deployment. For more information about the instance generation mode, see Configure immediate instance generation for a task. |
What to do next
You can go to Operation Center and view the auto triggered node that is deployed to the production environment on the Auto Triggered Tasks page and perform the related O&M operations on the node. For more information, see Perform basic O&M operations on auto triggered nodes.