All Products
Search
Document Center

DataWorks:General development process

Last Updated:Nov 14, 2024

DataWorks encapsulates different types of compute engine tasks into different types of nodes and allows you to create nodes to generate data development tasks. DataWorks also allows you to use resources, functions, and different logic processing nodes to develop complex tasks. This topic describes the general development process of data development tasks.

Prerequisites

Go to the DataStudio page

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

Then, you can create the desired nodes by performing the steps that are described in the following section.

Development process

The following figure and table show the general development process of a data development task.脚本开发流程

Step

Description

References

Step 1: Create a workflow

Data development in DataWorks is performed based on workflows and code. Before you perform development operations, you must create a workflow.

Create a workflow

Step 2: Create a table

DataWorks allows you to create tables in the DataWorks console and displays tables in a directory structure. You can manage the tables in the DataWorks console.

Before you develop data in your workspace, you must create tables to store raw data and tables to receive data cleansing results in the compute engines that are associated with your workspace. You can determine which types of tables are required based on the compute engines that you use.

Step 3: (Optional) Create and upload resources

DataWorks allows you to upload different types of resources such as text files and JAR packages to the specified compute engines and allows you to use the resources when you develop data. If you need to use some existing resources for data development, you can upload the resources by performing operations in the DataWorks console and then manage the resources in the console.

Note

You can view the compute engines for which you can create resources and the types of resources that are supported by compute engines in the DataWorks console.

Step 4: Create a scheduling node

Data development in DataWorks is based on nodes, and tasks of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a node type to develop nodes based on your business requirements.

You can also perform management operations on nodes with ease. For example, you can use a node group to clone multiple nodes at a time. You can quickly restore deleted nodes from the recycle bin.

DataWorks supports the following types of compute engines:

You can select different types of nodes for tasks of different types of compute engines. For information about different types of DataWorks nodes, see DataWorks nodes.

For information about the management operations that you can perform on nodes, see the following topics:

Step 5: (Optional) Reference resources in nodes

Before you can use resources in a DataWorks node, you must load the resources to the development environment of the node.

Step 6: (Optional) Register a function

Before you can use a function to develop data, you must register the function in the DataWorks console. Before you register a function, you must upload the resources that are required by the function to DataWorks.

Note

You can view the compute engines for which you can register functions in the DataWorks console.

Step 7: Write the node code

You can write code for a node that corresponds to a task of a specific compute engine type on the node configuration tab based on the syntax that is supported by the compute engine and the related database. The syntax based on which you write the node code varies based on the node type.

Note

After you write the code, click the 保存 icon to save the code at the earliest opportunity to prevent code loss.

For information about different types of DataWorks nodes, see DataWorks nodes.

Usage notes of common compute engines:

Subsequent steps: Debug code and configure scheduling properties

After the node code is developed, you can perform the following operations based on your business requirements:

  • Debug code: Debug and run a single node or the entire workflow to which the node belongs based on your business requirements. You can view the debugging result after the debugging is complete. For more information, see Debugging procedure.

  • Configure scheduling parameters: Configure scheduling parameters for the node. The node is periodically scheduled based on the configurations of the scheduling parameters. For more information, see Configure basic properties.

  • Commit and deploy the node: After the node is developed, you must commit it to the related environment for scheduling and running. If you use a workspace in standard mode, after you commit the node, you must click Deploy in the upper-right corner of the configuration tab of the node to deploy the node. For more information, see Deploy a node.

  • Perform O&M operations on the node: After the node is deployed, the node is displayed in Operation Center in the production environment by default. You can go to Operation Center in the production environment to view the running status of the node and perform O&M operations on the node. For more information, see Overview.