DataWorks encapsulates different types of compute engine tasks into different types of nodes and allows you to create nodes to generate data development tasks. DataWorks also allows you to use resources, functions, and different logic processing nodes to develop complex tasks. This topic describes the general development process of data development tasks.
Prerequisites
The desired data sources are associated with DataStudio. For more information, see Preparations before data development: Associate a data source or a cluster with DataStudio.
You are granted the permissions of the Development role. For more information, see Add a RAM user to a workspace as a member and assign roles to the member.
Go to the DataStudio page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Then, you can create the desired nodes by performing the steps that are described in the following section.
Development process
The following figure and table show the general development process of a data development task.
Step | Description | References |
Step 1: Create a workflow | Data development in DataWorks is performed based on workflows and code. Before you perform development operations, you must create a workflow. | |
Step 2: Create a table | DataWorks allows you to create tables in the DataWorks console and displays tables in a directory structure. You can manage the tables in the DataWorks console. Before you develop data in your workspace, you must create tables to store raw data and tables to receive data cleansing results in the compute engines that are associated with your workspace. You can determine which types of tables are required based on the compute engines that you use. | Create and use tables: View and manage tables: |
Step 3: (Optional) Create and upload resources | DataWorks allows you to upload different types of resources such as text files and JAR packages to the specified compute engines and allows you to use the resources when you develop data. If you need to use some existing resources for data development, you can upload the resources by performing operations in the DataWorks console and then manage the resources in the console. Note You can view the compute engines for which you can create resources and the types of resources that are supported by compute engines in the DataWorks console. | |
Step 4: Create a scheduling node | Data development in DataWorks is based on nodes, and tasks of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a node type to develop nodes based on your business requirements. You can also perform management operations on nodes with ease. For example, you can use a node group to clone multiple nodes at a time. You can quickly restore deleted nodes from the recycle bin. | DataWorks supports the following types of compute engines: You can select different types of nodes for tasks of different types of compute engines. For information about different types of DataWorks nodes, see DataWorks nodes. For information about the management operations that you can perform on nodes, see the following topics: |
Step 5: (Optional) Reference resources in nodes | Before you can use resources in a DataWorks node, you must load the resources to the development environment of the node. | |
Step 6: (Optional) Register a function | Before you can use a function to develop data, you must register the function in the DataWorks console. Before you register a function, you must upload the resources that are required by the function to DataWorks. Note You can view the compute engines for which you can register functions in the DataWorks console. | |
Step 7: Write the node code | You can write code for a node that corresponds to a task of a specific compute engine type on the node configuration tab based on the syntax that is supported by the compute engine and the related database. The syntax based on which you write the node code varies based on the node type. Note After you write the code, click the icon to save the code at the earliest opportunity to prevent code loss. | For information about different types of DataWorks nodes, see DataWorks nodes. Usage notes of common compute engines: |
Subsequent steps: Debug code and configure scheduling properties
After the node code is developed, you can perform the following operations based on your business requirements:
Debug code: Debug and run a single node or the entire workflow to which the node belongs based on your business requirements. You can view the debugging result after the debugging is complete. For more information, see Debugging procedure.
Configure scheduling parameters: Configure scheduling parameters for the node. The node is periodically scheduled based on the configurations of the scheduling parameters. For more information, see Configure basic properties.
Commit and deploy the node: After the node is developed, you must commit it to the related environment for scheduling and running. If you use a workspace in standard mode, after you commit the node, you must click Deploy in the upper-right corner of the configuration tab of the node to deploy the node. For more information, see Deploy a node.
Perform O&M operations on the node: After the node is deployed, the node is displayed in Operation Center in the production environment by default. You can go to Operation Center in the production environment to view the running status of the node and perform O&M operations on the node. For more information, see Overview.