The DataStudio service of DataWorks allows you to define the development and scheduling properties of auto triggered tasks. DataStudio works with Operation Center to provide a visualized development interface for tasks of various types of compute engines, such as MaxCompute, Hologres, and E-MapReduce (EMR). You can configure settings on the visualized development interface to perform intelligent code development, multi-engine task orchestration in workflows, and standardized task deployment. This way, you can build offline data warehouses, real-time data warehouses, and ad hoc analysis systems to ensure efficient and stable data production.
Go to the DataStudio page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Main features of DataStudio
The following figure shows the main features provided by DataStudio. For more information, see the Appendix: Terms related to data development section in this topic.
Feature | Description |
Object organization and management | DataStudio provides a mechanism to organize and manage objects in DataWorks.
For more information, see Create a workflow and the Task organization and management modes section in this topic. Note Limits on the maximum numbers of workflows and objects that you can create in DataStudio in each workspace:
If the numbers of workflows and objects in the current workspace reach the upper limit, you can no longer create a workflow or object. |
Task development |
For information about the node types that are supported by DataWorks, see DataWorks nodes. |
Task scheduling |
For more information about task scheduling, see Configure time properties and Scheduling dependency configuration guide. |
Task debugging | You can debug a task or a workflow. For more information, see Debugging procedure. |
Process control | DataStudio provides a standardized task deployment mechanism and various methods to perform process control. You can perform operations that include but are not limited to the following operations for process control:
|
Other features |
|
Introduction to the DataStudio page
You can follow the instructions that are described in Features on the DataStudio page to use the features of each module on the DataStudio page.
Task development process
DataWorks allows you to create different compute engine types of real-time synchronization tasks, batch synchronization tasks, batch processing tasks, and manually triggered tasks on the DataStudio page. For more information about data synchronization, see Overview of Data Integration. The configuration requirements on tasks of different compute engine types vary. Take note of the precautions and related instructions on the development of tasks of different compute engine types in DataWorks before you develop tasks based on the task type.
Instructions on the development of tasks of different compute engine types: You can add different data sources to DataWorks to develop tasks in DataWorks. The configuration requirements on tasks of different compute engine types vary. For more information, see the following topics:
Common development process: The following two workspace modes are available: standard mode and basic mode. The task development process varies based on the workspace mode.
Task development process in a workspace in standard mode
Task development process in a workspace in basic mode
Basic process: For example, you want to develop tasks in a workspace in standard mode. The development process includes the following stages: development, debugging, configuration of scheduling settings, task committing, task deployment, and O&M. For more information, see General development process.
Process control: During task development, you can perform operations such as code review and smoke testing provided by DataStudio and use check items preset in Data Governance Center and verification logic customized based on extensions in Open Platform to ensure that specified standards and requirements on task development are met.
NoteThe process control operations vary based on the workspace mode. The actual process control operations shall prevail.
Task organization and management modes
A workflow is a basic unit for code development and resource management. A workflow is an abstract business entity that allows you to develop code based on your business requirements. Workflows and nodes in different workspaces are separately developed. For more information about workflows, see Create a workflow.
Workflows can be displayed in a directory tree or in a panel. The display modes enable you to organize code from the business perspective and show the resource classification and business logic in a more efficient manner.
The directory tree allows you to organize your code by task type.
The panel shows the business logic in a workflow.
Appendix: Node types supported by DataStudio
The DataStudio service of DataWorks allows you to create various types of nodes. You can enable DataWorks to periodically schedule instances that are generated for nodes. You can also select a specific type of node to develop data based on your business requirements. For more information about the node types that are supported by DataWorks, see DataWorks nodes.
Appendix: Terms related to data development
Terms related to task development
Term
Description
Solution
A collection of workflows. A solution is a group of workflows that are dedicated to a specific business goal. A workflow can be added to multiple solutions. After you develop a solution and add a workflow to the solution, other users can reference and modify the workflow in their solutions for collaborative development.
Workflow
An abstract business entity and a collection of tasks, tables, resources, and functions for a specific business requirement. Tasks in this type of workflow are triggered to run as scheduled.
Manually triggered workflow
A collection of tasks, tables, resources, and functions for a specific business requirement.
Nodes in this type of workflow are manually triggered to run.
DAG
The abbreviation of
directed acyclic graph
. A DAG is used to display nodes and their dependencies. In DataStudio, all tasks in a workflow are displayed in the same DAG. This facilitates task development and dependency configuration.Task
A basic execution unit of DataWorks. DataWorks runs tasks in sequence based on the dependencies between the tasks.
Node
A task in a DAG. DataWorks runs nodes in sequence based on the dependencies between the nodes.
Terms related to task scheduling
Term
Description
Dependency
Used to define the sequence in which tasks are run. If Node B can run only after Node A finishes running, Node A is the ancestor node of Node B, and Node B depends on Node A. In a DAG, dependencies are represented by arrows between nodes.
Output name
The identifier used to distinguish the current node from other nodes. An output name is globally unique. A node can contain multiple output names. Scheduling dependencies between nodes are configured based on output names.
Output table name
We recommend that you use the name of the table generated by the current task as the output table name. Proper configuration of an output table name can help check whether data is from an expected ancestor table when you configure dependencies for a descendant node. We recommend that you do not manually modify an output table name that is generated based on automatic parsing. The output table name serves only as an identifier. Modifying an output table name does not affect the name of the table that is actually generated by executing SQL statements. The name of an actually generated table is subject to the SQL logic.
NoteAn output name must be globally unique. However, no such limit is imposed on an output table name.
Resource group for scheduling
A resource group that is used for task scheduling. For more information about resource groups, see Overview.
Scheduling parameter
Configured for a node when the node is scheduled to run. The values of scheduling parameters are dynamically replaced at the scheduling time of the node. If you want to obtain information about the runtime environment, such as the date and time, during repeated running of code, you can dynamically assign values to variables in the code based on the definition of scheduling parameters in DataWorks.
Data timestamp
The previous day of the scheduling time (the time when you want to schedule the node). In offline computing scenarios, a data timestamp represents the date on which a business transaction is conducted. The value of a data timestamp is accurate to the day. For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp.
Scheduling time
The time when you want to schedule the task to process business data. The scheduling time is accurate to the second. The scheduling time can be different from the actual time at which the task is scheduled to run. The actual time at which a task is run is affected by multiple factors.