The DataStudio service of DataWorks allows you to define the development and scheduling properties of auto triggered tasks. DataStudio works with Operation Center to provide a visualized development interface for tasks of various types of compute engines, such as MaxCompute, Hologres, and E-MapReduce (EMR). You can configure settings on the visualized development interface to perform intelligent code development, multi-engine task orchestration in workflows, and standardized task deployment. This way, you can build offline data warehouses, real-time data warehouses, and ad hoc analysis systems to ensure efficient and stable data production. This topic describes the terms that are used in DataStudio, the capabilities provided by DataStudio, and preparations before data development in DataStudio.
Go to the DataStudio page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Module introduction
Capability overview
The following figure shows the main features provided by DataStudio. For more information, see the Appendix: Terms related to data development section in this topic.

Feature | Description |
Object organization and management | DataStudio provides a mechanism to organize and manage objects in DataWorks. Object organization: The following two-level management mode is provided: . DataWorks allows you to organize objects in the directory tree of a workflow or on the configuration tab of a workflow. You can create required objects in the directory tree of a workflow or drag components on the configuration tab of the workflow to build a data processing workflow. You can use solutions to manage workflows. Object management: You can create and manage nodes, tables, resources, and functions in a visualized manner.
For more information, see Create a workflow and the Management modes section in this topic. Note Limits on the maximum numbers of workflows and objects that you can create in DataStudio in each workspace: Workflow: You can create a maximum of 10,000 workflows. Object (node, file, table, resource, or function): For DataWorks Enterprise Edition, you can create a maximum of 200,000 objects. For DataWorks Professional Edition, DataWorks Standard Edition, or DataWorks Basic Edition, you can create a maximum of 100,000 objects.
If the numbers of workflows and objects in the current workspace reach the upper limit, you can no longer create a workflow or object. |
Task development | Various capabilities: Provides nodes of a wide range of compute engine types to fully encapsulate compute engine capabilities. Provides general nodes. You can combine general nodes and nodes of a specific compute engine type in DataWorks to process complex business logic. For example, you can enable external systems to trigger the scheduling of nodes in DataWorks, check whether files exist, route results based on logical conditions, execute code of specific nodes in loops, and pass output between nodes.
Simple operations: Allows you to develop data on the configuration tab of a workflow in a visualized manner. You can drag components to perform hybrid orchestration of tasks of different compute engine types. Provides an intelligent SQL editor. The SQL editor provides features such as code hinting, display of the code structure by using SQL operators, and permission verification.
For information about the node types that are supported by DataWorks, see DataWorks nodes. |
Task scheduling | Trigger methods: The scheduling of tasks can be triggered by external systems, events, or output of ancestor tasks. The output of ancestor tasks triggers task scheduling based on inner lineage parsing. Dependencies: You can configure same-cycle and cross-cycle dependencies. You can also configure dependencies between different types of tasks whose scheduling frequencies are different. Execution control: You can determine whether to rerun a task and manage the scheduling time of a task based on the output of its ancestor task. You can specify a validity period during which a task is automatically run as scheduled and the scheduling type of a task. For example, you can specify a task as a dry-run task or freeze a task. After you specify a task as a dry-run task, the system returns a success response for the task without running the task. The scheduling of descendant tasks of the task is not blocked. After you freeze a task, the system does not run the task, and the scheduling of descendant tasks of the task is blocked. Idempotence: DataStudio provides a rerun mechanism that you can use to configure custom rerun conditions and rerun times.
For more information about task scheduling, see Configure time properties and Scheduling dependency configuration guide. |
Task debugging | You can debug a task or a workflow. For more information, see Debugging procedure. |
Process control | DataStudio provides a standardized task deployment mechanism and various methods to perform process control. You can perform operations that include but are not limited to the following operations for process control: Review code and perform smoke testing before a task is deployed. This helps block the execution of the process in which an error occurs in the production environment. For information about code review, see Code review. Customize process control on task committing and deployment to the production environment, in combination with governance items provided by Data Governance Center and verification logic customized based on extensions.
|
Other features | Openness: DataWorks Open Platform provides various API operations and a large number of built-in extension points. You can subscribe to event messages related to data development on DataWorks Open Platform. Permission control: You can manage the permissions on service modules of DataWorks and the data access permissions. For more information, see Manage permissions on workspace-level services. Viewing of operation records: DataWorks is integrated with ActionTrail. This allows you to query recent DataWorks behavior events of your Alibaba Cloud account in ActionTrail. For more information, see View operation records on the DataStudio page.
|
Introduction to the DataStudio page
You can follow the instructions that are described in Features on the DataStudio page to use the features of each module on the DataStudio page.
Development process
DataWorks DataStudio allows you to create different compute engine types of real-time synchronization tasks, batch synchronization tasks, batch processing tasks, and manually triggered tasks. For more information about data synchronization, see Overview of Data Integration. The configuration requirements on tasks of different compute engine types vary. Take note of the precautions and related instructions on the development of tasks of different compute engine types in DataWorks before you develop tasks based on the task type.
Instructions on the development of tasks of different compute engine types: You can add different data sources to DataWorks to develop tasks in DataWorks. The configuration requirements on tasks of different compute engine types vary. For more information, see the following topics:
Common development process: The following two workspace modes are available: standard mode and basic mode. The node development process varies based on the workspace mode.
Task development process in a workspace in standard mode
Task development process in a workspace in basic mode
Basic process: For example, you want to develop tasks in a workspace in standard mode. The development process includes the following stages: development, debugging, configuration of scheduling settings, task committing, task deployment, and O&M. For more information, see General development process.
Process control: During task development, you can perform operations such as code review and smoke testing provided by DataStudio and use check items preset in Data Governance Center and verification logic customized based on extensions in Open Platform to ensure that specified standards and requirements on task development are met.
Note
The process control operations vary based on the workspace mode. The actual process control operations shall prevail.
Management modes
A workflow is a basic unit for code development and resource management. A workflow is an abstract business entity that allows you to develop code based on your business requirements. Workflows and nodes in different workspaces are separately developed. For more information about workflows, see Create a workflow.
Workflows can be displayed in a directory tree or in a panel. The display modes enable you to organize code from the business perspective and show the resource classification and business logic in a more efficient manner.

Get started with DataStudio
Environment preparation
If you want to perform data modeling or data development, or periodically schedule tasks in Operation Center in DataWorks, you must associate your data source or cluster with DataStudio. This way, you can read data in the data source or cluster and perform data development operations.
Add a data source or cluster of a specific type based on the type of tasks that you want to develop and schedule.
Data source or cluster type | Description |
Data source or cluster type | Description |
MaxCompute | The first time you add a MaxCompute data source to DataWorks, DataWorks automatically associates the data source with DataStudio. You do not need to follow the instructions that are described in this topic to manually associate the data source with DataStudio. For MaxCompute data sources that are added later, you must manually associate the data sources with DataStudio. |
Hologres | After you add a data source of one of these types, you must follow the instructions that are described in this topic to manually associate the data source with DataStudio. |
AnalyticDB for PostgreSQL |
AnalyticDB for MySQL V3.0 |
ClickHouse |
EMR | After you register a cluster to DataWorks, DataWorks associates the cluster with DataStudio. You do not need to follow the instructions that are described in this topic to manually associate the cluster with DataStudio. |
Cloudera's Distribution Including Apache Hadoop (CDH) or Cloudera Data Platform (CDP) |
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the left-side navigation pane, click Computing Resource.
If the Computing Resource module is not displayed in the left-side navigation pane, you must go to the Personal Settings tab and select Computing Resource in the DataStudio Modules section to allow the Computing Resource module to be displayed in the left-side navigation pane of the DataStudio page. For more information, see Configure settings in the DataStudio Modules section.
Associate a data source or cluster.
On the Computing Resource page, search for the desired data source or cluster by computing resource name or computing resource type and click Associate. After you associate the data source or cluster with DataStudio, you can read data from the data source or cluster based on the connection information and perform relevant data development operations.
Note
If data source or cluster information changes, but the data on the current page is not updated in time, refresh the current page to update the cached data.

A data source or a cluster may fail to be associated with DataStudio in the following scenarios:
The configurations of the data sources or clusters of specific types do not support association with DataStudio. For example, you cannot associate a data source that is added by using an AccessKey pair with DataStudio. For more information about limits on the association, see the descriptions that are displayed in the DataWorks console when you associate a data source or a cluster with DataStudio.
The configurations in the development or production environment are missing.
A MaxCompute data source cannot be associated with multiple DataWorks workspaces at the same time.
Note
The reason why a data source or cluster cannot be associated with DataStudio varies based on the type of the data source or cluster. You can troubleshoot issues based on the reason that is displayed when you try to associate the data source or cluster with DataStudio.
Only the following types of data sources or clusters can be associated with DataStudio: MaxCompute, EMR, Hologres, AnalyticDB for MySQL, ClickHouse, CDH, CDP, and AnalyticDB for PostgreSQL.
The types and number of data sources or clusters that can be associated with DataStudio vary based on the DataWorks edition. For more information, see the Feature comparison section of the "Differences among DataWorks editions" topic.
Node types supported by DataStudio
The DataStudio service of DataWorks allows you to create various types of nodes. You can enable DataWorks to periodically schedule instances that are generated for nodes. You can also select a specific type of node to develop data based on your business requirements. For more information about the node types that are supported by DataWorks, see DataWorks nodes.
Appendix: Terms related to data development