Data Studio is an intelligent, lakehouse-based data development platform built by Alibaba Group based on 15 years of big data experience. Data Studio is compatible with multiple computing services of Alibaba Cloud and provides the following capabilities: intelligent extract, transform, and load (ETL), data catalog management, and cross-engine workflow orchestration. Data Studio provides personal development environment instances to support data development in Python, notebook-based data analysis, and Git integration. Data Studio also supports various plug-in ecosystems to implement the integration of real-time and offline data processing, big data and AI integration, and the lakehouse architecture. This facilitates data management throughout the lifecycle of data in "Data+AI" mode.
Overview
Data Studio is an intelligent, lakehouse-based data development platform that leverages the big data development methodology of Alibaba Group based on 15 years of big data experience. Data Studio is deeply compatible with dozens of big data and AI computing services provided by Alibaba Cloud, such as MaxCompute, E-MapReduce (EMR), Hologres, Realtime Compute for Apache Flink, and Platform for AI (PAI). Data Studio provides intelligent ETL services for data warehouses, data lakes, and the OpenLake lakehouse architecture, and supports the following features:
Data catalog: manages metadata in the lakehouse architecture.
Workflow: orchestrates real-time and offline development nodes and AI nodes of dozens of engine types.
Personal development environment instance: allows you to run and debug node code in Python, and supports notebook-based interactive analysis and the integration with the Git repository for code management and Apsara File Storage NAS for storage.
Notebook: an intelligent, interactive data development and analysis tool that can be used to perform engine-specific SQL or Python code analysis and run or debug code in real time. This way, you can obtain visualized data processing results.
Public preview description
You can turn on Participate in Public Preview of DataStudio of New Version only when you create a workspace.
NoteExisting workspaces cannot participate in the public preview of Data Studio.
Data in the new-version Data Studio service is independent of and does not communicate with the data in the old-version DataStudio service. The feature of migrating existing nodes from the old-version DataStudio service to the new-version Data Studio service is being developed.
Data Studio is available for public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), and China (Shenzhen).
Go to the Data Studio page
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
This entry point is available only to workspaces for which Data Studio in public preview is activated.
Main features of Data Studio
This section describes the main features provided by Data Studio. For information about the related terms, see the Appendix: Terms related to data development section in this topic.
Feature | Description |
Workflow management | DataWorks Data Studio provides the workflow-based development mode. This is a new R&D method. You can manage complex tasks with ease in a directed acyclic graph (DAG) from the business perspective. Note Limits on the maximum numbers of workflows and objects that you can create in new-version Data Studio and old-version DataStudio in each workspace:
If the number of workflows or objects in the current workspace reaches the upper limit, you can no longer create a workflow or an object. |
Task development |
|
Task scheduling |
|
Quality management | Data Studio provides a standard task deployment mechanism and various methods to perform quality management. You can perform operations that include but are not limited to the following operations for quality management:
|
Other features |
|
Task development process
DataWorks Data Studio allows you to create different compute engine types of real-time synchronization tasks, batch synchronization tasks, batch processing tasks, and manually triggered tasks. For more information about data synchronization, see Data Integration overview.
The following two workspace modes are available: standard mode and basic mode. The task development process varies based on the workspace mode.
Task development process in a workspace in standard mode
Task development process in a workspace in basic mode
Basic process: For example, you want to develop tasks in a workspace in standard mode. The development process includes the following stages: development, debugging, configuration of scheduling settings, deployment, and O&M.
Process control: During task development, you can perform operations such as code review provided by Data Studio and use check items preset in Data Governance Center and verification logic customized based on extensions in Open Platform to ensure that specified standards and requirements on task development are met.
Data development methods
Data Studio allows you to specify a custom development process. You can use the workflow feature to quickly build a data processing process. You can also manually create different types of nodes and configure scheduling dependencies for the nodes.
Node types supported by Data Studio
The Data Studio service of DataWorks allows you to create various types of nodes, such as data synchronization, MaxCompute, Hologres, EMR, Realtime Compute for Apache Flink, Python, notebook, and AnalyticDB. You can enable DataWorks to periodically schedule instances that are generated for nodes. You can also select a specific type of node to develop data based on your business requirements. For more information about the node types that are supported by DataWorks, see DataWorks nodes.
Appendix: Terms related to data development
Terms related to task development
Term | Description |
Auto triggered workflow | A new R&D method. You can manage complex tasks with ease in a DAG from the business perspective. You can create various types of nodes in a workflow, such as data synchronization, MaxCompute, Hologres, EMR, Realtime Compute for Apache Flink, Python, notebook, and AnalyticDB. You can configure scheduling settings at the workflow level. |
Manually triggered workflow | A collection of tasks, tables, resources, and functions for a specific business requirement. Nodes in this type of workflow are manually triggered to run. Nodes in an auto triggered workflow are triggered to run as scheduled. |
Node | A basic execution unit of DataWorks. Data Studio allows you to create various types of nodes, such as Data Integration nodes used for data synchronization, compute engine nodes used for data cleansing, and general nodes used together with compute engine nodes to process complex logic. Compute engine nodes include MaxCompute SQL nodes, Hologres SQL nodes, and EMR Hive nodes. General nodes include zero load nodes that can be used to manage multiple other nodes and do-while nodes that can run node code in loops. You can combine multiple types of nodes in your business to meet your different data processing requirements. |
Terms related to task scheduling
Term | Description |
Dependency | Used to define the sequence in which tasks are run. If Node B can run only after Node A finishes running, Node A is the ancestor node of Node B, and Node B depends on Node A. In a DAG, dependencies are represented by arrows between nodes. |
Output name | The output name of each task. When you configure dependencies between tasks within an Alibaba Cloud account, the output name of a task is used to connect to its descendant tasks. When you configure dependencies for a task, you must use the output name of the task instead of the node name or ID. After you configure the dependencies, the output name of the task serves as the input name of its descendant tasks. |
Output table name | We recommend that you use the name of the table generated by the current task as the output table name. Proper configuration of an output table name can help check whether data is from an expected ancestor table when you configure dependencies for a descendant node. We recommend that you do not manually modify an output table name that is generated based on automatic parsing. The output table name serves only as an identifier. Modifying an output table name does not affect the name of the table that is actually generated by executing SQL statements. The name of an actually generated table is subject to the SQL logic. Note An output name must be globally unique. However, no such limit is imposed on an output table name. |
Resource group for scheduling | A resource group used for task scheduling. For more information about resource groups, see Overview. |
Scheduling parameter | Configured for a node when the node is scheduled to run. The values of scheduling parameters are dynamically replaced at the scheduling time of the node. If you want to obtain information about the runtime environment, such as the date and time, during repeated running of code, you can dynamically assign values to variables in the code based on the definition of scheduling parameters in DataWorks. |
Data timestamp | The date that is directly related to a business activity, which reflects the actual time when a business transaction is conducted. This term is especially important in offline computing scenarios. For example, if you want to collect statistical data on the turnover generated on October 10, 2024 in retail business, the calculation started in the early morning of October 11, 2024. October 10, 2024 is the date on which the business transaction is conducted and represents the data timestamp. |
Scheduling time | The point in time at which auto triggered tasks are scheduled to run. The scheduling time can be accurate to the minute. Important Task running is affected by various factors. A task may not start to run when the scheduling time of the task arrives. Before a task is run, DataWorks checks whether ancestor tasks of the task are successfully run, whether the scheduling time of the task arrives, and whether scheduling resources are sufficient. The task can be triggered to run only if all the preceding conditions are met. |