All Products
Search
Document Center

:Overview of Data Studio

Last Updated:Nov 19, 2024

Data Studio is an intelligent, lakehouse-based data development platform built by Alibaba Group based on 15 years of big data experience. Data Studio is compatible with multiple computing services of Alibaba Cloud and provides the following capabilities: intelligent extract, transform, and load (ETL), data catalog management, and cross-engine workflow orchestration. Data Studio provides personal development environment instances to support data development in Python, notebook-based data analysis, and Git integration. Data Studio also supports various plug-in ecosystems to implement the integration of real-time and offline data processing, big data and AI integration, and the lakehouse architecture. This facilitates data management throughout the lifecycle of data in "Data+AI" mode.

Overview

Data Studio is an intelligent, lakehouse-based data development platform that leverages the big data development methodology of Alibaba Group based on 15 years of big data experience. Data Studio is deeply compatible with dozens of big data and AI computing services provided by Alibaba Cloud, such as MaxCompute, E-MapReduce (EMR), Hologres, Realtime Compute for Apache Flink, and Platform for AI (PAI). Data Studio provides intelligent ETL services for data warehouses, data lakes, and the OpenLake lakehouse architecture, and supports the following features:

  • Data catalog: manages metadata in the lakehouse architecture.

  • Workflow: orchestrates real-time and offline development nodes and AI nodes of dozens of engine types.

  • Personal development environment instance: allows you to run and debug node code in Python, and supports notebook-based interactive analysis and the integration with the Git repository for code management and Apsara File Storage NAS for storage.

  • Notebook: an intelligent, interactive data development and analysis tool that can be used to perform engine-specific SQL or Python code analysis and run or debug code in real time. This way, you can obtain visualized data processing results.

Public preview description

  • You can turn on Participate in Public Preview of DataStudio of New Version only when you create a workspace.

    Note
    • Existing workspaces cannot participate in the public preview of Data Studio.

    • Data in the new-version Data Studio service is independent of and does not communicate with the data in the old-version DataStudio service. The feature of migrating existing nodes from the old-version DataStudio service to the new-version Data Studio service is being developed.

  • Data Studio is available for public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), and China (Shenzhen).

Go to the Data Studio page

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio (New Version) in the Actions column.

Note

This entry point is available only to workspaces for which Data Studio in public preview is activated.

Main features of Data Studio

This section describes the main features provided by Data Studio. For information about the related terms, see the Appendix: Terms related to data development section in this topic.

image

Feature

Description

Workflow management

DataWorks Data Studio provides the workflow-based development mode. This is a new R&D method. You can manage complex tasks with ease in a directed acyclic graph (DAG) from the business perspective.

Note

Limits on the maximum numbers of workflows and objects that you can create in new-version Data Studio and old-version DataStudio in each workspace:

  • Workflow: You can create a maximum of 10,000 workflows.

  • Object (node, file, table, resource, or function): For DataWorks Enterprise Edition, you can create a maximum of 200,000 objects. For DataWorks Professional Edition, DataWorks Standard Edition, or DataWorks Basic Edition, you can create a maximum of 100,000 objects.

If the number of workflows or objects in the current workspace reaches the upper limit, you can no longer create a workflow or an object.

Task development

  • Various capabilities:

    • Provides nodes of a wide range of compute engine types to fully encapsulate compute engine capabilities.

    • Provides general nodes. You can combine general nodes and nodes of a specific compute engine type in DataWorks to process complex business logic. For example, you can enable external systems to trigger the scheduling of nodes in DataWorks, check whether files exist, route results based on logical conditions, execute code of specific nodes in loops, and pass output between nodes.

    • Supports the development of stream computing tasks based on Realtime Compute for Apache Flink and also supports collaborative task development between Realtime Compute for Apache Flink and other compute engines such as MaxCompute and Hologres.

  • Simple operations:

    • Allows you to develop data on the configuration tab of a workflow in a visualized manner. You can drag components to perform hybrid orchestration of tasks of different compute engine types.

    • Provides an intelligent SQL editor. The SQL editor provides features such as code hinting, display of the code structure by using SQL operators, and permission verification.

Task scheduling

  • Trigger methods: The scheduling of tasks can be triggered by external systems, events, or output of ancestor tasks. The output of ancestor tasks triggers task scheduling based on inner lineage parsing.

  • Dependencies: You can configure same-cycle and cross-cycle dependencies. You can also configure dependencies between different types of tasks whose scheduling frequencies are different.

  • Execution control: You can determine whether to rerun a task and manage the scheduling time of a task based on the output of its ancestor task. You can specify a validity period during which a task is automatically run as scheduled and the scheduling type of a task. For example, you can specify a task as a dry-run task or freeze a task. After you specify a task as a dry-run task, the system returns a success response for the task without running the task. The scheduling of descendant tasks of the task is not blocked. After you freeze a task, the system does not run the task, and the scheduling of descendant tasks of the task is blocked.

  • Idempotence: Data Studio provides a rerun mechanism that you can use to configure custom rerun conditions and rerun times.

Quality management

Data Studio provides a standard task deployment mechanism and various methods to perform quality management. You can perform operations that include but are not limited to the following operations for quality management:

  • Review code before a task is deployed. This helps block the execution of the process in which an error occurs in the production environment.

  • Configure custom process control on task committing and deployment to the production environment, in combination with governance items provided by Data Governance Center and verification logic customized based on extensions.

  • Associate a monitoring rule with a scheduling node. After the node is run, the monitoring rule is triggered to check the data generated by the node, and data anomalies are reported at the earliest opportunity.

Other features

  • Openness: DataWorks Open Platform provides various API operations and a large number of built-in extension points. You can subscribe to event messages related to data development on DataWorks Open Platform.

  • Permission control: You can manage the permissions on service modules of DataWorks and the data access permissions. For more information, see Manage permissions on workspace-level services.

Task development process

DataWorks Data Studio allows you to create different compute engine types of real-time synchronization tasks, batch synchronization tasks, batch processing tasks, and manually triggered tasks. For more information about data synchronization, see Data Integration overview.

The following two workspace modes are available: standard mode and basic mode. The task development process varies based on the workspace mode.

Task development process in a workspace in standard mode

image

Task development process in a workspace in basic mode

image

  • Basic process: For example, you want to develop tasks in a workspace in standard mode. The development process includes the following stages: development, debugging, configuration of scheduling settings, deployment, and O&M.

  • Process control: During task development, you can perform operations such as code review provided by Data Studio and use check items preset in Data Governance Center and verification logic customized based on extensions in Open Platform to ensure that specified standards and requirements on task development are met.

Data development methods

Data Studio allows you to specify a custom development process. You can use the workflow feature to quickly build a data processing process. You can also manually create different types of nodes and configure scheduling dependencies for the nodes.

Node types supported by Data Studio

The Data Studio service of DataWorks allows you to create various types of nodes, such as data synchronization, MaxCompute, Hologres, EMR, Realtime Compute for Apache Flink, Python, notebook, and AnalyticDB. You can enable DataWorks to periodically schedule instances that are generated for nodes. You can also select a specific type of node to develop data based on your business requirements. For more information about the node types that are supported by DataWorks, see DataWorks nodes.

Appendix: Terms related to data development

Terms related to task development

Term

Description

Auto triggered workflow

A new R&D method. You can manage complex tasks with ease in a DAG from the business perspective. You can create various types of nodes in a workflow, such as data synchronization, MaxCompute, Hologres, EMR, Realtime Compute for Apache Flink, Python, notebook, and AnalyticDB. You can configure scheduling settings at the workflow level.

Manually triggered workflow

A collection of tasks, tables, resources, and functions for a specific business requirement.

Nodes in this type of workflow are manually triggered to run. Nodes in an auto triggered workflow are triggered to run as scheduled.

Node

A basic execution unit of DataWorks. Data Studio allows you to create various types of nodes, such as Data Integration nodes used for data synchronization, compute engine nodes used for data cleansing, and general nodes used together with compute engine nodes to process complex logic. Compute engine nodes include MaxCompute SQL nodes, Hologres SQL nodes, and EMR Hive nodes. General nodes include zero load nodes that can be used to manage multiple other nodes and do-while nodes that can run node code in loops. You can combine multiple types of nodes in your business to meet your different data processing requirements.

Terms related to task scheduling

Term

Description

Dependency

Used to define the sequence in which tasks are run. If Node B can run only after Node A finishes running, Node A is the ancestor node of Node B, and Node B depends on Node A. In a DAG, dependencies are represented by arrows between nodes.

Output name

The output name of each task. When you configure dependencies between tasks within an Alibaba Cloud account, the output name of a task is used to connect to its descendant tasks.

When you configure dependencies for a task, you must use the output name of the task instead of the node name or ID. After you configure the dependencies, the output name of the task serves as the input name of its descendant tasks.

Output table name

We recommend that you use the name of the table generated by the current task as the output table name. Proper configuration of an output table name can help check whether data is from an expected ancestor table when you configure dependencies for a descendant node. We recommend that you do not manually modify an output table name that is generated based on automatic parsing. The output table name serves only as an identifier. Modifying an output table name does not affect the name of the table that is actually generated by executing SQL statements. The name of an actually generated table is subject to the SQL logic.

Note

An output name must be globally unique. However, no such limit is imposed on an output table name.

Resource group for scheduling

A resource group used for task scheduling. For more information about resource groups, see Overview.

Scheduling parameter

Configured for a node when the node is scheduled to run. The values of scheduling parameters are dynamically replaced at the scheduling time of the node. If you want to obtain information about the runtime environment, such as the date and time, during repeated running of code, you can dynamically assign values to variables in the code based on the definition of scheduling parameters in DataWorks.

Data timestamp

The date that is directly related to a business activity, which reflects the actual time when a business transaction is conducted. This term is especially important in offline computing scenarios. For example, if you want to collect statistical data on the turnover generated on October 10, 2024 in retail business, the calculation started in the early morning of October 11, 2024. October 10, 2024 is the date on which the business transaction is conducted and represents the data timestamp.

Scheduling time

The point in time at which auto triggered tasks are scheduled to run. The scheduling time can be accurate to the minute.

Important

Task running is affected by various factors. A task may not start to run when the scheduling time of the task arrives. Before a task is run, DataWorks checks whether ancestor tasks of the task are successfully run, whether the scheduling time of the task arrives, and whether scheduling resources are sufficient. The task can be triggered to run only if all the preceding conditions are met.