All Products
Search
Document Center

DataWorks:Usage description of DataWorks modules

Last Updated:Jan 09, 2025

This topic describes the features and basic use scenarios of DataWorks modules.

Data processing procedure and main modules

Data processing procedure

DataWorks is an end-to-end data development and governance platform. The data processing procedure includes the phases that are shown in the following figure.

image

DataWorks modules

Feature directory

Module

Description

Data integration

Data Integration

Data Integration provides comprehensive data synchronization solutions and supports batch synchronization, real-time synchronization, and full or incremental data synchronization. Data Integration provides the following benefits:

  • Flexible scheduling: Data Integration allows you to configure a scheduling cycle for batch synchronization tasks.

  • High compatibility: Data Integration supports more than 50 types of heterogeneous data sources. The data source types include the relational database, data warehouse, NoSQL, file storage system, and message queue.

  • Network connectivity: Data Integration supports data synchronization between heterogeneous data sources that are deployed over the Internet, data centers, and virtual private clouds (VPCs) in complex network environments.

  • Security monitoring: Security control and O&M monitoring are integrated to ensure the security and reliability of data synchronization.

Data modeling and development

Data Modeling

Data Modeling consists of the following sub-modules: Data Warehouse Planning, Data Standard, Dimensional Modeling, and Data Metric.

  • Data Warehouse Planning: allows you to plan data layers, data domains, and data marts, and configure model design workspaces. Different units can share the same data standards and the same data model.

  • Data Standard: allows you to define data standards, lookup tables, measurement units, and naming dictionaries. This sub-module also allows the system to generate quality rules. The checks that are based on the generated quality rules are simple.

  • Dimensional Modeling: supports reverse modeling, which helps resolve the issue of the cold start of modeling based on existing data warehouses. This sub-module also supports visualized dimensional modeling based on data warehouses and allows you to import data by using Excel files and quickly build data models by using FML statements, a type of domain-specific language (DSL) similar to SQL statements. You can seamlessly integrate this sub-module with DataStudio to enable the system to generate extract, transform, and load (ETL) code.

  • Data Metric: allows you to create atomic metrics and derived metrics. You can create a single derived metric or multiple derived metrics at a time based on the same atomic metric and different periods and modifiers. This sub-module is seamlessly integrated with Dimensional Modeling.

DataStudio

DataStudio supports various compute engines. DataStudio provides an intelligent code editor, visualization tools, an independent development environment, and reliable management features to ensure efficient task management and standardized data development processes.

  • Multi-engine support: DataStudio supports various compute engines, including MaxCompute, E-MapReduce (EMR), Cloudera's Distribution including Apache Hadoop (CDH), Hologres, AnalyticDB, and ClickHouse. You can create, test, deploy, and perform O&M operations on tasks of the preceding compute engine types in DataWorks.

  • Intelligent development tools: DataStudio provides an intelligent code editor and the scheduling capability and allows you to configure scheduling dependencies in a visualized manner. The scheduling capability is verified by complex tasks and business dependencies in Alibaba Group to ensure efficient and reliable task management.

  • Environment isolation and process standardization: DataStudio isolates the development environment from the production environment and provides features such as version management, code review, smoke testing, and deployment management. In addition, DataStudio works together with ActionTrail. This way, enterprises can develop data in a standard manner, and project quality and security are ensured.

Operation Center

Operation Center allows you to perform the following O&M operations on auto triggered tasks, manually triggered tasks, and real-time tasks that are deployed in DataStudio:

  • Task management: Operation Center monitors the task status to help you identify and troubleshoot issues at the earliest opportunity.

  • Viewing of key metrics: Operation Center provides key metrics for task O&M and lists of tasks of compute engine types to help you gain a deeper understanding of task performance.

Data Map

Data Map works based on table searches and provides features such as table usage instruction, data category, data lineage, and field lineage to help users and owners of data tables manage data in an efficient manner and facilitate collaborative development.

Data analysis

SQL Query

SQL Query helps you perform SQL-based analysis online, gain an insight into business requirements, and modify and share data. SQL Query allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting.

Data Insight

Data Insight supports data exploration and visualization. You can use the data insight feature to understand data distribution, create data cards, and combine data cards into a data report. In addition, data insight results can be shared by using long images.

Data governance

Data Quality

Data Quality can check the data quality of common big data storage systems, such as MaxCompute, EMR, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH. Data Quality allows you to configure monitoring rules that focus on multiple dimensions of data, such as integrity, accuracy, validity, consistency, uniqueness, and timeliness. You can configure a monitoring rule for a specific table and associate the monitoring rule with a scheduling node that generates the table data. After a task on the node is run, a check is automatically triggered. This facilitates reporting of data anomalies and allows you to handle data anomalies at the earliest opportunity. You can also configure a monitoring rule as a strong rule or a weak rule to determine whether to terminate the associated node when Data Quality detects anomalies. This way, you can prevent dirty data from spreading downstream and minimize the waste of time and money on data restoration.

Data Asset Governance

Data Asset Governance can detect issues that need to be handled in the data storage, task computing, code development, data quality, and security dimensions based on governance plans. Data Asset Governance provides health scores to assess the effectiveness of data governance and visualizes the governance results by providing governance reports and leaderboards of governance issues from the global, workspace, and individual dimensions. This helps you achieve governance objectives in an efficient manner. Data Asset Governance also provides features such as business asset management, asset analysis, resource consumption details of tasks, and cost estimation to help you better understand the usage details of various resources and optimize resource configurations.

Data service

DataService Studio

DataService Studio provides a service bus to help enterprises create and manage private and public APIs in a centralized manner. DataService Studio also provides a solution to the last mile issue among data warehouses, databases, and data applications, and facilitates data forwarding and sharing.

  • Dual-mode data API generation: DataService Studio allows you to create APIs based on tables in various data sources without the need to write code. You can also create APIs by specifying custom SQL statements. DataService Studio allows you to use functions to process the request parameters and returned results of APIs.

  • Serverless architecture: DataService Studio is built based on a serverless architecture. You can publish APIs to API Gateway by performing simple operations. When you publish APIs to API Gateway, you do not need to focus on the infrastructure such as the runtime environment.

Others

Security Center

Security Center provides the following core features:

  • Data permission management: Security Center supports fine-grained permission requesting, request processing, and permission auditing. This allows you to manage permissions based on the principle of least privilege. You can easily track the progress of request processing to ensure that related requests are processed at the earliest opportunity.

  • Data security management: Security Center uses features, such as data classification, sensitive data identification, data access auditing, and data source tracking, to help you identify data with security risks and handle the risks at the earliest opportunity. This ensures the security and reliability of data.

  • Security diagnostics and best practices: Security Center provides the platform security diagnostics and data usage diagnostics features to help you identify and resolve various security issues based on security specifications. The features ensure that your business operates more effectively in an optimal security environment.

Data Security Guard

In Data Security Guard, you can configure sensitive data identification rules, identify sensitive data based on rules, view identification results, and process sensitive data. You can identify and manage sensitive data before, during, and after the event that generates sensitive data to ensure data security.

Migration Assistant

Migration Assistant allows you to export data objects in your workspace, including auto triggered tasks, manually triggered tasks, resources, functions, data sources, table metadata, ad hoc queries, and script templates. You can also create full export tasks, incremental export tasks, or custom export tasks to export your data objects in DataWorks based on your business requirements.