Module | Description | References |
Data Modeling | Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications. | Data Modeling overview |
DataStudio | DataWorks encapsulates the capabilities of a CDP or CDH compute engine. This way, you can use the CDP or CDH compute engine to run CDP or CDH data synchronization and development tasks. Data synchronization: DataStudio supports only specific batch and real-time synchronization scenarios. For more information about data synchronization scenarios, see Data Integration overview. Data development: You can develop and allow the system to periodically schedule different types of tasks in DataWorks without the need to use complex command lines.
| |
You can use general nodes and nodes of a specific type of compute engine in DataWorks to process complex logic. DataWorks supports the following types of general nodes: Zero load nodes that are used to manage workflows HTTP Trigger nodes that are used in the scenarios in which external scheduling systems are used to trigger scheduling of nodes in DataWorks, OSS object inspection nodes, and FTP Check nodes Assignment nodes that are used to pass input parameters and output parameters for nodes, and parameter nodes Do-while nodes that are used to execute node code in loops, for-each nodes that are used to traverse the outputs of assignment nodes in loops and judge the outputs, and branch nodes Other nodes, such as common Shell nodes and MySQL database nodes
| |
After tasks on nodes are developed, you can perform the following operations based on your business requirements: Configure scheduling properties for nodes If you want to enable DataWorks to periodically run your tasks on nodes, you must configure scheduling properties for the nodes, such as scheduling dependencies and scheduling parameters. Debug nodes To ensure that tasks on nodes in the production environment are run in an efficient manner and prevent a waste of computing resources, we recommend that you debug and run the tasks before you deploy the tasks. Deploy nodes The tasks on nodes can be scheduled to run only after they are deployed to the production environment. Therefore, after the tasks are developed, you must deploy the tasks to the production environment. After the tasks are deployed, you can view and manage the tasks on the Auto Triggered Nodes page in Operation Center. Manage nodes You can perform various operations on the tasks on nodes, such as deploying and undeploying the tasks, and modifying scheduling properties for multiple tasks at the same time. Perform process management DataWorks provides process control for task development and deployment to ensure the accuracy and security of the operations that are performed on tasks. For example, DataWorks provides the code review, forceful smoke testing, and code review logic customization features.
| |
Operation Center | Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of tasks and perform O&M operations on tasks on which exceptions occur. For example, you can perform intelligent diagnostics and rerun tasks in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important tasks and difficulties in monitoring of massive tasks. This feature helps you ensure the timeliness of task output. | Perform basic O&M operations on auto triggered nodes |
Data Quality | Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and task scheduling processes. | Data Quality overview |