Usage notes for development of CDP or CDH tasks in DataWorks

DataWorks allows you to create nodes such as Hive, MR, Presto, and Impala nodes based on a Cloudera's Distribution Including Apache Hadoop (CDH) or Cloudera Data Platform (CDP) cluster. In the DataWorks console, you can configure CDP or CDH nodes, enable periodic scheduling of tasks on the nodes, and manage the metadata of the nodes to ensure that data is generated and managed in an efficient and stable manner. This topic describes the usage notes for the development of CDP or CDH tasks in DataWorks. The usage notes cover the basic development process, fee description, environment preparation, and permission management.

Background information

CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.
CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.

You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.

Limits

Only serverless resource groups or old-version exclusive resource groups for scheduling can be used to run CDH or CDP tasks.
Note
- DataWorks releases serverless resource groups that are used for general purposes, and we recommend that you use this type of resource group to run CDH or CDP tasks. Serverless resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a serverless resource group, see Create and use a serverless resource group. If you have purchased an old-version exclusive resource group for scheduling, you can also use the resource group to run CDH or CDP tasks.
- New users can purchase only serverless resource groups.
- If you register a cluster of a custom version to DataWorks, you can use only old-version exclusive resource groups for scheduling to run relevant tasks. For more information about cluster versions, see the Step 2: Register a CDH or CDP cluster section in this topic.
You can register a CDH or CDP cluster to DataWorks only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Chengdu), and Germany (Frankfurt).

Prerequisites

DataWorks is activated. For more information, see Activate DataWorks.
A CDP or CDH cluster is deployed on an Elastic Compute Service (ECS) instance and registered to DataWorks.
The cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment is connected to an Alibaba Cloud virtual private cloud (VPC). You can use Express Connect and VPN Gateway to ensure network connectivity. For more information, see Register a CDH or CDP cluster to DataWorks.
A serverless resource group is purchased.
By default, serverless resource groups are not connected to the networks of other cloud services after the resource groups are purchased. A CDP or CDH cluster must be connected to a serverless resource group before you can use the cluster. For more information about how to purchase a serverless resource group, see Create and use a serverless resource group.
A DataWorks workspace is created. For more information, see Manage workspaces.

Usage notes

The following table describes the usage notes for the development of CDP or CDH tasks in DataWorks.

Item	Description

Item	Description
Billing	When you develop CDP or CDH tasks in DataWorks, you are charged for not only DataWorks resources but also the resources of other Alibaba Cloud services.
Environment preparation	Before you develop CDP or CDH tasks in DataWorks, you must activate DataWorks of a desired edition and create resource groups based on your business requirements, register CDP or CDH clusters to DataWorks, and complete preparations in the development environment.
Permission management	DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements to implement fine-grained permission management.
Data integration	DataWorks Data Integration allows you to read data from and write data to CDP or CDH Hive. DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, full synchronization, and incremental synchronization.
Data modeling and development	DataWorks provides the Data Modeling service that is used to structure and manage large volumes of unordered and complex data. DataWorks also provides the DataStudio service for development of tasks that are scheduled to run. After the tasks are developed, you can go to Operation Center to monitor and perform O&M operations on the tasks.
Data governance	DataWorks allows you to manage CDP and CDH metadata and govern CDP and CDH data.
Data analysis and services	DataWorks DataAnalysis provides the CDP and CDH data analysis and service sharing capabilities.
Open Platform	DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Billing

1. Fees for DataWorks resources

This section describes the fees that are included in your DataWorks bill. For information about the billable items of DataWorks, see Billing overview.

Fee	Description

Fee	Description
Fees for the DataWorks edition that you use	You must activate DataWorks before you can develop tasks in DataWorks. If you activate DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition, you are charged the fees for the edition when you purchase the edition.
Fees for the scheduling resources that you use to schedule tasks	After tasks are developed, scheduling resources are required to schedule the tasks. You can purchase a serverless resource group or an old-version exclusive resource group for scheduling, and pay for the resource group. We recommend that you purchase a serverless resource group. Note A purchased serverless resource group can be used for task scheduling and data synchronization.
Fees for the resources that you use to synchronize data	A data synchronization task consumes scheduling resources and synchronization resources. You can purchase a serverless resource group or an old-version exclusive resource group for Data Integration, and pay for the resource group. We recommend that you purchase a serverless resource group.

Note

You are not charged scheduling fees if you run tasks on nodes by clicking Run or Run with Parameters in the top toolbar on the DataStudio page.
You are not charged scheduling fees for failed tasks or dry-run tasks.

For more information that helps you understand the billing details, see Issuing logic of scheduling tasks in DataWorks.

2. Fees for the resources of other Alibaba Cloud services

This section describes the fees that are not included in your DataWorks bill.

Important

You are charged for the resources of other Alibaba Cloud services based on the billing logic of the Alibaba Cloud services. For more information, see the billing documentation of the Alibaba Cloud services. For more information, see Billing.

Fee	Description

Fee	Description
Database fees	When you run data synchronization tasks to read data from and write data to databases, database fees may be generated.
Computing and storage fees	When you run tasks of a specific type of compute engine, computing and storage fees of this type of compute engine may be generated.
Network service fees	When you establish network connections between DataWorks and other related services, network service fees may be generated. For example, if you use services, such as Express Connect, Elastic IP Address (EIP), and Internet Shared Bandwidth, to establish network connections between DataWorks and other related services, you may be charged network service fees.

Environment preparation

1. Resource preparation

Item	Description	References

Item	Description	References
Select a DataWorks edition	DataWorks Basic Edition allows you to perform the following basic operations during the development of CDP or CDH data: migrate data to the cloud, develop data, schedule tasks, and govern data. If you want to use more advanced data governance and data security solutions, you can purchase DataWorks of an advanced edition, such as DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition.	Differences among DataWorks editions
Select a resource group	You can use a serverless resource group or an old-version exclusive resource group for scheduling to run tasks in a CDP or CDH cluster. We recommend that you use a serverless resource group.	Create and use a serverless resource group

2. Development environment preparation

You must register a CDP or CDH cluster with a DataWorks workspace before you can develop CDP or CDH tasks in DataStudio. You can add users to the workspace as members. This facilitates collaborative data development.

Item	Description	References

Item	Description	References
Prepare a data synchronization environment	Before you develop a data synchronization task in DataWorks based on the Hive service that is deployed in a cluster, you must add the Hive service to a DataWorks workspace as a data source.	Supported data source types and synchronization operations
Prepare an environment for data development and analysis	Before you use DataWorks to periodically schedule CDP or CDH tasks, you must add a CDP or CDH cluster to DataWorks as a data source. Then, you can use the data source to perform operations, such as data development, data analysis, and periodic task scheduling.	Register a CDH or CDP cluster to DataWorks
Prepare a collaborative development environment	To ensure that RAM users can collaborate with each other to develop data in a workspace, you must perform the following operations: Add the RAM users to the current workspace as members and assign the Development role to the RAM users in the workspace. Add the workspace members to the desired CDP or CDH cluster.	Add a RAM user to a workspace as a member and assign roles to the member

Permission management

DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements. Details of permission management:

1. Management of data access permissions

You can configure mappings between RAM users that are added to a DataWorks workspace as members to develop CDP or CDH tasks and CDP or CDH cluster accounts to allow the RAM users to have the permissions of the mapped CDP or CDH cluster accounts. For more information, see Configure mappings between tenant member accounts and CDH or CDP cluster accounts.

2. Management of permissions on services and features

Before you develop data in DataWorks as a RAM user, you must assign a workspace-level role to the RAM user to grant the RAM user specific permissions. For more information, see Best practices for managing permissions of RAM users.

You can refer to Manage permissions on global-level services to manage permissions on DataWorks service modules, such as prohibiting DataWorks users from accessing Data Map, and to manage permissions of performing operations in the DataWorks console, such as allowing DataWorks users to delete a workspace.
You can refer to Manage permissions on workspace-level services to manage permissions on DataWorks workspace-level service modules, such as allowing DataWorks users to access DataStudio to perform development-related operations, and to manage permissions on DataWorks global-level service modules, such as prohibiting DataWorks users from accessing Data Security Guard.

Getting started

DataWorks provides multiple services. You can develop tasks that are scheduled to run in DataStudio. After the tasks are developed, you can go to Operation Center in the production environment to monitor and perform O&M operations on the tasks. DataWorks also provides process control for task development and deployment to standardize data development operations and ensure security of data development.

1. Data integration

DataWorks Data Integration allows you to read data from and write data to CDP or CDH Hive, and CDP or CDH HBase. You must add the Hive or HBase service to DataWorks as a data source before you can synchronize data from another type of data source to a Hive or HBase data source or synchronize data from a Hive or HBase data source to another type of data source. In addition, DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, full synchronization, and incremental synchronization. You can select a scenario based on your business requirements. For more information, see Overview of Data Integration.

2. Data modeling and development

Module	Description	References

Module	Description	References
Data Modeling	Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications.	Data Modeling overview
DataStudio	DataWorks encapsulates the capabilities of a CDP or CDH compute engine. This way, you can use the CDP or CDH compute engine to run CDP or CDH data synchronization and development tasks. Data synchronization: DataStudio supports only specific batch and real-time synchronization scenarios. For more information about data synchronization scenarios, see Data Integration overview. Data development: You can develop and allow the system to periodically schedule different types of tasks in DataWorks without the need to use complex command lines.	Create a CDH Hive node Create a CDH Spark node Create a CDH MR node Create a CDH Presto node Create a CDH Impala node
	You can use general nodes and nodes of a specific type of compute engine in DataWorks to process complex logic. DataWorks supports the following types of general nodes: Zero load nodes that are used to manage workflows HTTP Trigger nodes that are used in the scenarios in which external scheduling systems are used to trigger scheduling of nodes in DataWorks, OSS object inspection nodes, and FTP Check nodes Assignment nodes that are used to pass input parameters and output parameters for nodes, and parameter nodes Do-while nodes that are used to execute node code in loops, for-each nodes that are used to traverse the outputs of assignment nodes in loops and judge the outputs, and branch nodes Other nodes, such as common Shell nodes and MySQL database nodes	Create and use a zero load node Create an HTTP Trigger node OSS object inspection node Create an FTP Check node Configure an assignment node Create a parameter node Logic of do-while nodes Logic of for-each nodes Configure a branch node
	After tasks on nodes are developed, you can perform the following operations based on your business requirements: Configure scheduling properties for nodes If you want to enable DataWorks to periodically run your tasks on nodes, you must configure scheduling properties for the nodes, such as scheduling dependencies and scheduling parameters. Debug nodes To ensure that tasks on nodes in the production environment are run in an efficient manner and prevent a waste of computing resources, we recommend that you debug and run the tasks before you deploy the tasks. Deploy nodes The tasks on nodes can be scheduled to run only after they are deployed to the production environment. Therefore, after the tasks are developed, you must deploy the tasks to the production environment. After the tasks are deployed, you can view and manage the tasks on the Auto Triggered Nodes page in Operation Center. Manage nodes You can perform various operations on the tasks on nodes, such as deploying and undeploying the tasks, and modifying scheduling properties for multiple tasks at the same time. Perform process management DataWorks provides process control for task development and deployment to ensure the accuracy and security of the operations that are performed on tasks. For example, DataWorks provides the code review, forceful smoke testing, and code review logic customization features.	Overview Debugging procedure Deploy nodes Perform batch operations Process management
Operation Center	Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of tasks and perform O&M operations on tasks on which exceptions occur. For example, you can perform intelligent diagnostics and rerun tasks in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important tasks and difficulties in monitoring of massive tasks. This feature helps you ensure the timeliness of task output.	Perform basic O&M operations on auto triggered nodes
Data Quality	Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and task scheduling processes.	Data Quality overview

3. Data governance

After you register a CDP or CDH cluster to DataWorks, DataWorks automatically collects metadata from your CDP or CDH compute engine. You can refer to Data Map overview to view metadata. In addition, you can refer to Data Governance Center overview to view the issues that are detected by DataWorks and perform related data governance operations.

Module	Description	References

Module	Description	References
Data Map	Data Map is an enterprise-grade data management platform that provides management, sorting, lineage viewing, quick search, and in-depth understanding capabilities for data objects based on the underlying unified metadata services. Note DataWorks allows you to view the lineages of CDH Hive, CDH Spark, CDH Spark SQL, and CDH Impala nodes at the table and field levels. For more information, see the Explanation of lineage display for various data sources section of the "View lineages" topic.	Data Map overview
Security Center Data Security Guard Approval Center	The Security Center, Data Security Guard, and Approval Center modules serve as an end-to-end data security governance platform that provides the following features: data categorization and sensitivity level classification, sensitive data identification, management on data-related authorization, masking of sensitive data, audit of access to sensitive data, and risk identification and response. This helps you implement data security governance. Note Approval Center does not allow you to specify custom approval processes for CDP or CDH tables.	Security Center overview Data Security Guard overview Approval Center overview
Data Governance Center	Data Governance Center automatically identifies items to be governed for multiple governance fields based on rules that come from experience in data-related fields, and provides governance and optimization solutions covering pre-event issue prevention and post-event issue resolution. Data Governance Center can help you actively and systematically complete data governance. Note You can use only global check items and governance items in Data Governance Center to check and resolve CDP or CDH data issues.	Data Governance Center overview

4. Data analysis and services

DataAnalysis and DataService Studio are designed to provide data processing and analysis capabilities for enterprises and help enterprises use the APIs that are managed in a unified manner to access and share data.

Module	Description	References

Module	Description	References
DataAnalysis	The DataAnalysis module of DataWorks helps you perform SQL-based analysis online, gain an insight into business requirements, and edit and share data, and allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting.	For more information, see DataAnalysis overview.
DataService Studio	DataService Studio is designed to provide comprehensive data service and sharing capabilities for enterprises and helps enterprises manage API services for internal and external systems in a centralized manner.	For more information, see DataService Studio overview.

5. Open Platform

DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Item	Description	References

Item	Description	References
OpenAPI	The OpenAPI module allows you to call DataWorks API operations so that you can integrate your applications with DataWorks. This can help facilitate big data processing, decrease manual operations and O&M operations, minimize data risks, and reduce costs for enterprises.	OpenAPI
OpenEvent	The OpenEvent module allows you to subscribe to DataWorks change events related to your applications so that you can detect and respond to the changes at the earliest opportunity.	OpenEvent overview
Extensions	You can use the OpenEvent module to subscribe to event messages that are generated in your DataWorks workspace. You can use the Extensions module to register your local program as an extension to manage extension point events and processes.	Extensions overview

Background information

Limits

Prerequisites

Usage notes

Billing

1. Fees for DataWorks resources

2. Fees for the resources of other Alibaba Cloud services

Environment preparation

1. Resource preparation

2. Development environment preparation

Permission management

1. Management of data access permissions

2. Management of permissions on services and features

Getting started

1. Data integration

2. Data modeling and development

3. Data governance

4. Data analysis and services

5. Open Platform

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)