DataWorks lets you connect to Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) clusters. You can register CDH and CDP clusters in DataWorks to perform a series of data development and administration operations, such as task development, scheduling, Data Map (metadata management), and Data Quality. Before you register a CDH or CDP cluster, you must obtain the required configuration information and configure network connectivity between the cluster and the resource group. This topic uses a CDH cluster as an example and describes how to obtain cluster information and configure network connectivity between the cluster and a resource group.
Background information
CDH is an open source platform distribution from Cloudera. It provides out-of-the-box features such as cluster management, monitoring, and diagnostics. It also supports various components to help you run end-to-end big data workflows.
CDP is a public data platform that collects and integrates customer data across platforms. It helps you collect real-time data and use it to build individual user data profiles.
You can register CDH and CDP clusters in DataWorks to perform data development and administration operations for your business needs. These operations include task development, scheduling, Data Map (metadata management), and Data Quality.
Prerequisites
A CDH cluster is deployed.
DataWorks supports CDH clusters that are not deployed on Alibaba Cloud ECS instances. However, you must ensure that the environment where the CDH cluster is deployed can connect to an Alibaba Cloud virtual private cloud (VPC). You can typically use network solutions such as Express Connect or VPN to ensure connectivity.
You have purchased a new Serverless resource group (recommended) or an old-version exclusive resource group for scheduling for DataWorks.
After you purchase a DataWorks resource group, it cannot connect to other cloud products by default. To connect to a CDH cluster, you must first establish network connectivity between the CDH cluster and the resource group before you can perform related operations.
NoteServerless resource groups (recommended) are General-purpose resource groups. They can be used for various task types, such as data synchronization and task scheduling. For more information about purchasing a Serverless resource group, see Use a Serverless resource group. New users can purchase only new resource groups. New users are those who have not activated any version of DataWorks in the current region.
If you have purchased an old-version exclusive resource group for scheduling, you can also use it to run CDH or CDP tasks. For more information, see Use an exclusive resource group for scheduling.
Obtain CDH cluster configuration information
Follow these steps to obtain the CDH configuration information that you need to register the CDH cluster in DataWorks.
Obtain the CDH version.
Log on to Cloudera Manager. On the main page, find the version of the deployed CDH cluster. The version is displayed to the right of the cluster name, as shown in the following figure.

Obtain the host and component addresses. You will use this information to configure the cluster connection when you register the CDH cluster.
Manually checking in Cloudera Manager
Log on to Cloudera Manager. From the Hosts drop-down menu, select Roles. Identify the service to configure based on its keyword and icon. Then, find the corresponding Host on the left and record the address in the required format.

Details:
HS2: HiveServer2
HMS: Hive Metastore
ID: Impala Daemon
RM: YARN ResourceManager
Obtain the configuration file. You will upload this file when you register the CDH cluster.
Log on to Cloudera Manager.
On the Status page, click the cluster's drop-down menu and select View Client Configuration URL.

In the dialog box, download the configuration package. This example uses YARN.

Obtain the network information of the CDH cluster. You will use this information to configure network connectivity with the DataWorks resource group.
Log on to the ECS console where the CDH cluster is deployed.
In the instance list, find the ECS instance where the CDH cluster is deployed. Click the instance name to go to the Instance Details page. On this page, record the Security Group, VPC, and Virtual Switch information.
Configure network connectivity
Serverless Resource Groups
This section describes how to configure network connectivity between a Serverless resource group and a CDH cluster.
After you purchase a DataWorks Serverless resource group, it cannot connect to other cloud products by default. To connect to a CDH cluster, you must obtain the network information of the CDH cluster and attach the resource group to the VPC where the cluster is deployed. This ensures network connectivity between the CDH cluster and the resource group.
Go to the network configuration page for the Serverless resource group.
Log on to the DataWorks console.
In the navigation pane on the left, click Resource Group. The Exclusive Resource Groups tab on the Resource Group List page is displayed by default.
Click Network Settings next to your resource group.
Attach a VPC.
On the VPC Binding tab, in the Data Scheduling & Data Integration section, click Add Binding. On the configuration page, select the VPC, zone, and vSwitch where the CDH cluster is located. Use the information that you recorded in Step 4 of the "Obtain CDH cluster configuration information" section.
Configure the host.
Go to the Alibaba Cloud DNS console and add an authoritative zone in PrivateZone for the host addresses that you recorded in Step 2 of the "Obtain CDH cluster configuration information" section.
Activate internal DNS resolution. For more information, see Activate internal DNS resolution.
NoteIf you have already activated internal DNS resolution, you can skip this step.
Add a built-in authoritative domain name. For more information, see Add a built-in authoritative domain name.
NoteThis topic uses the host domain name
cdh-header-1-cn-shanghaiobtained in the "Obtain the addresses from the Cloudera Manager Admin Console" section as an example. An authoritative resolution is configured for the domain namecdh-header-1-cn-shanghai. Adjust this parameter based on your host domain name.The resolved IP address is the
Private IP Addressof the ECS instance where the CDH cluster is deployed.
Set the scope of the domain name. For more information, see Set the scope of a domain name.
NoteWhen you set the scope of the domain name, select the VPC to which the CDH cluster and the resource group are attached.
Exclusive Resource Group for Scheduling
This section describes how to configure network connectivity between an exclusive resource group for scheduling and a CDH cluster.
After you purchase a DataWorks exclusive resource group for scheduling, it cannot connect to other cloud products by default. To connect to a CDH cluster, you must obtain the network information of the CDH cluster and attach the exclusive resource group for scheduling to the VPC where the cluster is deployed. This ensures network connectivity between the CDH cluster and the exclusive resource group for scheduling.
Go to the network configuration page for the exclusive resource group.
Log on to the DataWorks console.
In the navigation pane on the left, click Resource Group. The Resource Group List page appears, and the Exclusive Resource Groups tab is selected by default.
Click Network Settings next to your exclusive resource group for scheduling.
Attach a VPC.
On the VPC Binding tab, click Add Binding. On the configuration page, select the VPC, zone, vSwitch, and security group for the CDH cluster. You recorded this information in Step 4 of the "Obtain CDH cluster configuration information" section.
Configure the host.
On the Host Configuration tab, click Batch Modify. In the dialog box, enter the host address information that you recorded in Step 2 of the "Obtain CDH cluster configuration information" section.

What to do next
After you complete the preparations described in this topic, you can register the CDH cluster in DataWorks and perform development operations. For more information, see Data Development (Legacy): Attach a CDH computing resource.