Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be connected to DataWorks. This allows you to register CDH or CDP clusters to DataWorks. This way, you can use the data development and governance features provided by DataWorks to manage CDH or CDP data. The features include task development, task scheduling, metadata management in Data Map, and Data Quality. Before you register a CDH or CDP cluster to DataWorks, you must obtain the required configuration information about the cluster and configure network connectivity between the cluster and a specific resource group. This topic describes how to obtain the configuration information about a CDH cluster and configure network connectivity between the CDH cluster and a specific resource group.
Background information
CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.
CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.
You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.
Prerequisites
A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.
The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment is connected to an Alibaba Cloud virtual private cloud (VPC). You can use Express Connect and VPN Gateway to ensure network connectivity.
A new-version serverless resource group or an old-version exclusive resource group for scheduling is purchased. We recommend that you purchase a serverless resource group.
By default, DataWorks resource groups are not connected to the networks of other cloud services after the resource groups are purchased. A CDH cluster must be connected to a specific resource group before you can use the CDH cluster.
NoteDataWorks releases serverless resource groups that are used for general purposes, and we recommend that you purchase this type of resource group. Serverless resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a serverless resource group, see Create and use a serverless resource group. Users who have not activated DataWorks of any edition in the current region can purchase only serverless resource groups.
If you have purchased an old-version exclusive resource group for scheduling, you can also use the resource group to run CDH or CDP tasks. For more information, see Create and use an exclusive resource group for scheduling.
Obtain the configuration information about the CDH cluster
Perform the following steps to obtain the configuration information about the CDH cluster. The configuration information is required when you register the CDH cluster to DataWorks.
Obtain the version information about the CDH cluster.
Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information to the right of the cluster name, as shown in the following figure.
Obtain the host and component addresses of the CDH cluster. The addresses are required when you register the CDH cluster to DataWorks.
Obtain the addresses from the Cloudera Manager Admin Console
Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on the keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format.
Components:
HS2: HiveServer2
HMS: Hive Metastore
ID: Impala Daemon
RM: YARN ResourceManager
Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you register the CDH cluster to DataWorks.
Log on to the Cloudera Manager Admin Console.
On the Status tab, click the drop-down arrow to the right of the cluster name and select View Client Configuration URLs.
In the Client Configuration URLs dialog box, download a specific configuration package. In this example, the YARN configuration package is downloaded.
Obtain the network information about the CDH cluster. The network information is used to configure network connectivity between the CDH cluster and a DataWorks resource group.
Log on to the ECS console.
In the left-side navigation pane, choose Instances & Images > Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instance page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view and record the network information about the instance, such as the security group, VPC, and vSwitch.
Configure network connectivity
Serverless resource group
This section uses a serverless resource group as an example to describe how to establish a network connection between the resource group and a CDH cluster.
By default, DataWorks serverless resource groups are not connected to the networks of other cloud services after the resource groups are created. Before you use CDH, you must obtain the network information of your CDH cluster and associate your DataWorks serverless resource group with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks serverless resource group.
Go to the network configuration page of the serverless resource group.
Log on to the DataWorks console.
In the left-side navigation pane, click Resource Group. The Exclusive Resource Groups tab appears.
Find the desired serverless resource group and click Network Settings in the Actions column.
Associate the resource group with the VPC in which the CDH cluster is deployed.
In the Data Scheduling & Data Integration section of the VPC Binding tab that appears, click Add VPC Association. In the Add VPC Association dialog box, select the VPC, zone, and vSwitch that are recorded in Step 4 in the "Obtain the configuration information about the CDH cluster" section.
Configure hosts.
Log on to the Alibaba Cloud DNS console. Perform authoritative DNS resolution on the host addresses that are recorded in Step 2 in the "Obtain the configuration information about the CDH cluster" section on the Private DNS (PrivateZone) page.
Activate Private DNS. For more information, see Activate Private DNS.
NoteIf you have activated Private DNS, you can skip this step.
Add a built-in authoritative zone. For more information, see Add a built-in authoritative zone.
NoteIn this example, authoritative DNS resolution is performed on the
cdh-header-1-cn-shanghai
host address that is obtained from the Cloudera Manager Admin Console. You can change the value based on your host address configuration.The resolved IP address is the
private IP address
of the ECS instance on which your CDH cluster is deployed.
Set an effective scope for the built-in authoritative zone. For more information, see Set an effective scope for a built-in authoritative zone.
NoteWhen you specify a VPC where the built-in authoritative zone takes effect, you must select the VPC with which your CDH cluster and resource group are associated.
Exclusive resource group for scheduling
This section uses an exclusive resource group for scheduling as an example to describe how to establish a network connection between the resource group and a CDH cluster.
By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of other cloud services after the resource groups are created. Before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
Go to the network configuration page of the exclusive resource group for scheduling.
Log on to the DataWorks console.
In the left-side navigation pane, click Resource Group. The Exclusive Resource Groups tab appears.
Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.
Associate the resource group with the VPC in which the CDH cluster is deployed.
On the VPC Binding tab of the page that appears, click Add VPC Association. In the Add VPC Association dialog box, select the VPC, zone, vSwitch, and security group that are recorded in Step 4 in the "Obtain the configuration information about the CDH cluster" section.
Configure hosts.
Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in Step 2 in the "Obtain the configuration information about the CDH cluster" section.
What to do next
After you complete the preparations, you can register the CDH cluster to DataWorks for data development. For more information, see Register a CDH or CDP cluster to DataWorks.