Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be connected to DataWorks. This allows you to register CDH or CDP clusters to DataWorks. This way, you can use the data development and governance features provided by DataWorks to manage CDH or CDP data. The features involve task development, task scheduling, metadata management in Data Map, and data quality monitoring in Data Quality.
Background information
CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.
CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.
You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.
Prerequisites
The identity that you want to use is prepared and granted the required permissions. Only the following identities can register a CDH or CDP cluster:
An Alibaba Cloud account.
A DataWorks workspace member that is assigned the Workspace Administrator role. For more information about how to assign roles to members, see Add a RAM user to a workspace as a member and assign roles to the member.
A DataWorks workspace member that is attached the AliyunDataWorksFullAccess policy. For information about how to grant permissions, see Grant permissions to a RAM user and Grant permissions to a RAM role. For information about how to add a user to a DataWorks workspace as a member, see Add a RAM user to a workspace as a member and assign roles to the member.
A CDH or CDP cluster is deployed, and the required configuration information about the cluster is obtained. For more information, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.
Limits
Only serverless resource groups or old-version exclusive resource groups for scheduling can be used to run CDH or CDP tasks.
NoteDataWorks releases serverless resource groups that are used for general purposes, and we recommend that you use this type of resource group to run CDH or CDP tasks. Serverless resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a serverless resource group, see Create and use a serverless resource group. If you have purchased an old-version exclusive resource group for scheduling, you can also use the resource group to run CDH or CDP tasks. For more information, see Create and use an exclusive resource group for scheduling.
New users can purchase only serverless resource groups.
If you register a cluster of a custom version to DataWorks, you can use only old-version exclusive resource groups for scheduling to run relevant tasks. For more information about cluster versions, see the Step 2: Register a CDH or CDP cluster section in this topic.
You can register a CDH or CDP cluster to DataWorks only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Chengdu), and Germany (Frankfurt).
Step 1: Go to the cluster registration page
Go to the Management Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Management Center in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster, In the dialog box that appears, click CDH to go to the cluster registration page.
Step 2: Register a CDH or CDP cluster
If you use a workspace in standard mode, you must register the cluster in the development and production environments. For information about the modes of workspaces, see Differences between workspaces in basic mode and workspaces in standard mode.
The procedure of registering a CDP cluster to DataWorks is similar to the procedure of registering a CDH cluster to DataWorks. This topic describes how to register a CDH cluster to DataWorks.
Configure the basic information about the cluster.
Parameter
Description
Display Name Of Cluster
The name of the cluster in DataWorks. The name must be unique within the current tenant.
Cluster Version
The version of the cluster that you want to register.
You can select CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, or CDP 7.1.7 from the drop-down list. After you select one of these cluster versions, the component versions that are compatible with the cluster version are fixed. You can view the component versions when you configure settings in the Cluster Connection Information section. If the provided cluster versions do not meet your business requirements, you can select Custom Version from the drop-down list and specify the version for each component based on your business requirements.
NoteThe components required for the cluster vary based on the cluster version. You can view the components whose versions you must specify in the Cluster Connection Information section.
If you register a cluster of a custom version to DataWorks, you can use only an old-version exclusive resource group for scheduling to run relevant tasks. After the registration is complete, you must submit a ticket to contact technical support to initialize the environment.
Cluster Name
The name of the cluster that you want to register. This parameter is used to determine the source of the configuration information that is required when you register a cluster. You can select a cluster that is registered to another DataWorks workspace or create a cluster.
If you select a cluster that is registered to another DataWorks workspace, you can reference the configuration information of the cluster.
If you create a cluster, you must configure the cluster before you can register the cluster.
Configure the cluster connection information.
Select versions for required components that are deployed in the cluster based on your business requirements and enter the component addresses that you obtained. For more information about how to obtain component addresses, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.
NoteIf you want to use a serverless resource group to access CDH-related components by domain name, you must perform authoritative DNS resolution on the components in the Alibaba Cloud DNS console. For more information, see Add a built-in authoritative zone and Set an effective scope for a built-in authoritative zone.
Add configuration files.
You can upload configuration files of required components that are deployed in the cluster based on your business requirements. For more information about how to obtain configuration files, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.
The following table describes the configuration files.
Configuration file
Description
Scenario
Core-Site file
Contains the global configurations of the Hadoop core library, such as I/O settings that are commonly used by Hadoop Distributed File System (HDFS) and MapReduce.
You must upload such a file if you want to run Spark or MapReduce tasks.
Hdfs-Site file
Contains HDFS-related configurations, such as the data block size, the number of replicas, and the path name.
Mapred-Site file
Contains MapReduce-related parameters. For example, you can use this file to configure the execution method and scheduling settings of MapReduce jobs.
You must upload such a file if you want to run MapReduce tasks.
Yarn-Site file
Contains all configurations that are related to the YARN daemon, such as the configurations of resource managers, configurations of node managers, and runtime environment configurations of applications.
You must upload such a file if you want to run Spark or MapReduce tasks or if Kerberos Account is selected as the account mapping type.
Hive-Site file
Contains the parameters that are used to configure Hive. For example, you can use this file to configure the database connection information, Hive Metastore, and an execution engine.
You must upload such a file if Kerberos Account is selected as the account mapping type.
Spark-Defaults file
Contains the default configurations based on which a Spark job is run. You can use the
spark-defaults.conf
file to pre-configure a series of parameters, such as the memory size and CPU cores. The parameter settings are used when a Spark application is run.You must upload such a file if you want to run Spark tasks.
Config.Properties file
Contains the configurations of a Presto server. For example, you can use this file to configure global properties for coordinator and working nodes in a Presto cluster.
You must upload such a file if you want to use the Presto component and OpenLDAP Account or Kerberos Account is selected as the account mapping type.
Presto.Jks file
Stores security certificates, including private keys and public key certificates issued to applications. In a Presto database query engine, the
presto.jks
file is used to enable SSL- or TLS-encrypted communication for the Presto process to ensure data transmission security.Configure the default access identity for the cluster.
Configure the identity that is used to access the CDH cluster when you run CDH tasks in DataWorks. The supported identities vary based on the runtime environment.
NoteIf the Default Access Identity parameter is set to a value other than Cluster Account, but no required account mapping is configured or the Mapping Type parameter is set to No Authentication, tasks will fail to run.
Runtime environment
Default access identity
References
Development environment
Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.
Cluster account mapped by task performer: You must configure a mapping between a DataWorks tenant member that is used to run CDH tasks and a specific cluster account. After the configuration is complete, the mapped cluster account is used to access the CDH cluster.
Configure mappings between tenant member accounts and cluster accounts
Production environment
Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.
Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User: If you select one of these values for the Default Access Identity parameter, you must configure a mapping between the account that runs CDH tasks and a specific CDH cluster account. After the configuration is complete, the mapped CDH cluster account is actually used to run CDH tasks in DataWorks.
Click Complete. The CDH cluster is registered to DataWorks.
Step 3: Initialize a resource group
The first time you register a CDH cluster to DataWorks, or if the service configurations of your CDH cluster change or the version of a component in your CDH cluster is updated, you must initialize the resource group that you use. This ensures that the resource group can access the CDH cluster as expected, and CDH tasks can be run as expected by using the current environment configurations of the resource group. For example, if you modify the core-site.xml configuration file of your CDH cluster, you must initialize the resource group. You must go to the Cluster Management page in Management Center, find the desired CDH cluster that is registered to DataWorks, and then click Initialize Resource Group in the upper-right corner to initialize the resource group that you want to use.
DataWorks allows you to use only serverless resource groups or exclusive resource groups for scheduling to run CDH tasks. Therefore, you can select only a serverless resource group or an exclusive resource group for scheduling when you initialize a resource group. If no serverless resource group or exclusive resource group for scheduling is available, create a serverless resource group or an exclusive resource group for scheduling based on your business requirements. For more information, see Create and use a serverless resource group and Create and use an exclusive resource group for scheduling.
If you register a cluster of a custom version to DataWorks, you can use only an old-version exclusive resource group for scheduling to run relevant tasks. After the registration is complete, you must submit a ticket to contact technical support to initialize the environment.
What to do next
Configure identity mappings for the CDH cluster: If you set the Default Access Identity parameter to a value other than Cluster Account when you register the CDH cluster to DataWorks, you must configure identity mappings for the CDH cluster. The identity mappings are used to isolate and control the permissions on the CDH cluster in DataWorks.
Data development: You can create CDH Hive, CDH Spark, CDH MapReduce, CDH Impala, or CDH Presto nodes in DataStudio for data development. For more information, see Use DataWorks for data development.