Register an EMR cluster to DataWorks - DataWorks - Alibaba Cloud Documentation Center

DataWorks allows you to create various nodes, such as Hive, MapReduce, Presto, and Spark SQL nodes, based on an E-MapReduce (EMR) cluster for various purposes, such as configuring a workflow for EMR tasks, periodically scheduling the workflow, and managing metadata of the workflow. This helps ensure that data can be generated and managed in an efficient and stable manner. This topic describes how to register an EMR cluster to DataWorks in same-account or cross-account mode.

Background information

EMR is a big data processing solution that runs on the Alibaba Cloud platform.

EMR is based on Apache Hadoop and Apache Spark, which allows you to use peripheral systems in the Hadoop and Spark ecosystems to perform data analysis and processing with ease. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS. Alibaba Cloud allows you to deploy the EMR service in different forms based on your business requirements. The forms include Elastic Compute Service (ECS), Alibaba Cloud Kubernetes (ACK), and Serverless.

You can select various EMR components to run EMR tasks in DataWorks. The optimal configurations of different EMR components to run an EMR task vary. When you configure an EMR cluster, you can refer to Instruction on configuring an EMR cluster to select components that meet your business requirements.

Supported EMR cluster types

You must register an EMR cluster to DataWorks before you can use the cluster in the DataWorks console to run tasks. Before you can perform operations related to EMR in the DataWorks console, you must create required EMR clusters. You can register the following types of EMR clusters to DataWorks:

DataLake cluster (new data lake): created on the EMR on ECS page
Custom cluster: created on the EMR on ECS page
Hadoop cluster (old data lake): created on the EMR on ECS page
Important
- You can use EMR Hadoop clusters of the following versions in DataWorks:
  EMR V3.38.2, EMR V3.38.3, EMR V4.9.0, EMR V5.6.0, EMR V3.26.3, EMR V3.27.2, EMR V3.29.0, EMR V3.32.0, EMR V3.35.0, EMR V4.3.0, EMR V4.4.1, EMR V4.5.0, EMR V4.5.1, EMR V4.6.0, EMR V4.8.0, EMR V5.2.1, and EMR V5.4.3.
- We recommend that you do not use Hadoop clusters. We recommend that you migrate data from Hadoop clusters to DataLake clusters at the earliest opportunity. For more information, see Migrate data from a Hadoop cluster to a DataLake cluster.
Spark cluster: created on the EMR on ACK page
EMR Serverless StarRocks instance

Note

If your cluster cannot be registered to DataWorks, submit a ticket to contact technical support.

Limits

Task type: You cannot run EMR Flink tasks in the DataWorks console.

Task running: You can use a serverless resource group (recommended) or an old-version exclusive resource group for scheduling to run an EMR task.
Task governance:
- Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.
  Note
  For Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.
- If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. EMR-HOOK can be configured for EMR Hive and EMR Spark SQL services. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.
Supported regions: EMR Serverless Spark is available only in the China (Zhangjiakou) region.
For an EMR cluster for which Kerberos authentication is enabled, you must add inbound rules of UDP ports to the security group of the EMR cluster for the CIDR block of the vSwitch with which a resource group is associated.
Note
To add an inbound rule, perform the following operations: Log on to the EMR console. Go to the Basic Information tab of your EMR cluster. In the Security section of the Basic Information tab, click the icon to the right of the Cluster Security Group parameter. On the Security Group Details tab of the Security Groups page, click the Inbound tab in the Access Rule section. On the Inbound tab, click Add Rule. Set the Protocol Type parameter to Custom UDP, the Port Range parameter to the configuration specified in the /etc/krb5.conf file of your EMR cluster, and the Authorization Object parameter to the CIDR block of the vSwitch with which a resource group is associated.

Prerequisites.

The identity that you want to use is prepared and granted the required permissions.
Only the following identities can register an EMR cluster. For information about how to grant permissions to a RAM user, see Grant permissions to RAM users.
- An Alibaba Cloud account
- A RAM user or RAM role that is assigned the Workspace Administrator role and attached the AliyunEMRFullAccess policy
- A RAM user or RAM role that is attached the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies
An EMR cluster that meets your business requirements is purchased.
For information about the types of EMR clusters that you can register to DataWorks, see the Supported EMR cluster types section in this topic.

Precautions

If you want to isolate EMR data in the development environment from EMR data in the production environment by using a workspace in standard mode, you must register different EMR clusters in the development and production environments of the workspace. In addition, the metadata of the EMR clusters must be stored by using one of the following methods:
- Method 1: Store the metadata in two different catalogs in DLF. We recommend that you use this method. For more information, see Use DLF for unified metadata storage.
- Method 2: Store the metadata in two different ApsaraDB RDS databases. For information about how to configure an ApsaraDB RDS database as the metadatabase of an EMR cluster, see Configure a self-managed ApsaraDB RDS for MySQL database.
You can register an EMR cluster to multiple workspaces within the same Alibaba Cloud account but cannot register an EMR cluster to multiple workspaces across Alibaba Cloud accounts. For example, if you register an EMR cluster to a workspace within the current Alibaba Cloud, you cannot register the cluster to a workspace in another Alibaba Cloud account.

Step 1: Go to the Register EMR Cluster page

Go to the SettingCenter page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.

Step 2: Register an EMR cluster

On the Register EMR Cluster page, configure cluster information.

Note

If your workspace is in standard mode, you must configure separate cluster information for the development environment and production environment. For information about workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.

Cluster Display Name: the name of the EMR cluster in DataWorks. The name must be unique within the current tenant.
Alibaba Cloud Account To Which Cluster Belongs: the type of the Alibaba Cloud account to which the EMR cluster you want to register in the current workspace belongs. Valid values:
- Current Alibaba Cloud Account: the current Alibaba Cloud account
- Another Alibaba Cloud Account: another Alibaba Cloud account
  Note
  You cannot register an EMR Serverless Spark cluster across Alibaba Cloud accounts. This means that you cannot register an EMR Serverless Spark cluster that belongs to another Alibaba Cloud account to a workspace of the current Alibaba Cloud account.

The parameters that you must configure vary based on the account type that you select. The following tables describe the parameters.

Parameters that need to be configured if you select Current Alibaba Cloud Account

If you set the Alibaba Cloud Account To Which Cluster Belongs parameter to Current Alibaba Cloud Account, you must configure the parameters in the following table.

Parameter	Description
Cluster Type	The type of the EMR cluster that you want to register. For information about the types of EMR clusters that you can register to DataWorks, see the Supported EMR cluster types section in this topic.
Cluster	The EMR cluster that you want to register. Note If you set the Cluster Type parameter to EMR Serverless Spark, you must configure parameters such as E-MapReduce Work Space, Default Engine Version, and Default Resource Queue, which are displayed after you select EMR Serverless Spark. You must set the E-MapReduce Work Space parameter to the EMR cluster that you want to register.
Default Access Identity	The identity that you want to use to access the EMR cluster in the current workspace. Development environment: You can select Cluster Account: `hadoop` or Cluster Account Mapped to Account of Task Executor. Production environment: You can select Cluster Account: `hadoop`, Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User. Note If you select Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can configure a mapping between a DataWorks tenant member and a specified EMR cluster account. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running: If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run. If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.
Pass Proxy User Information	Specifies whether to pass the proxy user information. Note If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR cluster issues an authentication credential to each ordinary user, which can be a cumbersome operation. To manage permissions of users in a centralized manner, a superuser (real user) is used for permission authentication on behalf of a proxy user (ordinary user). When the proxy user accesses the EMR cluster, the identity authentication information of the superuser is used. You need to add a user as a proxy user if you want the user to be authenticated by using the identity authentication information of a superuser. Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the proxy user. DataStudio and DataAnalysis: The name of the Alibaba Cloud account used by the task executor is dynamically passed. The proxy user information is the account information of the task executor. Operation Center: The name of the Alibaba Cloud account used by the default access identity, which is specified when you register the EMR cluster, is consistently passed. The proxy user information is the account information of the default access identity. Do Not Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the account authentication method that is specified when you register the EMR cluster. The method used to pass the proxy user information varies based on the type of an EMR task: EMR Kyuubi tasks: The proxy user information is passed by using the `hive.server2.proxy.user` configuration item. EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks: The proxy user information is passed by using the `-proxy-user` configuration item.
Configuration files	If you set the Cluster Type parameter to HADOOP, you must also upload the configuration files that are required. You can obtain the configuration files in the EMR console. For more information, see Export and import service configurations. After you export a service configuration file, change the name of the file based on the file upload requirements of the GUI. You can also log on to the EMR cluster that you want to register, and go to the following paths to obtain the required configuration files: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`

Parameters that need to be configured if you select Another Alibaba Cloud Account

If you set the Alibaba Cloud Account To Which Cluster Belongs parameter to Another Alibaba Cloud Account, you must configure the parameters in the following table.

Parameter	Description
Alibaba Cloud Account UID	The UID of the Alibaba Cloud account to which the EMR cluster you want to register belongs.
RAM Role	The RAM role that you want to use to access the EMR cluster. The RAM role must meet the following requirements: The RAM role is created within the specified Alibaba Cloud account. The RAM role is authorized to access the DataWorks service activated within the current logon account. Note For information about how to register an EMR cluster across accounts, see Scenario: Register a cross-account EMR cluster.
EMR Cluster Type	The type of the EMR cluster that you want to register. You can register only `DataLake clusters`, `Hadoop clusters`, and `custom clusters` that are created on the EMR on ECS page across accounts.
EMR Cluster	The EMR cluster that you want to register.
Configuration files	The configuration files that are required. You can configure the parameters that are displayed to upload the required configuration files. For information about how to obtain configuration files, see Export and import service configurations. After you export a service configuration file, change the name of the file based on the file upload requirements of the GUI. You can also log on to the EMR cluster that you want to register, and go to the following paths to obtain the required configuration files: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`
Default Access Identity	The identity that you want to use to access the EMR cluster in the current workspace. Development environment: You can select Cluster Account: hadoop or Cluster Account Mapped to Account of Task Executor. Production environment: You can select Cluster Account: hadoop, Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User. Note If you select Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can configure a mapping between a DataWorks tenant member and a specified EMR cluster account. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running: If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run. If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.
Pass Proxy User Information	Specifies whether to pass the proxy user information. Note If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR cluster issues an authentication credential to each ordinary user, which can be a cumbersome operation. To manage permissions of users in a centralized manner, a superuser (real user) is used for permission authentication on behalf of a proxy user (ordinary user). When the proxy user accesses the EMR cluster, the identity authentication information of the superuser is used. You need to add a user as a proxy user if you want the user to be authenticated by using the identity authentication information of a superuser. Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the proxy user. DataStudio and DataAnalysis: The name of the Alibaba Cloud account used by the task executor is dynamically passed. The proxy user information is the account information of the task executor. Operation Center: The name of the Alibaba Cloud account used by the default access identity, which is specified when you register the EMR cluster, is consistently passed. The proxy user information is the account information of the default access identity. Do Not Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the account authentication method that is specified when you register the EMR cluster. The method used to pass the proxy user information varies based on the type of an EMR task: EMR Kyuubi tasks: The proxy user information is passed by using the `hive.server2.proxy.user` configuration item. EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks: The proxy user information is passed by using the `-proxy-user` configuration item.

Step 3: Initialize a resource group

If you register an EMR cluster to DataWorks for the first time, modify the service configurations of your EMR cluster, such as configurations in the core-site.xml file, or update the version of a component in your EMR cluster, you must initialize the resource group that you use. This ensures that the resource group can properly access the EMR cluster and EMR tasks can run as expected by using the current environment configurations of the resource group. To initiate a resource group, perform the following steps:

Go to the Cluster Management page in SettingCenter. Find the desired EMR cluster that is registered to DataWorks and click Initialize Resource Group in the section that displays the information of the EMR cluster.
In the Initialize Resource Group dialog box, find the desired resource group and click Initialize.
After the initialization is complete, click Confirmation.

Note

DataWorks allows you to use serverless resource groups (recommended) or old-version exclusive resource groups for scheduling to run EMR tasks. Therefore, you can select a serverless resource group or an exclusive resource group for scheduling when you need to initialize a resource group.
Resource group initialization may cause failure of tasks that are in progress. Therefore, we recommend that you initialize a resource group during off-peak hours unless otherwise required. For example, if cluster configurations are modified, you must immediately reinitialize a specified resource group. Otherwise, a large number of tasks may fail to run.

What to do next

Data development: You can refer to General development process to configure the required component environments.
Configure identity mappings for the EMR cluster: If you log on to the DataWorks console as a RAM user and set the Default Access Identity parameter to a value other than Cluster Account: hadoop when you register the EMR cluster to DataWorks, you must configure identity mappings for the EMR cluster. The identity mappings are used to control the permissions of the RAM user on the EMR cluster in DataWorks.
Specify global YARN queues: You can specify global YARN queues that can be used by each service of DataWorks and specify whether the global settings can overwrite the settings that are separately configured in each service.
Configure global Spark properties: You can refer to the official documentation for Spark to configure global Spark-related parameters. In addition, you can specify whether the global settings can overwrite the Spark-related parameters that are separately configured in each service of DataWorks and have the same names as the global parameters.
Configure the Kyuubi connection information: If you want to log on to Kyuubi by using a custom account and password to run related tasks, you can configure the Kyuubi connection information based on your business requirements by referring to this topic.