All Products
Search
Document Center

E-MapReduce:Create a cluster

Last Updated:Dec 12, 2024

Alibaba Cloud E-MapReduce (EMR) allows you to build and run open source big data frameworks such as Hadoop, Spark, Hive, and Presto for large-scale data processing and analysis. This topic describes how to create an EMR cluster on the EMR on ECS page in the EMR console.

Note

If this is the first time you create an EMR cluster after 17:00 (UTC+8) on December 19, 2022, you cannot create a Hadoop, Data Science, Presto, or ZooKeeper cluster.

Prerequisites

RAM authorization is complete. For more information, see Assign roles to an Alibaba Cloud account.

Precautions

When you create a DataLake cluster, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1, if the services that you select do not depend on nodes in a newly added task node group, you can click Remove Node Group in the Actions column of the task node group in the Node Group section.

Procedure

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.

    • The region of a cluster cannot be changed after the cluster is created.

    • By default, all resource groups in your account are displayed.

  3. On the EMR on ECS page, click Create Cluster.

  4. Configure the cluster as prompted.

    When you create a cluster, you need to configure the software, hardware, and basic information, and confirm the order for the cluster.

    Note

    After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster.

  5. After you verify that all configurations are correct, read the terms of service and select the check box.

  6. Click Confirm.

    Important
    • Pay-as-you-go clusters: The cluster is created immediately. After the cluster is created, the cluster is in the Running state.

    • Subscription clusters: An order is generated. The cluster will be created after you complete the payment.

Parameter description

Software parameters

Parameter

Description

Region

The geographic location where the Elastic Compute Service (ECS) instances of the cluster are located. To ensure minimal network latency, select a region that is close to your geographical location. After the cluster is created, you cannot change the region.

Select a region from the drop-down list.

Business Scenario

Select a business scenario based on your business requirements. Valid values:

  • Data Lake: provides a big data compute engine that allows you to analyze data in a flexible, reliable, and efficient manner.

    • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.

    • Supports the OSS-HDFS (fully managed HDFS) service for storage, which helps you reduce O&M costs. You are charged based on actual usage of the OSS-HDFS service.

    For more information, see DataLake cluster.

  • Data Analytics: provides efficient, real-time, and flexible data analytics capabilities to meet the requirements of various business scenarios, such as user profiling, recipient selection, BI reports, and business analytics. You can write data to online analytical processing (OLAP) engines such as ClickHouse and StarRocks for analysis by importing data or using external tables.

  • Real-time Data Streaming: provides an end-to-end (E2E) real-time computing solution. Dataflow clusters incorporate Kafka, a distributed message system with high throughput and scalability, and the commercial Flink kernel provided by Apache Flink-powered Ververica. The clusters are used to resolve various E2E real-time computing issues and are widely used in real-time data extract, transform, and load (ETL), and log collection and analysis scenarios. You can use one of the two components or both.

  • Data Service:

    • Provides a DataServing cluster that allows you to analyze data in a flexible, reliable, and efficient manner.

    • Provides semi-managed HBase clusters and can decouple computing clusters from data storage based on the OSS-HDFS (JindoFS) service.

    • Supports data caching by using JindoData to improve the read and write performance of DataServing clusters.

    For more information, see DataServing cluster.

  • Custom Cluster: provides various services. You can select services based on your business requirements.

    Note

    We recommend that you do not deploy multiple storage services on the same node group in the production environment.

Product Version

The version of EMR. For more information, see Overview.

High Service Availability

By default, this switch is turned off. If you turn on the switch, multiple master nodes are created in the cluster to ensure the high availability of the ResourceManager and NameNode processes. In addition, EMR distributes the master nodes across different underlying hardware devices to reduce the risk of failures.

Optional Services (Select One At Least)

The services that you can select for the cluster. You can select services based on your business requirements. The processes related to the services that you select are automatically started.

Important
  • The more services you select, the higher instance specifications a cluster needs to handle the services. You must select the instance type that matches the number of services you specified when you configure the hardware. Otherwise, resources may be insufficient for the cluster to run the services.

  • The services cannot be uninstalled after they are deployed in an EMR cluster.

  • The parameters that you need to configure vary based on the product version and services that you select.

Collect Service Operational Logs

Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics.

After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab.

Important

If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collection of service operational logs?

Metadata

The method for storing and managing metadata. Valid values:

  • DLF Unified Metadata: Metadata is stored in Data Lake Formation (DLF). We recommend that you select this method.

    After you activate DLF, the system selects a DLF catalog for you to store metadata. The ID of your account is used by default. If you want different clusters to be associated with different DLF catalogs, you can perform the following operations to create DLF catalogs:

    1. Click Create Catalog. In the popover that appears, enter a catalog ID and click OK.

    2. Select the catalog that you created from the DLF Catalog drop-down list.

  • Self-managed RDS: Metadata is stored in a self-managed or Alibaba Cloud ApsaraDB RDS database.

    If you select Self-managed RDS, you must configure the parameters of the existing ApsaraDB RDS database. For more information, see Configure a self-managed ApsaraDB RDS for MySQL database.

  • Built-in MySQL: Metadata is stored in the local MySQL database of your cluster. This method is not recommended.

    Note
    • Test environment: We recommend that you select Built-in MySQL.

    • Production environment: We recommend that you select DLF Unified Metadata or Self-managed RDS.

Root Storage Directory of Cluster

The root storage directory of cluster data. This parameter is required only if you select the OSS-HDFS service.

Important

If you click Create OSS-HDFS Bucket to create a bucket, you can read data from or write data to the bucket only in the EMR console. You cannot perform operations on the bucket in the OSS console or by using a specified API.

The first time you use OSS-HDFS, you must complete authorization as prompted. If you use a RAM user, you must attach the AliyunEMRDlsFullAccess policy and assign the AliyunOSSDlsDefaultRole and AliyunEMRDlsDefaultRole roles to the RAM user by using your Alibaba Cloud account. For more information, see Grant permissions to RAM users. Select a bucket for which OSS-HDFS is enabled in the same region, or click Create OSS-HDFS Bucket to create an OSS-HDFS bucket as the root storage path of the cluster.

Note
  • Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions.

  • You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.

Parameters related to services and product version

You need to configure the following parameters based on the services and product version that you select.

  • If you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0 and select the Hive service for the cluster, you must configure the following parameter.

    Parameter

    Description

    Hive Storage Mode

    The storage mode of Hive data. An OSS-HDFS or OSS directory is used for storage. By default, Data Lake Storage is selected. If you clear the check box, HDFS of the cluster is used for storage.

    If you do not clear the check box, you must configure the Hive Data Warehouse Path parameter. We recommend that you select a bucket for which the OSS-HDFS service is enabled.

    Note

    Make sure that you have the required permissions to access the selected OSS or OSS-HDFS bucket.

  • If you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0 and select the HBase service for the cluster, you must configure the following parameter.

    Parameter

    Description

    HBase Storage Mode

    The storage mode of HBase data files. Valid values: OSS-HDFS and OSS.

    If you set the HBase Storage Mode parameter to OSS-HDFS, you must configure the HBase Storage Path parameter. We recommend that you select a bucket for which the OSS-HDFS service is enabled.

  • If you create a cluster of EMR V5.12.1 or a later minor version, or of EMR V3.46.1 or a later minor version and select the OSS-HDFS and HBase services for the cluster, you must configure the following parameter. After the cluster is created, the HBase-HDFS service is automatically deployed. For more information, see HBase-HDFS.

    Parameter

    Description

    HBase Log Storage

    This check box is selected by default, which indicates that HBase stores HLog files in HDFS.

More

Important

If this is the first time you create an EMR cluster after 17:00 (UTC+8) on December 19, 2022, you cannot create a Data Science, Hadoop, Presto, or ZooKeeper cluster.

  • Machine Learning: is used for big data and AI scenarios.

    • Provides a distributed deep learning framework.

    • Provides more than 200 typical machine learning algorithm packages.

    • Provides AutoML capabilities and more than 10 deep learning algorithms, covering scenarios such as recommendation and advertising.

  • Old Data Lake: provides frameworks and pipelines for you to process and analyze large amounts of data, and supports open source components such as Apache Hive, Spark, and Presto. The following types of clusters are supported:

    • Hadoop:

      • Provides a complete list of open source components that are fully compatible with the Hadoop ecosystem.

      • Supports various scenarios such as big data offline processing, real-time processing, and interactive query.

      • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.

    • ZooKeeper: provides a distributed and consistent lock service that facilitates coordination among large-scale Hadoop, HBase, and Kafka clusters.

    • Presto: is an in-memory distributed SQL engine used for interactive queries. Presto clusters support various data sources and are suitable for complex analysis of petabytes of data and cross-data source queries.

(Optional) Advanced Settings

Parameter

Description

Kerberos Authentication

Specifies whether to enable Kerberos authentication for the cluster. This switch is turned off by default. Kerberos is an identity authentication protocol based on symmetric-key cryptography. Kerberos provides the identity authentication feature for other services. For more information, see Overview.

Important
  • Knox: Kerberos authentication is not supported.

  • Kudu: If you enable Kerberos authentication for Kudu, you must make additional configurations for Kerberos authentication to take effect. For more information, see Authentication.

Custom Software Configuration

Specifies whether to customize the configurations of software. You can use a JSON file to customize the configurations of basic software required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Customize software configurations.

Note

For more information about how to configure the parallelism of Hive jobs, see How do I estimate the maximum number of Hive jobs that can be concurrently run?

Hardware parameters

Parameter

Description

Billing Method

The billing method of the cluster. Subscription is selected by default. EMR supports the following billing methods:

  • Pay-as-you-go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. Bills are generated on an hourly basis at the top of every hour. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.

  • Subscription: a billing method that allows you to use an instance only after you pay for the instance.

    Note
    • We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.

    • If you select Subscription for Billing Method, you must also configure the Subscription Duration and Auto-renewal parameters. By default, the subscription period is six months and the Auto-renewal switch is turned on. If you turn on Auto-renewal, the system renews your subscription for one more month seven days before the expiration date. For more information, see Renewal policy.

Zone

The zone where you want to create a cluster. A zone in a region is a physical area with independent power supplies and network facilities. Clusters in zones within the same region can communicate with each other over an internal network. In most cases, you can use the zone that is selected by default.

VPC

The virtual private cloud (VPC) where you want to deploy the cluster. A VPC is a logically isolated network on which you have full control.

You can select an existing VPC or click Create VPC to create a VPC in the VPC console. For more information, see Create and manage a VPC.

Note

The internal IP address of the cluster is associated with the VPC. Therefore, you cannot modify the internal IP address after the cluster is created.

vSwitch

The vSwitch of the cluster. vSwitch is a basic component of VPCs. vSwitches can be used to establish network communication between cloud resources.

You can select an existing vSwitch or click Create vSwitch to create a vSwitch in the VPC console. For more information, see Create and manage a vSwitch.

Default Security Group

The security group of the cluster. A security group is a virtual firewall that is used to control the inbound and outbound traffic of instances in the security group. For more information, see Overview.

You can select an existing security group or click create a new security group to create a security group in the ECS console. For more information, see Create a security group.

Important

Do not use an advanced security group that is created in the ECS console.

Node Group

The node groups of the cluster. You can select instance types based on your business requirements. For more information, see Instance families.

  • Master node group: runs control processes, such as ResourceManager and NameNode.

  • Core node group: stores all the data of a cluster. You can add core nodes based on your business requirements after a cluster is created.

  • Task node group: stores no data and is used to adjust the computing capabilities of clusters. No task node group is configured by default. You can configure a task node group based on your business requirements.

  • Add to Deployment Set: If you turn on the High Service Availability switch, the master nodes are added to a deployment set by default. A deployment set is used to control the distribution of ECS instances. For more information, see Deployment set.

  • System Disk: You can select a standard SSD, enhanced SSD, or ultra disk based on your business requirements. You can adjust the size of the system disk based on your business requirements.

  • Data Disk: You can select standard SSDs, enhanced SSDs, or ultra disks based on your business requirements. You can adjust the size of the data disks based on your business requirements.

    Note

    If you select enhanced SSDs, you can specify different performance levels (PLs) for the enhanced SSDs based on the disk capacity to meet different cluster performance requirements. The default performance level is PL1. When you configure the system disk, you can select an enhanced SSD of the following performance levels: PL0, PL1, and PL2. When you configure data disks, you can select enhanced SSDs of the following performance levels: PL0, PL1, PL2, and PL3. For more information, see Disks.

  • Instances: One master node is configured by default. If you turn on the High Service Availability switch, multiple master nodes can be configured.

    Two core nodes are configured in the core node group by default. You can change the number of core nodes based on your business requirements.

  • Additional Security Group: An additional security group allows interactions between different external resources and applications. You can associate a node group with up to two additional security groups.

  • Assign Public Network IP: specifies whether to associate an EIP address with the cluster. This switch is turned off by default. You can assign public IP addresses only to the node groups of DataLake clusters.

    Note

    If you do not turn on this switch but want to access the cluster over the Internet after you create the cluster, you must apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP addresses.

Basic parameters

Parameter

Description

Cluster Name

The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

Identity Credentials

The credentials that are used to log on to the master node of the cluster. For more information, see Log on to a cluster. Valid values:

  • Key Pair (default): Select an existing key pair or click Create Key Pair to create a key pair.

    Key pairs are a secure and convenient authentication method provided for ECS instance logons. Only Linux instances support key pair-based authentication. For information about how to use a key pair, see SSH key pair overview.

  • Password: Configure a password for the master node and confirm the password. By default, the username is root.

    • The password must be 8 to 30 characters in length and must contain uppercase letters, lowercase letters, digits, and special characters.

    • The following special characters are supported: ! @ # $ % ^ & *

(Optional) Advanced Settings

Parameter

Description

ECS Application Role

You can assign an ECS application role to a cluster. EMR applies for a temporary AccessKey pair when applications running on the compute nodes of the cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services based on your business requirements.

Bootstrap Actions

You can configure bootstrap actions to run custom scripts before a cluster starts. You can use bootstrap actions to install third-party software and modify the runtime environment of your clusters. For more information, see Manage bootstrap actions.

Release Protection

You can turn on Release Protection when you create a pay-as-you-go cluster or after the cluster is created to prevent the cluster from being accidentally released. After you enable release protection for a cluster, you cannot directly release the cluster. To release the cluster, you must disable release protection. For more information, see Enable and disable release protection.

Tags

You can add a tag when you create a cluster or add a tag on the Basic Information tab after a cluster is created. Tags help you identify and manage cluster resources. For more information, see Manage and use tags.

Resource Group

You can group your resources based on usage, permissions, and ownership. For more information, see Use resource groups.

Data Disk Encryption

You can turn on this switch only when you create a cluster. If you turn on this switch, both data in transit and data at rest on the disk are encrypted. For more information, see Enable data disk encryption.

System Disk Encryption

You can turn on this switch only when you create a cluster. After you enable the system disk encryption feature for an EMR cluster, the operating system, program files, and other system-related data on the system disk are encrypted. For more information, see Enable system disk encryption.

Remarks

Remarks are used to records important information about an EMR cluster. You can modify the remarks on the Basic Information tab after the cluster is created. If you do not configure the Remarks parameter when you create a cluster, you can add remarks after the cluster is created.

Order confirmation

Optional. If a key pair is used for identity authentication, you can click Save as Cluster Template to save the configurations of the current cluster as a cluster template.

  1. In the Save as Cluster Template dialog box, configure the Cluster Template Name and Cluster Template Resource Group parameters.

    Parameter

    Description

    Cluster Template Name

    Enter a cluster template name to facilitate template management. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

    Cluster Template Resource Group

    Select an existing resource group based on your business requirements to manage cluster templates by group.

    If you want to use a new resource group, click Create Resource Group to create one. For more information, see Create a resource group.

  2. Click OK.

    A cluster template is created in the Manage Cluster Templates panel. For more information about cluster templates, see Create a cluster template.

References

  • For information about cluster-related issues, see FAQ about cluster management.

  • For information about how to add services to an existing cluster, see Add services.

  • For information about how to log on to a cluster, see Log on to a cluster.

  • For information about how to select an instance type, see ECS instances.

  • For information about component-related issues, see FAQ.

  • For information about how to create a cluster by calling an API operation, see CreateCluster.