This topic describes how to create an E-MapReduce (EMR) cluster.
Prerequisites
Procedure
- Go to the cluster creation page.
- Log on to the Alibaba Cloud EMR console.
- In the top navigation bar, select the region where you want to create a cluster and
select a resource group based on your business requirements.
- The region of a cluster cannot be changed after the cluster is created.
- All resource groups within your account are displayed by default.
- Click Cluster Wizard in the Clusters section.
- Configure the cluster. To create a cluster, you must configure software parameters, hardware parameters, and basic parameters as guided by the wizard.Important After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster.
- Configure software parameters.
Parameter Description Cluster Type The type of the cluster that you want to create. EMR supports the following types of clusters: - Hadoop:
- Provides Hadoop, Hive, and Spark components that serve as semi-hosted services and are used to store and compute large-scale distributed data offline.
- Provides Presto and Impala components for interactive queries.
- Provides other Hadoop ecosystem components, such as Oozie and Pig.
- Data Science: Data Science clusters are commonly used in big data and AI scenarios. Data Science clusters support the offline extract, transform, load (ETL) of big data based on Hive and Spark, and TensorFlow model training. You can choose the CPU+GPU heterogeneous computing framework and deep learning algorithms supported by NVIDIA GPUs to run computing jobs more efficiently.
- Druid: provides a semi-hosted, real-time, and interactive analytics service. Druid clusters can query big data within milliseconds and ingest data in multiple ways. You can use Druid clusters with services such as EMR Hadoop, EMR Spark, Object Storage Service (OSS), and ApsaraDB RDS to build a flexible and stable system for real-time queries.
- Presto: an open source interactive query engine that provides the SQL on Everything capability. Presto clusters can be used to quickly analyze and query data of any size. Presto clusters support non-relational data sources.
Cloud Native Option on ECS is selected by default. EMR Version The major version of EMR. The latest version is selected by default. Required Services The default components required for a specific cluster type. After a cluster is created, you can start or stop components on the cluster management page. Optional Services The other components that you can specify based on your business requirements. By default, the relevant service processes for the components you specify are started. Note The more components you specify, the higher instance specifications a cluster needs to handle the components. You must select the instance type that matches the number of components you specified when you configure the hardware. Otherwise, the cluster may have insufficient resources to run the components.Advanced Settings - Kerberos Mode: specifies whether to enable Kerberos authentication for clusters. This feature is disabled by default. It is not required by clusters created for common users.
- Custom Software Settings: customizes software settings. You can use a JSON file to customize the parameters of the basic components required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Customize software configurations. This feature is disabled by default.
- Hadoop:
- Configure hardware parameters.
Section Parameter Description Billing Method Billing Method Subscription is selected by default. EMR supports the following billing methods: - Pay-As-You-Go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. You are charged on an hourly basis. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
- Subscription: a billing method that allows you to use an instance only after you pay for the instance.
Note
- We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.
- If you select Subscription for Billing Method, you must also specify Subscription Period and Auto Renewal. By default, the subscription period is one month and the Auto Renewal switch is not turned on. If you turn on the Auto Renewal switch, the system renews your subscription for one more month seven days before the expiration date. For more information, see Renewal policy.
Network Configuration Zone The zone where you want to create a cluster. Zones are different geographical areas located in the same region. They are interconnected by an internal network. In most cases, you can use the zone selected by default. Network Type The network type of the cluster. The VPC network type is selected by default. VPC The virtual private cloud (VPC) where you want to deploy the cluster. Select a VPC in the same region as the zone. If no VPC is available in the region, click Create VPC/VSwitch to create a VPC. VSwitch The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch. Security Group Name The security group of the cluster. An existing security group is selected by default. For more information about security groups, see Overview. You can click Create Security Group and enter a security group name to create a security group.
Important Do not use an advanced security group that is created in the Elastic Compute Service (ECS) console.High Availability High Availability This feature is disabled by default. For a Hadoop cluster, if High Availability is enabled, two or three master nodes are created in the cluster to ensure the availability of the ResourceManager and NameNode processes. HBase clusters always work in high availability mode. If you do not enable high availability, only one master node is created, but a core node is used to support high availability. If you enable high availability, two master nodes are created to ensure higher security and reliability.
Instance Learn More - Master Instance: runs control processes, such as ResourceManager and NameNode.
You can select an instance type based on your business requirements. For more information, see Instance families.
- System Disk Type: You can select an SSD, ESSD, or ultra disk based on your business requirements.
- Disk Size: You can resize a disk based on your business requirements. The recommended minimum disk size is 120 GB. Valid values: 40 to 2048. Unit: GB.
- Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your business requirements.
- Disk Size: You can resize a disk based on your business requirements. The recommended minimum disk size is 80 GB. Valid values: 40 to 32768. Unit: GB.
- Master Nodes: One master node is configured by default. If high availability is enabled, two or three master nodes are configured.
- Core Instance: stores all the data of a cluster. You can add core nodes as needed after a cluster
is created.
- System Disk Type: You can select an SSD, ESSD, or ultra disk based on your business requirements.
- Disk Size: You can resize a disk based on your business requirements. The recommended minimum disk size is 120 GB.
- Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your business requirements.
- Disk Size: You can resize a disk based on your business requirements. The recommended minimum disk size is 80 GB.
- Core Nodes: Two core nodes are configured by default. You can change the number of core nodes based on your business requirements.
- Task Instance: stores no data. It is used to adjust the computing capabilities of clusters. No task node is configured by default. You can add task nodes based on your business requirements.
- Configure basic parameters.
Section Parameter Description Basic Information Cluster Name The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_). Type - DLF Unified Metadata (recommended): Metadata is stored in a data lake.
Data Lake Formation (DLF) provides a fully managed, maintenance-free, unified metadata service that provides high availability and performance. The metadata service is compatible with multiple versions of Hive and facilitate metadata migration between a Hive metastore and DLF. For more information, see Overview.
- Self-managed RDS: Metadata is stored in an ApsaraDB RDS database. For more information, see Configure an independent ApsaraDB RDS for MySQL database.
- Built-in MySQL (not recommended): Metadata is stored in the local MySQL database of a cluster.
Note You can select this option only in test scenarios. The local MySQL database is deployed on a single node of an EMR cluster. This cannot ensure high availability for services and may cause stability risks. We recommend that you select DLF Unified Metadata or Self-managed RDS in production scenarios.
Assign Public IP Address Specifies whether an elastic IP address (EIP) is associated with the cluster. This feature is disabled by default. Note To access the cluster over the Internet, apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP addresses.Key Pair For information about how to use a key pair, see SSH key pair overview. Password The password used to log on to a master node. The password must be 8 to 30 characters in length and contain uppercase letters, lowercase letters, digits, and special characters. The following special characters are supported: ! @ # $ % ^ & *
Advanced Settings Add User The user added to access the web UIs of open source big data software. Permission Settings The RAM roles that allow applications running in a cluster to access other Alibaba Cloud services. You can use the default RAM roles. - EMR Role: The value is fixed as AliyunEMRDefaultRole and cannot be changed. This RAM role authorizes a cluster to access other Alibaba Cloud services, such as ECS and OSS.
- ECS Role: You can also assign an application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of that cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services based on your business requirements.
Data Disk Encryption This feature is disabled by default. If you turn on Enable Encryption, data in all cloud disks that serve as the data disks of the ECS instances in the cluster is encrypted.Important You cannot encrypt data in local disks.Bootstrap Actions Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Manage bootstrap actions. Tag Optional. You can add a tag pair when you create a cluster or add a tag pair on the cluster details page after a cluster is created. For more information, see Manage and use tags. Resource Group Optional. For more information, see Use resource groups. Note The cluster configurations appear on the right side of the page when you configure parameters. After you complete the configurations, click Next: Confirm. You are directed to the Confirm step, in which you can confirm the configurations and the fee for the creation of your cluster. The fee varies based on the billing method. - DLF Unified Metadata (recommended): Metadata is stored in a data lake.
- Configure software parameters.
- Verify that the configuration is correct, read and select E-MapReduce Service Terms,
and then click Create. Important
- Pay-as-you-go clusters: Creation immediately starts after you click Create.
After the cluster is created, its status changes to Idle.
- Subscription clusters: An order is generated after you click Create. The cluster is created after you pay the fee.
- Pay-as-you-go clusters: Creation immediately starts after you click Create.