This topic describes how to use an Alibaba Cloud account to log on to the E-MapReduce (EMR) console and create a cluster on the EMR on ACK page.
Prerequisites
The AliyunOSSFullAccess and AliyunDLFFullAccess policies are attached to a RAM role. For more information, see Attach policies to a RAM role.
A Container Service for Kubernetes (ACK) cluster is created. For more information, see Create an ACK dedicated cluster or Create an ACK managed cluster.
A node pool is created. For more information, see Create a node pool.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Procedure
Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
On the EMR on ACK page, click Create Cluster.
On the E-MapReduce on ACK page, configure the parameters. The following table describes the parameters.
Parameter
Description
Region
The region in which you want to create a cluster. You cannot change the region after the cluster is created.
Cluster Type
The type of the cluster. Valid values:
Shuffle Service: an extension provided by EMR to optimize the shuffle operations of compute engines. The remote shuffle service provided by Shuffle Service allows Spark jobs to run on nodes that do not have local disks and supports dynamic resources. The service is suitable for Spark clusters in the ACK environment. For more information, see Celeborn.
ImportantWhen you create a Shuffle Service cluster, nodes in the dedicated node pool or nodes of the associated ACK cluster must belong to the big data instance families or instance families with local SSDs. Otherwise, the remote shuffle service fails to be deployed.
NoteIn EMR for ACK scenarios, the system provides a built-in automatic cleanup task named rss-pvc-clean for Shuffle Service clusters. The task is used to clean up PVC resources that are no longer used in a regular manner or under specific conditions. This optimizes the management of storage resources and prevents storage space from being occupied by invalid or redundant persistent data.
Presto: an in-memory distributed SQL engine that is used for interactive queries.
Presto clusters support various data sources and are suitable for complex analysis of petabytes of data and cross-data source queries.
Spark: a common distributed big data processing engine that provides various capabilities, such as extract, transform, and load (ETL), batch processing, and data modeling.
ImportantIf you want to associate a Spark cluster with a Shuffle Service cluster, the EMR versions of the clusters must be the same. For example, a Spark cluster whose EMR version is EMR-5.x-ack can be associated with only a Shuffle Service cluster whose EMR version is EMR-5.x-ack.
Flink: a distributed compute engine for stateful computing on bounded or unbounded data streams. Flink on ACK is developed based on EMR on ACK and Flink Kubernetes Operator 1.0.1. By default, Flink on ACK uses the kernel of Flink Enterprise Edition, which ensures that users can use Flink on ACK without additional configurations.
Product Version
The version of EMR. By default, the latest version is selected.
Component Version
Displays the type and version of the component that is deployed in the cluster of the specified type.
ACK Cluster
Select an existing ACK cluster or create an ACK cluster in the ACK console.
You can click Configure Dedicated Nodes to configure an EMR-dedicated node. You can configure an EMR-dedicated node or node pool by adding taints and labels to the node or node pool. This way, the node or node pool can be used only for EMR.
NoteWe recommend that you configure dedicated nodes in a node pool. If no node pool is available, create a node pool. For more information, see Create a node pool.
OSS Bucket
Select an existing bucket or create a bucket in the Object Storage Service (OSS) console.
Cluster Name
The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
Click Create.
If the status of the cluster changes to Running, the cluster is created.