This topic describes how to create and configure a Dataflow Kafka cluster, which refers to a Dataflow cluster that is deployed with the Kafka service.
Limits
Kafka is no longer supported in E-MapReduce (EMR) V5.18.0, EMR V3.52.0, and a minor version earlier than EMR V5.18.0 or V3.52.0. We recommend that you use ApsaraMQ for Kafka or manually install Kafka.
Precautions
When you create a Dataflow Kafka cluster, you must select the appropriate type of Elastic Compute Service (ECS) instance and determine the number of brokers based on the estimated load of your business. No general cluster plan can be provided due to the variety of business scenarios. You need to create a cluster based on your actual environment. In most cases, we recommend that you consider the following items when you select an instance type:
Deploy Kafka brokers on ECS instances whose CPU-to-memory ratio is 1:4.
Use cloud disks to store data.
Consider the relationship between the I/O throughput of cloud disks and the network interface controller (NIC) bandwidth.
Consider the following factors when you configure the deployment parameters:
The Kafka versions used in EMR depend on the ZooKeeper service. The availability of ZooKeeper determines whether the Kafka service is highly available. We recommend that you turn on High Service Availability when you create a cluster. If you turn on High Service Availability when you create the cluster, three nodes are deployed for the ZooKeeper service.
If the master node group is only used to deploy ZooKeeper, you need to configure only one data disk for the master node group.
For more information about evaluation-based suggestions, see Suggestions for evaluating cluster resources.
Procedure
Go to the cluster creation page.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
Optional. In the top navigation bar, select a region and a resource group based on your business requirements.
You cannot change the region of a cluster after the cluster is created.
By default, all resource groups in your account are displayed.
On the EMR on ECS page, click Create Cluster.
Configure the cluster.
To create a cluster, you must configure software parameters, hardware parameters, and basic parameters as guided by the wizard.
ImportantAfter a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster. For more information, see Create a cluster.
Configure software parameters.
Parameter
Example
Description
Region
China (Hangzhou)
The region in which you want to create the cluster. You cannot change the region of a cluster after the cluster is created.
Business Scenario
Real-time Data Streaming
The scenario in which you want to use the cluster. Select Real-time Data Streaming.
Product Version
EMR-3.43.1
The version of EMR. After you select an EMR version, you can view the version of each service.
For example, in an EMR V3.43.1 cluster, the version of Kafka is 2.12_2.4.1. The value 2.12 indicates the Scala version, and the value 2.4.1 indicates the version of open source Kafka.
High Service Availability
On
By default, the switch is turned off.
ImportantIf you turn on High Service Availability when you create the cluster, three nodes are deployed in the master node group for the ZooKeeper service. The Kafka versions used in EMR depend on the ZooKeeper service. Therefore, when you create a cluster, we recommend that you turn on High Service Availability.
Optional Services (Select One At Least)
Kafka
The services that you want to deploy in the cluster. Select Kafka.
You can select other services based on your business requirements. By default, the relevant components of the services that you selected are started.
Collect Service Operational Logs
On
Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics.
After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab.
ImportantIf you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collection of service operational logs?
Configure hardware parameters.
Parameter
Example
Description
Billing Method
Pay-as-you-go
The billing method of the cluster. By default, Subscription is selected. EMR supports the following billing methods:
Pay-as-you-go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the number of hours for which the cluster is actually used. Bills are generated on an hourly basis at the top of every hour. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
Subscription: a billing method that allows you to use an instance only after you pay for the instance.
NoteWe recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.
Zone
Zone I
The zone in which you want to create a cluster. A zone in a region is a physical area with independent power supplies and network facilities. Clusters in zones within the same region can communicate with each other over an internal network. In most cases, you can use the zone that is selected by default.
VPC
emr_test/vpc-bp1f4epmkvncimpgs****
The virtual private cloud (VPC) where you want to deploy the cluster. An existing VPC is selected by default.
If you want to use a new VPC, go to the VPC console to create one. For more information, see Create and manage a VPC.
vSwitch
vsw_test/vsw-bp1e2f5fhaplp0g6p****
The vSwitch of the cluster. Select a vSwitch in the specific zone based on your business requirements. If no vSwitch is available in the zone, go to the VPC console to create one. For more information, see Create and manage a vSwitch.
Default Security Group
sg-bp1ddw7sm2risw****/sg-bp1ddw7sm2risw****
The security group of the cluster. By default, an existing security group is selected. For more information about security groups, see Overview.
You can also click create a new security group to create a security group in the ECS console. For more information, see Create a security group.
ImportantDo not use an advanced security group that is created in the ECS console.
Node Group
Configure settings based on your business requirements
Instance Type: You can select instance types and specifications based on your business requirements or based on evaluation-based suggestions. For more information about evaluation-based suggestions, see Suggestions for evaluating cluster resources.
Add to Deployment Set: If you turn on High Service Availability, the master nodes are added to a deployment set by default. For more information about deployment sets, see Add nodes to the deployment set.
System Disk: You can select a type of system disk based on your business requirements.
System disk size: You can specify the size of a disk based on your business requirements. The recommended minimum disk size is 120 GiB. Valid values: 80 to 500. Unit: GiB.
Data Disk: You can select a type of data disk based on your business requirements.
NoteWe recommend that you select a cloud disk type.
Data disk size: You can specify the size of a disk based on your business requirements. The recommended minimum disk size is 80 GiB. Valid values: 40 to 32768. Unit: GiB.
Instances: By default, three master nodes and three core nodes are deployed.
Additional Security Group: You can associate the node group with a maximum of two additional security groups. An additional security group allows for interactions between different external resources and applications in a flexible manner.
Assign Public Network IP: specifies whether to associate an elastic IP address (EIP) with the cluster. By default, this switch is turned off.
NoteFor information about how to apply for an EIP address, see Elastic IP addresses.
Configure basic parameters.
Configure parameters in the Basic Information step.
ImportantThe following table describes all parameters. However, the parameters in the Advanced Settings section are not supported. Do not configure the parameters in this section.
Parameter
Example
Description
Cluster Name
Emr-Kafka
The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
Identity Credentials
Custom password
Key Pair (default): Use an SSH key pair to access the Linux instance.
For information about how to use an SSH key pair, see SSH key pair overview.
Password: Use the password that you set for the master node to access the Linux instance.
The password must be 8 to 30 characters in length and must contain uppercase letters, lowercase letters, digits, and special characters.
The following special characters are supported: ! @ # $ % ^ & *
In the Confirm step, read the terms of service and select the check box.
Click Confirm.
Refresh the EMR on ECS page to view the creation progress. When Status becomes Running, the cluster is created.
What to do next
After the cluster is created, you can modify the values of the default parameters of the cluster to meet production requirements. Examples:
Specify whether to enable the SSL encryption feature for an EMR Kafka cluster. For more information, see Use SSL to encrypt Kafka data.
Specify whether to enable the Simple Authentication and Security Layer (SASL) feature to perform logon authentication for an EMR Kafka cluster. For more information, see Log on to a Kafka cluster by using SASL.