This topic provides answers to some frequently asked questions about cluster management.
Why does a high-availability EMR cluster have three master nodes?
How do I enable data disk encryption? What are the impacts after I enable data disk encryption?
How do I specify the disk size when I scale out an EMR cluster?
In which scenarios do I need to turn on Add to Deployment Set?
How do I configure the Add to Deployment Set parameter when I scale out an EMR cluster?
How do I associate a public IP address with an ECS instance in an existing EMR cluster?
How do I modify the configurations of ECS instances in a node group?
FAQ about services:
Do I need to restart a service after I modify the configurations of the service?
Can I add services to an EMR cluster after I create the cluster?
Do EMR clusters support Hue? What do I do if Hue is not supported and I want to use the service?
Do EMR clusters support Oozie? What do I do if Hue is not supported and I want to use the service?
FAQ about EMR Doctor:
Can I upgrade an EMR cluster?
No, you cannot upgrade an E-MapReduce (EMR) cluster and the services that are deployed in the cluster. To upgrade an EMR cluster or the services that are deployed in the cluster, release the cluster and create another cluster.
What services do EMR clusters support?
The services that are supported by EMR clusters vary based on the cluster version and cluster type. For more information, see Overview.
Do EMR clusters support Hue? What do I do if Hue is not supported and I want to use the service?
DataLake clusters of EMR V5.8.0 or a later minor version and EMR V3.42.0 or a later minor version do not provide Hue. If you want to use the Hue service in a DataLake cluster, you must install the service first. For more information, see Install Hue in DataLake clusters.
Do EMR clusters support Oozie? What do I do if Oozie is not supported and I want to use the service?
DataLake clusters of EMR V5.8.0 or a later minor version and EMR V3.42.0 or a later minor version do not provide Oozie. If you want to use a workflow to schedule the service, you can use EMR Workflow. For more information, see What is EMR Workflow?
Why does a high-availability EMR cluster have three master nodes?
A high-availability EMR cluster with three master nodes is more reliable than a high-availability EMR cluster with two master nodes. High-availability EMR clusters with only two master nodes are no longer supported.
For high-availability clusters, EMR distributes master nodes across different underlying hardware to reduce the risk of failures.
How do I enable data disk encryption? What are the impacts after I enable data disk encryption?
When you create an EMR cluster, you can determine whether to turn on Data Disk Encryption in the Advanced Settings section of the Basic Configuration step. For more information, see Enable data disk encryption.
You can enable data disk encryption only when you create an EMR cluster. You cannot enable data disk encryption for an existing EMR cluster.
After a data disk is encrypted, data in transit and data at rest on the disk are encrypted. You can use the data disk encryption feature if your business has security compliance requirements. Data disk encryption is transparent to applications at the operating system level of Elastic Compute Service (ECS) instances and does not affect the running of jobs.
How do I release a cluster that fails to be created?
In most cases, an EMR cluster fails to be created because the RDS configurations are incorrect or the selected ECS instance type is unavailable.
If some ECS instances are created and the cluster status is Startup Failed, you must go to the ECS console to release the ECS instances. After you release the ECS instances, the system releases the EMR cluster.
If the EMR cluster fails to be deployed and the cluster status is Unexpectedly Terminated, the cluster does not have resources and you are not charged for the cluster. You can click Delete in the Actions column to delete the cluster.
Can I add services to an EMR cluster after I create the cluster?
EMR allows you to add specific services to an EMR cluster after you create the cluster. For more information, see Add services.
After you add services, you may need to modify the related configurations of specific existing services and restart the existing services for the configurations to take effect. We recommend that you add services during off-peak hours.
The services that you can add to a cluster vary based on the version of EMR. The actual services that you can add to a cluster are displayed in the EMR console.
Do I need to restart a service after I modify the configurations of the service?
If you modify the server-side configurations of an EMR cluster, such as the configurations of Spark, Hive, or Hadoop Distributed File System (HDFS), you must restart the service for the modification to take effect. If you modify the client-side configurations of an EMR cluster, you need to only click Deploy Client Configuration for the modification to take effect. For information about how to modify or add a configuration item, see Manage configuration items.
What is rolling restart?
Rolling restart refers to the process in which the system restarts an ECS instance only after the previous ECS instance is restarted and all big data services that are deployed on the previous ECS instance are restored. It takes approximately five minutes to restart an ECS instance.
How do I associate a public IP address with an ECS instance in an existing EMR cluster?
You can create an elastic IP address (EIP) and associate the EIP with an ECS instance that is of the virtual private cloud (VPC) network type and has no public IP address. This way, you can access the instance over the Internet. For more information, see the Associate one or more EIPs with an instance section of the Associate or disassociate an EIP topic.
In which scenarios do I need to turn on Add to Deployment Set?
The deployment set feature is provided by Alibaba Cloud ECS and is used to control the distribution of ECS instances. If the core nodes of your cluster use ECS instances that are equipped with local disks, we recommend that you turn on Add to Deployment Set to improve data security. You can turn on Add to Deployment Set to prevent multiple ECS instances from being deployed on the same physical machine. This way, if a physical machine fails, the ECS instances on other physical machines are not affected and the local HDFS data is not lost.
You can add up to 20 ECS instances to a deployment set. For more information, see Add nodes to the deployment set.
How do I configure the Add to Deployment Set parameter when I scale out an EMR cluster?
By default, Add to Deployment Set is turned on for ECS instances that are equipped with local disks and turned off for ECS instances that are equipped with other disks. You can specify whether to turn on Add to Deployment Set based on your business requirements. For more information, see Add nodes to a deployment set.
How do I specify the disk size when I scale out an EMR cluster?
When you scale out an EMR cluster, the disk size of new nodes is the same as the disk size of existing nodes in the node group and cannot be modified. You can expand the disks of a node group based on your business requirements. For more information, see Expand a disk.
Can I resize the disks of EMR clusters?
You can expand but cannot shrink the data disks of EMR clusters. You cannot resize a system disk.
To expand the data disk of an EMR cluster, perform the following steps: On the Nodes tab of the desired cluster, find the node group and click Expand Disk in the Actions column. For more information, see Expand a disk.
Can I scale out or scale in an EMR cluster?
Yes, you can scale out or scale in an EMR cluster. When you scale out or scale in an EMR cluster, take note of the following items:
Scale-out: You can add only core nodes and task nodes. By default, the configurations of an added node are the same as the configurations of an existing node in the node group. During a scale-out, you must make sure that you complete the payment for your order. If you do not complete the payment for your order, the scale-out fails. For more information, see Scale out an EMR cluster.
Scale-in:
For information about how to scale in a pay-as-you-go task node group, see Scale in an EMR cluster.
If you want to scale in a subscription task node group, you must perform manual scale-in because subscription to ECS instances can only be manually canceled. If you want to scale in a core node group, you must perform manual scale-in to ensure the stability of HDFS data in the core node group. For more information, see Scale in a node group.
How do I modify the configurations of ECS instances in a node group?
You can upgrade the specifications of a node group in a subscription EMR cluster to modify the configurations of the ECS instances in the node group. You cannot downgrade the specifications of a node group.
To upgrade the specifications of a node group, perform the following steps: On the Nodes tab of the desired cluster, find the desired node group, move the pointer over the icon in the Actions column, and then select Upgrade Configuration. For more information, see Upgrade node configurations.
What do I do if the error message "The specified parameter AddNumber is not valid" appears during the scale-out of a cluster?
Problem description: The error message
The specified parameter AddNumber is not valid. add instances number :xxx larger than deploymentSet availableAmount: xxx deploymentSetId: ds-uf6gwfou0a13kekupt14xxxx
appears during the scale-out of a cluster.Cause: Add to Deployment Set is turned on for the cluster that you want to scale out, and the number of nodes in a node group exceeds the upper limit in a deployment set. For more information about the deployment set, see Add nodes to a deployment set.
Solution: Contact ECS technical support to increase the maximum number of nodes that can be added to a deployment set for your current account.
How do I stop collection of service operational logs?
If you do not want EMR to collect your data, you can disable collection of service operational logs.
If you disable collection of service operational logs, the EMR cluster health check and technical support are limited. Other features of your cluster remain available. Proceed with caution.
Solution:
Disable collection of service operational logs.
When you create a cluster, turn off Collect Service Operational Logs in the Software Configuration step.
For an existing cluster, turn off Collection Status of Service Operational Logs in the Software Information section of the Basic Information tab of the cluster.
Check whether collection of service operational logs is disabled.
Check whether the
namenode-log
information exists in the /usr/local/ilogtail/user_log_config.json file. If the namenode-log information does not exist, collection of service operational logs is disabled.NoteAfter you disable collection of service operational logs, it takes about 2 to 3 minutes for the configuration to take effect.
What information is collected in service operational logs?
Service operational logs contain only data about the running of service components on a cluster. You can enable or disable log collection for all services with one click. If you disable log collection, the EMR cluster health check and after-sales technical support are limited.
Collect Service Operational Logs is turned on by default when you create a cluster. You can determine whether to turn off the switch based on your business requirements. For more information, see How do I stop collection of service operational logs?
Which types of clusters support EMR Doctor (health diagnostics feature in the EMR console)?
Only DataLake and Hadoop clusters support the health diagnostics feature. After you create an EMR cluster, click the name of the cluster. On the page that appears, you can click the Monitoring and Diagnostics tab and then the Health Diagnostics subtab to use the health diagnostics feature.
If you create an EMR Hadoop cluster, you must activate EMR Doctor before you can use the health diagnostics feature in the cluster. For more information, see Activate EMR Doctor (Hadoop clusters).
Does the installation or upgrade of EMR Doctor exert impacts on services in an EMR cluster and jobs that run on the cluster?
During the installation or upgrade of EMR Doctor, no services in an EMR cluster are restarted, and no impacts are exerted on existing jobs that are run on the cluster. After EMR Doctor is installed, the required parameters for EMR Doctor are automatically configured for the cluster. You do not need to perform any manual configurations.
During the installation or upgrade of EMR Doctor, EMR delivers configurations of services such as YARN, Spark, Tez, and Hive to clusters. Before you install or upgrade EMR Doctor, we recommend that you check whether some service configurations are modified and saved but are not delivered and evaluate the impacts of delivering the service configurations to clusters.
What type of data does EMR Doctor collect?
EMR Doctor does not collect your actual data or scan your actual files or file content.
EMR Doctor collects only necessary event data, such as the start time, end time, metrics, and counters of a job.
Am I charged for EMR Doctor?
EMR Doctor is free of charge.
What are the impacts of job data collection on job execution?
The storage metadata collection feature of EMR Doctor can dynamically adjust the amount of collected resources based on the amount of user resources. This prevents excess user resources from being occupied.
The job collection feature of EMR Doctor works based on the Java probe technology. The feature does not separately start Java process monitoring. Job data is collected in asynchronous mode. It does not block the main process of a job. If the job collection pressure is heavy, the collected data is automatically discarded, and you can adjust the collection frequency by configuring parameters.
The following table lists the data of some TPC-DS tests.
SQL and engine | Collection duration when EMR Doctor is used (average duration of job collection based on 10 calculation rounds) | Collection duration when EMR Doctor is not used (average duration of job collection based on 10 calculation rounds) |
query7 (Spark) | 21.0s | 21.2s |
query71 (Tez) | 50.8s | 49.8s |
query19 (MapReduce) | 68.6s | 68.2s |
In this example, a test based on the TPC-DS benchmark is performed, but the test does not meet all requirements of a TPC-DS benchmark test. As a result, the test results may not match the published results of the TPC-DS benchmark test.
When can I obtain the collection report?
After EMR Doctor is installed or upgraded in an EMR cluster, the daily cluster report feature performs an analysis based on the jobs that users want to run and whether the storage metadata collection feature is enabled. In this case, the EMR cluster must contain jobs.
Computing jobs: After computing jobs in an EMR cluster are collected, the latest reports for the jobs can be viewed on the next day. The content of the reports is an overall evaluation on the cluster based on the execution status of the jobs in the cluster.
Storage analysis: The Collect Information About Storage Resources feature of EMR Doctor is disabled by default. You can manually enable this feature. After you enable the Collect Information About Storage Resources feature, the related information is collected at 10:00 in the morning on the current day. After data is collected, the data is analyzed in the morning on the next day and reports are generated based on the analysis results. If data is collected in the afternoon on the current day, you can view reports on the day after the next day.
Can specific values be provided for parameters?
Optimization suggestions that are provided by EMR Doctor are directional. For example, we recommend that you decrease the amount of memory and modify the garbage collection parameters without providing specific parameter values. EMR Doctor collects job data by using the recording and sampling method. EMR Doctor aims to prevent impacts on your program. You need to adjust parameters based on suggestions and check whether the configuration is suitable.
What do I do if an error message that indicates insufficient ECS resources appears when I scale out a cluster?
Problem description: A cluster fails to be scaled out, and the "Insufficient ECS resources_OutofStock" or "Insufficient ECS resources_OperationDenied.NoStock" error message appears.
Cause: ECS instances of the specified instance type that you want to add to a node group are insufficient.
Solution: Perform the scale-out operation after ECS instances of the specified instance type are sufficient. Alternatively, create a node group that contains ECS instances of another instance type. For more information, see Create a node group.
What do I do if an error message that indicates insufficient ECS resources appears when I create a cluster or a node group?
Problem description: A cluster or a node group fails to be created and the "Insufficient ECS resources_OutofStock" or "Insufficient ECS resources_OperationDenied.NoStock" error message appears.
Cause: ECS instances of the instance type that you select when you create a cluster or a node group are insufficient.
Solution: Select another ECS instance type of which ECS instances are sufficient and that meets your business requirements when you create a cluster or a node group.
How do I remove a service that is no longer needed?
You cannot remove existing services that are deployed in a cluster. After a service is started, you cannot remove the service in the console or by calling an API operation.
How do I log on to a node of a cluster?
After an EMR cluster is created, you can use the password that you specified when you created the cluster to log on to the master node of the cluster. For information about how to log on to other nodes of the cluster, see Log on to other nodes of the cluster.
How do I view the vSwitch with which a node is associated?
In Alibaba Cloud EMR on ECS, vSwitches are associated with nodes in node groups. You cannot view the associations on the Basic Information tab. To view the vSwitch with which a node is associated, perform the following steps: Go to the Nodes tab of your cluster, find the node group to which the node belongs, and then click the node group name. In the Node Group Attributes panel, view the setting of the vSwitch parameter.
What do I do if packet loss frequently occurs in clusters?
Problem description: Packet loss frequently occurs in clusters, and error messages such as
neighbour: arp_cache: neighbor table overflow!
may appear in system logs. This indicates that the Address Resolution Protocol (ARP) cache table reaches its upper capacity limit and cannot effectively map media access control (MAC) addresses to IP addresses. As a result, the network performance is decreased.Problem analysis: In a large-scale distributed system, especially when your EMR cluster is of a minor version earlier than EMR V5.18.0 or V3.52.0 and has more than 1,000 servers, the network may be unstable and packet loss may occur. You can configure system parameters to optimize the management of ARP caches.
An ARP cache table stores MAC and IP address pairs. Related parameters:
net.ipv4.neigh.default.gc_thresh1
: If the number of entries in an ARP cache table is less than the value of this parameter, the garbage collector does not collect the entries. Default value: 128.net.ipv4.neigh.default.gc_thresh2
: If the number of entries in an ARP cache table is greater than the value of this parameter, the garbage collector collects the entries within 5 seconds. Default value: 512.net.ipv4.neigh.default.gc_thresh3
: The maximum number of entries to keep in an ARP cache table. Default value: 1024.
NoteThe default values for the parameters are small. As a result, packet loss and network instability occur when the cluster has more than 1,000 servers. In this case, you need to change the default values based on your business requirements.
Solution:
Add the following code to the
/etc/sysctl.conf
file to increase the values of the preceding parameters and the maximum number of connections that are allowed to come into the servers.net.ipv4.neigh.default.gc_thresh1 = 512 net.ipv4.neigh.default.gc_thresh2 = 2048 net.ipv4.neigh.default.gc_thresh3 = 10240 net.nf_conntrack_max = 524288
Run the
sudo sysctl -p
command to make the modifications take effect.NoteIf the error message
sysctl: cannot stat /proc/sys/net/nf_conntrack_max: No such file or directory
appears when you run thesysctl -p
command, you can run thesudo modprobe nf_conntrack
command to load the nf_conntrack module and then run thesysctl -p
command again.