E-MapReduce (EMR) allows you to create alert rules to monitor the usage of service resources in EMR clusters. If resource metrics meet specific alert conditions, alerts are triggered and CloudMonitor sends alert notifications. This way, you can identify and handle the exceptions of monitored clusters at the earliest opportunity. This topic describes how to create and view alert rules in the EMR console.
Background information
The alerting feature is provided by CloudMonitor. You can manage alert rules or use more monitoring and alerting features in the CloudMonitor console. For more information, see What is CloudMonitor?
Prerequisites
An EMR cluster is created. For more information, see Create a cluster.
Limits
If you use a RAM user, you must grant the following permissions to the RAM user. For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.
{
"Version": "1",
"Statement": [
{
"Action": [
"cms:DescribeContactGroupList",
"cms:DescribeMetricMetaList",
"cms:PutResourceMetricRules",
"cms:DescribeMetricRuleList"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
Create alert rules
Create alert rules by using a template
Go to the Alert Management subtab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select a region and a resource group based on your business requirements.
On the EMR on ECS page, click the ID of the desired cluster.
On the page that appears, click the Monitoring and Diagnostics tab.
Click the Alert Management subtab.
On the Alert Management subtab, click Create Alert Rules.
In the Create Alert Rules panel, find the desired service and click Create Alert Rules in the Actions column.
Configure parameters and click Create. The following table describes the parameters.
Parameter
Description
Rule Description
The description of the alert rules in the template. You can view the metric names and change the default thresholds of the metrics.
For information about the services to which the template applies and metric description, see Services in alert rule templates.
Mute Period
The interval at which the alert notification is resent before the alert is cleared.
Validity Period
The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.
Alert Contact Group
The alert contact groups to which alert notifications are sent.
Alert notification method
The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods:
Phone Call, Text Message, Email, and DingTalk Chatbot
Text Message, Email, and DingTalk Chatbot
Email and DingTalk Chatbot
Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.
Callback URL
The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported.
After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management subtab.
You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.
Create custom alert rules
Go to the Alert Management subtab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select a region and a resource group based on your business requirements.
On the EMR on ECS page, click the ID of the desired cluster.
On the page that appears, click the Monitoring and Diagnostics tab.
Click the Alert Management subtab.
On the Alert Management subtab, click Create Alert Rules.
In the Create Alert Rules panel, click Create Custom Rule.
Configure parameters and click Create. The following table describes the parameters.
Parameter
Description
Alert Rule
The name and content of an alert rule.
This parameter specifies the condition that triggers an alert.
NoteFor information about the EMR metrics in alert rules, see CloudMonitor metrics.
You can click Add Alert Rule to create multiple alert rules.
Mute Period
The interval at which the alert notification is resent before the alert is cleared.
Validity Period
The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.
Alert Contact Group
The alert contact groups to which alert notifications are sent.
Alert notification method
The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods:
Phone Call, Text Message, Email, and DingTalk Chatbot
Text Message, Email, and DingTalk Chatbot
Email and DingTalk Chatbot
Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.
Callback URL
The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported. This parameter is optional.
After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management subtab.
You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.
View alert rules
You can view alert rules on the Alert Management subtab.
Parameter | Description |
Rule Name | The name of the alert rule. |
Status | The status of the alert rule in CloudMonitor. Valid values: OK, Alert, No Data, Disabled, and Enabled. |
Rule Description | The description of the alert rule. An alert is triggered when the conditions of an alert rule are met. |
Alert Contact Group | The alert contact groups to which alert notifications are sent. |
Actions |
|
Services in alert rule templates
Service name | Component name | Metric | Description |
Node (Host) | Disk | emr_node_part_max_used | If the condition that the average value of the specified metric is greater than 80% is met two consecutive times, an alert is triggered. The check is performed once every minute. |
CPU | emr_node_cpu_idle | If the condition that the average value of the specified metric is less than 5% is met five consecutive times, an alert is triggered. The check is performed once every minute. | |
Memory | emr_node_mem_used_percent | If the condition that the average value of the specified metric is greater than 90% is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
HDFS | NameNode | hdfs_namenode_jvm_MemHeapUsedM / hdfs_namenode_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. |
NameNode | hdfs_namenode_rpc_service_activity_CallQueueLength | If the condition that the average value of the specified metric is greater than 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
NameNode | hdfs_namenode_fsnamesystem_CorruptBlocks | If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
NameNode | hdfs_namenode_safemode_status | If the condition that the NameNode is in safe mode is met, an alert is triggered. The check is performed once every minute. | |
DataNode | hdfs_datanode_jvm_MemHeapUsedM / hdfs_datanode_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. | |
Spark | SparkHistoryServer | spark_history_jvm_old_space_utilization | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute. |
SparkThriftServer | spark_thrift_driver_jvm_heap_used/spark_thrift_driver_jvm_heap_max | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
Hive | HiveMetaStore | hive_metastore_memory_heap_used/hive_metastore_memory_heap_max | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute. |
HiveMetaStore | hive_metastore_threads_blocked_count | If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
HiveServer2 | hive_server_memory_heap_used/hive_server_memory_heap_max | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
HiveServer2 | hive_server_threads_deadlock_count | If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
YARN | ResourceManager | yarn_cluster_status | If one of the following conditions is met in the previous five minutes, an alert is triggered: two or more HA switchovers occur, the status of a node is 1, or the status of a node is always -1. |
ResourceManager | yarn_resourcemanager_jvm_MemHeapUsedM / yarn_resourcemanager_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. | |
NodeManager | yarn_cluster_unhealthyNodes | If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
NodeManager | yarn_nodemanager_jvm_MemHeapUsedM / yarn_nodemanager_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. | |
TimelineServer | yarn_timelineserver_jvm_MemHeapUsedM / yarn_timelineserver_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. | |
MRHistoryServer | yarn_jobhistory_jvm_MemHeapUsedM / yarn_jobhistory_jvm_MemHeapMaxM | If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute. | |
Zookeeper | Zookeeper | zk_znode_count | If the condition that the average value of the specified metric is greater than or equal to 10000 is met two consecutive times, an alert is triggered. The check is performed once every minute. |
Zookeeper | zk_watch_count | If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
Kafka | KafkaBroker | Kafka_Broker_kafka_log_LogManager_OfflineLogDirectoryCount | If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute. |
Kafka_Broker_kafka_server_ReplicaManager_UnderReplicatedPartitions | If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute. | ||
Presto/Trino | Trino | trino_QueryManager_FailedQueries_OneMinute_Count | If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute. |
trino_ClusterMemoryPool_name_general_BlockedNodes | If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute. | ||
Presto | presto_QueryManager_FailedQueries_OneMinute_Count | If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute. | |
presto_ClusterMemoryPool_name_general_BlockedNodes | If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute. | ||
Impala | Impalad | num_waiting_queries | If the condition that the average value of the specified metric is greater than or equal to 10 is met two consecutive times, an alert is triggered. The check is performed once every minute. Note You can adjust the threshold based on the number of concurrent queries supported by the cluster. |
Kudu | kudu-master | kudu_cluster_replica_skew | If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute. Note You can adjust the threshold based on your business requirements. |