All Products
Search
Document Center

E-MapReduce:Manage alert rules

Last Updated:Sep 02, 2024

E-MapReduce (EMR) allows you to create alert rules to monitor the usage of service resources in EMR clusters. If resource metrics meet specific alert conditions, alerts are triggered and CloudMonitor sends alert notifications. This way, you can identify and handle the exceptions of monitored clusters at the earliest opportunity. This topic describes how to create and view alert rules in the EMR console.

Background information

The alerting feature is provided by CloudMonitor. You can manage alert rules or use more monitoring and alerting features in the CloudMonitor console. For more information, see What is CloudMonitor?

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Limits

If you use a RAM user, you must grant the following permissions to the RAM user. For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

{
    "Version": "1",
    "Statement": [
        {
            "Action": [
                "cms:DescribeContactGroupList",
                "cms:DescribeMetricMetaList",
                "cms:PutResourceMetricRules",
                "cms:DescribeMetricRuleList"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Create alert rules

Create alert rules by using a template

  1. Go to the Alert Management subtab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select a region and a resource group based on your business requirements.

    3. On the EMR on ECS page, click the ID of the desired cluster.

    4. On the page that appears, click the Monitoring and Diagnostics tab.

    5. Click the Alert Management subtab.

  2. On the Alert Management subtab, click Create Alert Rules.

  3. In the Create Alert Rules panel, find the desired service and click Create Alert Rules in the Actions column.

  4. Configure parameters and click Create. The following table describes the parameters.

    Parameter

    Description

    Rule Description

    The description of the alert rules in the template. You can view the metric names and change the default thresholds of the metrics.

    For information about the services to which the template applies and metric description, see Services in alert rule templates.

    Mute Period

    The interval at which the alert notification is resent before the alert is cleared.

    Validity Period

    The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.

    Alert Contact Group

    The alert contact groups to which alert notifications are sent.

    Alert notification method

    The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods:

    • Phone Call, Text Message, Email, and DingTalk Chatbot

    • Text Message, Email, and DingTalk Chatbot

    • Email and DingTalk Chatbot

    Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.

    Callback URL

    The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported.

    After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management subtab.

    You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.

Create custom alert rules

  1. Go to the Alert Management subtab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select a region and a resource group based on your business requirements.

    3. On the EMR on ECS page, click the ID of the desired cluster.

    4. On the page that appears, click the Monitoring and Diagnostics tab.

    5. Click the Alert Management subtab.

  2. On the Alert Management subtab, click Create Alert Rules.

  3. In the Create Alert Rules panel, click Create Custom Rule.

  4. Configure parameters and click Create. The following table describes the parameters.

    Parameter

    Description

    Alert Rule

    The name and content of an alert rule.

    This parameter specifies the condition that triggers an alert.

    Note
    • For information about the EMR metrics in alert rules, see CloudMonitor metrics.

    • You can click Add Alert Rule to create multiple alert rules.

    Mute Period

    The interval at which the alert notification is resent before the alert is cleared.

    Validity Period

    The period during which the alert rules are valid. The system monitors the data based on the alert rules only within the valid period.

    Alert Contact Group

    The alert contact groups to which alert notifications are sent.

    Alert notification method

    The methods that you want to use to send alert notifications and the alert email subject. Supported alert notification methods:

    • Phone Call, Text Message, Email, and DingTalk Chatbot

    • Text Message, Email, and DingTalk Chatbot

    • Email and DingTalk Chatbot

    Alert Email Subject: optional. If you specify the alert email subject, the specified remarks are included in the alert notification email.

    Callback URL

    The callback URL that can be accessed over the Internet. CloudMonitor sends a POST request to push an alert to the callback URL that you specify. Only HTTP requests are supported. This parameter is optional.

    After you create an alert rule, the rule takes effect on the instances in the cluster. You can view the created alert rules on the Alert Management subtab.

    You can also click Manage Alert Rules to go to the CloudMonitor console to view or modify alert rules.

View alert rules

You can view alert rules on the Alert Management subtab.

Parameter

Description

Rule Name

The name of the alert rule.

Status

The status of the alert rule in CloudMonitor. Valid values: OK, Alert, No Data, Disabled, and Enabled.

Rule Description

The description of the alert rule. An alert is triggered when the conditions of an alert rule are met.

Alert Contact Group

The alert contact groups to which alert notifications are sent.

Actions

  • Details: You can click Details to go to the CloudMonitor console to view the details of an alert rule, such as the alert contact groups, the alert history, and alert resources.

  • Edit Rule: You can click Edit Rule to go to the CloudMonitor console to modify the parameters that are configured for the alert rule.

Services in alert rule templates

Service name

Component name

Metric

Description

Node (Host)

Disk

emr_node_part_max_used

If the condition that the average value of the specified metric is greater than 80% is met two consecutive times, an alert is triggered. The check is performed once every minute.

CPU

emr_node_cpu_idle

If the condition that the average value of the specified metric is less than 5% is met five consecutive times, an alert is triggered. The check is performed once every minute.

Memory

emr_node_mem_used_percent

If the condition that the average value of the specified metric is greater than 90% is met two consecutive times, an alert is triggered. The check is performed once every minute.

HDFS

NameNode

hdfs_namenode_jvm_MemHeapUsedM / hdfs_namenode_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

NameNode

hdfs_namenode_rpc_service_activity_CallQueueLength

If the condition that the average value of the specified metric is greater than 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute.

NameNode

hdfs_namenode_fsnamesystem_CorruptBlocks

If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.

NameNode

hdfs_namenode_safemode_status

If the condition that the NameNode is in safe mode is met, an alert is triggered. The check is performed once every minute.

DataNode

hdfs_datanode_jvm_MemHeapUsedM / hdfs_datanode_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

Spark

SparkHistoryServer

spark_history_jvm_old_space_utilization

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.

SparkThriftServer

spark_thrift_driver_jvm_heap_used/spark_thrift_driver_jvm_heap_max

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.

Hive

HiveMetaStore

hive_metastore_memory_heap_used/hive_metastore_memory_heap_max

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.

HiveMetaStore

hive_metastore_threads_blocked_count

If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute.

HiveServer2

hive_server_memory_heap_used/hive_server_memory_heap_max

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, an alert is triggered. The check is performed once every minute.

HiveServer2

hive_server_threads_deadlock_count

If the condition that the average value of the specified metric is greater than 50% is met two consecutive times, an alert is triggered. The check is performed once every minute.

YARN

ResourceManager

yarn_cluster_status

If one of the following conditions is met in the previous five minutes, an alert is triggered: two or more HA switchovers occur, the status of a node is 1, or the status of a node is always -1.

ResourceManager

yarn_resourcemanager_jvm_MemHeapUsedM / yarn_resourcemanager_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

NodeManager

yarn_cluster_unhealthyNodes

If the condition that the average value of the specified metric is greater than 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.

NodeManager

yarn_nodemanager_jvm_MemHeapUsedM / yarn_nodemanager_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

TimelineServer

yarn_timelineserver_jvm_MemHeapUsedM / yarn_timelineserver_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

MRHistoryServer

yarn_jobhistory_jvm_MemHeapUsedM / yarn_jobhistory_jvm_MemHeapMaxM

If the condition that the average value of the specified metric is greater than 95% is met two consecutive times, or no metric data is generated, an alert is triggered. The check is performed once every minute.

Zookeeper

Zookeeper

zk_znode_count

If the condition that the average value of the specified metric is greater than or equal to 10000 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Zookeeper

zk_watch_count

If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Kafka

KafkaBroker

Kafka_Broker_kafka_log_LogManager_OfflineLogDirectoryCount

If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Kafka_Broker_kafka_server_ReplicaManager_UnderReplicatedPartitions

If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Presto/Trino

Trino

trino_QueryManager_FailedQueries_OneMinute_Count

If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.

trino_ClusterMemoryPool_name_general_BlockedNodes

If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Presto

presto_QueryManager_FailedQueries_OneMinute_Count

If the condition that the average value of the specified metric is greater than or equal to 1 is met two consecutive times, an alert is triggered. The check is performed once every minute.

presto_ClusterMemoryPool_name_general_BlockedNodes

If the condition that the average value of the specified metric is greater than 0 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Impala

Impalad

num_waiting_queries

If the condition that the average value of the specified metric is greater than or equal to 10 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Note

You can adjust the threshold based on the number of concurrent queries supported by the cluster.

Kudu

kudu-master

kudu_cluster_replica_skew

If the condition that the average value of the specified metric is greater than or equal to 1000 is met two consecutive times, an alert is triggered. The check is performed once every minute.

Note

You can adjust the threshold based on your business requirements.