AWS CloudWatch是用于实时监控AWS资源以及运行在AWS上的应用的一个服务。CloudWatch支持通过AWS SNS服务发送告警消息,您只需要在AWS SNS中配置日志服务开放告警接口的URL,即可将CloudWatch告警消息发送给日志服务,由日志服务告警系统完成告警降噪、通知等处理。
CloudWatch配置
CloudWatch告警消息
CloudWatch告警分为静态阈值告警和异常检测告警。静态阈值告警消息和异常检测告警消息的Trigger字段的值不同。更多信息,请参见CloudWatch::Alarm属性说明。
- 静态阈值告警消息中的Trigger字段值包含MetricName和Dimensions等字段。
- 异常检测告警消息值的Trigger字段值包含Metrics等字段,其中Metrics字段值是一个指标数据查询列表。
- 静态阈值告警消息
{ "AlarmName": "test-alert", "AlarmDescription": "this is a test alert", "AWSAccountId": "123456", "NewStateValue": "ALARM", "NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).", "StateChangeTime": "2021-08-04T03:10:10.215+0000", "Region": "US East (Ohio)", "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert", "OldStateValue": "OK", "Trigger": { "MetricName": "NumberOfMessagesPublished", "Namespace": "AWS/SNS", "StatisticType": "Statistic", "Statistic": "SUM", "Unit": null, "Dimensions": [ { "value": "my-topic", "name": "TopicName" } ], "Period": 60, "EvaluationPeriods": 1, "ComparisonOperator": "GreaterThanOrEqualToThreshold", "Threshold": 1.0, "TreatMissingData": "- TreatMissingData: missing", "EvaluateLowSampleCountPercentile": "" } }
- 异常检测的告警消息
{ "AlarmName": "cpu alrm", "AlarmDescription": "this is a cpu alarm", "AWSAccountId": "123456", "NewStateValue": "INSUFFICIENT_DATA", "NewStateReason": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching].", "StateChangeTime": "2021-08-05T08:38:47.104+0000", "Region": "US East (Ohio)", "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm", "OldStateValue": "OK", "Trigger": { "Period": 60, "EvaluationPeriods": 2, "ComparisonOperator": "GreaterThanUpperThreshold", "ThresholdMetricId": "ad1", "TreatMissingData": "- TreatMissingData: breaching", "EvaluateLowSampleCountPercentile": "", "Metrics": [ { "Id": "m1", "MetricStat": { "Metric": { "Dimensions": [ { "value": "i-1a2b3c4d", "name": "InstanceId" } ], "MetricName": "CPUUtilization", "Namespace": "AWS/EC2" }, "Period": 60, "Stat": "Average" }, "ReturnData": true }, { "Expression": "ANOMALY_DETECTION_BAND(m1, 0.1)", "Id": "ad1", "Label": "CPUUtilization (预期)", "ReturnData": true } ] } }
告警消息映射
CloudWatch告警被接入到日志服务后,映射为日志服务告警内容。示例如下:
- 静态阈值告警消息
{ "aliuid": "aliuid1", "alert_instance_id": "{自动生成}", "alert_id": "CloudWatch_test-alert", "alert_type": "sls_pub", "alert_name": "test-alert", "region": "{告警中心Project所在地域}", "project": "{告警中心所属的Project}", "project_id": 0, "next_eval_interval": 60, "alert_time": 1628046610, "fire_time": 1628046610, "fire_results": null, "fire_results_count": 0, "resolve_time": 0, "status": "firing", "results": null, "labels": { "TopicName": "my-topic", "__comparison_operator__": "GreaterThanOrEqualToThreshold", "__statistic__": "SUM", "__statistic_type__": "Statistic", "__threshold__": "1", "metric_name": "NumberOfMessagesPublished" }, "annotations": { "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert", "__aws_accountId__": "123456", "__aws_region__": "US East (Ohio)", "__cloud_watch_alert_type__": "StaticThreshold", "__config_app__": "sls_pub_alert", "__pub_alert_app__": "{开放告警应用ID}", "__pub_alert_protocol__": "cloud_watch", "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}", "__pub_alert_service__": "{开放告警服务ID}", "desc": "this is a test alert", "title": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." }, "severity": 10, "policy": { "alert_policy_id": "{开放告警应用中配置的告警策略ID}", "action_policy_id": "{开放告警应用中配置的行动策略ID}", "use_default": false, "repeat_interval": "{开放告警应用中配置的重复等待时间}" }, "template": null, "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/test-alert" }
- 异常检测告警消息
{ "aliuid": "aliuid1", "alert_instance_id": "{自动生成}", "alert_id": "CloudWatch_cpu alrm", "alert_type": "sls_pub", "alert_name": "cpu alrm", "region": "{告警中心Project所在地域}", "project": "{告警中心所属的Project}", "project_id": 0, "next_eval_interval": 120, "alert_time": 1628152727, "fire_time": 1628152727, "fire_results": null, "fire_results_count": 0, "resolve_time": 0, "status": "firing", "results": null, "labels": { "__comparison_operator__": "GreaterThanUpperThreshold", "__threshold_metricId__": "ad1" }, "annotations": { "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm", "__aws_accountId__": "123456", "__aws_region__": "US East (Ohio)", "__cloud_watch_alert_type__": "AnomalyDetection", "__config_app__": "sls_pub_alert", "__pub_alert_app__": "{开放告警应用ID}", "__pub_alert_protocol__": "cloud_watch", "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}", "__pub_alert_service__": "{开放告警服务ID}", "desc": "this is a cpu alarm", "title": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching]." }, "severity": 8, "policy": { "alert_policy_id": "{开放告警应用中配置的告警策略ID}", "action_policy_id": "{开放告警应用中配置的行动策略ID}", "use_default": false, "repeat_interval": "{开放告警应用中配置的重复等待时间}" }, "template": null, "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/cpu%20alrm" }
日志服务告警消息内容与CloudWatch告警消息内容的映射关系如下:
日志服务字段 | CloudWatch字段 | 说明 |
---|---|---|
aliuid | 无 | 用于接入告警的开放告警应用所属的阿里云账号ID。 |
alert_id | 无 | 告警监控规则的ID。
alert_id字段值为CloudWatch_{$alert_name},其中{$alert_name}为告警监控规则的名称。 |
alert_type | 无 | 告警类型,固定为sls_pub。 |
alert_name | AlarmName | 告警监控规则的名称。 |
status | NewStateValue | 告警状态。
|
next_eval_interval |
|
告警评估间隔时间,为CloudWatch告警消息中的Period字段值和EvaluationPeriods字段值的乘积。 |
alert_time | StateChangeTime | 告警触发时间。 |
fire_time | StateChangeTime | 告警首次触发时间。 |
resolve_time | StateChangeTime | 告警恢复时间。
|
labels | 无 | 标签信息。
|
annotations | 无 | 标注信息,日志服务的annotations字段中将加入以下字段:
|
severity | NewStateValue | 告警严重度。
|
policy | 无 | 您在开放告警应用中配置的告警策略。更多信息,请参见Policy结构。 |
project | 无 | 告警中心所属的Project。更多信息,请参见项目(Project)。 |
drill_down_query | 无 | 对应CloudWatch告警的URL地址。 |