The multi-cluster alert management feature allows you to create and modify alert rules on a Fleet instance. After you create or modify an alert rule, the Fleet instance distributes the alert rule to the associated clusters that you specify. This allows the associated clusters to use the same alert rule. The Fleet instance can also distribute alert rules to clusters that are newly associated with the Fleet instance. This topic describes how to use the multi-cluster alert management feature.
Prerequisites
The Fleet management feature is enabled. For more information, see Enable multi-cluster management.
Two clusters (the service provider cluster and service consumer cluster) are associated with the Fleet instance. For more information, see Associate clusters with a Fleet instance.
The kubeconfig file of the Fleet instance is obtained in the Distributed Cloud Container Platform for Kubernetes (ACK One) console and a kubectl client is connected to the Fleet instance.
Two clusters (the service provider cluster and service consumer cluster) are associated with the Fleet instance. For more information, see Associate clusters with a Fleet instance.
The latest version of Alibaba Cloud CLI is installed and Alibaba Cloud CLI is configured.
Components required for multi-cluster alert management are installed in the clusters that you want to manage. For more information, see the Install and update the components section of the "Alert management" topic.
Background information
In multi-cluster scenarios, the alert rules used by each cluster need to be the same. If you want to modify an alert rule, you must log on to each cluster and modify the alert rule. This process is complicated and prone to errors. You can use the multi-cluster alert management feature to centrally configure alert rules for multiple clusters that are associated with a Fleet instance. You need to only create alert rules on the Fleet instance. The rules specify the anomalies that trigger alerts and information about the contacts who receive alert notifications. For more information, see Alert management. The following figure shows how multi-cluster alert management is implemented.
Step 1: Create a contact and a contact group
Perform the following steps to create a contact and a contact group. After you create the contact and the contact group, they can be used by all Container Service for Kubernetes (ACK) clusters that belong to your Alibaba Cloud account.
Log on to the ACK console. In the left-side navigation pane, click Cluster.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster. The details page of the cluster appears. In the left-side navigation pane, choose .
NoteIf this is the firs time you use the Alerts component, the console prompts you to install the component. If Component needs upgrading is displayed, click Upgrade. After the update is complete, the Alerts page appears.
On the Alerts page, perform the following steps to create a contact and a contact group.
Click the Alert Contacts tab and click Create.
In the Create Alert Contact panel, configure the Name, Phone Number, and Email parameters. Then, click OK.
After the contact is created, the system sends a text message or email to the contact for activation. Activate the contact as prompted.
Click the Alert Contact Groups tab and click Create.
In the Create Alert Contact Group panel, set the Group Name parameter, select contacts in the Contacts section, and then click OK.
You can add contacts to or remove contacts from the Selected Contacts column.
Step 2: Obtain the contact group ID
Run the following command in the Aliyun CLI to query the ID of the contact group that you created.
aliyun cs GET /alert/contact_groups { "contact_groups": [ { "ali_uid": 14783****, "binding_info": "{\"sls_id\":\"ack_14783****_***\",\"cms_contact_group_name\":\"ack_Default Contact Group\",\"arms_id\":\"1****\"}", "contacts": null, "created": "2021-07-21T12:18:34+08:00", "group_contact_ids": [ 2*** ], "group_name": "Default Contact Group", "id": 3***, "updated": "2022-09-19T19:23:57+08:00" } ], "page_info": { "page_number": 1, "page_size": 100, "total_count": 1 } }
Set the contactGroups parameter based on the information in the output.
contactGroups: - arms_contact_group_id: "1****" # Set the parameter to the value of the contact_groups.binding_info.arms_id field in the output. cms_contact_group_name: ack_Default Contact Group # Set the parameter to the value of the contact_groups.binding_info.cms_contact_group_name field in the output. id: "3***" #Set the parameter to the value of the contact_groups.id field in the output.
Step 3: Create alert rules
You can create an alert rule based on the following template. The template includes all alert rules supported by ACK. This step describes how to enable the error-events rule.
The name of the alert rule must be default. The namespace of the alert rule must be kube-system. For more information about the supported alert rules, see the Configure alert rules by using CRDs section of the "Alert management" topic.
After you create the alert rule, the error-events rule does not take effect. You must create a distribution rule to distribute the error-events rule to the associated clusters. Then, the error-events rule takes effect.
Set
rules.enable
toenable
for the error-events rule.Add the contactGroups parameter that you configured in the preceding step. Save the changes and name the file as ackalertrule.
Run the
kubectl apply -f ackalertrule.yaml
command to create an alert rule on the Fleet instance.
The following sample code provides an example:
apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
name: default
namespace: kube-system
spec:
groups:
- name: error-events
rules:
- enable: enable
contactGroups:
- arms_contact_group_id: "1****"
cms_contact_group_name: ack_Default Contact Group
id: "3***"
expression: sls.app.ack.error
name: error-event
notification:
message: kubernetes cluster error event.
type: event
- name: warn-events
rules:
- enable: disable
expression: sls.app.ack.warn
name: warn-event
notification:
message: kubernetes cluster warn event.
type: event
- name: cluster-core-error
rules:
- enable: disable
expression: prom.apiserver.notHealthy.down
name: apiserver-unhealthy
notification:
message: "Cluster APIServer not healthy. \nPromQL: ((sum(up{job=\"apiserver\"})
<= 0) or (absent(sum(up{job=\"apiserver\"})))) > 0"
type: metric-prometheus
- enable: disable
expression: prom.etcd.notHealthy.down
name: etcd-unhealthy
notification:
message: "Cluster ETCD not healthy. \nPromQL: ((sum(up{job=\"etcd\"}) <= 0)
or (absent(sum(up{job=\"etcd\"})))) > 0"
type: metric-prometheus
- enable: disable
expression: prom.scheduler.notHealthy.down
name: scheduler-unhealthy
notification:
message: "Cluster Scheduler not healthy. \nPromQL: ((sum(up{job=\"ack-scheduler\"})
<= 0) or (absent(sum(up{job=\"ack-scheduler\"})))) > 0"
type: metric-prometheus
- enable: disable
expression: prom.kcm.notHealthy.down
name: kcm-unhealthy
notification:
message: "Custer kube-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-kube-controller-manager\"})
<= 0) or (absent(sum(up{job=\"ack-kube-controller-manager\"})))) > 0"
type: metric-prometheus
- enable: disable
expression: prom.ccm.notHealthy.down
name: ccm-unhealthy
notification:
message: "Cluster cloud-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-cloud-controller-manager\"})
<= 0) or (absent(sum(up{job=\"ack-cloud-controller-manager\"})))) > 0"
type: metric-prometheus
- enable: disable
expression: prom.coredns.notHealthy.requestdown
name: coredns-unhealthy-requestdown
notification:
message: "Cluster CoreDNS not healthy, continuously request down. \nPromQL:
(sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or
(sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)"
type: metric-prometheus
- enable: disable
expression: prom.coredns.notHealthy.panic
name: coredns-unhealthy-panic
notification:
message: "Cluster CoreDNS not healthy, continuously panic. \nPromQL: sum(rate(coredns_panic_count_total{}[3m]))
> 0"
type: metric-prometheus
- enable: disable
expression: prom.ingress.request.errorRateHigh
name: ingress-err-request
notification:
message: Cluster Ingress Controller request error rate high (default error
rate is 85%).
type: metric-prometheus
- enable: disable
expression: prom.ingress.ssl.expire
name: ingress-ssl-expire
notification:
message: "Cluster Ingress Controller SSL will expire in a few days (default
14 days). \nPromQL: ((nginx_ingress_controller_ssl_expire_time_seconds -
time()) / 24 / 3600) < 14"
type: metric-prometheus
- name: cluster-error
rules:
- enable: disable
expression: sls.app.ack.docker.hang
name: docker-hang
notification:
message: kubernetes node docker hang.
type: event
- enable: disable
expression: sls.app.ack.eviction
name: eviction-event
notification:
message: kubernetes eviction event.
type: event
- enable: disable
expression: sls.app.ack.gpu.xid_error
name: gpu-xid-error
notification:
message: kubernetes gpu xid error event.
type: event
- enable: disable
expression: sls.app.ack.image.pull_back_off
name: image-pull-back-off
notification:
message: kubernetes image pull back off event.
type: event
- enable: disable
expression: sls.app.ack.node.down
name: node-down
notification:
message: kubernetes node down event.
type: event
- enable: disable
expression: sls.app.ack.node.restart
name: node-restart
notification:
message: kubernetes node restart event.
type: event
- enable: disable
expression: sls.app.ack.ntp.down
name: node-ntp-down
notification:
message: kubernetes node ntp down.
type: event
- enable: disable
expression: sls.app.ack.node.pleg_error
name: node-pleg-error
notification:
message: kubernetes node pleg error event.
type: event
- enable: disable
expression: sls.app.ack.ps.hang
name: ps-hang
notification:
message: kubernetes ps hang event.
type: event
- enable: disable
expression: sls.app.ack.node.fd_pressure
name: node-fd-pressure
notification:
message: kubernetes node fd pressure event.
type: event
- enable: disable
expression: sls.app.ack.node.pid_pressure
name: node-pid-pressure
notification:
message: kubernetes node pid pressure event.
type: event
- enable: disable
expression: sls.app.ack.ccm.del_node_failed
name: node-del-err
notification:
message: kubernetes delete node failed.
type: event
- enable: disable
expression: sls.app.ack.ccm.add_node_failed
name: node-add-err
notification:
message: kubernetes add node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.run_command_fail
name: nlc-run-cmd-err
notification:
message: kubernetes node pool nlc run command failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.empty_task_cmd
name: nlc-empty-cmd
notification:
message: kubernetes node pool nlc delete node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.url_mode_unimpl
name: nlc-url-m-unimp
notification:
message: kubernetes nodde pool nlc delete node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.op_not_found
name: nlc-opt-no-found
notification:
message: kubernetes node pool nlc delete node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.destroy_node_fail
name: nlc-des-node-err
notification:
message: kubernetes node pool nlc destory node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.drain_node_fail
name: nlc-drain-node-err
notification:
message: kubernetes node pool nlc drain node failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.restart_ecs_wait_fail
name: nlc-restart-ecs-wait
notification:
message: kubernetes node pool nlc restart ecs wait timeout.
type: event
- enable: disable
expression: sls.app.ack.nlc.restart_ecs_fail
name: nlc-restart-ecs-err
notification:
message: kubernetes node pool nlc restart ecs failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.reset_ecs_fail
name: nlc-reset-ecs-err
notification:
message: kubernetes node pool nlc reset ecs failed.
type: event
- enable: disable
expression: sls.app.ack.nlc.repair_fail
name: nlc-sel-repair-err
notification:
message: kubernetes node pool nlc self repair failed.
type: event
- name: res-exceptions
rules:
- enable: disable
expression: cms.host.cpu.utilization
name: node_cpu_util_high
notification:
message: kubernetes cluster node cpu utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.host.memory.utilization
name: node_mem_util_high
notification:
message: kubernetes cluster node memory utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.host.disk.utilization
name: node_disk_util_high
notification:
message: kubernetes cluster node disk utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.host.public.network.utilization
name: node_public_net_util_high
notification:
message: kubernetes cluster node public network utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.host.fs.inode.utilization
name: node_fs_inode_util_high
notification:
message: kubernetes cluster node file system inode utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.slb.qps.utilization
name: slb_qps_util_high
notification:
message: kubernetes cluster slb qps utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.slb.traffic.tx.utilization
name: slb_traff_tx_util_high
notification:
message: kubernetes cluster slb traffic utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.slb.max.connection.utilization
name: slb_max_con_util_high
notification:
message: kubernetes cluster max connection utilization too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: "85"
type: metric-cms
- enable: disable
expression: cms.slb.drop.connection
name: slb_drop_con_high
notification:
message: kubernetes cluster drop connection count per second too high.
thresholds:
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: count
value: "1"
type: metric-cms
- enable: disable
expression: sls.app.ack.node.disk_pressure
name: node-disk-pressure
notification:
message: kubernetes node disk pressure event.
type: event
- enable: disable
expression: sls.app.ack.resource.insufficient
name: node-res-insufficient
notification:
message: kubernetes node resource insufficient.
type: event
- enable: disable
expression: sls.app.ack.ip.not_enough
name: node-ip-pressure
notification:
message: kubernetes ip not enough event.
type: event
- enable: disable
expression: sls.app.ack.csi.no_enough_disk_space
name: disk_space_press
notification:
message: kubernetes csi not enough disk space.
type: event
- name: cluster-scale
rules:
- enable: disable
expression: sls.app.ack.autoscaler.scaleup_group
name: autoscaler-scaleup
notification:
message: kubernetes autoscaler scale up.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.scaledown
name: autoscaler-scaledown
notification:
message: kubernetes autoscaler scale down.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.scaleup_timeout
name: autoscaler-scaleup-timeout
notification:
message: kubernetes autoscaler scale up timeout.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.scaledown_empty
name: autoscaler-scaledown-empty
notification:
message: kubernetes autoscaler scale down empty node.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.scaleup_group_failed
name: autoscaler-up-group-failed
notification:
message: kubernetes autoscaler scale up failed.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.cluster_unhealthy
name: autoscaler-cluster-unhealthy
notification:
message: kubernetes autoscaler error, cluster not healthy.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.delete_started_timeout
name: autoscaler-del-started
notification:
message: kubernetes autoscaler delete node started long ago.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.delete_unregistered
name: autoscaler-del-unregistered
notification:
message: kubernetes autoscaler delete unregistered node.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.scaledown_failed
name: autoscaler-scale-down-failed
notification:
message: kubernetes autoscaler scale down failed.
type: event
- enable: disable
expression: sls.app.ack.autoscaler.instance_expired
name: autoscaler-instance-expired
notification:
message: kubernetes autoscaler scale down instance expired.
type: event
- name: workload-exceptions
rules:
- enable: disable
expression: prom.job.failed
name: job-failed
notification:
message: "Cluster Job failed. \nPromQL: kube_job_status_failed{job=\"_kube-state-metrics\"}
> 0"
type: metric-prometheus
- enable: disable
expression: prom.deployment.replicaError
name: deployment-rep-err
notification:
message: "Cluster Deployment replication status error. \nPromQL: kube_deployment_spec_replicas{job=\"_kube-state-metrics\"}
!= kube_deployment_status_replicas_available{job=\"_kube-state-metrics\"}"
type: metric-prometheus
- enable: disable
expression: prom.daemonset.scheduledError
name: daemonset-status-err
notification:
message: "Cluster Daemonset pod status or scheduled error. \nPromQL: ((100
- kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{}
* 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{}))
> 0"
type: metric-prometheus
- enable: disable
expression: prom.daemonset.misscheduled
name: daemonset-misscheduled
notification:
message: "Cluster Daemonset misscheduled. \nPromQL: kube_daemonset_status_number_misscheduled{job=\"_kube-state-metrics\"}
\ > 0"
type: metric-prometheus
- name: pod-exceptions
rules:
- enable: disable
expression: sls.app.ack.pod.oom
name: pod-oom
notification:
message: kubernetes pod oom event.
type: event
- enable: disable
expression: sls.app.ack.pod.failed
name: pod-failed
notification:
message: kubernetes pod start failed event.
type: event
- enable: disable
expression: prom.pod.status.notHealthy
name: pod-status-err
notification:
message: 'Pod status exception. \nPromQL: min_over_time(sum by (namespace,
pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed", job="_kube-state-metrics"})[${mins}m:1m])
> 0'
type: metric-prometheus
- enable: disable
expression: prom.pod.status.crashLooping
name: pod-crashloop
notification:
message: 'Pod status exception. \nPromQL: sum_over_time(increase(kube_pod_container_status_restarts_total{job="_kube-state-metrics"}[1m])[${mins}m:1m])
> 3'
type: metric-prometheus
- name: cluster-storage-err
rules:
- enable: disable
expression: sls.app.ack.csi.invalid_disk_size
name: csi_invalid_size
notification:
message: kubernetes csi invalid disk size.
type: event
- enable: disable
expression: sls.app.ack.csi.disk_not_portable
name: csi_not_portable
notification:
message: kubernetes csi not protable.
type: event
- enable: disable
expression: sls.app.ack.csi.deivce_busy
name: csi_device_busy
notification:
message: kubernetes csi disk device busy.
type: event
- enable: disable
expression: sls.app.ack.csi.no_ava_disk
name: csi_no_ava_disk
notification:
message: kubernetes csi no available disk.
type: event
- enable: disable
expression: sls.app.ack.csi.disk_iohang
name: csi_disk_iohang
notification:
message: kubernetes csi ioHang.
type: event
- enable: disable
expression: sls.app.ack.csi.latency_too_high
name: csi_latency_high
notification:
message: kubernetes csi pvc latency load too high.
type: event
- enable: disable
expression: prom.pv.failed
name: pv-failed
notification:
message: 'Cluster PersistentVolume failed. \nPromQL: kube_persistentvolume_status_phase{phase=~"Failed|Pending",
job="_kube-state-metrics"} > 0'
type: metric-prometheus
- name: cluster-network-err
rules:
- enable: disable
expression: sls.app.ack.ccm.no_ava_slb
name: slb-no-ava
notification:
message: kubernetes slb not available.
type: event
- enable: disable
expression: sls.app.ack.ccm.sync_slb_failed
name: slb-sync-err
notification:
message: kubernetes slb sync failed.
type: event
- enable: disable
expression: sls.app.ack.ccm.del_slb_failed
name: slb-del-err
notification:
message: kubernetes slb delete failed.
type: event
- enable: disable
expression: sls.app.ack.ccm.create_route_failed
name: route-create-err
notification:
message: kubernetes create route failed.
type: event
- enable: disable
expression: sls.app.ack.ccm.sync_route_failed
name: route-sync-err
notification:
message: kubernetes sync route failed.
type: event
- enable: disable
expression: sls.app.ack.terway.invalid_resource
name: terway-invalid-res
notification:
message: kubernetes terway have invalid resource.
type: event
- enable: disable
expression: sls.app.ack.terway.alloc_ip_fail
name: terway-alloc-ip-err
notification:
message: kubernetes terway allocate ip error.
type: event
- enable: disable
expression: sls.app.ack.terway.parse_fail
name: terway-parse-err
notification:
message: kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation
error.
type: event
- enable: disable
expression: sls.app.ack.terway.allocate_failure
name: terway-alloc-res-err
notification:
message: kubernetes parse resource error.
type: event
- enable: disable
expression: sls.app.ack.terway.dispose_failure
name: terway-dispose-err
notification:
message: kubernetes dispose resource error.
type: event
- enable: disable
expression: sls.app.ack.terway.virtual_mode_change
name: terway-virt-mod-err
notification:
message: kubernetes virtual mode changed.
type: event
- enable: disable
expression: sls.app.ack.terway.config_check
name: terway-ip-check
notification:
message: kubernetes terway execute pod ip config check.
type: event
- enable: disable
expression: sls.app.ack.ingress.err_reload_nginx
name: ingress-reload-err
notification:
message: kubernetes ingress reload config error.
type: event
- name: security-err
rules:
- enable: disable
expression: sls.app.ack.si.config_audit_high_risk
name: si-c-a-risk
notification:
message: kubernetes high risks have be found after running config audit.
type: event
ruleVersion: v1.0.9
Step 4: Create a distribution rule
An alert rule is a type of Kubernetes resource. Alert rule distribution and application distribution are implemented based on open source Kubevela, which can distribute Kubernetes resources from the Fleet instance to associated clusters. Perform the following steps to distribute the alert rule to associated clusters:
Create a file named ackalertrule-app.yaml and copy the following content to the file.
Method 1: Distribute the alert rule to the associated clusters to which the production=true label is added
Run the following command to add a label to a cluster that is associated with the Fleet instance:
kubectl get managedclusters #Query the IDs of associated clusters. kubectl label managedclusters <clusterid> production=true
Distribute the alert rule to the associated clusters to which the production=true label is added.
apiVersion: core.oam.dev/v1beta1 kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 spec: components: - name: alertrules type: ref-objects properties: objects: - resource: ackalertrules name: default policies: - type: topology name: prod-clusters properties: clusterSelector: production: "true" #The label that is used to select associated clusters.
Method 2: Distribute the alert rule to the associated clusters that you specify
Replace
<clusterid>
with the IDs of the clusters to which you want to distribute the alert rule.apiVersion: core.oam.dev/v1beta1 kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 spec: components: - name: alertrules type: ref-objects properties: objects: - resource: ackalertrules name: default policies: - type: topology name: prod-clusters properties: clusters: ["<clusterid1>", "<clusterid2>"] #The IDs of the clusters to which you want to distribute the alert rule.
Run the following command to create a distribution rule:
kubectl apply -f ackalertrule-app.yaml
Run the following command to query the status of the distributed resources:
kubectl amc appstatus alertrules -n kube-system --tree --detail
Expected output:
CLUSTER NAMESPACE RESOURCE STATUS APPLY_TIME DETAIL c565e4**** (cluster1)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: ** cbaa12**** (cluster2)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: **
Modify an alert rule
Perform the following steps to modify an alert rule:
Modify the ackalertrule.yaml file and run the
kubectl apply -f ackalertrule.yaml
command to apply the changes to the alert rule.Update the
annotations: app.oam.dev/publishVersion
annotation in the ackalertrule-app.yaml file and run thekubectl apply -f ackalertrule-app.yaml
command to apply the changes to the distribution rule.