All Products
Search
Document Center

Container Service for Kubernetes:Use Managed Service for Prometheus to monitor ACK Edge clusters

Last Updated:Apr 30, 2024

You can view predefined dashboards and performance metrics for ACK Edge clusters in Managed Service for Prometheus. This topic describes how to connect Managed Service for Prometheus to ACK Edge clusters.

Prerequisites

  • An ACK Edge cluster whose version is 1.18.8-aliyunedge.1 or later is created. For more information, see Create an ACK Edge cluster.

  • ack-arms-prometheus 1.1.4 or later is installed in the ACK Edge cluster. If an earlier version is installed, update ack-arms-prometheus to the latest version. For more information, see How do I check the version of the ack-arms-prometheus component?.

  • Port forwarding based on port 9100 of the node exporter and port 9445 of the GPU exporter is configured in the cluster ConfigMap named kube-system/edge-tunnel-server-cfg. The following code block shows the port forwarding configuration:

    http-proxy-ports: 9445
    https-proxy-ports: 9100

Introduction to Managed Service for Prometheus

Managed Service for Prometheus is a fully managed monitoring service interfaced with the open source Prometheus ecosystem. Managed Service for Prometheus monitors a wide array of components and provides multiple predefined dashboards. Managed Service for Prometheus saves you the efforts to manage underlying services, such as data storage, data display, and system maintenance.

For more information about Managed Service for Prometheus, see What is Managed Service for Prometheus?

View Grafana dashboards provided by Managed Service for Prometheus

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage. In the left-side navigation pane, choose Operations > Prometheus Monitoring.

    Note

    If this is the first time you access Managed Service for Prometheus, click Install on the Prometheus Monitoring page. Then, the system automatically installs the ack-arms-prometheus component and checks the dashboards. After the ack-arms-prometheus component is installed, you are redirected to the details page of Managed Service for Prometheus.

    On the Prometheus Monitoring page, you can view different monitoring data on the predefined dashboards. For example, you can view the monitoring data of nodes, applications, and GPUs on the Node Monitoring, Application Monitoring, and GPU Monitoring dashboards, respectively.

Configure Prometheus alert rules

You can create alert rules for monitoring tasks. This allows the system to notify you by phone call, email, text message, DingTalk message, WeChat Enterprise message, or webhook in real time when the conditions in the alert rules are met. When the conditions in an alert rule are met, alerts are sent to the contact group that you specified. Before you can create a contact group, you must create contacts. When you create a contact, you can specify the mobile phone number and email address that are used by the contact to receive alerts. You can also specify the contact groups to which alerts are sent in the notification policy so that the alerts can be handled at the earliest opportunity.

  • For more information about how to create a DingTalk chatbot, see DingTalk chatbots.

  • For more information about how to create a WeChat Enterprise chatbot, see WeCom chatbots.

Step 1: Create a contact

  1. Log on to the ARMS console. In the left-side navigation pane, choose Alert Management > Notification Objects.
  2. On the Contacts tab, click Create Contact in the upper-right corner.
  3. In the Create Contact dialog box, set the parameters and click OK.

    Parameter

    Description

    Name

    The name of the contact.

    Phone Number

    After you specify the mobile phone number of a contact, the contact can be notified by phone call and text message.

    Note

    You can specify only verified mobile phone numbers in a notification policy. For more information about how to verify a mobile phone number, see Verify mobile phone numbers.

    Email

    After you specify the email address of a contact, the contact can be notified by email.

    Important

    You can create at most 100 contacts.

Step 2: Create a Prometheus alert rule

Use a predefined metric to create an alert rule

If you set Check Type to Static Threshold, you can select a predefined metric and create an alert rule based on the metric.

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Alert Rules.

  3. In the upper-right corner of the Prometheus Alert Rules page, click Create Prometheus Alert Rule.

  4. On the Create Prometheus Alert Rule page, set the following parameters and click Save.

    Parameter

    Description

    Example

    Alert Name

    Enter the name of the alert rule.

    Production cluster - container CPU utilization alert

    Check Type

    Select Static Threshold.

    Static Threshold

    Prometheus Instance

    Select the Prometheus instance for which you want to create the alert rule.

    Production cluster

    Alert Contact Group

    Select an alert contact group.

    The alert contact groups that are supported by a Prometheus instance vary based on the type of the Prometheus instance. The options in the drop-down list vary based on the type of the Prometheus instance that you specify.

    Kubernetes load

    Alert Metric

    Select the metric that you want to monitor by using the alert rule. Different alert contact groups provide different metrics.

    Container CPU utilization

    Alert Condition

    Specify the condition based on which alert events are generated.

    If the CPU utilization of the container is greater than 80%, an alert event is generated.

    Filter Conditions

    Specify the applicable scope of the alert rule. If a resource meets both the filter condition and the alert condition, an alert event is generated.

    The following types of filter conditions are supported:

    • Traverse: The alert rule applies to all resources in the current Prometheus instance. By default, Traverse is selected.

    • Equal: If you select this filter condition, you must enter a resource name. The alert rule applies only to the specified resource. You cannot specify multiple resources at the same time.

    • Not equal: If you select this filter condition, you must enter a resource name. The alert rule applies to resources other than the specified resource. You cannot specify multiple resources at the same time.

    • Regex match: If you select this filter condition, you must enter a regular expression to match resource names. The alert rule that you create by using the template applies to all resources that match the regular expression.

    • Regex not match: If you select this filter condition, you must enter a regular expression to match resource names. The alert rule applies to resources that do not match the regular expression.

    Note

    After you set the filter conditions, the Data Preview section appears.

    Traverse

    Data Preview

    The Data Preview section displays the PromQL statement that corresponds to the alert condition. The section also displays the values of the specified metric in a time series graph.

    By default, only the real-time values of one resource are displayed. You can specify filter conditions to view the metric values of different resources in different time ranges.

    Note
    • The threshold in the time series graph is represented by a red line. The part of the curve that meets the alert condition is displayed in dark red, and the part of the curve that does not meet the alert condition is displayed in blue.

    • You can move the pointer over the curve to view resource details at a specific point in time.

    • You can also select a time period on the time series curve to view the time series curve of the selected time period.

    None

    Duration

    • If the alert condition is met, an alert event is generated: If a data point reaches the threshold, an alert event is generated.

    • If the alert condition is continuously met for N minutes, an alert event is generated: An alert event is generated only if the duration for which the threshold is reached is greater than or equal to N minutes.

    1

    Alert Level

    Specify the alert level. Default value: Default. Valid values: Default, P4, P3, P2, and P1. Default indicates the lowest severity level, while P1 indicates the highest severity level.

    Default

    Alert Message

    Specify the alert message that you want to send to the end users. You can specify custom variables in the alert message based on the Go template syntax.

    Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU utilization: {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%. Current value: {{ printf "%.2f" $value }}%

    Advanced Settings

    Alert Check Cycle

    An alert rule is triggered every N minutes to check whether the alert conditions are met. Default value: 1. Minimum value: 1.

    1

    Specify Notification Policies

    • Do Not Specify Notification Policies: If you select this option, you can create a notification policy on the Notification Policy page after you create the alert rule. On the Notification Policy page, you can specify match rules and match conditions. For example, you can specify an alert rule name as the match condition. When the alert rule is triggered, an alert event is generated and an alert notification is sent to the contacts or contact groups that are specified in the notification policy. For more information, see Create and manage a notification policy.

    • You can also select a notification policy from the drop-down list. ARMS automatically adds a match rule to the selected notification policy and specifies the ID of the alert rule as the match condition. The name of the alert rule is displayed on the Notification Policy page. This way, the alert events that are generated based on the alert rule can be matched by the selected notification policy.

    Important

    After you select a notification policy, the alert events that are generated based on the alert rule can be matched by the notification policy and alerts can be generated. The alert events may also be matched by other notification policies that use fuzzy match, and alerts may be generated. One or more alert events can be matched by one or more notification policies.

    Do Not Specify Notification Policies

    Tags

    Specify tags for the alert rule. The specified tags can be used to match notification policies.

    None

    Annotations

    Specify annotations for the alert rule.

    None

Use a custom PromQL statement to create an alert rule

To monitor a metric other than the predefined metrics, you can use a custom PromQL statement to create an alert rule.

On the Create Prometheus Alert Rule page, set the following parameters and click Save.

Parameter

Description

Example

Alert Name

Enter the name of the alert rule.

Pod CPU utilization exceeds 8%

Check Type

Select Custom PromQL.

Custom PromQL

Prometheus Instance

Select the Prometheus instance for which you want to create the alert rule.

None

Reference alarm indicator

Optional. The reference metrics drop-down list displays common metrics. After you select a metric, the PromQL statement of the metric is displayed in the Custom PromQL Statements field. You can modify the statement based on your business requirements.

The values in the reference metrics drop-down list vary based on the type of the Prometheus instance.

Pod disk usage alert

Custom PromQL Statements

Specify the PromQL statement based on which alert events are generated.

max(container_fs_usage_bytes{pod!="", namespace!="arms-prom",namespace!="monitoring"}) by (pod_name, namespace, device)/max(container_fs_limit_bytes{pod!=""}) by (pod_name,namespace, device) * 100 > 90

Data Preview

The Data Preview section displays the time series graph of resources that meet the specified conditions in the PromQL statement.

By default, the alert data of all resources that meet the specified conditions in the PromQL statement is displayed. You can configure filter conditions to display the data of a specific resource in a specific time range.

Note
  • You can move the pointer over the curve to view resource details at a specific point in time.

  • You can also select a time period on the time series curve to view the time series curve of the selected time period.

None

Duration

  • If the alert condition is met, an alert event is generated: If a data point reaches the threshold, an alert event is generated.

  • If the alert condition is continuously met for N minutes, an alert event is generated: An alert event is generated only if the duration for which the threshold is reached is greater than or equal to N minutes.

1

Alert Level

Specify the alert level. Default value: Default. Valid values: Default, P4, P3, P2, and P1. Default indicates the lowest severity level, while P1 indicates the highest severity level.

Default

Alert Message

Specify the alert message that you want to send to the end users. You can specify custom variables in the alert message based on the Go template syntax.

Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / The utilization of the {{$labels.device}} disk exceeds 90%. Current value: {{ printf "%.2f" $value }}%

Advanced Settings

Alert Check Cycle

An alert rule is triggered every N minutes to check whether the alert conditions are met. Default value: 1. Minimum value: 1.

1

Specify Notification Policies

  • Do Not Specify Notification Policies: If you select this option, you can create a notification policy on the Notification Policy page after you create the alert rule. On the Notification Policy page, you can specify match rules and match conditions. For example, you can specify an alert rule name as the match condition. When the alert rule is triggered, an alert event is generated and an alert notification is sent to the contacts or contact groups that are specified in the notification policy. For more information, see Create and manage a notification policy.

  • You can also select a notification policy from the drop-down list. ARMS automatically adds a match rule to the selected notification policy and specifies the ID of the alert rule as the match condition. The name of the alert rule is displayed on the Notification Policy page. This way, the alert events that are generated based on the alert rule can be matched by the selected notification policy.

Important

After you select a notification policy, the alert events that are generated based on the alert rule can be matched by the notification policy and alerts can be generated. The alert events may also be matched by other notification policies that use fuzzy match, and alerts may be generated. One or more alert events can be matched by one or more notification policies.

Do Not Specify Notification Policies

Tags

Specify tags for the alert rule. The specified tags can be used to match notification policies.

None

Annotations

Specify annotations for the alert rule.

None

FAQ

How do I check the version of the ack-arms-prometheus component?

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Operations > Add-ons in the left-side navigation pane.

  3. On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component.

    The version number is displayed in the lower part of the component. If a new version is available, click Upgrade on the right side to update the component.

    Note

    The Upgrade button is displayed only if the component is not updated to the latest version.

How is monitoring data collected from ACK Edge clusters?

In edge computing scenarios, edge nodes are deployed in data centers. Therefore, virtual private clouds (VPCs) in the cloud and edge nodes belong to different network planes. The Prometheus agent that is deployed on the cloud cannot access the endpoints of the node exporter and GPU exporter to collect monitoring metrics. ack-arms-prometheus 1.1.4 and later versions use cloud-edge tunneling provided by ACK Edge clusters to collect monitoring data from edge nodes to the cloud.

Why is Managed Service for Prometheus unable to monitor GPU-accelerated nodes?

Managed Service for Prometheus may be unable to monitor GPU-accelerated nodes that are configured with taints. You can perform the following steps to view the taints of a GPU-accelerated node.

  1. Run the following command to view the taints of a GPU-accelerated node:

    If you added custom taints to the GPU-accelerated node, you can view information about the custom taints. In this example, a taint whose key is set to test-key, value is set to test-value, and effect is set to NoSchedule is added to the node.

    kubectl describe node cn-beijing.47.100.***.***

    Expected output:

    Taints:test-key=test-value:NoSchedule
  2. Use one of the following methods to handle the taint:

    • Run the following command to delete the taint from the GPU-accelerated node:

      kubectl taint node cn-beijing.47.100.***.*** test-key=test-value:NoSchedule-
    • Add a toleration rule that allows pods to be scheduled to the CPU-accelerated node with the taint.

      # 1 Run the following command to modify ack-prometheus-gpu-exporter: 
      kubectl edit daemonset -n arms-prom ack-prometheus-gpu-exporter
      
      # 2. Add the following fields to the YAML file to tolerate the taint: 
      #Other fields are omitted. 
      # The tolerations field must be added above the containers field and both fields must be of the same level. 
      tolerations:
      - key: "test-key"
        operator: "Equal"
        value: "test-value"
        effect: "NoSchedule"
      containers:
       # Irrelevant fields are not shown.

What do I do if I fail to reinstall ack-arms-prometheus due to residual resource configurations of ack-arms-prometheus?

If you delete only the namespace of Managed Service for Prometheus, resource configurations are retained. In this case, you may fail to reinstall ack-arms-prometheus. You can perform the following operations to delete the residual resource configurations:

  • Run the following command to delete the arms-prom namespace:

    kubectl delete namespace arms-prom
  • Run the following commands to delete the related ClusterRoles:

    kubectl delete ClusterRole arms-kube-state-metrics
    kubectl delete ClusterRole arms-node-exporter
    kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-prometheus-oper3
    kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-pilot-prom-k8s
    kubectl delete ClusterRole gpu-prometheus-exporter
  • Run the following commands to delete the related ClusterRoleBindings:

    kubectl delete ClusterRoleBinding arms-node-exporter
    kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2
    kubectl delete ClusterRoleBinding arms-kube-state-metrics
    kubectl delete ClusterRoleBinding arms-pilot-prom-k8s
    kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding gpu-prometheus-exporter
  • Run the following commands to delete the related Roles and RoleBindings:

    kubectl delete Role arms-pilot-prom-spec-ns-k8s
    kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system

After you delete the residual resource configurations, go to the ACK console, choose Operations > Add-ons, and reinstall the ack-arms-prometheus component.