All Products
Search
Document Center

Container Service for Kubernetes:Monitor ACK Edge clusters with Prometheus Monitoring for Alibaba Cloud

Last Updated:Mar 26, 2026

ACK Edge clusters deploy nodes in offline data centers where VPCs and edge nodes run on different network planes. Standard Prometheus agents cannot directly reach Node Exporter or GPU Exporter endpoints on edge nodes. Managed Service for Prometheus solves this by using the built-in cloud-native tunnel in ACK Edge clusters to automatically bridge the cloud-to-edge gap. This topic describes how to connect Managed Service for Prometheus to an ACK Edge cluster, view predefined dashboards, and configure alert rules.

Prerequisites

Before you begin, make sure you have:

  • An ACK Edge cluster of version 1.18.8-aliyunedge.1 or later. For more information, see Create an ACK Edge cluster.

  • The ack-arms-prometheus add-on of version 1.1.4 or later installed in the ACK Edge cluster. To check or upgrade the version, see Check the ack-arms-prometheus version.

  • For clusters running Kubernetes earlier than version 1.26: port forwarding enabled in the kube-system/edge-tunnel-server-cfg ConfigMap for Node Exporter port 9100 and GPU Exporter port 9445:

    http-proxy-ports: 9445
    https-proxy-ports: 9100

How monitoring works in edge clusters

Because VPCs in the cloud and edge nodes run on different network planes, the Prometheus Agent in the cloud cannot directly reach Node Exporter and GPU Exporter endpoints on edge nodes.

Starting from ack-arms-prometheus 1.1.4, the add-on uses the built-in cloud-native O&M communication tunnel (Tunnel) in ACK Edge clusters to automatically establish a collection link between the cloud and the edge.

ACK Edge clusters support the Basic Edition of Container Monitoring.

View Grafana dashboards

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, choose Operations > Prometheus Monitoring.

    If this is your first time accessing this page, follow the on-screen instructions and click Install. The console installs the add-on automatically and redirects you to the Prometheus Monitoring details page after installation.
  3. On the Prometheus Monitoring page, view monitoring data for nodes, applications, and GPUs on the predefined dashboards.

Configure Prometheus alert rules

Alert rules trigger notifications when a metric crosses a defined threshold. When an alert fires, Managed Service for Prometheus sends a notification to the contacts in the specified contact group. Supported notification channels include phone call, text message, email, DingTalk, WeCom, and webhook.

Configuring alerts involves two steps: create a contact to receive notifications, then create the alert rule. For information about setting up DingTalk and WeCom notification channels, see DingTalk Robot and WeCom Robot.

Step 1: Create a contact

  1. Log on to the ARMS console. In the left navigation pane, choose Alert Management > Notification Objects.

  2. On the Contacts tab, click Create Contact.

  3. In the Create Contact dialog box, configure the parameters and click OK.

    Important

    A maximum of 100 contacts can be created.

    ParameterDescription
    NameThe name of the contact.
    Phone NumberThe mobile number of the contact. Used for phone call and text message notifications. Only verified mobile numbers can be used in notification policies. For more information, see Verify a mobile number.
    EmailThe email address of the contact. Used for email notifications.

Step 2: Create a Prometheus alert rule

Prometheus alert rules support two check types:

Check typeWhen to use
Static ThresholdMonitor a preset metric with a simple threshold condition.
Custom PromQLMonitor metrics not available in the Static Threshold preset list, using a custom PromQL expression.

Create an alert rule based on a static threshold

  1. Log on to the ARMS console.

  2. In the left navigation pane, choose Managed Service for Prometheus > Prometheus Alert Rules.

  3. On the Prometheus Alert Rules page, click Create Prometheus Alert Rule.

  4. On the Create Prometheus Alert Rule page, configure the following parameters and click Save.

    Filter conditions are limited to 300 characters.
    ParameterDescriptionExample
    Alert Rule NameThe name of the alert rule.Production cluster - container CPU utilization alert
    Check TypeSelect Static Threshold.Static Threshold
    Prometheus InstanceThe Prometheus instance to monitor.Production cluster
    Alert Contact GroupThe contact group to notify when the alert fires. Available groups vary by Prometheus instance type.Kubernetes load
    Alert MetricThe metric to monitor. Available metrics vary by contact group.Container CPU Usage
    Alert ConditionThe condition that triggers an alert event.CPU utilization greater than 80%
    Filter ConditionsThe scope of resources the alert rule applies to. See the filter condition types below.Traverse
    Data PreviewDisplays the PromQL statement for the alert condition and a time series graph. The threshold appears as a red line; data above it appears in dark red. Available after you set filter conditions.
    DurationWhen to generate an alert event: immediately when a data point hits the threshold, or only after the condition is met continuously for N minutes.1
    Alert LevelThe severity of the alert. Valid values: Default (lowest), P4, P3, P2, P1 (highest).Default
    Alert MessageThe message sent to contacts when the alert fires. Supports Go template variables.Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU utilization: {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%. Current value: {{ printf "%.2f" $value }}%
    Alert NotificationSimple Mode: Set Notification Objects, Notification Period, and Whether to Resend Notifications. Standard Mode: Assign a notification policy, or create one later on the Notification Policy page. For more information, see Create and manage a notification policy.Do Not Specify Notification Policy
    Alert Check Cycle (Advanced Settings)How often the rule evaluates the alert condition, in minutes. Default: 1. Minimum: 1.1
    Check When Data Is Complete (Advanced Settings)Whether to wait for complete data before evaluating the condition.Yes
    Tags (Advanced Settings)Tags for the alert rule, used to match notification policies.
    Annotations (Advanced Settings)Annotations for the alert rule.

    Filter condition types:

    TypeDescription
    TraverseApplies to all resources in the Prometheus instance. Selected by default.
    EqualApplies to the single resource name you specify.
    Not EqualApplies to all resources except the one you specify.
    Regex matchApplies to all resources whose names match the regular expression.
    Regex not matchApplies to all resources whose names do not match the regular expression.

Create an alert rule using a custom PromQL statement

Use the Custom PromQL check type to monitor metrics not available in the Static Threshold preset list.

On the Create Prometheus Alert Rule page, set Check Type to Custom PromQL, configure the following parameters, and click Save.

ParameterDescriptionExample
Alert Rule NameThe name of the alert rule.Pod CPU utilization exceeds 8%
Check TypeSelect Custom PromQL.Custom PromQL
Prometheus InstanceThe Prometheus instance to monitor.
Reference Alert Contact GroupThe contact group to notify. Available groups vary by Prometheus instance type.Kubernetes load
Reference Metrics(Optional) Select a preset metric to populate the Custom PromQL Statements field with its PromQL expression. Modify the expression as needed.Pod disk usage alert
Custom PromQL StatementsThe PromQL expression that defines the alert condition.max(container_fs_usage_bytes{pod!="", namespace!="arms-prom",namespace!="monitoring"}) by (pod_name, namespace, device)/max(container_fs_limit_bytes{pod!=""}) by (pod_name,namespace, device) * 100 > 90
Data PreviewDisplays a time series graph of the metric. Move the pointer over the curve to view details at a specific point; select a time period to zoom in.
DurationWhen to generate an alert event: immediately when a data point hits the threshold, or only after the condition is met for N minutes.1
Alert LevelThe severity of the alert. Valid values: Default (lowest), P4, P3, P2, P1 (highest).Default
Alert MessageThe notification message. Supports Go template variables.Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / The utilization of the {{$labels.device}} disk exceeds 90%. Current value: {{ printf "%.2f" $value }}%
Alert NotificationSimple Mode or Standard Mode. See the Static Threshold section for details.Do Not Specify Notification Policy
Alert Check Cycle (Advanced Settings)Evaluation frequency in minutes. Default: 1. Minimum: 1.1
Check When Data Is Complete (Advanced Settings)Whether to wait for complete data before evaluating.Yes
Tags (Advanced Settings)Tags for matching notification policies.
Annotations (Advanced Settings)Annotations for the alert rule.

FAQ

Check the ack-arms-prometheus version

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. Click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, click the Logs and Monitoring tab and find ack-arms-prometheus.

The version number is displayed on the component card. If a newer version is available, an Upgrade button appears on the card—click it to update.

Why can't GPU monitoring be deployed?

GPU monitoring fails to deploy when the GPU node has taints that prevent pod scheduling. Run the following command to check:

kubectl describe node cn-beijing.47.100.***.***

If you see a taint in the output (for example, Taints: test-key=test-value:NoSchedule), resolve it using one of the following approaches:

  • Remove the taint from the node:

    kubectl taint node cn-beijing.47.100.***.*** test-key=test-value:NoSchedule-
  • Add a toleration to the GPU exporter DaemonSet so pods can be scheduled to the tainted node:

    # Edit the ack-prometheus-gpu-exporter DaemonSet
    kubectl edit daemonset -n arms-prom ack-prometheus-gpu-exporter

    In the YAML file, add the following tolerations field at the same level as containers:

    tolerations:
    - key: "test-key"
      operator: "Equal"
      value: "test-value"
      effect: "NoSchedule"
    containers:
      # Other fields omitted

How do I completely remove ARMS-Prometheus configurations when reinstallation fails?

If you deleted only the arms-prom namespace, residual cluster-level resources remain and block reinstallation. Run the following commands to remove all ARMS-Prometheus resources:

  1. Delete the namespace:

    kubectl delete namespace arms-prom
  2. Delete the ClusterRoles:

    kubectl delete ClusterRole arms-kube-state-metrics
    kubectl delete ClusterRole arms-node-exporter
    kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-prometheus-oper3
    kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-pilot-prom-k8s
    kubectl delete ClusterRole gpu-prometheus-exporter
    kubectl delete ClusterRole o11y:addon-controller:role
    kubectl delete ClusterRole arms-aliyunserviceroleforarms-clusterrole
  3. Delete the ClusterRoleBindings:

    kubectl delete ClusterRoleBinding arms-node-exporter
    kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2
    kubectl delete ClusterRoleBinding arms-kube-state-metrics
    kubectl delete ClusterRoleBinding arms-pilot-prom-k8s
    kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding gpu-prometheus-exporter
    kubectl delete ClusterRoleBinding o11y:addon-controller:rolebinding
    kubectl delete ClusterRoleBinding arms-kube-state-metrics-agent
    kubectl delete ClusterRoleBinding arms-node-exporter-agent
    kubectl delete ClusterRoleBinding arms-aliyunserviceroleforarms-clusterrolebinding
  4. Delete the Roles and RoleBindings:

    kubectl delete Role arms-pilot-prom-spec-ns-k8s
    kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system

After deleting these resources, go to the Container Service for Kubernetes (ACK) consoleContainer Service console, choose Operations > Add-ons, and reinstall ack-arms-prometheus.