ACK Edge clusters deploy nodes in offline data centers where VPCs and edge nodes run on different network planes. Standard Prometheus agents cannot directly reach Node Exporter or GPU Exporter endpoints on edge nodes. Managed Service for Prometheus solves this by using the built-in cloud-native tunnel in ACK Edge clusters to automatically bridge the cloud-to-edge gap. This topic describes how to connect Managed Service for Prometheus to an ACK Edge cluster, view predefined dashboards, and configure alert rules.
Prerequisites
Before you begin, make sure you have:
An ACK Edge cluster of version 1.18.8-aliyunedge.1 or later. For more information, see Create an ACK Edge cluster.
The ack-arms-prometheus add-on of version 1.1.4 or later installed in the ACK Edge cluster. To check or upgrade the version, see Check the ack-arms-prometheus version.
For clusters running Kubernetes earlier than version 1.26: port forwarding enabled in the
kube-system/edge-tunnel-server-cfgConfigMap for Node Exporter port 9100 and GPU Exporter port 9445:http-proxy-ports: 9445 https-proxy-ports: 9100
How monitoring works in edge clusters
Because VPCs in the cloud and edge nodes run on different network planes, the Prometheus Agent in the cloud cannot directly reach Node Exporter and GPU Exporter endpoints on edge nodes.
Starting from ack-arms-prometheus 1.1.4, the add-on uses the built-in cloud-native O&M communication tunnel (Tunnel) in ACK Edge clusters to automatically establish a collection link between the cloud and the edge.
ACK Edge clusters support the Basic Edition of Container Monitoring.
View Grafana dashboards
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, choose Operations > Prometheus Monitoring.
If this is your first time accessing this page, follow the on-screen instructions and click Install. The console installs the add-on automatically and redirects you to the Prometheus Monitoring details page after installation.
On the Prometheus Monitoring page, view monitoring data for nodes, applications, and GPUs on the predefined dashboards.
Configure Prometheus alert rules
Alert rules trigger notifications when a metric crosses a defined threshold. When an alert fires, Managed Service for Prometheus sends a notification to the contacts in the specified contact group. Supported notification channels include phone call, text message, email, DingTalk, WeCom, and webhook.
Configuring alerts involves two steps: create a contact to receive notifications, then create the alert rule. For information about setting up DingTalk and WeCom notification channels, see DingTalk Robot and WeCom Robot.
Step 1: Create a contact
Log on to the ARMS console. In the left navigation pane, choose Alert Management > Notification Objects.
On the Contacts tab, click Create Contact.
In the Create Contact dialog box, configure the parameters and click OK.
ImportantA maximum of 100 contacts can be created.
Parameter Description Name The name of the contact. Phone Number The mobile number of the contact. Used for phone call and text message notifications. Only verified mobile numbers can be used in notification policies. For more information, see Verify a mobile number. Email The email address of the contact. Used for email notifications.
Step 2: Create a Prometheus alert rule
Prometheus alert rules support two check types:
| Check type | When to use |
|---|---|
| Static Threshold | Monitor a preset metric with a simple threshold condition. |
| Custom PromQL | Monitor metrics not available in the Static Threshold preset list, using a custom PromQL expression. |
Create an alert rule based on a static threshold
Log on to the ARMS console.
In the left navigation pane, choose Managed Service for Prometheus > Prometheus Alert Rules.
On the Prometheus Alert Rules page, click Create Prometheus Alert Rule.
On the Create Prometheus Alert Rule page, configure the following parameters and click Save.
Filter conditions are limited to 300 characters.
Parameter Description Example Alert Rule Name The name of the alert rule. Production cluster - container CPU utilization alert Check Type Select Static Threshold. Static Threshold Prometheus Instance The Prometheus instance to monitor. Production cluster Alert Contact Group The contact group to notify when the alert fires. Available groups vary by Prometheus instance type. Kubernetes load Alert Metric The metric to monitor. Available metrics vary by contact group. Container CPU Usage Alert Condition The condition that triggers an alert event. CPU utilization greater than 80% Filter Conditions The scope of resources the alert rule applies to. See the filter condition types below. Traverse Data Preview Displays the PromQL statement for the alert condition and a time series graph. The threshold appears as a red line; data above it appears in dark red. Available after you set filter conditions. — Duration When to generate an alert event: immediately when a data point hits the threshold, or only after the condition is met continuously for N minutes. 1 Alert Level The severity of the alert. Valid values: Default (lowest), P4, P3, P2, P1 (highest). Default Alert Message The message sent to contacts when the alert fires. Supports Go template variables. Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU utilization: {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%. Current value: {{ printf "%.2f" $value }}%Alert Notification Simple Mode: Set Notification Objects, Notification Period, and Whether to Resend Notifications. Standard Mode: Assign a notification policy, or create one later on the Notification Policy page. For more information, see Create and manage a notification policy. Do Not Specify Notification Policy Alert Check Cycle (Advanced Settings) How often the rule evaluates the alert condition, in minutes. Default: 1. Minimum: 1. 1 Check When Data Is Complete (Advanced Settings) Whether to wait for complete data before evaluating the condition. Yes Tags (Advanced Settings) Tags for the alert rule, used to match notification policies. — Annotations (Advanced Settings) Annotations for the alert rule. — Filter condition types:
Type Description Traverse Applies to all resources in the Prometheus instance. Selected by default. Equal Applies to the single resource name you specify. Not Equal Applies to all resources except the one you specify. Regex match Applies to all resources whose names match the regular expression. Regex not match Applies to all resources whose names do not match the regular expression.
Create an alert rule using a custom PromQL statement
Use the Custom PromQL check type to monitor metrics not available in the Static Threshold preset list.
On the Create Prometheus Alert Rule page, set Check Type to Custom PromQL, configure the following parameters, and click Save.
| Parameter | Description | Example |
|---|---|---|
| Alert Rule Name | The name of the alert rule. | Pod CPU utilization exceeds 8% |
| Check Type | Select Custom PromQL. | Custom PromQL |
| Prometheus Instance | The Prometheus instance to monitor. | — |
| Reference Alert Contact Group | The contact group to notify. Available groups vary by Prometheus instance type. | Kubernetes load |
| Reference Metrics | (Optional) Select a preset metric to populate the Custom PromQL Statements field with its PromQL expression. Modify the expression as needed. | Pod disk usage alert |
| Custom PromQL Statements | The PromQL expression that defines the alert condition. | max(container_fs_usage_bytes{pod!="", namespace!="arms-prom",namespace!="monitoring"}) by (pod_name, namespace, device)/max(container_fs_limit_bytes{pod!=""}) by (pod_name,namespace, device) * 100 > 90 |
| Data Preview | Displays a time series graph of the metric. Move the pointer over the curve to view details at a specific point; select a time period to zoom in. | — |
| Duration | When to generate an alert event: immediately when a data point hits the threshold, or only after the condition is met for N minutes. | 1 |
| Alert Level | The severity of the alert. Valid values: Default (lowest), P4, P3, P2, P1 (highest). | Default |
| Alert Message | The notification message. Supports Go template variables. | Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / The utilization of the {{$labels.device}} disk exceeds 90%. Current value: {{ printf "%.2f" $value }}% |
| Alert Notification | Simple Mode or Standard Mode. See the Static Threshold section for details. | Do Not Specify Notification Policy |
| Alert Check Cycle (Advanced Settings) | Evaluation frequency in minutes. Default: 1. Minimum: 1. | 1 |
| Check When Data Is Complete (Advanced Settings) | Whether to wait for complete data before evaluating. | Yes |
| Tags (Advanced Settings) | Tags for matching notification policies. | — |
| Annotations (Advanced Settings) | Annotations for the alert rule. | — |
FAQ
Check the ack-arms-prometheus version
Log on to the ACK console. In the left navigation pane, click Clusters.
Click the name of your cluster. In the left navigation pane, click Add-ons.
On the Add-ons page, click the Logs and Monitoring tab and find ack-arms-prometheus.
The version number is displayed on the component card. If a newer version is available, an Upgrade button appears on the card—click it to update.
Why can't GPU monitoring be deployed?
GPU monitoring fails to deploy when the GPU node has taints that prevent pod scheduling. Run the following command to check:
kubectl describe node cn-beijing.47.100.***.***If you see a taint in the output (for example, Taints: test-key=test-value:NoSchedule), resolve it using one of the following approaches:
Remove the taint from the node:
kubectl taint node cn-beijing.47.100.***.*** test-key=test-value:NoSchedule-Add a toleration to the GPU exporter DaemonSet so pods can be scheduled to the tainted node:
# Edit the ack-prometheus-gpu-exporter DaemonSet kubectl edit daemonset -n arms-prom ack-prometheus-gpu-exporterIn the YAML file, add the following
tolerationsfield at the same level ascontainers:tolerations: - key: "test-key" operator: "Equal" value: "test-value" effect: "NoSchedule" containers: # Other fields omitted
How do I completely remove ARMS-Prometheus configurations when reinstallation fails?
If you deleted only the arms-prom namespace, residual cluster-level resources remain and block reinstallation. Run the following commands to remove all ARMS-Prometheus resources:
Delete the namespace:
kubectl delete namespace arms-promDelete the ClusterRoles:
kubectl delete ClusterRole arms-kube-state-metrics kubectl delete ClusterRole arms-node-exporter kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role kubectl delete ClusterRole arms-prometheus-oper3 kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role kubectl delete ClusterRole arms-pilot-prom-k8s kubectl delete ClusterRole gpu-prometheus-exporter kubectl delete ClusterRole o11y:addon-controller:role kubectl delete ClusterRole arms-aliyunserviceroleforarms-clusterroleDelete the ClusterRoleBindings:
kubectl delete ClusterRoleBinding arms-node-exporter kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2 kubectl delete ClusterRoleBinding arms-kube-state-metrics kubectl delete ClusterRoleBinding arms-pilot-prom-k8s kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding kubectl delete ClusterRoleBinding gpu-prometheus-exporter kubectl delete ClusterRoleBinding o11y:addon-controller:rolebinding kubectl delete ClusterRoleBinding arms-kube-state-metrics-agent kubectl delete ClusterRoleBinding arms-node-exporter-agent kubectl delete ClusterRoleBinding arms-aliyunserviceroleforarms-clusterrolebindingDelete the Roles and RoleBindings:
kubectl delete Role arms-pilot-prom-spec-ns-k8s kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system
After deleting these resources, go to the Container Service for Kubernetes (ACK) consoleContainer Service console, choose Operations > Add-ons, and reinstall ack-arms-prometheus.