Container Monitoring (Pro) supports a 90-day storage period for basic metrics and managed Prometheus collectors. Container Monitoring (Pro) provides various built-in monitoring dashboards, default alert rules for components of Container Service for Kubernetes (ACK), and Remote Write and EventBridge-based data delivery capabilities.
Prerequisites
The pay-as-you-go billing method of Container Monitoring (Pro) is enabled.
Specify an ACK cluster of Container Monitoring (Pro)
Log on to the Application Real-Time Monitoring Service (ARMS) console. In the left-side navigation pane, click Integration Center. On the Integration Center page, search for Kubernetes Cluster Monitor and click Kubernetes Cluster Monitor.
In the Kubernetes Cluster Monitor panel, select the ACK cluster to be connected, set the Select version parameter to Container Monitoring (Pro), and then click OK.
Upgrade Container Monitoring (Basic) to Container Monitoring (Pro)
If you upgrade Container Monitoring (Basic) to Container Monitoring (Pro), you cannot downgrade Container Monitoring (Pro) to Container Monitoring (Basic).
Only ACK Pro clusters are supported.
Log on to the ARMS console. In the left-side navigation pane, click Integration Management. On the Integration Management page, click the Integrated Environments tab, and click Container Service.
Find the cluster that you want to upgrade and click Upgrade in the Actions column. In the dialog box that appears, click OK.
Dashboards supported by Container Monitoring (Pro)
Type | Dashboard |
Monitoring Overview | Cluster Overview |
Namespace Dashboard | |
Key Component Monitoring | ACK Pro API server |
ACK Pro ETCD | |
ACK Pro Scheduler | |
ACK Pro Cloud Controller Manager | |
ACK Pro Kube Controller Manager | |
Node Monitoring | Node Pool Overview |
Nodes | |
Application Monitoring | Deployment Details |
StatefulSet Details | |
DaemonSet Details | |
Pods | |
Network Monitoring | CoreDNS |
Ingresses | |
Storage Monitoring | CSI - Cluster Dimension |
CSI - Nodes | |
Pod IO Monitoring (Pod Level) | |
Frontend Storage IO Monitoring (Cluster Level) | |
GPU Monitoring | GPUs - Cluster Dimension |
GPUs - Nodes | |
GPUs - Pods | |
Cost Analysis/Resource Optimization | Resource Profile |
Others | Backend Storage IO Monitoring (Cluster Level) |
k8s-reclaimed-resource | |
Prometheus | |
Virtual Node(ECI) Overview |
Default alert rules
Alert rule name or ID | Alert group | Template |
The CPU utilization of the node is greater than 75% | Node | The CPU utilization of the {{ $labels.instance }} node is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%. |
The CPU utilization of the node is greater than 85% | Node | The CPU utilization of the {{ $labels.instance }} node is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%. |
The memory usage of the node is greater than 75% | Node | The memory usage of the {{ $labels.instance }} node is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%. |
The memory usage of the node is greater than 85% | Node | The memory usage of the {{ $labels.instance }} node is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%. |
The node is abnormal | Node | The {{$labels.node}} node is unavailable for more than 10 minutes. |
The disk usage is greater than 95% | Node | The usage of the {{ $labels.device }} disk of the {{ $labels.instance }} node is greater than 95%. The current disk usage is {{ printf "%.2f" $value }}%. |
The availability rate of pods in the Deployment is less than 50% | Workload | The availability rate of pods in the {{$labels.deployment}} Deployment of the {{$labels.namespace}} namespace is less than 50%. Currently, the number of unavailable pods is {{ $value }}. |
Fail to execute the job | Workload | The {{$labels.job_name}} job in the {{$labels.namespace}} namespace fails to be executed. |
Fail to start the pod due to timeout | Workload | The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace fails to be started within 15 minutes. Cause: {{$labels.reason}}. |
The pod is abnormal | Workload | The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is in the {{$labels.phase}} state for more than 10 minutes. |
The pod is frequently restarted | Workload | The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is restarted more than {{ $labels.metrics_params_value}} times within {{$labels.metrics_params_time}} minutes. Currently, the pod is restarted {{ $value }} times. |
The CPU utilization of the container is greater than 85% | Workload | The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%. |
The CPU utilization of the container is greater than 75% | Workload | The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%. |
The memory usage of the container is greater than 75% | Workload | The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%. |
The memory usage of the container is greater than 85% | Workload | The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%. |