Use Container Monitoring (Pro) - Managed Service for Prometheus

Container Monitoring (Pro) supports a 90-day storage period for basic metrics and managed Prometheus collectors. Container Monitoring (Pro) provides various built-in monitoring dashboards, default alert rules for components of Container Service for Kubernetes (ACK), and Remote Write and EventBridge-based data delivery capabilities.

Prerequisites

The pay-as-you-go billing method of Container Monitoring (Pro) is enabled.

Specify an ACK cluster of Container Monitoring (Pro)

Log on to the Application Real-Time Monitoring Service (ARMS) console. In the left-side navigation pane, click Integration Center. On the Integration Center page, search for Kubernetes Cluster Monitor and click Kubernetes Cluster Monitor.
In the Kubernetes Cluster Monitor panel, select the ACK cluster to be connected, set the Select version parameter to Container Monitoring (Pro), and then click OK.

Upgrade Container Monitoring (Basic) to Container Monitoring (Pro)

Important

If you upgrade Container Monitoring (Basic) to Container Monitoring (Pro), you cannot downgrade Container Monitoring (Pro) to Container Monitoring (Basic).
Only ACK Pro clusters are supported.

Log on to the ARMS console. In the left-side navigation pane, click Integration Management. On the Integration Management page, click the Integrated Environments tab, and click Container Service.
Find the cluster that you want to upgrade and click Upgrade in the Actions column. In the dialog box that appears, click OK.

Dashboards supported by Container Monitoring (Pro)

Type	Dashboard
Monitoring Overview	Cluster Overview
Monitoring Overview	Namespace Dashboard
Key Component Monitoring	ACK Pro API server
	ACK Pro ETCD
	ACK Pro Scheduler
	ACK Pro Cloud Controller Manager
	ACK Pro Kube Controller Manager
Node Monitoring	Node Pool Overview
Node Monitoring	Nodes
Application Monitoring	Deployment Details
	StatefulSet Details
	DaemonSet Details
	Pods
Network Monitoring	CoreDNS
Network Monitoring	Ingresses
Storage Monitoring	CSI - Cluster Dimension
	CSI - Nodes
	Pod IO Monitoring (Pod Level)
	Frontend Storage IO Monitoring (Cluster Level)
GPU Monitoring	GPUs - Cluster Dimension
	GPUs - Nodes
	GPUs - Pods
Cost Analysis/Resource Optimization	Resource Profile
Others	Backend Storage IO Monitoring (Cluster Level)
	k8s-reclaimed-resource
	Prometheus
	Virtual Node(ECI) Overview

Default alert rules

Alert rule name or ID	Alert group	Template
The CPU utilization of the node is greater than 75%	Node	The CPU utilization of the {{ $labels.instance }} node is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%.
The CPU utilization of the node is greater than 85%	Node	The CPU utilization of the {{ $labels.instance }} node is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%.
The memory usage of the node is greater than 75%	Node	The memory usage of the {{ $labels.instance }} node is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%.
The memory usage of the node is greater than 85%	Node	The memory usage of the {{ $labels.instance }} node is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%.
The node is abnormal	Node	The {{$labels.node}} node is unavailable for more than 10 minutes.
The disk usage is greater than 95%	Node	The usage of the {{ $labels.device }} disk of the {{ $labels.instance }} node is greater than 95%. The current disk usage is {{ printf "%.2f" $value }}%.
The availability rate of pods in the Deployment is less than 50%	Workload	The availability rate of pods in the {{$labels.deployment}} Deployment of the {{$labels.namespace}} namespace is less than 50%. Currently, the number of unavailable pods is {{ $value }}.
Fail to execute the job	Workload	The {{$labels.job_name}} job in the {{$labels.namespace}} namespace fails to be executed.
Fail to start the pod due to timeout	Workload	The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace fails to be started within 15 minutes. Cause: {{$labels.reason}}.
The pod is abnormal	Workload	The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is in the {{$labels.phase}} state for more than 10 minutes.
The pod is frequently restarted	Workload	The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is restarted more than {{ $labels.metrics_params_value}} times within {{$labels.metrics_params_time}} minutes. Currently, the pod is restarted {{ $value }} times.
The CPU utilization of the container is greater than 85%	Workload	The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%.
The CPU utilization of the container is greater than 75%	Workload	The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%.
The memory usage of the container is greater than 75%	Workload	The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%.
The memory usage of the container is greater than 85%	Workload	The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%.