All Products
Search
Document Center

Managed Service for Grafana:Container Monitoring (Pro)

Last Updated:Sep 27, 2024

Container Monitoring (Pro) supports a 90-day storage period for basic metrics and managed Prometheus collectors. Container Monitoring (Pro) provides various built-in monitoring dashboards, default alert rules for components of Container Service for Kubernetes (ACK), and Remote Write and EventBridge-based data delivery capabilities.

Prerequisites

The pay-as-you-go billing method of Container Monitoring (Pro) is enabled.

Specify an ACK cluster of Container Monitoring (Pro)

  1. Log on to the Application Real-Time Monitoring Service (ARMS) console. In the left-side navigation pane, click Integration Center. On the Integration Center page, search for Kubernetes Cluster Monitor and click Kubernetes Cluster Monitor.

  2. In the Kubernetes Cluster Monitor panel, select the ACK cluster to be connected, set the Select version parameter to Container Monitoring (Pro), and then click OK.

Upgrade Container Monitoring (Basic) to Container Monitoring (Pro)

Important
  • If you upgrade Container Monitoring (Basic) to Container Monitoring (Pro), you cannot downgrade Container Monitoring (Pro) to Container Monitoring (Basic).

  • Only ACK Pro clusters are supported.

  1. Log on to the ARMS console. In the left-side navigation pane, click Integration Management. On the Integration Management page, click the Integrated Environments tab, and click Container Service.

  2. Find the cluster that you want to upgrade and click Upgrade in the Actions column. In the dialog box that appears, click OK.

Dashboards supported by Container Monitoring (Pro)

Type

Dashboard

Monitoring Overview

Cluster Overview

Namespace Dashboard

Key Component Monitoring

ACK Pro API server

ACK Pro ETCD

ACK Pro Scheduler

ACK Pro Cloud Controller Manager

ACK Pro Kube Controller Manager

Node Monitoring

Node Pool Overview

Nodes

Application Monitoring

Deployment Details

StatefulSet Details

DaemonSet Details

Pods

Network Monitoring

CoreDNS

Ingresses

Storage Monitoring

CSI - Cluster Dimension

CSI - Nodes

Pod IO Monitoring (Pod Level)

Frontend Storage IO Monitoring (Cluster Level)

GPU Monitoring

GPUs - Cluster Dimension

GPUs - Nodes

GPUs - Pods

Cost Analysis/Resource Optimization

Resource Profile

Others

Backend Storage IO Monitoring (Cluster Level)

k8s-reclaimed-resource

Prometheus

Virtual Node(ECI) Overview

Default alert rules

Alert rule name or ID

Alert group

Template

The CPU utilization of the node is greater than 75%

Node

The CPU utilization of the {{ $labels.instance }} node is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%.

The CPU utilization of the node is greater than 85%

Node

The CPU utilization of the {{ $labels.instance }} node is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%.

The memory usage of the node is greater than 75%

Node

The memory usage of the {{ $labels.instance }} node is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%.

The memory usage of the node is greater than 85%

Node

The memory usage of the {{ $labels.instance }} node is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%.

The node is abnormal

Node

The {{$labels.node}} node is unavailable for more than 10 minutes.

The disk usage is greater than 95%

Node

The usage of the {{ $labels.device }} disk of the {{ $labels.instance }} node is greater than 95%. The current disk usage is {{ printf "%.2f" $value }}%.

The availability rate of pods in the Deployment is less than 50%

Workload

The availability rate of pods in the {{$labels.deployment}} Deployment of the {{$labels.namespace}} namespace is less than 50%. Currently, the number of unavailable pods is {{ $value }}.

Fail to execute the job

Workload

The {{$labels.job_name}} job in the {{$labels.namespace}} namespace fails to be executed.

Fail to start the pod due to timeout

Workload

The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace fails to be started within 15 minutes. Cause: {{$labels.reason}}.

The pod is abnormal

Workload

The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is in the {{$labels.phase}} state for more than 10 minutes.

The pod is frequently restarted

Workload

The {{$labels.pod_name}} pod in the {{$labels.namespace}} namespace is restarted more than {{ $labels.metrics_params_value}} times within {{$labels.metrics_params_time}} minutes. Currently, the pod is restarted {{ $value }} times.

The CPU utilization of the container is greater than 85%

Workload

The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current CPU utilization is {{ printf "%.2f" $value }}%.

The CPU utilization of the container is greater than 75%

Workload

The CPU utilization of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current CPU utilization is {{ printf "%.2f" $value }}%.

The memory usage of the container is greater than 75%

Workload

The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 75%. The current memory usage is {{ printf "%.2f" $value }}%.

The memory usage of the container is greater than 85%

Workload

The memory usage of the {{$labels.container}} container in the {{$labels.pod_name}} pod of the {{$labels.namespace}} namespace is greater than 85%. The current memory usage is {{ printf "%.2f" $value }}%.