System Observer Monitoring (SysOM) is an OS kernel-level container monitoring method. Container Service for Kubernetes (ACK) allows you to monitor containers at the OS kernel level based on SysOM. This capability can help you better deploy and migrate containerized applications and monitor containers. This topic describes how to enable and use ack-sysom-monitor. This topic also describes the SysOM metrics for container monitoring.
Prerequisites
An ACK managed cluster is created or an ACK Serverless cluster is created after October 2021, and the Kubernetes version of the cluster is 1.18.8 or later. For more information about how to create a cluster, see Create an ACK managed cluster and Create an ACK Serverless cluster. For more information about how to update a cluster, see Manually update ACK clusters.
Managed Service for Prometheus is enabled. For more information, see Enable Managed Service for Prometheus.
Introduction to ack-sysom-monitor
ack-sysom-monitor is a SysOM component that uses the extened Berkeley Packet Filter (eBPF) technology to collect node and container metrics and enhance metrics at the kernel level. In addition to system metrics, ack-sysom-monitor also provides enhanced metrics and supports pod kernel-level monitoring and node kernel-level monitoring to help you identify common issues, including system jitters, delays, resource leaks, and pod memory exceptions.
Billing of ack-sysom-monitor
After the ack-sysom-monitor component is enabled, related components automatically send monitoring metrics to Managed Service for Prometheus. These metrics are considered as custom metrics. Fees are charged for custom metrics.
Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.
Enable ack-sysom-monitor
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center.
In the Infrastructure section of the Integration Center page, find and click SysOM System Observation.
In the Start Integration step of the SysOM System Observation panel, select the ACK cluster that you want to integrate and click OK.
Use ack-sysom-monitor
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Prometheus Monitoring page, click the SysOM tab to view the metrics provided by ack-sysom-monitor.
ack-sysom-monitor supports node kernel-level monitoring and pod kernel-level monitoring.
Node kernel-level monitoring
On the SysOM - Nodes tab, you can view the CPU, memory, scheduling, storage, and network metrics of a node.
Pod kernel-level monitoring
On the SysOM - Pods tab, you can view the memory, CPU, network, and I/O metrics of a pod in real time.
What to do next
If you want to disable kernel-level container monitoring based on SysOM, you can uninstall the ack-sysom-monitor component. This avoids incurring additional fees. For more information, see Manage components.
Metrics
The metrics provided by ack-sysom-monitor are defined based on the data model used by Prometheus.
Node metrics
Node metrics include CPU, memory, storage, network, and other metrics.
Metrics related to CPUs and scheduling
Metric | Type | Unit | Description |
sysom_proc_cpu_total | gauge | % | Displays information about the CPU uptime of a node. This metric indicates the ratio of the CPU uptime in a state to the total CPU uptime. The following states are supported: user mode, kernel mode, softirq, hardirq, idle, and iowait. |
sysom_proc_cpus | gauge | % | Displays information about the uptime of a CPU on a node. This metric indicates the ratio of the uptime of a CPU in a state to the total uptime of the CPU. The following states are supported: user mode, kernel mode, softirq, hardirq, idle, and iowait. |
sysom_proc_sirq | gauge | % | Displays information about softirq of a node. This metric indicates the number of times that each type of softirq occurs. Supported softirq types include HI, TIMER, NET_TX, NET_RX, BLOCK, IRQ_POLL, TASKLET, SCHED, HRTIMER, and RCU softirqs. |
sysom_proc_stat_counters | gauge | - | Displays whether the node runs an excessive number of processes in the D state and information about the system loads. This metric indicates the number of processes in the Running or D state. In addition, it also indicates the system startup time and the number of times of context switching. |
sysom_proc_loadavg | gauge | - | Displays the load average of a node. This metric indicates the load average, including the runq length, load average within the previous 1 minute, load average within the previous 5 minutes, load average within the previous 15 minutes, and total number of system processes. |
sysom_proc_schedstat | gauge | ns (nanoseconds) | Displays information about the scheduling latency of a node. This metric displays statistics related to CPU scheduling, including the waiting time of the processes in the queue of the current CPU and the length of the timeslice that runs in the current CPU. |
sysom_cpu_dist | gauge | - | Displays the overall scheduling information of a node. This metric indicates the interval between the time when the process releases the CPU to the next time when the process is scheduled to the CPU. The metric also counts the number of times that a process falls into each of the following intervals: 1us, 10us, 100us, 1ms, 10ms, 100ms, and 1s. |
Metrics related to memory
Metric | Type | Unit | Description |
sysom_proc_meminfo | gauge | KiB | Displays the usage of different types of memory resources on a node. This metric indicates the memory usage, including but not limited to the total memory (Total), free memory (Free), available memory (Available), caches (Cache), buffers (Buffers), reclaimable memory (SReclaimable), and Unreclaimable memory (SUnreclaim). |
sysom_proc_vmstat | gauge | - | Displays the memory usage and memory events of a node in details. This metric indicates the memory statistics of different pages and memory events. The memory information and memory events include free pages (Free Pages), dirty pages (Dirty Pages), page reads and writes (Pages Read/Write), number of pages reclaimed from the Inactive list (Pages Reclaimed from Inactive List), and number of times that the Out-of-Memory (OOM) killer kills applications. |
sysom_proc_buddyinfo | gauge | - | Displays information about how the buddy system allocates and releases kernel memory. This metric indicates the detailed information about the kernel buddy system, including all memory nodes and zones and the number of blocks in different sizes in linked lists. |
Metrics related to storage
Metric | Type | Unit | Description |
sysom_proc_disks | gauge | - | Displays information about the input, output, IOPS, and latency of each disk on a node. This metric indicates disk and partition statistics, including the number of read and write requests completed by a partition, total amount of time used to complete the read and write requests, number of times that read and write requests are merged, and number of inflight read and write requests. |
sysom_fs_stat | gauge | - | Displays the usage of file systems mounted to a node. This metric indicates the usage of a file system, including the mount target of the file system, block size, number of used blocks and number of available blocks, and number of used inodes and number of available inodes. |
Metrics related to networks
Metric | Type | Unit | Description |
sysom_proc_networks | gauge | - | Displays information about the data transfer of the network interface cards (NICs) on a node. This metric indicates the data transfer information of an NIC, including the total number of data packets received or sent by the NIC, total number of bytes, total number of data packets discarded by the device driver, and total number of data packets that fail to be sent or received. |
sysom_proc_pkt_status | gauge | - | Displays information about data packets processed by the network protocol stack of a node. This metric indicates the number of events that occur when data packets pass through the network protocol stack, including the number of times of packet loss, the number of overflows, and the number of invalid assertions. |
sysom_sock_stat | gauge | - | This metric can help identify the insufficient socket or buffer issue caused by the application logic or system parameters. The metric displays statistics about the usage of sockets and buffers, including the usage of total, raw, TCP, and UDP sockets, the number of sockets in the TCP time wait or orphan state, and the memory usage of TCP and UDP sockets. |
sysom_softnets | gauge | - | Displays information about data packets received by the NIC softirqs of each CPU on a node. This metric indicates statistics about the NIC softirqs of a CPU, including the number of packets received or sent by a softirq and the number of times that the net_rx_action function is called to handle packet reception softirqs. |
sysom_net_health_hist | gauge | - | Displays the trend of the round-trip time (RTT) of all TCP connections on a node. This metric indicates the trend of the RTT of all TCP connections on a node. It counts the number of connections that correspond to each average RTT value, such as 10 milliseconds, 100 milliseconds, and 1 second. |
sysom_net_health_count | gauge | - | This metric is similar to the |
sysom_net_retrans_count | gauge | - | Displays retransmission information about all TCP connections on a node. This metric indicates the types of data packets that are retransmitted through TCP connections and the number of retransmitted data packets of each type (such as SYN, SYN-ACK, and RESET packets), including the number of packets retransmitted due to retransmission timeouts. |
sysom_net_tcp_count | gauge | - | Displays basic information about the TCP connections on a node. This metric indicates statistics about TCP connections, including the number of active TCP connections, number of TCP segments received or sent, number of TCP segments retransmitted, and number of packets that fail to be received. |
sysom_net_udp_count | gauge | - | Displays basic information about the UDP connections on a node. This metric indicates statistics about UDP connections, including the number of UDP packets received or sent, the number of times that the UDP send or receive buffer encounters errors, and the number of data packets that encounter errors because no ports are available. |
sysom_net_ip_count | gauge | - | Displays basic information about the IP layer of a node. This metric indicates statistics about the IP layer, including the number of data packets that are forwarded, received, or sent. |
sysom_net_icmp_count | gauge | - | Displays basic information about the ICMP protocol of a node. This metric indicates statistics about the ICMP protocol, including the number of data packets that are received or sent by ICMP and the number of data packets that fail to be received or sent. |
Other system metrics
Metric | Type | Unit | Description |
sysom_cgroups | gauge | - | Displays the number of cgroups used by different cgroup subsystems to help you identify cgroup leaks. This metric indicates the number of cgroups in different cgroup subsystems, including the CPU, Cpuacct, Memory, Pids, Blkio, and Devices subsystems. |
sysom_uptime | gauge | s (seconds) | Displays system loads. This metric indicates the uptime of the system from the time when the system starts up to the current time. This metric also indicates the idle time of the system. |
Metrics related to containers
Container metrics include CPU, memory, IO, network, and other metrics.
Metrics related to CPUs and scheduling
Metric | Type | Unit | Description |
sysom_container_cpu_stat | gauge | - | Helps you monitor and assess whether resource quotas need to be adjusted or other optimizations are required. This metric indicates statistics about CPU limits for containers, including the number of times that CPU limits are enforced in each cgroup, total number of times that CPU limits are enforced, and duration of CPU limit enforcement. |
sysom_container_cpu_acctstat | gauge | % | Displays the CPU usage information of containers. This metric indicates the CPU utilization of tasks in a container that runs in each mode, including the CPU utilization in user mode, CPU utilization in kernel mode, and total CPU utilization. |
sysom_container_cpu_cfsquota | gauge | - | Displays the period of time during which a container is limited by the Completely Fair Scheduler (CFS). This metric indicates the amount of time that a container can run within each CFS time window, including the cfs_period_us and cfs_quota_us parameters.
|
Metrics related to memory
Metric | Type | Unit | Description |
sysom_container_memory_stat | gauge | KiB | Displays the usage of different types of memory resources in containers. This metric indicates statistics about the memory usage of containers, including the total memory (Total), free memory (Free), available memory (Available), caches (Cache), buffers (Buffers), reclaimable memory (SReclaimable), and Unreclaimable memory (SUnreclaim). |
sysom_container_memory_filecache | gauge | KiB | This metric helps you quickly learn the usage of page caches in containers and identify issues such as insufficient memory, memory latency, and memory jitters caused by overuse of page caches. The metric indicates the usage of page caches in containers, including the top 10 files that occupy the most page caches in each container, the size of each file, and the total size of page caches that are occupied. |
sysom_container_memory_gdrcm_latency | gauge | Times | Displays the number of delays caused by memory reclamation due to insufficient memory resources and the duration of the delays. This metric indicates the number of delays caused by memory reclamation due to insufficient memory resources and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds. |
sysom_container_memory_cdrcm_latency | gauge | Times | Displays the number of delays caused by memory reclamation due to insufficient memory cgroups and the duration of the delays. Note This metric is valid only if the current memory cgroups are non-root cgroups or memory limits are configured for the current memory cgroups. This metric indicates the number of delays caused by memory reclamation due to insufficient memory cgroups and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds. |
sysom_container_memory_cpt_latency | gauge | Times | Displays the number of delays caused by kernel memory adjustment. When a process in a container applies for memory resources, memory adjustment is triggered if the node has insufficient memory or an excessive number of memory fragments exists. This metric indicates the number of delays caused by kernel memory adjustment and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds. |
Metrics related to IO
Metric | Type | Unit | Description |
sysom_container_blkio_stat | gauge | - | Displays basic IO information about containers. This metric indicates the IO statistics of a disk used by a container, including the number and bytes of read or write requests to the disk, the number and bytes of read or write requests that are submitted to the queue, and the waiting time of the read or write requests. |
Metrics related to networks
Metric | Type | Unit | Description |
sysom_container_network_stat | gauge | - | Displays basic data transfer information about containers. This metric indicates the data transfer statistics of a virtual NIC, including the number of data packets or bytes received or sent by the virtual NIC and the number of data packets that are discarded by the virtual NIC device. Data packets that are discarded by the network protocol stack are not taken into account. |