All Products
Search
Document Center

CloudMonitor:Operating system monitoring

Last Updated:Jul 10, 2024

You can install the CloudMonitor agent on Elastic Compute Service (ECS) instances or third-party hosts to collect the monitoring data of operating systems. You can configure alert rules for operating system metrics. If the value of a metric reaches the alert condition, an alert notification is sent to you so that you can monitor the metric in a timely manner.

Prerequisites

The CloudMonitor agent is installed on the host that you want to monitor. For more information, see Install and uninstall the CloudMonitor agent.

Metrics

The values of operating system metrics are collected every 15 seconds. The following metrics are available:

  • CPU metrics

    • Windows

      The system calls the NtQuerySystemInformation function in ntdll.dll to obtain the CPU time that is consumed by each process or thread. The system calls this function twice to obtain the CPU utilization of each process or thread that runs in the period between two calls.

    • Linux

      You can check the output of the top command to view information about the metrics that are described in the following table.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Remarks for Linux

    (Agent)cpu.idle

    The CPU idle time in percentage.

    %

    cpu_idle

    userId and instanceId

    Maximum, Minimum, and Average

    The percentage of the CPU idle time to the total CPU time.

    (Agent)cpu.system

    The CPU utilization of the kernel.

    %

    cpu_system

    userId and instanceId

    Maximum, Minimum, and Average

    The CPU utilization for a system context switch. A high value indicates that excessive processes or threads run on the host.

    (Agent)cpu.user

    The CPU utilization of user processes.

    %

    cpu_user

    userId and instanceId

    Maximum, Minimum, and Average

    The CPU utilization of user processes.

    (Agent)cpu.wait

    The percentage of the CPU that waits for I/O operations to complete.

    %

    cpu_wait

    userId and instanceId

    Maximum, Minimum, and Average

    A high value indicates frequent I/O operations.

    (Agent)cpu.other

    The percentage of the CPU that is occupied by other operations.

    %

    cpu_other

    userId and instanceId

    Maximum, Minimum, and Average

    Calculation method: CPU utilization of low-priority processes + CPU utilization of SoftIrq + CPU utilization of Irq + CPU utilization of Stolen.

    (Agent)cpu.total

    The percentage of the CPU that is occupied.

    %

    cpu_total

    userId and instanceId

    Maximum, Minimum, and Average

    Current consumption = 1 - Host.cpu.idle

  • Memory metrics

    • Windows

      The system calls the GlobalMemoryStatusEx function in kernel32.dll to obtain the current physical and virtual memory usage in a 32-bit Windows operating system.

    • Linux

      You can check the output of the free command to view information about the metrics described in the following table. The free command obtains memory information from the /proc/meminfo file.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Remarks for Linux

    (Agent)memory.total.space

    The total size of memory.

    Byte

    memory_totalspace

    userId and instanceId

    Maximum, Minimum, and Average

    The total size of memory on the host.

    Data source: the value of the MemTotal parameter in the /proc/meminfo file.

    (Agent)memory.free.space

    The size of available memory.

    Byte

    memory_freespace

    userId and instanceId

    Maximum, Minimum, and Average

    The size of available memory in the system.

    Data source: the value of the MemFree parameter in the /proc/meminfo file.

    (Agent)memory.used.space

    The size of used memory.

    Byte

    memory_usedspace

    userId and instanceId

    Maximum, Minimum, and Average

    The size of used memory in the system.

    Calculation method: Total size of memory - Size of available memory.

    (Agent)memory.actualused.space

    The size of memory that is consumed by users.

    Byte

    memory_actualusedspace

    userId and instanceId

    Maximum, Minimum, and Average

    Calculation method:

    • If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Total size of memory - Value of MemAvailable.

    • If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: Size of used memory - Size of memory that is used by buffers - Size of cached memory.

    Note

    The calculation result is more accurate in CentOS 7.2, Ubuntu 16.04, or their later versions that use the latest Linux kernel. For more information about the MemAvailable parameter, see commit.

    (Agent)memory.free.utilization

    The percentage of available memory.

    %

    memory_freeutilization

    userId and instanceId

    Maximum, Minimum, and Average

    Calculation method:

    • If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Value of MemAvailable/Total size of memory × 100%.

    • If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of actualused)/Total size of memory × 100%.

    (Agent)memory.used.utilization

    The memory usage.

    %

    memory_usedutilization

    userId and instanceId

    Maximum, Minimum, and Average

    Calculation method:

    • If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of MemAvailable)/Total size of memory × 100%.

    • If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Total size of available memory - Size of memory that is used by buffers - Size of cached memory)/Total size of memory × 100%.

  • Metrics of average system loads

    • Windows

      The metric is unavailable for Windows hosts.

    • Linux

      You can check the output of the top command to view information about the metrics that are described in the following table. A higher value of a metric indicates more running processes.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    (Agent)load.1m

    The average system load within the previous minute.

    None

    load_1m

    userId and instanceId

    Maximum, Minimum, and Average

    (Agent)load.5m

    The average system load within the previous 5 minutes.

    None

    load_5m

    userId and instanceId

    Maximum, Minimum, and Average

    (Agent)load.15m

    The average system load within the previous 15 minutes.

    None

    load_15m

    userId and instanceId

    Maximum, Minimum, and Average

    (Agent)load.1m.percore

    The average system load per CPU core within the previous minute.

    None

    load_per_core_1m

    userId and instanceId

    Maximum, Minimum, and Average

    (Agent)load.5m.percore

    The average system load per CPU core within the previous 5 minutes.

    None

    load_per_core_5m

    userId and instanceId

    Maximum, Minimum, and Average

    (Agent)load.15m.percore

    The average system load per CPU core within the previous 15 minutes.

    None

    load_per_core_15m

    userId and instanceId

    Maximum, Minimum, and Average

  • Disk metrics

    • Windows

      The system calls the GetDiskFreeSpaceExA function in Kernel32.dll to obtain the used disk space, disk usage, free disk space, and total disk space. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the disk information in HKEY_PERFORMANCE_DATA, including the read count, write count, read bytes, written bytes, read time, write time, and disk active time.

    • Linux

      You can check the output of the df command to view information about the metrics for disk and inode usage. You can check the output of the iostat command to view information about the metrics for disk reads and writes.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Host.diskusage.used

    The disk space in use.

    Byte

    diskusage_used

    userId, instanceId, and device

    Maximum, Minimum, and Average

    Host.diskusage.utilization

    The disk usage.

    %

    diskusage_utilization

    userId, instanceId, and device

    Maximum, Minimum, and Average

    Host.diskusage.free

    The size of available disk space for regular users and superusers.

    Byte

    diskusage_free

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)disk.usage.avail_device

    The size of available disk space for regular users.

    Byte

    diskusage_avail

    userId, instanceId, and device

    Maximum, Minimum, and Average

    Host.diskusage.total

    The size of the total disk space.

    Byte

    diskusage_total

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)disk.read.bps_device

    The number of bytes that are read from the disk per second.

    Byte/s

    disk_readbytes

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)disk.write.bps_device

    The number of bytes that are written to the disk per second.

    Byte/s

    disk_writebytes

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)disk.read.iops_device

    The number of read requests that the disk receives per second.

    Requests/s

    disk_readiops

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)disk.write.iops_device

    The number of write requests that the disk receives per second.

    Requests/s

    disk_writeiops

    userId, instanceId, and device

    Maximum, Minimum, and Average

  • File system metric

    • Windows

      The metric is unavailable for Windows hosts.

    • Linux

      You can check the output of the df command to view information about the metric described in the following table.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Remarks for Linux

    (Agent)fs.inode.utilization_device

    The inode usage.

    %

    fs_inodeutilization

    userId, instanceId, and device

    Maximum, Minimum, and Average

    Linux operating systems use inode numbers rather than file names to identify files. If inodes are used up, you cannot create files even if the disk space is sufficient. Therefore, the system must monitor the inode usage. The number of inodes indicates the number of files. A large number of small files can cause a high inode usage.

  • Network metrics

    • Windows

      The system calls the GetAdaptersAddresses function in Iphlpapi.dll to obtain the addresses of NICs on the host. The system calls the GetIfTable function to obtain the data of metrics for each interface, for example, the number of bits that an interface receives and sends per second, the number of packets that an interface receives and sends per second, and the number of error packets that an interface receives and sends.

    • Linux

      • You can check the output of the ss command to view information about the TCP connection metric.

        Note

        TCP connections represent the connections that are established between ECS instances and clients over TCP.

        By default, the CloudMonitor agent collects the following data about TCP connections in different states: TCP_TOTAL, ESTABLISHED, and NON_ESTABLISHED. TCP_TOTAL indicates the total number of connections. ESTABLISHED indicates the number of established connections. NON_ESTABLISHED indicates the number of connections that are not in the ESTABLISHED state.

      • You can check the output of the iftop command to view information about the network metrics.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    (Agent)network.in.rate_device

    The number of bits that the NIC receives per second. This is the downstream bandwidth of the NIC.

    bit/s

    networkin_rate

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.out.rate_device

    The number of bits that the NIC sends per second. This is the upstream bandwidth of the NIC.

    bit/s

    networkout_rate

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.in.packages_device

    The number of packets that the NIC receives per second.

    Count/s

    networkin_packages

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.out.packages_device

    The number of packets that the NIC sends per second.

    Count/s

    networkout_packages

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.in.errorpackages_device

    The number of inbound error packets that the drive detects.

    Count/s

    networkin_errorpackages

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.out.errorpackages_device

    The number of outbound error packets that the drive detects.

    Count/s

    networkout_errorpackages

    userId, instanceId, and device

    Maximum, Minimum, and Average

    (Agent)network.tcp.connection_state

    The number of TCP connections in each state. These connection states include LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.

    Count

    net_tcpconnection

    userId, instanceId, and state

    Maximum, Minimum, and Average

  • Metrics related to the top five processes

    • Windows

      • Query

        The system calls the OpenProcess function in Kernel32.dll to obtain the handle of a process. The system calls the GetProcessTimes function twice to obtain the CPU time consumed by the process and then calculate the CPU utilization of the process in the interval between the two executions of the command. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the process information in HKEY_PERFORMANCE_DATA, including the process ID, parent process ID, priority, virtual memory, resident memory, shared memory, number of files that the process opens, thread count, page errors, read bytes, and written bytes.

      • Count the number of processes that match the specified keyword

        • The system calls the OpenProcess function to obtain the handle of a process. The system calls the NtQueryInformationProcess function in ntdll.dll to obtain RTL_USER_PROCESS_PARAMETERS of the process. The system calls the ReadProcessMemory function to obtain the arguments and root path of the process from the command line information. This way, the system can obtain the directory of the process.

        • The system calls the OpenProcessToken function to obtain the handle of a token. The system calls the GetTokenInformation function to obtain the token information. The system calls the LookupAccountSid function to obtain the username and user group of the process.

        • The system matches the directory, username, and user group of the process with the keyword. If the process information matches the keyword, the system increases the value of Host.process.number by 1.

    • Linux

      • You can check the output of the top command to view information about the CPU utilization and memory usage of processes. The CPU utilization indicates the consumption of multi-core CPUs.

      • You can check the output of the lsof command to view information about the Host.process.openfile metric.

      • You can check the output of the ps aux | grep '<Keyword>' command to view information about the Host.process.number metric.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Remarks

    (Agent)process.cpu_pid

    The CPU utilization of a process.

    %

    process.cpu

    userId, instanceId, name, and pid

    Average

    You cannot configure alert rules for this metric.

    (Agent)process.memory_pid

    The memory usage of a process.

    %

    process.memory

    userId, instanceId, name, and pid

    Average

    You cannot configure alert rules for this metric.

    (Agent)process.openfile_pid

    The number of files that are opened by a process.

    Count

    process.openfile

    userId, instanceId, name, and pid

    Average

    You cannot configure alert rules for this metric.

    (Agent)process.count_processname

    The number of processes that match the specified keyword.

    Count

    process.number

    userId, instanceId, and processName

    Average

    You cannot configure alert rules for this metric.

View the monitoring data of the operating system

  1. Log on to the CloudMonitor console.

  2. In the left-side navigation pane, click Cloud Service Monitoring > Host Monitoring.

  3. On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.

    On the OS Monitoring tab, you can view the monitoring data of the operating system. You can also configure alert rules for the metrics and view the alerting status. For more information, see Step 2: Create an alert rule for the host and Step 3: View host alerts.

References