Metrics for the monitoring of host operating systems - CloudMonitor

You can install the CloudMonitor agent on Elastic Compute Service (ECS) instances or third-party hosts to collect the monitoring data of operating systems. You can configure alert rules for operating system metrics. If the value of a metric reaches the alert condition, an alert notification is sent to you so that you can monitor the metric in a timely manner.

Prerequisites

The CloudMonitor agent is installed on the host that you want to monitor. For more information, see Install and uninstall the CloudMonitor agent.

Metrics

The values of operating system metrics are collected every 15 seconds. The following metrics are available:

CPU metrics

Windows
The system calls the NtQuerySystemInformation function in ntdll.dll to obtain the CPU time that is consumed by each process or thread. The system calls this function twice to obtain the CPU utilization of each process or thread that runs in the period between two calls.
Linux
You can check the output of the top command to view information about the metrics that are described in the following table.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Remarks for Linux
(Agent)cpu.idle	The CPU idle time in percentage.	%	cpu_idle	userId and instanceId	Maximum, Minimum, and Average	The percentage of the CPU idle time to the total CPU time.
(Agent)cpu.system	The CPU utilization of the kernel.	%	cpu_system	userId and instanceId	Maximum, Minimum, and Average	The CPU utilization for a system context switch. A high value indicates that excessive processes or threads run on the host.
(Agent)cpu.user	The CPU utilization of user processes.	%	cpu_user	userId and instanceId	Maximum, Minimum, and Average	The CPU utilization of user processes.
(Agent)cpu.wait	The percentage of the CPU that waits for I/O operations to complete.	%	cpu_wait	userId and instanceId	Maximum, Minimum, and Average	A high value indicates frequent I/O operations.
(Agent)cpu.other	The percentage of the CPU that is occupied by other operations.	%	cpu_other	userId and instanceId	Maximum, Minimum, and Average	Calculation method: CPU utilization of low-priority processes + CPU utilization of SoftIrq + CPU utilization of Irq + CPU utilization of Stolen.
(Agent)cpu.total	The percentage of the CPU that is occupied.	%	cpu_total	userId and instanceId	Maximum, Minimum, and Average	Current consumption = 1 - Host.cpu.idle

Memory metrics

Windows
The system calls the GlobalMemoryStatusEx function in kernel32.dll to obtain the current physical and virtual memory usage in a 32-bit Windows operating system.
Linux
You can check the output of the free command to view information about the metrics described in the following table. The free command obtains memory information from the /proc/meminfo file.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Remarks for Linux
(Agent)memory.total.space	The total size of memory.	Byte	memory_totalspace	userId and instanceId	Maximum, Minimum, and Average	The total size of memory on the host. Data source: the value of the MemTotal parameter in the /proc/meminfo file.
(Agent)memory.free.space	The size of available memory.	Byte	memory_freespace	userId and instanceId	Maximum, Minimum, and Average	The size of available memory in the system. Data source: the value of the MemFree parameter in the /proc/meminfo file.
(Agent)memory.used.space	The size of used memory.	Byte	memory_usedspace	userId and instanceId	Maximum, Minimum, and Average	The size of used memory in the system. Calculation method: Total size of memory - Size of available memory.
(Agent)memory.actualused.space	The size of memory that is consumed by users.	Byte	memory_actualusedspace	userId and instanceId	Maximum, Minimum, and Average	Calculation method: If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Total size of memory - Value of MemAvailable. If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: Size of used memory - Size of memory that is used by buffers - Size of cached memory. Note The calculation result is more accurate in CentOS 7.2, Ubuntu 16.04, or their later versions that use the latest Linux kernel. For more information about the MemAvailable parameter, see commit.
(Agent)memory.free.utilization	The percentage of available memory.	%	memory_freeutilization	userId and instanceId	Maximum, Minimum, and Average	Calculation method: If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Value of MemAvailable/Total size of memory × 100%. If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of actualused)/Total size of memory × 100%.
(Agent)memory.used.utilization	The memory usage.	%	memory_usedutilization	userId and instanceId	Maximum, Minimum, and Average	Calculation method: If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of MemAvailable)/Total size of memory × 100%. If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Total size of available memory - Size of memory that is used by buffers - Size of cached memory)/Total size of memory × 100%.

Metrics of average system loads

Windows
The metric is unavailable for Windows hosts.
Linux
You can check the output of the top command to view information about the metrics that are described in the following table. A higher value of a metric indicates more running processes.

Metric	Description	Unit	MetricName	Dimensions	Statistics
(Agent)load.1m	The average system load within the previous minute.	None	load_1m	userId and instanceId	Maximum, Minimum, and Average
(Agent)load.5m	The average system load within the previous 5 minutes.	None	load_5m	userId and instanceId	Maximum, Minimum, and Average
(Agent)load.15m	The average system load within the previous 15 minutes.	None	load_15m	userId and instanceId	Maximum, Minimum, and Average
(Agent)load.1m.percore	The average system load per CPU core within the previous minute.	None	load_per_core_1m	userId and instanceId	Maximum, Minimum, and Average
(Agent)load.5m.percore	The average system load per CPU core within the previous 5 minutes.	None	load_per_core_5m	userId and instanceId	Maximum, Minimum, and Average
(Agent)load.15m.percore	The average system load per CPU core within the previous 15 minutes.	None	load_per_core_15m	userId and instanceId	Maximum, Minimum, and Average

Disk metrics

Windows
The system calls the GetDiskFreeSpaceExA function in Kernel32.dll to obtain the used disk space, disk usage, free disk space, and total disk space. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the disk information in HKEY_PERFORMANCE_DATA, including the read count, write count, read bytes, written bytes, read time, write time, and disk active time.
Linux
You can check the output of the df command to view information about the metrics for disk and inode usage. You can check the output of the iostat command to view information about the metrics for disk reads and writes.

Metric	Description	Unit	MetricName	Dimensions	Statistics
Host.diskusage.used	The disk space in use.	Byte	diskusage_used	userId, instanceId, and device	Maximum, Minimum, and Average
Host.diskusage.utilization	The disk usage.	%	diskusage_utilization	userId, instanceId, and device	Maximum, Minimum, and Average
Host.diskusage.free	The size of available disk space for regular users and superusers.	Byte	diskusage_free	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)disk.usage.avail_device	The size of available disk space for regular users.	Byte	diskusage_avail	userId, instanceId, and device	Maximum, Minimum, and Average
Host.diskusage.total	The size of the total disk space.	Byte	diskusage_total	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)disk.read.bps_device	The number of bytes that are read from the disk per second.	Byte/s	disk_readbytes	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)disk.write.bps_device	The number of bytes that are written to the disk per second.	Byte/s	disk_writebytes	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)disk.read.iops_device	The number of read requests that the disk receives per second.	Requests/s	disk_readiops	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)disk.write.iops_device	The number of write requests that the disk receives per second.	Requests/s	disk_writeiops	userId, instanceId, and device	Maximum, Minimum, and Average

File system metric

Windows
The metric is unavailable for Windows hosts.
Linux
You can check the output of the df command to view information about the metric described in the following table.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Remarks for Linux
(Agent)fs.inode.utilization_device	The inode usage.	%	fs_inodeutilization	userId, instanceId, and device	Maximum, Minimum, and Average	Linux operating systems use inode numbers rather than file names to identify files. If inodes are used up, you cannot create files even if the disk space is sufficient. Therefore, the system must monitor the inode usage. The number of inodes indicates the number of files. A large number of small files can cause a high inode usage.

Network metrics

Windows
The system calls the GetAdaptersAddresses function in Iphlpapi.dll to obtain the addresses of NICs on the host. The system calls the GetIfTable function to obtain the data of metrics for each interface, for example, the number of bits that an interface receives and sends per second, the number of packets that an interface receives and sends per second, and the number of error packets that an interface receives and sends.
Linux
- You can check the output of the ss command to view information about the TCP connection metric.
  Note
  TCP connections represent the connections that are established between ECS instances and clients over TCP.
  By default, the CloudMonitor agent collects the following data about TCP connections in different states: TCP_TOTAL, ESTABLISHED, and NON_ESTABLISHED. TCP_TOTAL indicates the total number of connections. ESTABLISHED indicates the number of established connections. NON_ESTABLISHED indicates the number of connections that are not in the ESTABLISHED state.
- You can check the output of the iftop command to view information about the network metrics.

Metric	Description	Unit	MetricName	Dimensions	Statistics
(Agent)network.in.rate_device	The number of bits that the NIC receives per second. This is the downstream bandwidth of the NIC.	bit/s	networkin_rate	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.out.rate_device	The number of bits that the NIC sends per second. This is the upstream bandwidth of the NIC.	bit/s	networkout_rate	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.in.packages_device	The number of packets that the NIC receives per second.	Count/s	networkin_packages	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.out.packages_device	The number of packets that the NIC sends per second.	Count/s	networkout_packages	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.in.errorpackages_device	The number of inbound error packets that the drive detects.	Count/s	networkin_errorpackages	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.out.errorpackages_device	The number of outbound error packets that the drive detects.	Count/s	networkout_errorpackages	userId, instanceId, and device	Maximum, Minimum, and Average
(Agent)network.tcp.connection_state	The number of TCP connections in each state. These connection states include LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.	Count	net_tcpconnection	userId, instanceId, and state	Maximum, Minimum, and Average

Metrics related to the top five processes

Windows
- Query
  The system calls the OpenProcess function in Kernel32.dll to obtain the handle of a process. The system calls the GetProcessTimes function twice to obtain the CPU time consumed by the process and then calculate the CPU utilization of the process in the interval between the two executions of the command. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the process information in HKEY_PERFORMANCE_DATA, including the process ID, parent process ID, priority, virtual memory, resident memory, shared memory, number of files that the process opens, thread count, page errors, read bytes, and written bytes.
- Count the number of processes that match the specified keyword
  - The system calls the OpenProcess function to obtain the handle of a process. The system calls the NtQueryInformationProcess function in ntdll.dll to obtain RTL_USER_PROCESS_PARAMETERS of the process. The system calls the ReadProcessMemory function to obtain the arguments and root path of the process from the command line information. This way, the system can obtain the directory of the process.
  - The system calls the OpenProcessToken function to obtain the handle of a token. The system calls the GetTokenInformation function to obtain the token information. The system calls the LookupAccountSid function to obtain the username and user group of the process.
  - The system matches the directory, username, and user group of the process with the keyword. If the process information matches the keyword, the system increases the value of Host.process.number by 1.
Linux
- You can check the output of the top command to view information about the CPU utilization and memory usage of processes. The CPU utilization indicates the consumption of multi-core CPUs.
- You can check the output of the lsof command to view information about the Host.process.openfile metric.
- You can check the output of the ps aux | grep '<Keyword>' command to view information about the Host.process.number metric.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Remarks
(Agent)process.cpu_pid	The CPU utilization of a process.	%	process.cpu	userId, instanceId, name, and pid	Average	You cannot configure alert rules for this metric.
(Agent)process.memory_pid	The memory usage of a process.	%	process.memory	userId, instanceId, name, and pid	Average	You cannot configure alert rules for this metric.
(Agent)process.openfile_pid	The number of files that are opened by a process.	Count	process.openfile	userId, instanceId, name, and pid	Average	You cannot configure alert rules for this metric.
(Agent)process.count_processname	The number of processes that match the specified keyword.	Count	process.number	userId, instanceId, and processName	Average	You cannot configure alert rules for this metric.

View the monitoring data of the operating system

Log on to the CloudMonitor console.
In the left-side navigation pane, click Cloud Service Monitoring > Host Monitoring.
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
On the OS Monitoring tab, you can view the monitoring data of the operating system. You can also configure alert rules for the metrics and view the alerting status. For more information, see Step 2: Create an alert rule for the host and Step 3: View host alerts.

CloudMonitor:Operating system monitoring

Prerequisites

Metrics

View the monitoring data of the operating system

References