You can install the CloudMonitor agent on Elastic Compute Service (ECS) instances or third-party hosts to collect the monitoring data of operating systems. You can configure alert rules for operating system metrics. If the value of a metric reaches the alert condition, an alert notification is sent to you so that you can monitor the metric in a timely manner.
Prerequisites
The CloudMonitor agent is installed on the host that you want to monitor. For more information, see Install and uninstall the CloudMonitor agent.
Metrics
The values of operating system metrics are collected every 15 seconds. The following metrics are available:
CPU metrics
Windows
The system calls the NtQuerySystemInformation function in ntdll.dll to obtain the CPU time that is consumed by each process or thread. The system calls this function twice to obtain the CPU utilization of each process or thread that runs in the period between two calls.
Linux
You can check the output of the
top
command to view information about the metrics that are described in the following table.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Remarks for Linux
(Agent)cpu.idle
The CPU idle time in percentage.
%
cpu_idle
userId and instanceId
Maximum, Minimum, and Average
The percentage of the CPU idle time to the total CPU time.
(Agent)cpu.system
The CPU utilization of the kernel.
%
cpu_system
userId and instanceId
Maximum, Minimum, and Average
The CPU utilization for a system context switch. A high value indicates that excessive processes or threads run on the host.
(Agent)cpu.user
The CPU utilization of user processes.
%
cpu_user
userId and instanceId
Maximum, Minimum, and Average
The CPU utilization of user processes.
(Agent)cpu.wait
The percentage of the CPU that waits for I/O operations to complete.
%
cpu_wait
userId and instanceId
Maximum, Minimum, and Average
A high value indicates frequent I/O operations.
(Agent)cpu.other
The percentage of the CPU that is occupied by other operations.
%
cpu_other
userId and instanceId
Maximum, Minimum, and Average
Calculation method: CPU utilization of low-priority processes + CPU utilization of SoftIrq + CPU utilization of Irq + CPU utilization of Stolen.
(Agent)cpu.total
The percentage of the CPU that is occupied.
%
cpu_total
userId and instanceId
Maximum, Minimum, and Average
Current consumption = 1 - Host.cpu.idle
Memory metrics
Windows
The system calls the GlobalMemoryStatusEx function in kernel32.dll to obtain the current physical and virtual memory usage in a 32-bit Windows operating system.
Linux
You can check the output of the
free
command to view information about the metrics described in the following table. The free command obtains memory information from the/proc/meminfo
file.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Remarks for Linux
(Agent)memory.total.space
The total size of memory.
Byte
memory_totalspace
userId and instanceId
Maximum, Minimum, and Average
The total size of memory on the host.
Data source: the value of the MemTotal parameter in the /proc/meminfo file.
(Agent)memory.free.space
The size of available memory.
Byte
memory_freespace
userId and instanceId
Maximum, Minimum, and Average
The size of available memory in the system.
Data source: the value of the MemFree parameter in the /proc/meminfo file.
(Agent)memory.used.space
The size of used memory.
Byte
memory_usedspace
userId and instanceId
Maximum, Minimum, and Average
The size of used memory in the system.
Calculation method: Total size of memory - Size of available memory.
(Agent)memory.actualused.space
The size of memory that is consumed by users.
Byte
memory_actualusedspace
userId and instanceId
Maximum, Minimum, and Average
Calculation method:
If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Total size of memory - Value of MemAvailable.
If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: Size of used memory - Size of memory that is used by buffers - Size of cached memory.
NoteThe calculation result is more accurate in CentOS 7.2, Ubuntu 16.04, or their later versions that use the latest Linux kernel. For more information about the MemAvailable parameter, see commit.
(Agent)memory.free.utilization
The percentage of available memory.
%
memory_freeutilization
userId and instanceId
Maximum, Minimum, and Average
Calculation method:
If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: Value of MemAvailable/Total size of memory × 100%.
If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of actualused)/Total size of memory × 100%.
(Agent)memory.used.utilization
The memory usage.
%
memory_usedutilization
userId and instanceId
Maximum, Minimum, and Average
Calculation method:
If the MemAvailable parameter exists in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Value of MemAvailable)/Total size of memory × 100%.
If the MemAvailable parameter does not exist in the /proc/meminfo file, the following formula is used for calculation: (Total size of memory - Total size of available memory - Size of memory that is used by buffers - Size of cached memory)/Total size of memory × 100%.
Metrics of average system loads
Windows
The metric is unavailable for Windows hosts.
Linux
You can check the output of the
top
command to view information about the metrics that are described in the following table. A higher value of a metric indicates more running processes.
Metric
Description
Unit
MetricName
Dimensions
Statistics
(Agent)load.1m
The average system load within the previous minute.
None
load_1m
userId and instanceId
Maximum, Minimum, and Average
(Agent)load.5m
The average system load within the previous 5 minutes.
None
load_5m
userId and instanceId
Maximum, Minimum, and Average
(Agent)load.15m
The average system load within the previous 15 minutes.
None
load_15m
userId and instanceId
Maximum, Minimum, and Average
(Agent)load.1m.percore
The average system load per CPU core within the previous minute.
None
load_per_core_1m
userId and instanceId
Maximum, Minimum, and Average
(Agent)load.5m.percore
The average system load per CPU core within the previous 5 minutes.
None
load_per_core_5m
userId and instanceId
Maximum, Minimum, and Average
(Agent)load.15m.percore
The average system load per CPU core within the previous 15 minutes.
None
load_per_core_15m
userId and instanceId
Maximum, Minimum, and Average
Disk metrics
Windows
The system calls the GetDiskFreeSpaceExA function in Kernel32.dll to obtain the used disk space, disk usage, free disk space, and total disk space. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the disk information in HKEY_PERFORMANCE_DATA, including the read count, write count, read bytes, written bytes, read time, write time, and disk active time.
Linux
You can check the output of the
df
command to view information about the metrics for disk and inode usage. You can check the output of theiostat
command to view information about the metrics for disk reads and writes.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Host.diskusage.used
The disk space in use.
Byte
diskusage_used
userId, instanceId, and device
Maximum, Minimum, and Average
Host.diskusage.utilization
The disk usage.
%
diskusage_utilization
userId, instanceId, and device
Maximum, Minimum, and Average
Host.diskusage.free
The size of available disk space for regular users and superusers.
Byte
diskusage_free
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)disk.usage.avail_device
The size of available disk space for regular users.
Byte
diskusage_avail
userId, instanceId, and device
Maximum, Minimum, and Average
Host.diskusage.total
The size of the total disk space.
Byte
diskusage_total
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)disk.read.bps_device
The number of bytes that are read from the disk per second.
Byte/s
disk_readbytes
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)disk.write.bps_device
The number of bytes that are written to the disk per second.
Byte/s
disk_writebytes
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)disk.read.iops_device
The number of read requests that the disk receives per second.
Requests/s
disk_readiops
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)disk.write.iops_device
The number of write requests that the disk receives per second.
Requests/s
disk_writeiops
userId, instanceId, and device
Maximum, Minimum, and Average
File system metric
Windows
The metric is unavailable for Windows hosts.
Linux
You can check the output of the
df
command to view information about the metric described in the following table.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Remarks for Linux
(Agent)fs.inode.utilization_device
The inode usage.
%
fs_inodeutilization
userId, instanceId, and device
Maximum, Minimum, and Average
Linux operating systems use inode numbers rather than file names to identify files. If inodes are used up, you cannot create files even if the disk space is sufficient. Therefore, the system must monitor the inode usage. The number of inodes indicates the number of files. A large number of small files can cause a high inode usage.
Network metrics
Windows
The system calls the GetAdaptersAddresses function in Iphlpapi.dll to obtain the addresses of NICs on the host. The system calls the GetIfTable function to obtain the data of metrics for each interface, for example, the number of bits that an interface receives and sends per second, the number of packets that an interface receives and sends per second, and the number of error packets that an interface receives and sends.
Linux
You can check the output of the
ss
command to view information about the TCP connection metric.NoteTCP connections represent the connections that are established between ECS instances and clients over TCP.
By default, the CloudMonitor agent collects the following data about TCP connections in different states: TCP_TOTAL, ESTABLISHED, and NON_ESTABLISHED. TCP_TOTAL indicates the total number of connections. ESTABLISHED indicates the number of established connections. NON_ESTABLISHED indicates the number of connections that are not in the ESTABLISHED state.
You can check the output of the
iftop
command to view information about the network metrics.
Metric
Description
Unit
MetricName
Dimensions
Statistics
(Agent)network.in.rate_device
The number of bits that the NIC receives per second. This is the downstream bandwidth of the NIC.
bit/s
networkin_rate
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.out.rate_device
The number of bits that the NIC sends per second. This is the upstream bandwidth of the NIC.
bit/s
networkout_rate
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.in.packages_device
The number of packets that the NIC receives per second.
Count/s
networkin_packages
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.out.packages_device
The number of packets that the NIC sends per second.
Count/s
networkout_packages
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.in.errorpackages_device
The number of inbound error packets that the drive detects.
Count/s
networkin_errorpackages
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.out.errorpackages_device
The number of outbound error packets that the drive detects.
Count/s
networkout_errorpackages
userId, instanceId, and device
Maximum, Minimum, and Average
(Agent)network.tcp.connection_state
The number of TCP connections in each state. These connection states include LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.
Count
net_tcpconnection
userId, instanceId, and state
Maximum, Minimum, and Average
Metrics related to the top five processes
Windows
Query
The system calls the OpenProcess function in Kernel32.dll to obtain the handle of a process. The system calls the GetProcessTimes function twice to obtain the CPU time consumed by the process and then calculate the CPU utilization of the process in the interval between the two executions of the command. The system calls the RegConnectRegistryA function to connect to the HKEY_PERFORMANCE_DATA entry in the registry. Then, the system calls the RegQueryValueExA function to query the process information in HKEY_PERFORMANCE_DATA, including the process ID, parent process ID, priority, virtual memory, resident memory, shared memory, number of files that the process opens, thread count, page errors, read bytes, and written bytes.
Count the number of processes that match the specified keyword
The system calls the OpenProcess function to obtain the handle of a process. The system calls the NtQueryInformationProcess function in ntdll.dll to obtain RTL_USER_PROCESS_PARAMETERS of the process. The system calls the ReadProcessMemory function to obtain the arguments and root path of the process from the command line information. This way, the system can obtain the directory of the process.
The system calls the OpenProcessToken function to obtain the handle of a token. The system calls the GetTokenInformation function to obtain the token information. The system calls the LookupAccountSid function to obtain the username and user group of the process.
The system matches the directory, username, and user group of the process with the keyword. If the process information matches the keyword, the system increases the value of Host.process.number by 1.
Linux
You can check the output of the
top
command to view information about the CPU utilization and memory usage of processes. The CPU utilization indicates the consumption of multi-core CPUs.You can check the output of the
lsof
command to view information about the Host.process.openfile metric.You can check the output of the
ps aux | grep '<Keyword>'
command to view information about the Host.process.number metric.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Remarks
(Agent)process.cpu_pid
The CPU utilization of a process.
%
process.cpu
userId, instanceId, name, and pid
Average
You cannot configure alert rules for this metric.
(Agent)process.memory_pid
The memory usage of a process.
%
process.memory
userId, instanceId, name, and pid
Average
You cannot configure alert rules for this metric.
(Agent)process.openfile_pid
The number of files that are opened by a process.
Count
process.openfile
userId, instanceId, name, and pid
Average
You cannot configure alert rules for this metric.
(Agent)process.count_processname
The number of processes that match the specified keyword.
Count
process.number
userId, instanceId, and processName
Average
You cannot configure alert rules for this metric.
View the monitoring data of the operating system
Log on to the CloudMonitor console.
In the left-side navigation pane, click
.On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
On the OS Monitoring tab, you can view the monitoring data of the operating system. You can also configure alert rules for the metrics and view the alerting status. For more information, see Step 2: Create an alert rule for the host and Step 3: View host alerts.