This topic describes the alert metrics that the Application Monitoring sub-service of Application Real-Time Monitoring Service (ARMS) provides. All metric data is monitored once a minute.
JVMs
The following JVM metrics are for reference only. JVM-related descriptions are subject to JVM official documentation.
Metrics
Metric name | Unit | Commonly used | Description |
Number of JVM full GCs (instantaneous value) | None | Yes | The number of full garbage collections (GCs) performed by the JVM in the last N minutes. If full GCs frequently occur in your application, errors may occur. |
JVM full GC duration (instantaneous value) | Milliseconds | No | The time consumed for full GCs in the last N minutes. The instantaneous value of full GC duration indicates the garbage collection performance of the current JVM. Generally, the shorter the full GC duration, the better the JVM performance. If full GCs take too long, applications may significantly stutter. This affects the user experience. |
Number of JVM young GCs (instantaneous value) | None | Yes | The number of young GCs performed by the JVM in the last N minutes. The instantaneous value of the number of young GCs indicates the speed of JVM object creation and destruction and the usage of the young generation. Generally, the more young GCs, the more objects are created in the application. Besides, the application may have memory leaks or unreasonable memory usage. |
JVM young GC duration (instantaneous value) | Milliseconds | No | The time consumed for young GCs in the last N minutes. The instantaneous value of young GC duration indicates the garbage collection performance of the current JVM. Generally, the longer the young GC duration, the lower the garbage collection efficiency. In this case, the application may stutter. |
Total JVM heap memory | M | No | The total size of the JVM heap memory, including the memory of young and old generations. The size of JVM heap memory must be properly configured based on the load and performance requirements of the application. Excessively small JVM heap memory leads to frequent garbage collections and affects application performance. Excessively large JVM heap memory occupies a large amount of system resources and affects system stability. |
Used JVM heap memory | M | Yes | The size of JVM heap memory used by Java programs. The size of used JVM heap memory must be strictly controlled to prevent system performance degradation, or memory overflow caused by memory leaks or excessive memory usage. |
Committed JVM non-heap memory | M | No | The size of non-heap memory used by Java programs. The size of committed JVM non-heap memory must be strictly controlled to prevent excessive memory usage caused by excessive class loading or a large amount of static variables and constants. |
Initial JVM non-heap memory | M | No | Generally, the initial size of JVM non-heap memory is dynamically calculated based on factors such as JVM version, operating system, and JVM parameters. |
Maximum JVM non-heap memory | M | No | If you are using a Java version earlier than 8, this metric is controlled by the JVM parameter MaxPermSize. Otherwise, this metric is controlled by MaxMetaspaceSize. |
Used JVM non-heap memory | M | Yes | The size of used JVM non-heap memory, including Metaspace and PermGen. |
Number of JVM blocked threads | None | No | The number of blocked threads waiting for monitor locks. Excessive blocked threads may cause system performance degradation. |
Total number of JVM threads | None | Yes | The number of threads in all states. An excessive number of threads may result in insufficient memory and CPU resources. This affects the application performance and stability. |
Number of JVM deadlocked threads | None | No | The number of deadlocks. A deadlock occurs when two or more threads waiting for one another to finish accessing a resource. When a deadlock occurs in the JVM, the number of deadlocked threads increases until more and more deadlocks occur. Generally, the more deadlocked threads, the more severe the situation. The application may even crash. |
Number of new JVM threads | None | No | The number of threads created by the JVM. A large number of threads can be created in a JVM, but excessive threads may result in a waste of system resources and pressure on thread scheduling. |
Number of JVM runnable threads | None | No | The maximum number of threads supported by the JVM at runtime. If excessive threads are created, a large amount of memory resources are consumed. The system may run slow or crash. |
Number of JVM terminated threads | None | No | The number of threads that can run at the same time in the JVM at runtime. The number of threads can be controlled based on the actual situation to prevent thread resource waste or thread starvation. |
Number of JVM timed-out waiting threads | None | Yes | The number of threads that waited for a resource and timed out at JVM runtime. If the number of timed out threads is too large, some bottlenecks may exist in the system. In this case, you need to optimize resources to improve the processing capabilities and response speed of the system. |
Number of JVM waiting threads | None | No | The number of waiting threads in the current JVM. For highly concurrent applications, an increase in the number of JVM waiting threads may result in performance degradation. |
Number of JVM GCs (cumulative value) | None | No | The cumulative number of GCs performed in the JVM. |
JVM mark-and-sweep garbage collection cycles (cumulative value) | None | No | The cumulative number of mark-and-sweep garbage collection cycles in the JVM. |
JVM heap memory usage (%) | None | No | The ratio between the allocated heap memory and the total heap memory at JVM runtime. This metric can be used to measure the efficiency and performance of JVM memory management. Generally, the JVM heap memory usage must be maintained lower than 70% to prevent problems such as memory overflow. |
Dimension and filter condition
The preceding metrics are monitored by node IP address. You can use one of the following methods to filter IP addresses:
Traversal: traverses the IP address of each node and configures alerting for the metric data of each node.
Equals (=): specifies specific nodes for alerting. Example: =172.20.XX.XX.
No dimension: aggregates and configures alerting for the metric data of all nodes.
Scheduled tasks
Metrics
Metric name | Unit | Commonly used | Description |
Duration | Milliseconds | No | The average duration of the scheduled task. |
Total number of executions | None | No | The number of times that the scheduled task was executed. |
Number of execution errors | None | No | The number of times that the scheduled task was not executed as expected within the specified time interval. |
Scheduling latency | Milliseconds | No | The time spent on scheduling before the scheduled task was started. |
Dimension and filter condition
The preceding metrics are monitored by scheduled task. You can use one of the following methods to filter scheduled tasks:
Traversal: traverses the scheduled tasks and configures alerting for the metric data of each scheduled task.
Equals (=): specifies specific scheduled tasks for alerting. Example: =LoadGenerator.mockUserApiLoad.
No dimension: aggregates and configures alerting for the metric data of all scheduled tasks.
Exceptions
Metrics
Metric name | Unit | Commonly used | Description |
Number of exceptions | None | Yes | The number of exceptions that occurred at software runtime, such as null pointer exceptions, array out-of-bounds exceptions, and I/O exceptions. You can use this metric to check whether a call stack throws errors and whether application call errors occur. |
Response time of abnormal interface calls | Milliseconds | Yes | The response time of an abnormal interface call for the application. If an interface call is abnormal, errors occur. You can use this metric to estimate the impact of errors thrown by the call stack on the response time of the interface call, and check whether errors occur. |
Dimensions and filter condition
The preceding metrics are monitored by interface name. You can use one of the following methods to filter interface names:
Traversal: traverses the accessed interfaces and configures alerting for the metric data of each interface.
Equals (=): specifies specific interfaces for alerting. Example: =/tb/api/users/{userId}.
Not Equals (!=): excludes specific interfaces from alerting, and separately configures alerting for other interfaces. Example:!=/tb/api/users/{userId}
Contains: configures alerting for interfaces that contain a specific keyword. Example: Contains api.
Does Not Contain: configures alerting for interfaces that do not contain a specific keyword. Example: Does Not Contain api.
Regular expression: configures alerting for interfaces that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and configures alerting for the metric data of all interfaces.
The preceding metrics are monitored by exception. You can use one of the following methods to filter exceptions:
Traversal: traverses the exceptions and configures alerting for the metric data of each exception.
Equals (=): specifies specific exceptions for alerting. Example: =FeignException$InternalServerError.
Not Equals (!=): excludes specific exceptions from alerting, and separately configures alerting for other exceptions. Example:!=FeignException$InternalServerError.
Contains: configures alerting for exceptions that contain a specific keyword. Example: Contains data.
Does Not Contain: configures alerting for exceptions that do not contain a specific keyword. Example: Does Not Contain data.
Regular expression: configures alerting for exceptions that match the specified regular expression. Example: =/(data)/i.
No dimension: aggregates and configures alerting for the metric data of all exceptions.
Application dependency services
Metrics
Metric name | Unit | Commonly used | Description |
Number of application dependency service calls | None | No | The number of downstream interfaces on which the application depends. You can use this metric to check whether the number of application dependency service calls increases. |
Application dependency service call error rate (%) | None | No | The value of this metric is calculated by using the following formula: Error rate of application-dependent service calls = Number of abnormal downstream interface requests/Total number of interface requests. You can use this metric to check whether the errors of the application dependency services increase and affect the application. |
Response time of application dependency service calls | Milliseconds | Yes | The average response time of the downstream interfaces on which the application depends. You can use this metric to whether the time consumed by the application dependency services increases and affects the current application. |
Dimension and filter condition
The preceding metrics are monitored by interface call type. You can use one of the following methods to filter interface call types:
Traversal: traverses the interface call types and separately configures alerting for the metric data of each type, such as HTTP, MySQL, and Redis.
Equals (=): specifies specific interface call types for alerting. Example: =http.
No dimension: aggregates and configures alerting for the metric data of all interfaces.
ECS instances
Metrics
Metric name | Unit | Commonly used | Description |
Node CPU utilization (%) | None | No | The CPU utilization of the node. Each node is a server. Excessive CPU utilization may cause problems such as slow system response and service unavailability. |
Node CPU utilization in user mode (%) | None | No | The ratio between the node CPU time occupied by processes running in user mode and the total CPU time. Processes in user mode are applications in user space, such as web services and databases. |
Idle node disk space | MB | Yes | The unused disk space of the node. You can use this metric to check whether the disk space is full. If the disk space is full, the system may crash or cannot work as expected. |
Node disk utilization (%) | None | No | The ratio between the used disk space and the total disk space. The higher the disk utilization, the less the storage capacity of the node. |
Node system load | None | Yes | You can use this metric to check whether the workload of the node is excessively high. For a node that has N cores, the maximum workload is N. |
Idle node memory | MB | Yes | The size of the unused memory in the node. You can use this metric to check whether the memory of the node is sufficient. If the memory of the node is insufficient, exceptions such as out-of-memory (OOM) errors may occur. |
Node memory usage (%) | None | No | The percentage of memory in use. If the memory usage of the node exceeds 80%, you need to reduce memory pressure by adjusting the configurations of the node or optimizing the memory usage of tasks. |
Number of error packets received on the node | None | No | The number of error packets that the node received when it processed network communication. These error packets may be caused by network transmission issues or application issues. If error packets are received, the node may fail to process the network communication, and the system may be affected. |
Number of error packets sent from the node | None | No | The number of error packets that the node sent when it processed network communication. These error packets may be caused by network transmission issues or application issues. You can use this metric to check whether the node network is normal. |
Number of JVM instances | None | Yes | The number of JVM instances running in real time. Generally, this metric is used to configure service downtime alerting. |
Number of bytes sent from the node | None | No | The amount of data sent by the node over a network, including data, system messages, and error messages sent by the application. |
Number of packets sent from the node | None | No | The number of messages sent from the node over a network. |
Number of bytes received on the node | None | No | The total amount of data received by the node over a network. |
Number of packets received on the node | None | No | The number of packets received by the node over a network. |
Dimension and filter condition
The preceding metrics are monitored by node IP address. You can use one of the following methods to filter IP addresses:
Traversal: traverses the IP address of each node and configures alerting for the metric data of each node.
Equals (=): specifies specific nodes for alerting. Example: =172.20.XX.XX.
No dimension: aggregates and configures alerting for the metric data of all nodes.
Containers
The ARMS agent v4.1.0 and later collects monitoring data about CPU and memory for monitoring and alerting.
Metrics
Metric name | Unit | Commonly used | Description |
CPU utilization in user mode | None | No | The time spent by the process executing code in user mode. This is the CPU time directly used by the application to perform tasks, including the application code and all library functions that do not run in kernel mode. |
CPU utilization in kernel mode | None | No | The time spent by the process executing in kernel mode (also known as system mode). Kernel mode occurs when the application performs system calls, handles interrupts, or uses features provided by the kernel. This part of the time reflects the CPU resources consumed by the operating system in servicing the process. |
Total CPU utilization | None | Yes | The total CPU utilization is CPU utilization in user mode plus CPU utilization in kernel mode. |
Memory usage | Bytes | Yes | The amount of memory being used by the container at runtime. It reflects the total amount of memory actively used by the container, including parts that are marked as non-swappable by the operating system and data that has been cached but is still considered active. |
Number of sent network packets | None | No | The number of packets sent from the container over a network. |
Number of sent bytes | Bytes | Yes | The number of bytes sent from the container over a network. |
Number of sent error packets | None | No | The number of error packets that the container sent when it processed network communication. These error packets may be caused by network transmission issues or application issues. You can use this metric to check whether the container network is normal. |
Number of sent discarded packets | None | No | The total number of outbound network packets that have been dropped by the system or network stack since the container network interface was brought up. |
Number of received packets | None | No | The number of messages received by the container over a network. |
Number of received bytes | Bytes | Yes | The total amount of data received by the container over a network. |
Number of received error packets | None | No | The number of error packets that the container received when it processed network communication. These error packets may be caused by network transmission issues or application issues. If error packets are received, the container may fail to process the network communication, and the system may be affected. |
Number of received discarded packets | None | No | The total number of inbound network packets that have been dropped by the system or network stack since the container network interface was brought up. |
Dimension and filter condition
The preceding metrics are monitored by node IP address. You can use one of the following methods to filter IP addresses:
Traversal: traverses the containers and configures alerting for the metric data of each container.
Equals (=): specifies specific containers for alerting. Example: =172.20.XX.XX.
No dimension: aggregates and configures alerting for the metric data of all containers.
Application providing services
Metrics
Metric name | Unit | Commonly used | Description |
Number of calls | None | Yes | The number of application entry point calls, including HTTP and Dubbo calls. You can use this metric to analyze the number of calls of the application, estimate the business volume, and check whether exceptions occur in the application. |
Number of error calls | None | Yes | The number of error entry point calls of the application, including HTTP and Dubbo calls. If the status code 400 is returned or the application entry point call is intercepted by the top layer of Dubbo, the call is considered as an error. You can use this metric to check whether the application has error calls. |
Call error rate (%) | None | Yes | The error rate of application entry point calls is calculated by using the following formula: Error rate = Number of error application entry-point calls/Total number of application entry-point calls × 100%. |
Call response time | Milliseconds | Yes | The response time of an application entry point call, such as an HTTP call or a Dubbo call. You can use this metric to check for slow requests and exceptions. |
Dimensions and filter condition
The preceding metrics are monitored by interface name. You can use one of the following methods to filter interface names:
Traversal: traverses the accessed interfaces and configures alerting for the metric data of each interface.
Equals (=): specifies specific interfaces for alerting. Example: =/tb/api/users/{userId}.
Not Equals (!=): excludes specific interfaces from alerting, and separately configures alerting for other interfaces. Example:!=/tb/api/users/{userId}
Contains: configures alerting for interfaces that contain a specific keyword. Example: Contains api.
Does Not Contain: configures alerting for interfaces that do not contain a specific keyword. Example: Does Not Contain api.
Regular expression: configures alerting for interfaces that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and configures alerting for the metric data of all interfaces.
The preceding metrics are monitored by interface call type. You can use one of the following methods to filter interface call types:
Traversal: traverses the interface call types and separately configures alerting for the metric data of each type, such as HTTP, MySQL, and Redis.
Equals (=): specifies specific interface call types for alerting. Example: =http.
No dimension: aggregates and configures alerting for the metric data of all interfaces.
Thread pools
Metrics
Metric name | Commonly used | Description |
Number of core threads | Yes | The number of always active threads in the thread pool. |
Maximum number of threads | Yes | The maximum number of threads that can exist simultaneously in the thread pool. |
Number of active threads | Yes | The number of threads executing tasks. You can use this metric to monitor the status of the thread pool and evaluate the performance of the thread pool. |
Queue size | Yes | The size of the thread queue depends on the application requirements and system resource availability. In multithreaded programming, if the queue size is excessively small, tasks may queue for a long time. This reduces the performance of the programs. If the queue size is excessively large, a large amount of system resources may be consumed. This causes system crashes or performance degradation. |
Current number of threads | Yes | The number of threads running or waiting to run. |
Number of executed tasks | Yes | The number of executed and completed tasks in a task queue or the thread pool. You can use this metric to evaluate the performance of the task queue or thread pool. |
Thread pool usage (%) | Yes | The ratio between the number of threads in use in the thread pool and the total number of threads in the thread pool. |
Dimensions and filter condition
The preceding metrics are monitored by node IP address. You can use one of the following methods to filter IP addresses:
Traversal: traverses the IP address of each node and configures alerting for the metric data of each node.
Equals (=): specifies specific nodes for alerting. Example: =172.20.XX.XX.
No dimension: aggregates and configures alerting for the metric data of all nodes.
The preceding metrics are monitored by thread pool name. You can use one of the following methods to filter thread pool names:
Traversal: traverses the thread pools and configures alerting for the metric data of each thread pool.
Equals (=): specifies specific thread pools for alerting. Example: =pool-*-thread-*.
No dimension: aggregates and configures alerting for the metric data of all thread pools.
The preceding metrics are monitored by thread pool type. You can use one of the following methods to filter thread pool types:
Traversal: traverses the thread pool types and configures alerting for the metric data of each thread pool type.
Equals (=): specifies specific thread pool types for alerting. Example: =FixedThreadPool.
No dimension: aggregates and configures alerting for the metric data of all thread pool types.
HTTP status codes
Metrics
Metric name | Commonly used | Description |
Number of HTTP requests with 4xx status codes | Yes | The number of HTTP requests for which status codes 4xx were returned. 4xx status codes indicate that the requested resource does not exist, or the required parameters are missing. Common 4xx status codes include 400 and 404. |
Number of HTTP requests with 5xx status codes | Yes | The number of HTTP requests for which status codes 5xx were returned. 5xx status codes indicate that internal server errors have occurred, or the system is busy. Common 5xx status codes include 500 and 503. |
Dimension and filter condition
The preceding metrics are monitored by interface name. You can use one of the following methods to filter interface names:
Traversal: traverses the accessed interfaces and configures alerting for the metric data of each interface.
Equals (=): specifies specific interfaces for alerting. Example: =/tb/api/users/{userId}.
Not Equals (!=): excludes specific interfaces from alerting, and separately configures alerting for other interfaces. Example:!=/tb/api/users/{userId}
Contains: configures alerting for interfaces that contain a specific keyword. Example: Contains api.
Does Not Contain: configures alerting for interfaces that do not contain a specific keyword. Example: Does Not Contain api.
Regular expression: configures alerting for interfaces that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and configures alerting for the metric data of all interfaces.
Databases
Metrics
Metric name | Unit | Commonly used | Description |
Number of database requests | None | Yes | The number of requests that the application sent to a database at runtime. Each request contains a read or write operation. The number of database requests affects the performance and response time of the application. |
Number of database request errors | None | Yes | The number of errors that occurred when the application requested the database at runtime, such as database connection failures, query statement errors, and insufficient permissions. A large number of database requests errors indicates that the interaction between the application and the database is abnormal. In this case, the application cannot run as expected. |
Database request response time | Milliseconds | Yes | The time internal between the time that the application sent a request to a database and the time that the database made a response. The response time of database requests affects the application performance and user experience. If the response time is excessively long, the application may stutter or slow down. |
Dimension and filter condition
The preceding metrics are monitored by database name. You can use one of the following methods to filter database names:
Traversal: traverses the databases and configures alerting for the metric data of each database.
Equals (=): specifies specific database types for alerting. Example: =mysql-pod:3306(demo_db).
No dimension: aggregates and configures alerting for the metric data of all databases.