Network Intelligence Service (NIS) provides the event center to help you monitor resources based on events. You can view the resources that are exposed to potential risks and configure alert rules for specific events. This way, you can handle these events at your earliest opportunity to prevent business interruptions.
Scenarios
Alibaba Cloud defines NIS events to record the information about cloud network resources and notify users, such as the status of O&M tasks, resource exceptions, and resource status changes.
Notification for risks and exceptions
When events related to resource availability or performance issues occur, Alibaba Cloud pushes the events to the event center in the NIS console. Such events include instance performance degradation caused by excessive resource usage, business unavailability caused by packet loss in Internet connections, and instance subscription expiration. We recommend that you handle these events at the earliest opportunity in case business interruptions occur.
Automated O&M
Alibaba Cloud defines the status of the events that are displayed in the NIS console. This helps you understand the status of system O&M tasks. New events and status changes of events are reported to CloudMonitor, which allows you to build an event-driven automated O&M system to meet your business requirements.
Limits
No event is collected for retired instance families. For more information, see the end-of-sale notice for each Alibaba Cloud service.
Basic information
Event types
Alibaba Cloud defines events to record the information about cloud network resources and notify users. Events are categorized into the types described in the following table based on causes.
Type | Description | Example |
Issue event | Exceptions that have impaired business and have been in the In Progress state for 7 days. |
|
Risk event | Exceptions that may impair business and have been in the In Progress state for 7 days. |
|
Event levels
Alibaba Cloud defines the following levels for events based on their impacts on the operation of instances:
Critical: Events at this level may result in instance unavailability and must be handled at your earliest opportunity.
Warn: Events at this level have affected your business. You must pay close attention to these events or handle them at an appropriate point in time.
Info: You can decide whether to pay attention to the events at this level.
For more information about the codes, names, descriptions, and handling suggestions on events, see the Events section in this topic.
Events
This section summarizes the events supported by NIS and provides suggestions on how to handle these events.
Issue events do not apply to shared-resource Classic Load Balancer (CLB) instances.
Issue events
Event code | Event name | Event level | Event name in CloudMonitor | Event description and impact | Alert rules | Suggestion for users |
Internet-facing instance | ||||||
problem-internetBandwidthOverlimit | Packet loss due to excess bandwidth usage | Critical | problem-internetBandwidthOverLimit | Packets are lost because the bandwidth of an Internet-facing instance exceeds the peak bandwidth. An instance that generates Internet data transfer is called an Internet-facing instance, such as an elastic IP address (EIP), bandwidth plan, or CLB instance. | Critical: The bandwidth usage has frequently exceeded the peak bandwidth within the last 10 minutes and packets are lost. | Increase the peak bandwidth. |
Internet NAT gateway | ||||||
problem-nat-sessionOverLimit | Connection drop caused by excess NAT sessions | Critical | problem-nat-sessionOverLimit | The number of sessions on an Internet NAT gateway exceeds the upper limit. As a result, new sessions fail and more than 100 packets are lost per second. | Critical: The number of concurrent sessions has exceeded the upper limit in the last 10 minutes and more than 100 packets are lost per second. | Upgrade the Internet NAT gateway or create multiple Internet NAT gateways. For more information, see Manage NAT Gateway quotas and Create and manage an Internet NAT gateway. |
problem-nat-sessionNewOverLimit | Connection drop caused by excess new NAT sessions | Critical | problem-nat-sessionNewOverLimit | The number of new sessions on an Internet NAT gateway per second exceeds the upper limit. As a result, new sessions fail and more than 100 packets are lost per second. | Critical: The number of new sessions has exceeded the upper limit within the last 10 minutes and more than 100 packets are lost per second. | |
problem-nat-portAllocationError | Allocation failure of SNAT source ports | Critical | problem-nat-portAllocationError | The EIPs bound to an Internet NAT gateway are insufficient. As a result, source ports fail to be allocated and over 10 packets are lost per second. Note You cannot configure subscription policies for this event. | Critical: Source ports have frequently failed to be allocated within the last 10 minutes and more than 10 packets are lost per second. | Increase the number of EIPs that are bound to the Internet NAT gateway. For more information, see Create and manage an Internet NAT gateway. |
problem-nat-datapathUnavailable | Data path unavailability of NAT gateways | Critical | problem-nat-datapathUnavailable | The data path of a NAT gateway is unavailable. The availability of the NAT gateway is 0% within the last 10 minutes. This indicates that all traffic on the NAT gateway is affected and the NAT gateway cannot work as expected. This may be caused by an event on the Alibaba Cloud side. Alibaba Cloud engineers are trying to restore the service. | Critical: The availability of the NAT gateway is 0% within the last 10 minutes. | If you have deployed multiple NAT gateways to implement high service availability, we recommend that you switch traffic to another NAT gateway. For more information, see Deploy multiple NAT gateways to implement high availability. Otherwise, we recommend that you contact Alibaba Cloud engineers to obtain the latest recovery progress. |
problem-nat-datapathDegraded | Data path degradation of NAT gateways | Critical | problem-nat-datapathDegraded | The data path of a NAT gateway is degraded. The availability of the NAT gateway is lower than 80% within the last 10 minutes. This indicates that more than 20% of traffic on the NAT gateway is affected and the NAT gateway cannot work as expected. The packet loss may be caused by an event on the Alibaba Cloud side. Alibaba Cloud engineers are trying to restore the service. | Critical: The availability of the NAT gateway is lower than 80% within the last 10 minutes and packets are lost. | |
CLB instance | ||||||
problem-clb-connectionOverLimit | Discarded new connections caused by excess CLB sessions | Critical | problem-clb-connectionOverLimit | The number of new connections or concurrent connections of a CLB instance exceeds the upper limit. As a result, new sessions fail and the number of dropped connections per second is large. | Critical: The number of concurrent sessions has exceeded the upper limit within the last 10 minutes and packets are lost. | Upgrade the CLB instance to a Network Load Balancer (NLB) instance or an Application Load Balancer (ALB) instance. For more information, see Manage CLB quotas. For more information about NLB and ALB, see What is NLB? and What is ALB? |
problem-clb-bandwidthOverLimit | Packet loss due to excess bandwidth usage of CLB instances | Critical | problem-clb-bandwidthOverLimit | Packet loss occurs because the bandwidth of a CLB instance exceeds the peak bandwidth. | Critical: The bandwidth usage of an instance has frequently exceeded the peak bandwidth within the last 10 minutes and more than 100 bits are lost per second. | Increase the peak bandwidth. For more information, see FAQ about CLB instances. |
problem-clb-connectionFail | Sharp increase in failed CLB connections | Critical | problem-clb-connectionFail | The number of failed connections of a CLB instance is sharply increased due to excess number, excess workload, or business exception of the backend servers of a CLB instance. | Critical: The number of failed connections of the CLB instance sharply increases. An alert is triggered if all of the following conditions are met: Condition 1: The number of failed connections exceeds 100 per second. Condition 2: The number of failed connections increases by 30% compared with the previous 10 minutes. Condition 3: You can use AI to analyze the historical failed connection data and establish a baseline range. The actual number of failed connections consecutively exceeds the upper limit of the baseline by more than 30% within 10 minutes. | Upgrade the backend servers, upgrade the CLB instance, or check the service status of the backend servers. For more information, see Manage CLB quotas. |
NLB | ||||||
problem-nlb-connectionFail | Sharp increase in failed NLB connections | Critical | problem-nlb-connectionFail | The number of failed connections between the virtual IP addresses of NLB instances and Elastic Compute Service (ECS) instances is greatly increased for 10 consecutive minutes. Possible causes:
| Critical: An alert is triggered if the number of failed NLB connections meets all of the following conditions: Condition 1: Within a monitoring window of 610 seconds, the proportion of failed connections exceeds 100% of the baseline for 3 consecutive minutes. Condition 2: Within a monitoring window of 610 seconds, the number of failed connections increases by at least 50% compared with the previous hour for 7 consecutive minutes. Condition 3: Within a monitoring window of 610 seconds, the number of failed connections is equal to or greater than 1000 for 8 consecutive minutes. | Check the bandwidth usage and service status of the backend servers. |
problem-nlb-newConnectionSurge | Discarded new NLB connections | Critical | problem-nlb-newConnectionSurge | The number of new connections between the virtual IP addresses of NLB instances and ECS instances is greatly increased. As a result, new connection requests are discarded for consecutive milliseconds or seconds. | Critical: An alert is triggered if the number of NLB connections meets all of the following conditions: Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the virtual IP address (VIP) per second is greater than 0. Condition 2: There are more than eight monitoring points within 10 minutes in which the number of connections created by the VIP per second is less than 200000. |
Purchase multiple NLB instances to distribute traffic to the NLB instances or submit a ticket to your account manager. |
problem-nlb-newConnectionOverLimit | Excess new NLB connections | Critical | problem-nlb-newConnectionOverLimit | The number of new connections between a virtual IP address of an NLB instance and ECS instances per second exceeds the upper limit. As a result, new connection requests are discarded for consecutive milliseconds or seconds. | Critical: An alert is triggered if the number of NLB connections meets all of the following conditions: Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the VIP per second is greater than 0. Condition 2: There are more than eight monitoring points within 10 minutes in which the number of connections created by the VIP per second is greater than or equal to 200000. | |
problem-nlb-concurrentConnectionOverLimit | Excess concurrent NLB connections | Critical | problem-nlb-concurrentConnectionOverLimit | The number of concurrent connections between a virtual IP address of an NLB instance and ECS instances per second exceeds the upper limit. As a result, new connection requests are discarded for consecutive milliseconds or seconds. | Critical: An alert is triggered if the number of NLB connections meets all of the following conditions: Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the VIP per second is greater than 0. Condition 2: There are more than eight monitoring points within 10 minutes in which the concurrent connection number of the VIP is greater than 5000000. | |
ALB | ||||||
problem-alb-intranetBandwidthOverLimit | Packet loss due to excess private bandwidth usage of ALB instances | Critical | problem-alb-intranetBandwidthOverLimit | The outbound or inbound bandwidth on a virtual IP address of an ALB instance exceeds the upper limit. A domain name is pointed to the IP address. | Critical: There are more than eight monitoring points within 10 minutes in which the traffic discarded by the ALB instance exceeds 100 bit/s. | Add a canonical name (CNAME) record for the ALB instance. For more information, see Add a CNAME record to an ALB instance. |
problem-alb-sessionOverLimit | Discarded new connections caused by excess ALB sessions | Critical | problem-alb-sessionOverLimit | The number of new or concurrent connections that are established between a virtual IP address of an ALB instance and ECS instances exceeds the upper limit. As a result, new sessions fail. A domain name is pointed to the IP address. | Critical: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the ALB instance per second is greater than 0. | |
problem-alb-qpsOverLimit | 503 error code returned because the number of QPS sent to a virtual IP address of an ALB instance exceeds the upper limit | Critical | problem-alb-qpsOverLimit | The number of queries per second (QPS) received by a virtual IP address of an ALB instance exceeds the upper limit. A domain name is pointed to the IP address. | Critical: There are more than eight monitoring points within 10 minutes in which the number of requests discarded by the ALB instance per second exceeds 200 queries per second (QPS). Compared to the previous 7 minutes, the number of requests discarded by the instance per second increases by 30% or more for 10 consecutive minutes. | |
Cloud Enterprise Network (CEN) instance | ||||||
problem-cen-routeOverLimit | Excess CEN routes | Critical | problem-cen-routeOverLimit | The number of CEN routes exceeds the quota, which may result in network issues. | Critical: The number of CEN routes exceeds the quota, which may result in network issues. | Upgrade transit routers. For more information, see Upgrade transit routers from Basic Edition to Enterprise Edition. |
TR | ||||||
problem-cen-vpcAttachBandwidthOverLimit | Packet loss due to excess usage of virtual private cloud (VPC) connection bandwidth | Critical | problem-cen-vpcAttachBandwidthOverLimit | Packet loss occurs because the bandwidth of CEN transit routers exceeds the peak bandwidth. | Critical: There are more than 5 monitoring points within 10 minutes in which the inbound packet loss rate is greater than 0. | Increase the peak bandwidth. For more information, see Manage CEN quotas. |
problem-cen-peerAttachBandwidthOverLimit | Packet loss due to excess usage of inter-region connection bandwidth | Critical | problem-cen-peerAttachBandwidthOverLimit | Packet loss occurs because the bandwidth of CEN transit routers exceeds the peak bandwidth. | Critical: An alert is triggered if the actual traffic of the transit router (TR) meets all of the following conditions. Condition 1: There are more than eight monitoring points within 10 minutes in which the outbound peak bandwidth usage exceeds 90%. Condition 2: There are more than eight monitoring points in which the outbound packet loss rate in rate-limited scenarios exceeds 100 packets per second (pps). | Increase the peak bandwidth. For more information, see Manage CEN quotas. |
Risk events
Event code | Event name | Event level | Event name in CloudMonitor | Event description and impact | Alert rules | Suggestion for users |
Internet-facing instance | ||||||
risk-internetPacketLoss | Risk of Internet connection packet loss | Warn | risk-internetPacketLoss | If a packet loss alert is triggered for a physical connection of an Internet service provider (ISP) between two regions of Alibaba Cloud, data transfer over the connection may be affected. In the next 10 minutes, the bandwidth of instances within the current Alibaba Cloud account on the connection may exceed 0.5 Mbit/s or the packet loss rate of the connection may exceed 50%. Important Before you monitor this event, you must enable Internet traffic analysis in specific regions or for specific IP addresses. For more information, see the Enable the Internet traffic analysis capability section of the Work with the Internet traffic analysis capability topic. | Critical: Packet loss rate over the Internet is greater than 50% or nationwide packet loss occurs, and the average bandwidth in the last 10 minutes is 0.05 Mbit/s. Warn: The packet loss rate over the Internet is less than 50%, and the average bandwidth has exceeded 0.5 Mbit/s in the last 10 minutes. | Check whether the bandwidth of the instances on this physical connection meets your business requirements. For more information, see the 5-tuple data on the Internet Traffic page of the NIS console. If an exception occurs, you can migrate critical business data to other regions. If no exception occurs, ignore this alert. |
risk-internetBandwidthOverlimit | Packet loss risk due to excess bandwidth usage | Warn | risk-internetBandwidthOverlimit | According to historical data, the actual bandwidth of instances may exceed the peak bandwidth at a specific point in time in the future at a probability of greater than 90%. | Warn: The actual bandwidth exceeds the peak bandwidth at a certain time at a probability of greater than 90%, and packet loss occurs. | Take note of the bandwidth. If the peak bandwidth is exceeded, increase the peak bandwidth. |
VPN Gateway | ||||||
risk-vpn-bpsOverLimit | Excess usage risk of VPN connection bandwidth | Warn | risk-vpn-bpsOverLimit | The bandwidth utilization of a VPN connection has exceeded 90% three times in the last 10 minutes. | Warn: There are more than three monitoring points within 10 minutes in which the bandwidth usage exceeds 90%. | Warn: There are more than eight monitoring points within 10 minutes in which the bandwidth usage exceeds 30%. |
risk-vpn-bgpRouteLimit | Risk of excess BGP routes | Warn | risk-vpn-bgpRouteLimit | The number of routes that a VPN gateway has automatically learned by using Border Gateway Protocol (BGP) dynamic routing has exceeded 90% of the BGP route quota in the last 10 minutes. | Warn: There are more than one monitoring points within 10 minutes in which the route usage exceeds 90%. | Take note of the number. If the quota is exceeded, we recommend that you aggregate the CIDR blocks of the VPN gateway based on your network planning. |
Express Connect | ||||||
risk-ec-physicalConnectionFail | Express Connect circuit or port failure | Warn | risk-ec-physicalConnectionFail | Services are interrupted because exceptions occur on the Express Connect circuits of ISPs or device ports. | Warn: The minute-level inbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met. Condition 1: The Express Connect circuit experiences the down status at least 3 times but fewer than 20 times. Condition 2: The Express Connect circuit experiences the down status for more than 2 consecutive time points. Condition 3: The down status does not apply to all Express Connect circuits. | Contact your account manager. |
risk-ec-bgpRouterFail | BGP connection failure | Warn | BGP connection failure | BGP connection failures and route loss occur because connections over Express Connect circuits fail or the BGP settings are abnormal. | Warn: An alert is triggered if the status of the BGP connection changes from Connected to any other statuses. | Contact your account manager. |
risk-ec-inTrafficDroppedToZero | Inbound VBR traffic plummet | Warn | risk-ec-inTrafficDroppedToZero | Inbound virtual border router (VBR) traffic is sharply decreased because exceptions occur on the Express Connect circuits of ISPs or device ports. | Warn: The minute-level inbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met. Condition 1: The rate is dropped by 99% or more compared to the average rate of the previous 7 minutes for 3 consecutive minutes. Condition 2: Compared to the average rate of the previous 7 minutes, the absolute value of the rate drop each minute is greater than or equal to 1 Mbit/s for 3 consecutive minutes. Condition 3: Compared to the average rates of the previous 15, 30, and 60 minutes, the absolute value of the rate drop each minute is greater than or equal to 0.5 Mbit/s for 3 consecutive minutes. Condition 4 (Intelligent baseline alarm): By analyzing the historical inbound rate patterns of the VBR instance, AI can predict the range of the inbound rate for the next cycle. At the time the cycle arrives, if the lower limit of the predicted range is exceeded by 99% for 2 consecutive minutes within 3 minutes, it is considered an abnormal failure. | Check whether service traffic is normal or failovers are performed after health checks. If your business is impaired, contact your account manager. |
risk-ec-outTrafficDroppedToZero | Outbound VBR traffic plummet | Warn | risk-ec-outTrafficDroppedToZero | Outbound VBR traffic is sharply decreased because exceptions occur on the Express Connect circuits of ISPs or device ports. | Warn: The minute-level outbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met. Condition 1: The rate is dropped by 99% or more compared to the average rate of the previous 7 minutes for 3 consecutive minutes. Condition 2: Compared to the average rate of the previous 7 minutes, the absolute value of the rate drop each minute is greater than or equal to 1 Mbit/s for 3 consecutive minutes. Condition 3: Compared to the average rates of the previous 15, 30, and 60 minutes, the absolute value of the rate drop each minute is greater than or equal to 0.5 Mbit/s for 3 consecutive minutes. Condition 4 (Intelligent baseline alarm): By learning the historical outbound rate patterns of the VBR instance, AI can predict the stable range of the outbound rate for the next cycle. At the time the cycle arrives, if the lower limit of the predicted range is exceeded by 99% for 2 consecutive minutes within 3 minutes, it is considered an abnormal failure. | Check whether service traffic is normal or failovers are performed after health checks. If your business is impaired, contact your account manager. |
Related operations
Operation | Description and references |
View events | You can view events in the following ways:
|
Subscribe to an event | You can configure event subscription policies in the CloudMonitor console. After you configure the policies, you are notified of the occurrence and updates of events by phone call, text message, or email in a timely manner. For more information, see Configure event subscription policies. |
Handle events | After you view events, you can resolve the issues based on the suggestions. For more information, see the Events section of the Event center topic. |