All Products
Search
Document Center

Network Intelligence Service:Manage events

Last Updated:Dec 17, 2024

Network Intelligence Service (NIS) provides the event center to help you monitor resources based on events. You can view the resources that are exposed to potential risks and configure alert rules for specific events. This way, you can handle these events at your earliest opportunity to prevent business interruptions.

Scenarios

Alibaba Cloud defines NIS events to record the information about cloud network resources and notify users, such as the status of O&M tasks, resource exceptions, and resource status changes.

  • Notification for risks and exceptions

    When events related to resource availability or performance issues occur, Alibaba Cloud pushes the events to the event center in the NIS console. Such events include instance performance degradation caused by excessive resource usage, business unavailability caused by packet loss in Internet connections, and instance subscription expiration. We recommend that you handle these events at the earliest opportunity in case business interruptions occur.

  • Automated O&M

    Alibaba Cloud defines the status of the events that are displayed in the NIS console. This helps you understand the status of system O&M tasks. New events and status changes of events are reported to CloudMonitor, which allows you to build an event-driven automated O&M system to meet your business requirements.

Limits

No event is collected for retired instance families. For more information, see the end-of-sale notice for each Alibaba Cloud service.

Basic information

Event types

Alibaba Cloud defines events to record the information about cloud network resources and notify users. Events are categorized into the types described in the following table based on causes.

Category

Description

Example

Issue event

Exceptions that have impaired business and have been in the In Progress state for 7 days.

  • Packet loss due to excess public bandwidth usage

  • Instance suspension due to overdue payments

Risk event

Exceptions that may impair business and have been in the In Progress state for 7 days.

  • Risk of affecting business due to packet loss on physical connections

  • Risk of failure due to sudden changes in traffic usage

  • Risk of instance suspension due to overdue payments

Event levels

Alibaba Cloud defines the following levels for events based on their impacts on the operation of instances:

  • Critical: Events at this level may result in instance unavailability and must be handled at your earliest opportunity.

  • Warn: Events at this level have affected your business. You must pay close attention to these events or handle them at an appropriate point in time.

  • Info: You can decide whether to pay attention to the events at this level.

Note

For more information about the codes, names, descriptions, and handling suggestions on events, see the Events section in this topic.

Events

This section summarizes the events supported by NIS and provides suggestions on how to handle these events.

Note

Issue events do not apply to shared-resource Classic Load Balancer (CLB) instances.

Issue events

Event code

Event name

Event level

Event name in CloudMonitor

Event description and impact

Alert rules

Suggestion for users

Internet-facing instance

problem-internetBandwidthOverlimit

Packet loss due to excess bandwidth usage

Critical

problem-internetBandwidthOverLimit

Packets are lost because the bandwidth of an Internet-facing instance exceeds the peak bandwidth.

An instance that generates Internet data transfer is called an Internet-facing instance, such as an elastic IP address (EIP), bandwidth plan, or CLB instance.

Critical: The bandwidth usage has frequently exceeded the peak bandwidth within the last 10 minutes and packets are lost.

Increase the peak bandwidth.

Internet NAT gateway

problem-nat-sessionOverLimit

Connection drop caused by excess NAT sessions

Critical

problem-nat-sessionOverLimit

The number of sessions on an Internet NAT gateway exceeds the upper limit. As a result, new sessions fail and more than 100 packets are lost per second.

Critical: The number of concurrent sessions has exceeded the upper limit in the last 10 minutes and more than 100 packets are lost per second.

Upgrade the Internet NAT gateway or create multiple Internet NAT gateways. For more information, see Manage NAT Gateway quotas and Create and manage an Internet NAT gateway.

problem-nat-sessionNewOverLimit

Connection drop caused by excess new NAT sessions

Critical

problem-nat-sessionNewOverLimit

The number of new sessions on an Internet NAT gateway per second exceeds the upper limit. As a result, new sessions fail and more than 100 packets are lost per second.

Critical: The number of new sessions has exceeded the upper limit within the last 10 minutes and more than 100 packets are lost per second.

problem-nat-portAllocationError

Allocation failure of SNAT source ports

Critical

problem-nat-portAllocationError

The EIPs bound to an Internet NAT gateway are insufficient. As a result, source ports fail to be allocated and over 10 packets are lost per second.

Note

You cannot configure subscription policies for this event.

Critical: Source ports have frequently failed to be allocated within the last 10 minutes and more than 10 packets are lost per second.

Increase the number of EIPs that are bound to the Internet NAT gateway. For more information, see Create and manage an Internet NAT gateway.

problem-nat-datapathUnavailable

Data path unavailability of NAT gateways

Critical

problem-nat-datapathUnavailable

The data path of a NAT gateway is unavailable. The availability of the NAT gateway is 0% within the last 10 minutes. This indicates that all traffic on the NAT gateway is affected and the NAT gateway cannot work as expected. This may be caused by an event on the Alibaba Cloud side. Alibaba Cloud engineers are trying to restore the service.

Critical: The availability of the NAT gateway is 0% within the last 10 minutes.

If you have deployed multiple NAT gateways to implement high service availability, we recommend that you switch traffic to another NAT gateway. For more information, see Deploy multiple NAT gateways to implement high availability. Otherwise, we recommend that you contact Alibaba Cloud engineers to obtain the latest recovery progress.

problem-nat-datapathDegraded

Data path degradation of NAT gateways

Critical

problem-nat-datapathDegraded

The data path of a NAT gateway is degraded. The availability of the NAT gateway is lower than 80% within the last 10 minutes. This indicates that more than 20% of traffic on the NAT gateway is affected and the NAT gateway cannot work as expected. The packet loss may be caused by an event on the Alibaba Cloud side. Alibaba Cloud engineers are trying to restore the service.

Critical: The availability of the NAT gateway is lower than 80% within the last 10 minutes and packets are lost.

CLB instance

problem-clb-connectionOverLimit

Discarded new connections caused by excess CLB sessions

Critical

problem-clb-connectionOverLimit

The number of new connections or concurrent connections of a CLB instance exceeds the upper limit. As a result, new sessions fail and the number of dropped connections per second is large.

Critical: The number of concurrent sessions has exceeded the upper limit within the last 10 minutes and packets are lost.

Upgrade the CLB instance to a Network Load Balancer (NLB) instance or an Application Load Balancer (ALB) instance.

For more information, see Manage CLB quotas. For more information about NLB and ALB, see What is NLB? and What is ALB?

problem-clb-bandwidthOverLimit

Packet loss due to excess bandwidth usage of CLB instances

Critical

problem-clb-bandwidthOverLimit

Packet loss occurs because the bandwidth of a CLB instance exceeds the peak bandwidth.

Critical: The bandwidth usage of an instance has frequently exceeded the peak bandwidth within the last 10 minutes and more than 100 bits are lost per second.

Increase the peak bandwidth. For more information, see FAQ about CLB instances.

problem-clb-connectionFail

Sharp increase in failed CLB connections

Critical

problem-clb-connectionFail

The number of failed connections of a CLB instance is sharply increased due to excess number, excess workload, or business exception of the backend servers of a CLB instance.

Critical: The number of failed connections of the CLB instance sharply increases. An alert is triggered if all of the following conditions are met:

Condition 1: The number of failed connections exceeds 100 per second.

Condition 2: The number of failed connections increases by 30% compared with the previous 10 minutes.

Condition 3: You can use AI to analyze the historical failed connection data and establish a baseline range. The actual number of failed connections consecutively exceeds the upper limit of the baseline by more than 30% within 10 minutes.

Upgrade the backend servers, upgrade the CLB instance, or check the service status of the backend servers.

For more information, see Manage CLB quotas.

NLB

problem-nlb-connectionFail

Sharp increase in failed NLB connections

Critical

problem-nlb-connectionFail

The number of failed connections between the virtual IP addresses of NLB instances and Elastic Compute Service (ECS) instances is greatly increased for 10 consecutive minutes. Possible causes:

  • Network jitter

  • Poor performance of backend servers

Critical: An alert is triggered if the number of failed NLB connections meets all of the following conditions:

Condition 1: Within a monitoring window of 610 seconds, the proportion of failed connections exceeds 100% of the baseline for 3 consecutive minutes.

Condition 2: Within a monitoring window of 610 seconds, the number of failed connections increases by at least 50% compared with the previous hour for 7 consecutive minutes.

Condition 3: Within a monitoring window of 610 seconds, the number of failed connections is equal to or greater than 1000 for 8 consecutive minutes.

Check the bandwidth usage and service status of the backend servers.

problem-nlb-newConnectionSurge

Discarded new NLB connections

Critical

problem-nlb-newConnectionSurge

The number of new connections between the virtual IP addresses of NLB instances and ECS instances is greatly increased. As a result, new connection requests are discarded for consecutive milliseconds or seconds.

Critical: An alert is triggered if the number of NLB connections meets all of the following conditions:

Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the virtual IP address (VIP) per second is greater than 0.

Condition 2: There are more than eight monitoring points within 10 minutes in which the number of connections created by the VIP per second is less than 200000.

Purchase multiple NLB instances to distribute traffic to the NLB instances or submit a ticket to your account manager.

problem-nlb-newConnectionOverLimit

Excess new NLB connections

Critical

problem-nlb-newConnectionOverLimit

The number of new connections between a virtual IP address of an NLB instance and ECS instances per second exceeds the upper limit. As a result, new connection requests are discarded for consecutive milliseconds or seconds.

Critical: An alert is triggered if the number of NLB connections meets all of the following conditions:

Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the VIP per second is greater than 0.

Condition 2: There are more than eight monitoring points within 10 minutes in which the number of connections created by the VIP per second is greater than or equal to 200000.

problem-nlb-concurrentConnectionOverLimit

Excess concurrent NLB connections

Critical

problem-nlb-concurrentConnectionOverLimit

The number of concurrent connections between a virtual IP address of an NLB instance and ECS instances per second exceeds the upper limit. As a result, new connection requests are discarded for consecutive milliseconds or seconds.

Critical: An alert is triggered if the number of NLB connections meets all of the following conditions:

Condition 1: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the VIP per second is greater than 0.

Condition 2: There are more than eight monitoring points within 10 minutes in which the concurrent connection number of the VIP is greater than 5000000.

ALB

problem-alb-intranetBandwidthOverLimit

Packet loss due to excess private bandwidth usage of ALB instances

Critical

problem-alb-intranetBandwidthOverLimit

The outbound or inbound bandwidth on a virtual IP address of an ALB instance exceeds the upper limit. A domain name is pointed to the IP address.

Critical: There are more than eight monitoring points within 10 minutes in which the traffic discarded by the ALB instance exceeds 100 bit/s.

Add a canonical name (CNAME) record for the ALB instance. For more information, see Add a CNAME record to an ALB instance.

problem-alb-sessionOverLimit

Discarded new connections caused by excess ALB sessions

Critical

problem-alb-sessionOverLimit

The number of new or concurrent connections that are established between a virtual IP address of an ALB instance and ECS instances exceeds the upper limit. As a result, new sessions fail. A domain name is pointed to the IP address.

Critical: There are more than eight monitoring points within 10 minutes in which the number of connections discarded by the ALB instance per second is greater than 0.

problem-alb-qpsOverLimit

503 error code returned because the number of QPS sent to a virtual IP address of an ALB instance exceeds the upper limit

Critical

problem-alb-qpsOverLimit

The number of queries per second (QPS) received by a virtual IP address of an ALB instance exceeds the upper limit. A domain name is pointed to the IP address.

Critical: There are more than eight monitoring points within 10 minutes in which the number of requests discarded by the ALB instance per second exceeds 200 queries per second (QPS). Compared to the previous 7 minutes, the number of requests discarded by the instance per second increases by 30% or more for 10 consecutive minutes.

Cloud Enterprise Network (CEN) instance

problem-cen-routeOverLimit

Excess CEN routes

Critical

problem-cen-routeOverLimit

The number of CEN routes exceeds the quota, which may result in network issues.

Critical: The number of CEN routes exceeds the quota, which may result in network issues.

Upgrade transit routers. For more information, see Upgrade transit routers from Basic Edition to Enterprise Edition.

TR

problem-cen-vpcAttachBandwidthOverLimit

Packet loss due to excess usage of virtual private cloud (VPC) connection bandwidth

Critical

problem-cen-vpcAttachBandwidthOverLimit

Packet loss occurs because the bandwidth of CEN transit routers exceeds the peak bandwidth.

Critical: There are more than 5 monitoring points within 10 minutes in which the inbound packet loss rate is greater than 0.

Increase the peak bandwidth. For more information, see Manage CEN quotas.

problem-cen-peerAttachBandwidthOverLimit

Packet loss due to excess usage of inter-region connection bandwidth

Critical

problem-cen-peerAttachBandwidthOverLimit

Packet loss occurs because the bandwidth of CEN transit routers exceeds the peak bandwidth.

Critical: An alert is triggered if the actual traffic of the transit router (TR) meets all of the following conditions.

Condition 1: There are more than eight monitoring points within 10 minutes in which the outbound peak bandwidth usage exceeds 90%.

Condition 2: There are more than eight monitoring points in which the outbound packet loss rate in rate-limited scenarios exceeds 100 packets per second (pps).

Increase the peak bandwidth. For more information, see Manage CEN quotas.

Risk events

Event code

Event name

Event level

Event name in CloudMonitor

Event description and impact

Alert rules

Suggestion for users

Internet-facing instance

risk-internetPacketLoss

Risk of Internet connection packet loss

Warn

risk-internetPacketLoss

Packet loss is detected in the following physical connection: {Alibaba Cloud region} to {Country} - {Region} - {ISP}. The businesses within the current account may experience network jitters.

Critical: An alert is triggered if any of the following conditions is met:

Condition 1: The regional ISP network packet loss exceeds 50%.

Condition 2: Nationwide ISP network packet loss occurs. The bandwidth usage of the connection within the previous 10 minutes equals or exceeds 0.05 Mbit/s.

Note
  • Regional: a physical connection whose destination is {Country}-{Region}-{ISP}.

  • Nationwide: a physical connection whose destination is {Country}-{ISP}.

Warn: The packet loss rate over the Internet is less than 50%, and the average bandwidth has exceeded 0.5 Mbit/s in the last 10 minutes.

Check whether the bandwidth of the instances on this physical connection meets your business requirements. For more information, see the 5-tuple data on the Internet Traffic page of the NIS console. If an exception occurs, you can migrate critical business data to other regions. If no exception occurs, ignore this alert.

risk-internetBandwidthOverlimit

Packet loss risk due to excess bandwidth usage

Warn

risk-internetBandwidthOverlimit

According to historical data, the actual bandwidth of instances may exceed the peak bandwidth at a specific point in time in the future at a probability of greater than 90%.

Warn: The actual bandwidth exceeds the peak bandwidth at a certain time at a probability of greater than 90%, and packet loss occurs.

Take note of the bandwidth. If the peak bandwidth is exceeded, increase the peak bandwidth.

VPN Gateway

risk-vpn-bpsOverLimit

Excess usage risk of VPN connection bandwidth

Warn

risk-vpn-bpsOverLimit

The bandwidth utilization of a VPN connection has exceeded 90% three times in the last 10 minutes.

Warn: There are more than three monitoring points within 10 minutes in which the bandwidth usage exceeds 90%.

Warn: There are more than eight monitoring points within 10 minutes in which the bandwidth usage exceeds 30%.

risk-vpn-bgpRouteLimit

Risk of excess BGP routes

Warn

risk-vpn-bgpRouteLimit

The number of routes that a VPN gateway has automatically learned by using Border Gateway Protocol (BGP) dynamic routing has exceeded 90% of the BGP route quota in the last 10 minutes.

Warn: There are more than one monitoring points within 10 minutes in which the route usage exceeds 90%.

Take note of the number. If the quota is exceeded, we recommend that you aggregate the CIDR blocks of the VPN gateway based on your network planning.

Express Connect

risk-ec-physicalConnectionFail

Express Connect circuit or port failure

Warn

risk-ec-physicalConnectionFail

Services are interrupted because exceptions occur on the Express Connect circuits of ISPs or device ports.

Warn: The minute-level inbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met.

Condition 1: The Express Connect circuit experiences the down status at least 3 times but fewer than 20 times.

Condition 2: The Express Connect circuit experiences the down status for more than 2 consecutive time points.

Condition 3: The down status does not apply to all Express Connect circuits.

Contact your account manager.

risk-ec-bgpRouterFail

BGP connection failure

Warn

BGP connection failure

BGP connection failures and route loss occur because connections over Express Connect circuits fail or the BGP settings are abnormal.

Warn: An alert is triggered if the status of the BGP connection changes from Connected to any other statuses.

Contact your account manager.

risk-ec-inTrafficDroppedToZero

Inbound VBR traffic plummet

Warn

risk-ec-inTrafficDroppedToZero

Inbound virtual border router (VBR) traffic is sharply decreased because exceptions occur on the Express Connect circuits of ISPs or device ports.

Warn: The minute-level inbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met. Condition 1: The rate is dropped by 99% or more compared to the average rate of the previous 7 minutes for 3 consecutive minutes. Condition 2: Compared to the average rate of the previous 7 minutes, the absolute value of the rate drop each minute is greater than or equal to 1 Mbit/s for 3 consecutive minutes. Condition 3: Compared to the average rates of the previous 15, 30, and 60 minutes, the absolute value of the rate drop each minute is greater than or equal to 0.5 Mbit/s for 3 consecutive minutes. Condition 4 (Intelligent baseline alarm): By analyzing the historical inbound rate patterns of the VBR instance, AI can predict the range of the inbound rate for the next cycle. At the time the cycle arrives, if the lower limit of the predicted range is exceeded by 99% for 2 consecutive minutes within 3 minutes, it is considered an abnormal failure.

Check whether service traffic is normal or failovers are performed after health checks. If your business is impaired, contact your account manager.

risk-ec-outTrafficDroppedToZero

Outbound VBR traffic plummet

Warn

risk-ec-outTrafficDroppedToZero

Outbound VBR traffic is sharply decreased because exceptions occur on the Express Connect circuits of ISPs or device ports.

Warn: The minute-level outbound rate from the data center to the VPC for the VBR instance is monitored. An alert is triggered if all of the following conditions are met. Condition 1: The rate is dropped by 99% or more compared to the average rate of the previous 7 minutes for 3 consecutive minutes.

Condition 2: Compared to the average rate of the previous 7 minutes, the absolute value of the rate drop each minute is greater than or equal to 1 Mbit/s for 3 consecutive minutes.

Condition 3: Compared to the average rates of the previous 15, 30, and 60 minutes, the absolute value of the rate drop each minute is greater than or equal to 0.5 Mbit/s for 3 consecutive minutes.

Condition 4 (Intelligent baseline alarm): By learning the historical outbound rate patterns of the VBR instance, AI can predict the stable range of the outbound rate for the next cycle. At the time the cycle arrives, if the lower limit of the predicted range is exceeded by 99% for 2 consecutive minutes within 3 minutes, it is considered an abnormal failure.

Check whether service traffic is normal or failovers are performed after health checks. If your business is impaired, contact your account manager.

Related operations

Operation

Description and references

View events

You can view events in the following ways:

Subscribe to an event

You can configure event subscription policies in the CloudMonitor console. After you configure the policies, you are notified of the occurrence and updates of events by phone call, text message, or email in a timely manner. For more information, see Configure event subscription policies.

Handle events

After you view events, you can resolve the issues based on the suggestions. For more information, see the Events section of the Event center topic.