Fault Management

Fault Management Overview

Fault management is a concept derived from ITIL. The purpose of fault management in IT enterprises or Internet enterprises is to resume normal service operations as soon as possible when a major downtime occurs in the production environment, to minimize the negative impact of component failures on the business, so as to ensure that the service level objectives and service level quality agreed with business customers in advance are met.

In the practice of IT and Internet enterprises, the following situations may cause failures:

Faults caused by scheduled hardware and operating system maintenance, including hard disk replacement and operating system patches.
Application faults, including software application performance problems, application bugs, and system application changes.
Faults caused by human operation: including faults caused by misoperation and non-standard operation not in accordance with regulations.
System software faults: including operating system crashes and database faults.
Hardware failure: including hard disk, network interface controller damage.
Related equipment failure: including power interruption caused by UPS failure.
Natural disasters, including floods, fires, earthquakes.

Take Alibaba Group as an example. To reduce the impact of faults, the fault management system of Alibaba Group integrates the definition, detection, emergency response capabilities, and subsequent governance of scenarios that affect real business into the scope of fault management. Combined with the innovative "risk warning" of Ali Group, management starts from "hidden dangers", and covers common faults that cause certain impact and lead to performance degradation, as well as "major faults" that seriously affect the business.

In addition, considering some characteristics of Internet enterprises, such as a large number of scenarios with extremely high requirements for rapid response, multiple internal applications and rapid iterative development environments such as DevOps/Agile, and the linkage mechanism of multiple departments (legal affairs, government affairs, public relations, customer service, and technical support) involved in major fault emergency, the fault management system also combines the above characteristics of Internet enterprises to make corresponding mechanism optimization.

Importance of Fault Management

Both theory and practice have proved that failure as long as there is the possibility of occurrence, it will always occur. According to Murphy's law, assuming that the probability of an accident occurring in one experiment (activity) is p(p>0), the probability of at least one occurrence in n experiments (activities) is p=1-(1-p)n. Thus, when the number of experiments n tends to infinity, pn will become more and more to 1, that is, become an inevitable event.

To ensure business stability, you can use fault management to:

Identify and solve risks in advance to prevent problems;
Timely discovery, rapid positioning, rapid recovery of faults to reduce the impact of the fault surface (1-5-10 solution);
Ensure effective implementation of improvement measures and avoid recurrence of failures.

Through the establishment of a specification can be followed, the whole process of closed-loop fault management system, with the improvement of technical means, can effectively reduce the probability of failure, shorten the MTTR of the fault, and ultimately make the damage caused by the fault close to zero.

In daily operations, the phenomenon of service interruption, service quality degradation or user service experience degradation regardless of the cause is called a fault, but does not include problems caused by the user-side environment or the user's own operations.

"User Experience Decline" indicates that the core of the fault should pay attention to the user's feelings. User complaints can be learned through customer service channels, and the usage of the user terminal can also be inferred through monitoring channels.
"Service interruption and service quality decline" indicates that even if the user does not complain (or even no user uses), if there is a problem with the service provided by the enterprise, it is also a fault.
"No matter what the reason" refers to whether it is the enterprise's own reasons or the reasons of third parties such as suppliers and operators, as long as it affects users, it is a fault.

Fault management is a complete set of emergency corresponding process mechanisms for faults, including: fault emergency, fault convergence, fault tracking, fault review, fault improvement and other core functions. By establishing a fault emergency mechanism, the stable operation of services and service experience can be guaranteed. Fault management can also be understood as the escalation of major events.

Fault management shall include the following functions or features:

Fault level definition: For different lines of business, different personnel must be convened for unified formulation. Determine the approval of all parties. And the fault level should be formulated according to the following points:

Functional importance
Affect products, services, and applications
Impact surface (number of users, number of losses, public opinion, etc.)

Fault emergency: supports global emergency notification of faults, and multiple notification channels such as telephone, SMS, email, and IM to ensure that the key progress of faults is notified to relevant personnel in a timely manner and accelerate information flow;
Fault convergence: supports alarm convergence by time /number, and unified processing of alarm convergence to one fault;
Fault tracing: supports online management and collaboration on the latest progress of faults, fault impact surface (service impact), public opinion feedback, and timeline, and collaborative processing based on a unified perspective to improve fault handling efficiency.
Fault review: Based on best practices and experiences, the structured requirements for deep review of faults are precipitated, and online checkpoints are formed to carry the process to the ground in the form of products. Including root cause checkpoints (such as failure causes, recent activities, injection methods, recovery methods, etc.), failure change inspection, monitoring inspection, and the need to identify the responsible person and team for each failure;
Fault improvement: You can specify the improvement and acceptance measures, the responsible person, and the completion time for each fault. This ensures that each fault after a deep review can improve business continuity and avoid the recurrence of similar historical faults.

Best practices

Formulation and entry of fault level definition

Ideas for the development of standardized fault level definitions:

Divide the business into large subclasses based on the business attributes (at the level of the overall technical architecture of the business)
Distinguish the core modules from sub-core and non-core modules in each sub-business (functional level)
According to the service level of each functional module to adapt to different impact surfaces and fault level definition templates

Among them, adapting different impact surfaces and their corresponding fault level definition templates according to the service level is the focus of this idea. The following examples explain (for reference only, each business can follow their own actual situation and use some recommended values as appropriate):

For core functionality:

In the case of large volume (for example, the peak hour minute level exceeds 1000TPS and the daily average is more than 100W), it is recommended that the minute-level success rate drop of 30% or more is defined as P1
In the case of medium body mass (e.g. 100-1000TPS per minute during peak hours, 10-100W per day), it is recommended that the overall success volume drop of 45% or more within 10 minutes be defined as P1
In the case of small volume (e.g. 10-100TPS per minute during peak hours, 1-10W per day), the overall success rate falling by 45% or more within 15/30 minutes is defined as P1
For smaller businesses (daily average is less than 1W TPS), P2 can be defined as the overall success rate drop of 45% or more within 60 minutes.

Note

The business function module preferably starts from the user's perspective, or from the perspective of external calls, such as the decline in the business level used by the user or the decline in the success of external calls.

When the highest fault level P1 is determined, we lower the impact surface in turn to form a P2-P4 standard (P3 can be considered for the main path failure of large-scale services, and P4 level failure is not set), such as 30%-20%, 45%-30% and other impact surfaces corresponding to the remaining levels.

For sub-core functions (such as marketing, registration, etc.), one level can be reduced on the basis of core functions;

For non-core functions (such as query, background use, etc.), two levels can be uniformly reduced on the basis of core functions;

The template for generating a fault level definition can be as follows (in actual use, it can be appropriately simplified to avoid excessive redundancy)

After the fault level definition is formulated, it needs to be approved by the technical director and subsequently publicized to the technical team and upstream and downstream teams. It needs to be preached when necessary.

Service Groups and Fault Emergency Groups

A service group is a group of personnel that can be bound to one or more fault scenarios. When a fault is triggered, the on-duty member of the service group and the additional service group members are automatically added to the fault emergency group. Service groups also support scheduling. In short, a service group is a group of on-duty personnel on a faulty platform.

A fault emergency group is a fault handling group that is automatically created after a fault notification. In addition to the automatically added handling members, other related personnel can also actively join to troubleshoot the fault. The fault emergency group also has fault handling related functions such as check-in response, auxiliary troubleshooting, and combat manual.

Fault records

Record the key time points and key operations related to the fault during the fault.

Fault recovery and improvement measures

The fault review information is synchronized. After the fault ends, the person responsible for the fault is located and held responsible.

After the fault is resumed, you must make targeted improvements to the faulty part to avoid the recurrence of such faults in the future.