The ChaosBlade-Based Chaos Engineering Practice of Qunar

By Qunar Basic Platform Team

1. Preface

Microservice architecture has been implemented in Qunar for many years, and the number of microservice applications has reached the thousands. As the call procedure between services becomes more complex, failures occur frequently, bringing huge economic losses to Qunar. Thus, stability construction has become an important work. Netflix proposed to improve system stability through chaos engineering in 2010. Since then, chaos engineering has been proved as an effective means to discover system weaknesses and establish the capability and reliability of the system to resist out-of-control conditions in the production environment. From the end of 2019, Qunar has also combined with its technical system to start the exploration of chaos engineering. The following part introduces our practical experience in this process.

2. Selection

We studied open-source tools related to chaos engineering at the beginning of the project to avoid repeated development. We analyzed them based on the characteristics of our technical system.

At that time, the basic resources were mainly based on KVM, and containerization was also being explored. Therefore, we had to support both platforms.
The main technology stack of our company is Java.

We also considered the active situation of the community. Finally, we chose ChaosBlade as the fault injection tool in combination with the in-house chaos engineering console in the final plan. (There was no chaosblade-box at that time.)

3. Architecture

Based on Qunar's internal system, the overall architecture is listed below:

Its vertical structure from top to bottom is listed below:

Service Governance: Portal is a CI/CD platform that provides application portrait and information, such as application dependencies, application attributes, and runtime resources. Fault drills can be created through chaos engineering UI (fault drills contain application information, application resources, and faults to be injected).
The Chaos engineering console (chaos console) provides the functions of task procedure scheduling of multiple applications and faults and procedure control of fault drills.
SaltStack and chaosblade-operator provide the installation and uninstallation capabilities of ChaosBlade.
The resources of the application are divided into containers hosted by KVM and Kubernetes. The orchestration system of fault drills communicates with the HTTP service started by ChaosBlade and Restful API to inject and recover faults.

Its horizontal structure is listed below:

The automated testing platform mainly provides the regression capability of online cases during drills and labeling assertions for strong and weak dependencies.
At the beginning of the drill, the Chaos console monitors the alerts of the core metrics of relevant applications. If there is any alert information, the responsible personnel will be notified, and the drill will be terminated and restarted to reduce losses promptly.

4. System Evolution

The application of Chaos engineering in Qunar has mainly gone through two stages:

1) Build the Capability of Fault Injection: The main issue to be solved at this stage is enabling users to create fault drills and verify whether some aspects of the system meet expectations through appropriate fault strategies manually.

2) In scenarios with strong and weak dependencies, provide the capability of dependency labeling, strong/weak dependency verification, and automated closed loop of strong and weak dependencies. Use Chaos engineering to improve the efficiency of microservice governance.

4.1 Fault Drill

It is a basic capability of Chaos engineering to simulate faults through fault injection. At this stage, we mainly provide fault injection for three scenarios: machine shutdown, OS-layer failure, and Java application. On top of this, we provide scenario-based functions.

4.1.1 Drill Process

A typical drill process is listed below:

4.1.2 Difficulties

Insufficient Open-Source Plug-ins for Fault Injection

chaosblade-exec-jvm provides the basic capabilities of Java fault injection and plug-ins of some open-source components, but they are still insufficient for the company's internal components. As a result, the personnel responsible for middleware carried out custom development, adding AsyncHttpClient, QRedis, and other fault injection-related plug-ins. The fault injection function based on the call point for HTTP DUBBO was also added.

Containerized Transformation

In the middle of 2021, Qunar started the containerized migration of applications. Fault drills also needed to support containerized drills. Based on chaosblade-operator, we compared the following plans:

The three main concerns in the plan:

Installation and Uninstallation of the Agent
Policy Injection and Recovery
Transformation Cost of the Control End

Based on the comparison of the plans above, we selected the third implementation plan.

4.2 Automated Closed Loop of Strong and Weak Dependencies

4.2.1 Background

Based on the fault drill platform, we provide the fault drill functions in scenarios with strong and weak dependencies:

Displaying and labeling of dependencies between applications
Based on dependency information, fill in the fault policy parameters reversely

However, the verification of the strong/weak dependency still needs to be operated by humans. So, we combine the automatic testing tools to develop the function of automatic labeling of strong and weak dependencies. The maintenance of strong and weak dependencies can be completed through an automated process, forming a closed loop.

4.2.2 Plan

The chaos console obtains the application dependency from the service governance platform periodically. Then, it generates fault drills based on the exception returning policy (according to downstream interfaces.) Then, it injects faults into the test environment of the application, performing assertion through running cases on the automated test platform and differentiating the results in real-time. Finally, the assertion result is obtained. The Chaos console combines the test assertion and logs hits by the fault policy to determine whether the current downstream interface is strongly or weakly dependent.

4.2.3 Difficulties

1) Compatibility of Java Agent

The automated testing platform supports the recording and playback mode. During regression testing, you can use pre-recorded traffic to perform mock operations on certain interfaces. During the process, the recording and playback agent based on JVM-SandBox is used. chaosblade-exec-jvm is also an agent based on JVM-SandBox. You may encounter compatibility issues when using two agents together.

Can two agents take effect at the same time? The namespace feature is added in version 1.3.0 of JVM-SandBox. That means multiple Java agents based on JVM-SandBox can be enabled at the same time if the namespaces are different. The default namespace used by ChaosBlade can be modified.
When AOP is used in a library, will fault injection be invalid if mock takes effect first? The blacklist function is added to the agent for recording and playback to avoid this problem.

2) Test assertions are different from ordinary tests.

When using the automated test platform for regression testing, more attention is paid to the integrity and accuracy of the data. However, when doing fault drills, there is usually a problem with weak dependencies. In addition to conventional judgments of status code, the judgment of the returned results is more about whether the core data nodes are correct. For this reason, a separate set of assertion configurations is added to the automated test platform to adapt to the fault drills.

5. Contribution in Open-Source

The main open-source project used in Qunar's chaos engineering practice is ChaosBlade. We have carried out varied custom development and bug fixes for chaosblade, chaosblade-exec-jvm, and chaosblade-operator. Some modifications have been submitted to the official for repo and merge. At the same time, we also have communicated with the ChaosBlade community. We prepare to participate in community construction to contribute our share to the ChaosBlade open-source community.

6. Planning for the Future

Our fault drill platform has performed over 80 simulated data center power-off drills and over 500 daily drills, involving over 50 core applications and over 4,000 machines. Business lines have also formed a good atmosphere of quarterly cycle drills and pre-launch verification.

The main goal of our next step is to automate online random drills, determine the minimum explosion radius through service-dependent procedures, and establish steady-state assertions for online drills. Finally, we plan to realize regular random drills for all procedures of the core pages of the whole company. At the same time, we will explore the use scenarios of chaos engineering in service governance and stability construction. This will provide technical guarantees for the stable development of the company's business.

Community

The ChaosBlade-Based Chaos Engineering Practice of Qunar

1. Preface

2. Selection

3. Architecture

4. System Evolution

4.1 Fault Drill

4.1.1 Drill Process

4.1.2 Difficulties

4.2 Automated Closed Loop of Strong and Weak Dependencies

4.2.1 Background

4.2.2 Plan

4.2.3 Difficulties

5. Contribution in Open-Source

6. Planning for the Future

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Microservices Engine (MSE)

ACK One

Container Registry

Container Service for Kubernetes