Performing Stress Tests

What are the general process and methods of stress testing? What data metrics are important? How can you estimate the QPS to be supported by the backend? This article shares and analyzes the problems that need attention in the process of stress testing. Comments and feedback are welcomed.

1. Stress Testing Goal

Before planning the stress tests, we must first clarify the goals. The ultimate goal must be optimizing the system performance, but different approaches may be required for different specified goals. In general, goals may include:

1.1. Find System Bottlenecks and Optimize System Performance

Baseline performance data are insufficient when new systems are launched. In this case, QPS and RT are generally not specified in stress testing. System problems are exposed and solved under constant stress to approach the system limit.

1.2. Establish a Performance Baseline

This goal aims to collect the current maximum performance indicators of the system. Based on business characteristics, the tolerance for RT and error rate is determined first. Then, the maximum QPS and concurrency that can be supported are calculated through stress testing.

At the same time, it is possible to establish a reasonable alert mechanism by combining performance indicators and monitoring data. Alarm items for system water level, traffic limiting threshold, and elastic policies can also be established.

The system capabilities and SLA can be quantified (for example, being referred to in bidding.)

1.3. Performance Regression

For online systems or systems with specific performance requirements, the QPS and RT that the system needs to support can be determined based on the online operation. Then, the regression verification can be conducted before any performance impact is involved, ensuring that the performance meets expectations.

1.4. System Stability

This goal focuses on whether the system can provide an SLA guarantee stably for a long time under certain stress. Generally, you can set the stress value to 80% of the service peak traffic and continue adding stress.

1.5. Latency and Stability of Network and Lines

Network quality is required in some business scenarios, such as DNS resolution, CDN services, multi-player real-time online games, and high-frequency transactions. These scenarios require extremely low network latency. The difference of network lines (i.e., China Telecom, China Unicom, Education Network) is especially important.

2. Stress Testing Object

After clarifying the testing goal, you need to determine what needs to be tested. Generally, you can divide stress testing objects like this:

3. Stress Testing Data

During stress testing, you need to focus on the following data indicators:

3.1. Starter/Client

The three most important indicators are QPS, RT, and success rate. Other indicators include:

Average Page Response Time: This one is important.
Concurrency: Concurrency is not that important while focusing on QPS.
Maximum Number of Concurrent Online Users: Additional stress testing is not needed for the user login except for some special scenarios.
Network Quality: Latency, fluctuations, and others will not be described in this article.

3.2. Server

Monitoring data is a key component, including:

4. Result Analysis

Generally, with the increase of stress (the increase of concurrent requests), the relationship among QPS, RT, and success rate is explored to find the balance point of the system. It would be better if the monitoring data at the server could also be used for analysis.

5. Testing Tools

5.1. jmeter

Concurrency Thread Group

Java Sampler

Composite Chart

You can merge multiple charts into one chart. The coordinate system automatically scales to display the results on one chart.

6. Calculation Methods of Performance Indicators

The preceding indicators are all system-oriented data, which is insensitive or meaningless to users. So, what kind of figures are meaningful?

If you are providing an online web service, the user may care about the concurrency limit of your system while not being aware of the system lag. The system SLA may tolerate occasional page error retries.

If you are providing a settlement system, the user may be concerned about the settlement performance. In the case of ensuring transaction validity, users care about the number of orders that can be processed per second. Settlement errors are not allowed, but the system SLA can also tolerate occasional timeout retries.

Let's take a look at the analysis on the diagram below:

6.1. Basic Calculation Methods

The page visit (PV) indicates the number of backend API calls counted by the backend log. If there is a general-sense PV counted by the frontend, the basic principle is the same. The PV can be converted as PV * x-ratio = number of back-end calls.

1. Obtain the average and peak value of daily on-site AS API PV and UV.

2. Take Max (PV peak value * 0.8 and average daily PV) as the target PV'. Take the UV value in the PV' period as the reference N' of the concurrent user number. Calculate PV' / perMinute / N' and take the result as O', which is the number of APIs triggered by user operations.

3. Calculate the QPS to support by the backend according to the following rules:

3.1 . Model Assumption: Pareto principles: 80% of PV occurs during 20% of working hours per day (ratio = 0.8).
3.2. Let's assume the ratio of single page requests mapped to the backend API request is 1:10, and the working hours are 8 hours a day (e = 10).
3.3. Let's assume common users operate the page 10 times per minute during peak hours (o = 10).
3.4. Calculate the QPS required to support the concurrent operations of N' users based on the daily on-site PV: PV' ratio / (working hour 60 60 (1 − ratio)) = QPS
3.5. Calculate the QPS required to support concurrent operations of N' users based on the hourly onsite PV during peak hours: PV' ratio / (1 60 60 (1 - ratio)) = QPS

4. Based on the stress testing QPS, the maximum number of concurrent users are calculated like this:

4.1. Based on the formulas above, each user operates 10 times in one minute. Each frontend operation corresponds to 10 APIs in the backend:
PV in one minute = N 10 10
N 10 10 / 60 = QPS ==> N = QPS * 0.6
4.2. If each user operates the page 100 times within an hour, and each frontend operation corresponds to 10 APIs on the backend:
PV in an hour = N 100 10
N 100 10 / 60 60 = QPS ==> N = QPS 3.6

6.2. Forward Deduction

If the on-site environment data indicates that 50 people have logged into the system during the peak period from 9:00 to 10:00, the PV value is 10,000. According to step 3.1, the system can support normal operation under the current user number only when the overall QPS is greater than or equal to 11.

6.3. Reverse Deduction

If the stress test results from the home environment show that the QPS of a random API call is 30, maintain the assumptions above and refer to step 4.2. The calculation result can support simultaneous operations from 18 users during peak hours.

6.4. Defects

The methods above calculate the random API at 1:1, but the call is uneven. The distribution information of API calls can be collected according to the on-site data. Stress testing simulates the call distribution as similar as possible to the actual one.

It estimates the expansion ratio of the minute by minute operation number on the page to the backend API corresponding to each frontend request. You can make an approximate calculation based on the model, but it is less accurate compared with a direct calculation based on the on-site data.

7. Other Considerations

Procedure tracing for bottleneck analysis
API Log
eagleye-traceId

Cache impact on the database
Consider the stress testing scenario to determine the amount of stress on the database layer
Is it necessary to create a large amount of random stress testing data? For example, in the cache optimization scenario for a single user, the performance with a single user can not be used to push multi-user concurrent scenarios.
Average Page Response Time: This one is important.

Stress testing for synchronous APIs and asynchronous APIs (staragent)
Testing for the processing capability of backend tasks, which is the immediate result returning after an asynchronous task is submitted
The impact of traffic limiting settings at different system levels on APIs
There are traffic limiting settings, such as Sentinel in the service layer, X5 in the Nginx layer, and other LVS-based settings.
Message communication, especially for broadcast messages
Databases, especially for write consistency
Long-procedure calls in complex scenarios
The impact of NGINX/Tomcat configurations on requests
The impact of easy-to-ignore object serialization/deserialization on performance
Hotspot Data

Community