Performance Testing - Well-Architected Framework - Alibaba Cloud Documentation Center

Performance Testing concepts, scenarios, and best practices.

Performance testing is a type of testing that uses automated tools to evaluate system performance metrics under various normal, peak, and exceptional load scenarios. It necessitates a fundamental understanding of system performance and should be carried out under certain conditions. Performance stress testing is classified into load testing, stress testing, concurrency testing, configuration testing, and reliability testing based on the goal of the test.

Load testing examines how performance parameters vary as the load is gradually increased.
Stress testing detects a system's bottleneck or unsatisfactory performance point in order to identify the system's maximum service level.
Concurrency testing simulates numerous users attempting to access the system at the same time. It determines whether there are any performance concerns, such as deadlocks, when many users access the same software, module, or data record at the same time.
Configuration testing modifies the software or hardware environment of the system being tested in order to determine the amount to which alternative methods affect the system's performance. As a result, it aids in determining the best allocation principle for system resources.
Reliability testing runs the system under a certain business load for a period of time to check system stability.

Applicable Scenarios

The following scenarios can benefit from performance testing:

Support for New Systems: Prior to the launch of a new system, performance stress testing can give an accurate picture of the system's load capacity. This, along with an estimate of the number of possible users, helps ensure a positive user experience once the system has been launched.
Verification of Technical Upgrades: Performance testing can be used to efficiently assess the efficacy of new technology during system restructuring, offering recommendations for system reconstruction.
Assurance of Peak Business Stability: Sufficient performance testing prior to a business peak helps assure the stability of major promotional activities and other peak enterprises, preventing business losses.
Site Capacity Planning: Through fine-grained capacity planning for websites, performance testing can be utilized to govern the distribution of resources in distributed systems.
Bottleneck Detection: Performance bottlenecks in the system can be detected through performance testing and focused optimization to increase system performance.

Effective performance testing, in conjunction with the system development, refactoring, launch, and optimization phases, provides crucial direction for system stability and is a critical component of the system lifecycle.

Performance Testing Best Practices

Establish Performance Testing Goals and Baselines

Performance testing objectives might be derived from project plans or business requirements. During this step, It is better to establish business, resource, and application KPIs for performance testing. The following "Golden Trio" for business is significant.

Business Response Time (RT): various systems have various intended business response times, which are usually within one second. The RT for major e-commerce systems is often in the tens of milliseconds range.
Business Throughput (TPS/ Transactions Per Second): The number of transactions processed by the system per second is an important parameter for determining the system's processing capacity. The TPS reference value might refer to systems in the same industry, taking into account the specific business. TPS levels for SMEs are typically 50~1000, 1000~50000 for banks, and 30000~300000 for large e-commerce websites.
Success Rate: This metric measures the success rate of the business under pressure. Generally, success rate should be greater than 99.6%.

Once these business criteria have been established, it's good to establish a performance baseline. After completing the performance test manually or automatically, compare the test results to the performance baseline to determine whether the performance test was successful.

Establish the Performance Testing Environment

The performance testing environment is classified as follows:

Brand-new Production Environment

When transferring to the cloud or a new IDC, the system may be put into formal operation after executing end-to-end business stress tests, which would produce high verification results. This is because it includes the final real-world performance environment. Data from the real-world production environment may typically be anonymized and imported, ensuring that business data (transaction data, flow, different key business records, etc.) is kept for at least six months. This also ensures data integrity (including cross-system business integrity data), and the stress test is based on this core data. The appropriate core business traffic (login, shopping cart behavior, transaction behavior, etc.) must then be built, followed by applicable data cleanup immediately before production and initialization once the foundational data is in place.

Proportional Performance Environment

This environment is produced by partitioning a section of the production environment and preparing a capacity proportional to the production environment with a shared access layer to do performance testing. Although the system can be expensive, the benefits include more risk controllability and more precise capacity planning. The shared access layer (CDN dynamic acceleration, BGP, WAF, ALB/CLB, 4/7 layer load balancing, etc.) must be guaranteed the same in a proportional performance environment, as this can help uncover problems. The backend service capacity comparison should be at least a 1/4 guarantee for the production environment. Larger variance in comparison will result in a considerable drop in accuracy, although database setups should match. In terms of basic data preparation, the approaches employed in a new production environment can be applied, guaranteeing business data volume is maintained for more than a half-year while ensuring data integrity.

Production Environment

There are two forms of basic data stored in the production environment. One does not involve any database changes and is based solely on basic table test accounts (necessary data integrity must also be applied). Following a stress test, the associated test generated current data is deleted (the mode of deletion can solidify SQL scripts or fall into the system); another type is stress test traffic alone defining (as separately defined Header), recognized and passed down during the business process, including asynchronous messages and middleware, eventually falling to the database's shadow table or shadow library. Furthermore, performing production environment stress testing during off-peak business periods minimizes the impact on production activity. Regardless of which strategy is used, creating a separate stress test dedicated cluster can help to avoid production business effect.

In the age of cloud computing, it is recommended to use a proportional performance environment strategy. During stress tests, it is simple to rapidly increase compute resources and lower them after the tests are completed, saving money.

Build Performance Testing Scenarios

Before running performance testing, test scripts should be written and an input parameter file generated to imitate the business chain's real request chain and loads as closely as feasible. A recording approach is typically used to ensure that test scripts are appropriate for real-user behavior and that no interfaces are left out of the scripts. This approach reliably records and automatically converts browser or client user behavior into a stress test script. Open-source JMeter stress test tools and Alibaba Cloud's PTS (Performance Testing Service) offer script recording capabilities to assist customers in quickly creating test scripts.

Perform performance testing and analysis of test results

Once the test scripts and input parameter files are ready, performance testing tools can be used to run the tests as planned. Throughout the process, the request success rate, response time, and business throughput should be monitored. If these indicators show large fluctuations, such as a considerable fall in success rate or throughput or an increase in reaction time, this indicates that performance bottlenecks have been reached. These specific bottlenecks can then be identified via system resource monitoring and application monitoring, along with the appropriate elasticity expansion. Following the changes, tests are repeated to confirm the expansion effects.

Continuous Stress Testing to Prevent Performance Decline

After performance testing is completed, regular regression tests should be performed and compared to performance baselines to prevent system performance degradation during ongoing iterations. The frequency of testing is usually aligned with agile development cycles - it is advised to execute it every week or every two weeks, with automated, frequent performance testing. If considerable performance reduction is found, the system version might be reverted in a timely way, and performance monitoring could be used to identify any bottlenecks.