Develop the key technical requirements for the execution of the performance testing - Performance Testing

This topic develops the key technical requirements for the execution of the performance testing. The technical requirements help Performance Testing Service (PTS) users prevent technical risks after the system goes online, evaluate the real capabilities of online systems, and test online capabilities based on business models to handle potential risks in advance.

Applicable scope

The technical requirements apply to all projects that require the performance testing. The technical requirements analyze the important and key technologies for the execution of the performance testing, including system environments, test metrics, business models, data volume, test models, test types, scripts (APIs), scenarios, monitoring, bottleneck analysis, tuning, and distributed cloud-based stress testing tools used in the performance testing.

System environments

Analysis
A system environment is classified into the production environment, the test environment, and other environments. A production environment and a test environment have their own advantages and disadvantages. The production environment provides a more accurate measurement and a more optimized reference, whereas requiring you to clean up relevant full test data and refer to subsequent data volume for the structure of basic data, or filter the relevant test data when you collect statistics on Business Intelligence (BI) data. The more efficient solution is full-session stress testing provided by Alibaba Cloud. To avoid negative impacts on your production services, we recommend that you perform the stress testing in a production environment during off-hours. You can control the risks of a test environment. However, you have trouble building the test environment and pay a high cost for building the same scale as the production environment corresponding to the test environment. Therefore, a common solution is to build a test environment based on a ratio such as half, quarter, or eighth, separately deploy test clusters in some applications in a production environment, or share databases. In addition, the test environment needs to import basic data that is masked from the production environment, such as data that is generated within the last six or twelve months, to maintain its data relevance. This is important for the accuracy and reference of the stress testing performed in the test environment.
Risks
The risks of a test environment mainly lie in the differences with a production environment corresponding to the test environment. Therefore, the reference value of test data in the test environment is compromised. You can select a proper method that addresses your needs. For example, if you pay more attention to verifying the network of an entry, you can share the entry between the test environment and the production environment. If you are not familiar with the operating system platform, middlewares, and databases of the test environment, you cannot easily analyze and tune bottlenecks.
Requirements
1. Build a test environment
  After you are familiar with the preceding issues, you must meet the following requirements for building a test environment:
  - Make sure that the architecture of the test environment is the same as that of the production environment corresponding to the test environment.
  - Make sure that the models of the test environment are the same as those of the production environment. Make sure that cloud-based resources are Elastic Compute Service (ECS) instances or containers of the same specifications.
  - Make sure that the software versions of the test environment are the same as those of the production environment. The versions include the operating system version, middleware version, database version, and application version.
  - Make sure that the parameters of the test environment are the same as those of the production environment. The parameters include operating system parameters, middleware parameters, database parameters, and application parameters.
  - Make sure that the basic data volume of the test environment is in the same order of magnitude as that of the production environment.
  - Reduce the number of load generators in the test environment and proportionally scale down other resources in the test environment.
  - Take note that the ideal configurations of the test environment is half or quarter of those of the production environment.
2. Investigate a test environment
  You must investigate the following layers of a test environment:
  - System architecture: You can investigate the system composition, layer functions, and differences with the production environment corresponding to the test environment. These results are mainly used for bottleneck analysis and production environment performance evaluation.
  - Operating system platform: You can investigate the operating system platform to perform tool monitoring.
  - Middleware: You can investigate middlewares in the test environment to perform tool monitoring and bottleneck locating.
  - Database: You can investigate databases in the test environment to perform tool monitoring and bottleneck locating.
  - Application: You can investigate instances and parameters that are started in the test environment to perform issue finding and bottleneck locating.
  You can use Application Performance Management (APM), such as Application Real-Time Monitoring Service (ARMS), to locate issues that occur at the middleware, database, or application layer.

Test metrics

Analysis
Test metrics are classified into business metrics, resource metrics, application metrics, and frontend metrics.
- Business metrics: include the number of concurrent users, transactions per second (TPS), success rate, and response time (RT).
- Resource metrics: include CPU utilization, memory utilization, I/O, and kernel parameters that contains semaphore and the number of open files.
- Application metrics: include the number of idle threads, the number of database connections, the number of GCs or full heap GCs, and function duration.
- Frontend metrics: include page loading time and network time that contains Domain Name Service (DNS) resolution duration, connection time, and transfer time.
Risks
Different users require diverse metric types and have unlike expectations. You must investigate metrics and specify metric-specific thresholds for different personnel with an unlike role in advance to test the system performance that can be reached under the thresholds, locate bottlenecks, and tune performance. If you do not pay attention to the test metrics in advance, you may obtain invalid test results that are not required by the relevant personnel.
Requirements
1. Business metrics
  - RT: This metric is common for all relevant personnel. Business departments have a higher need for the specific value of this metric. In most cases, the expected value of business RT varies based on the system services. We recommend that you set this metric to a value that falls within 1 second. For example, the RT of Taobao system services basically falls within tens of milliseconds.
  - TPS: This metric indicates the number of transactions processed by a system per second. This metric is key to measure the processing capability of the system. You can refer to the TPS in the system of the same identify and your business when you specify the TPS for your services. The TPS is 50 to 1000 for small- and medium-sized enterprises, 1000 to 50000 for banks, and 30000 to 300000 for Taobao.
  - Success rate: This metric measures the success rate of requests under high system workloads. In most cases, the success rate of the industry exceeds 99.6%.
2. Resource metrics
  In most cases, system resource metrics cannot exceed bottleneck values. For example, you must limit the CPU utilization to 75% or refrain your system from using swap partitions. Ideally, when the system capabilities does not enhance, resources become bottlenecks which is not caused by other bottlenecks in most cases. In this case, if resources are added, the system capabilities also enhance. However, the system capabilities do not enhance when many system performance testing resources do not reach bottlenecks in most cases.

Business models

Analysis
A system has a lot of business. Each business logic and business volume is not the same, and the amount of system resources consumed by each business is not the same. Therefore, the business type and the business proportion determine the processing capabilities of the system. Business models play a key role in the performance testing. For example, in the e-commerce scenario, different promotion forms and main categories determine the overall ratio of different resources. Therefore, accurate traffic landing on PTS for the stress testing and the obtaining of the system bottleneck can make full use of machine resources to achieve business purposes.
Risks
If the improper business type and business proportion are selected for a business model and have a large difference with those of a production environment, test results have no reference value. In addition, you can misunderstand the system capabilities. Although the business volume proportion of some businesses is very low in a system, a mutation is fatal to the system. Therefore, you must focus on the businesses in the performance testing.
Requirements
In most cases, you must select high-volume and risky business that provides a high use frequency and has the growth potential in the future as the typical business of your system. You can evaluate the systems that have been launched based on the historical business volume and production environment performance during the peak hours. For systems that are about to be launched, you can evaluate the systems based on the results of investigation and resource consumption in a transaction.
1. Online systems
  - Collect the business type and business volume of different peak periods in a production environment, and determine whether the business type and business volume of each time period are very different. If a large difference exists, multiple business models must be used. If a small difference exists, only one business model can be used.
  - Collect the time points at which resource consumption is high and resource exceptions occur during the peak hours in a production environment, and capture the reasons of high resource consumption and resource exceptions.
  - Collect production issues and analyze them. If the issues are caused by a business and the business has been ignored in previous performance testing, the business is significantly at risk and needs to be added to business models in subsequent performance testing.
2. Offline systems
  - Determine the business type and the business proportion by conducting investigation.
  - Determine whether some businesses have the possibility of mutations in promotion and other activities by conducting investigation.
  - Determine the resource consumption of each business based on the test results. If some businesses that have a low proportion consume a large amount of resources, you must adjust the proportion of the business.

Data volume

Analysis
Data volume mainly includes basic data volume (historical data volume, bottom data volume, or existing data volume in databases) and parametric data volume. Data volume plays a very important role in the performance testing. The results of a query for several entries significantly different from those of a query for millions of entries. As the business volume increases, entries also increase. Therefore, you must maintain the same magnitude of data volume as that of the production environment corresponding to a performance test environment when you use the test environment. If you insert a test account in the production environment, you can ensure the environment authenticity and maintain the same magnitude of basic data volume to a certain extent. Full-session stress testing provided by Alibaba Cloud also requires the same magnitude of basic data volume between the test and production environments. In addition, we also need to consider the parametric data volume and the data distribution during testing.
Risks
If the basic data volume of a test environment is not in the same order of magnitude as that of the production environment corresponding to the test environment, the values of relevant metrics are untrue. For example, the response time is much quicker than that in the production environment, and even test results have no reference value. If parametric data volume is too small and data distribution is not considered, test results are untrue and meaningless. If you want to insert a test account into the production environment, you must consider the integrity of the data preparation and the cleanup logic. Full-session stress testing provided by Alibaba Cloud requires a large amount of transformation costs and involves continuous iteration and maintenance.
Requirements
1. Basic data volume
  The basic data volume of a test environment needs to be in the same order of magnitude as that of the production environment corresponding to the test environment. In most cases, you must consider the growth trend of data volume in the next three years. If data volume quickly increases, you must construct more data in the test environment.
2. Parametric data volume
  - Maximize the parametric data volume. If necessary, you can clean up the cache or provide parametric data volume by writing code.
  - Properly distribute parametric data. If a business has obvious geographical distribution characteristics, you must consider the data distribution.

Test models

Analysis
Test models are evolved from business models. In most cases, the test models and the business models are the same. However, due to a failure to simulate a business or for security reasons, the business needs to be removed and the business proportion needs to be recalculated.
Risks
- Refer to the preceding the risks of business models.
- If the removed business are at risk, you must evaluate the risks of the business. If the risks are high, you must adopt other solutions.
Requirements
Refer to the preceding requirements of business models.

Test types

Analysis
Test types are mainly classified into load testing and stress testing. The test types include single transaction benchmark testing, load testing, stress testing, hybrid transaction load testing (capacity testing), business mutation testing, hybrid transaction stability testing, hybrid transaction reliability testing, batch testing, and testing for the impacts of batch testing on hybrid transactions. Each test type serves a different purpose. You can select a proper test type based on the reality of your production system.
Risks
If a test type is missing, some scenarios of the real production system are not detected, which causes risks, such as system crashes and slow RT.
Requirements
If you have sufficient time, we recommend that you test most test types. You can also refer to the following requirements:
- Single transaction benchmark testing: optional.
- Single transaction load testing: optional. If a system is offline, we recommend that you perform the load testing for the system to view the resource consumption.
- Hybrid transaction load testing (capacity testing): required.
- Hybrid transaction stress testing: optional.
- Business mutation testing: optional.
- Hybrid transaction stability testing: required.
- Hybrid transaction reliability testing: optional.
- Batch testing: optional.
- Testing for the impacts of batch testing on hybrid transactions: optional.

Business sessions

Analysis
A business session is an ordered set that consists of stress testing APIs with business meanings. The session is similar to a transaction. The business session is used to simulate business operations that you perform. The simulation correction directly affects the performance of your system. When simulating business operations, parameterized data is required. For more information about the distribution and data volume of parameterized data, see Data volume.
Risks
If a business is not successful or the business logic is significantly from that of the real production environment, test results have no reference value.
Requirements
- Orchestrate business sessions based on business rules in a production environment.
- Verify the return values of your server at key points and add checkpoints (assertions) to a stress testing API (a request on the client that is triggered by a user behavior). For more information, see Interface response parameters.
- Parameterize data and maximize the data volume as much as possible.

Scenarios

Analysis
A (stress testing) scenario is a combination of several HTTP or HTTPS-based Uniform Resource Locators (URLs) or APIs. The scenario is used to simulate business scenarios in the real production environment, including the pressure mode, pressure increment method, and running time. A simulated scenario needs to be consistent with the scenario in a production environment. Especially in a period of time, the tested TPS proportion of each business needs to be consistent with the business proportion during the peak hours in the production environment.
Risks
The risks of a scenario are that the tested TPS proportion of a business is inconsistent with that of the business proportion in a production environment. If the business proportion seriously deviates from the tested TPS proportion, test results are untrue or invalid and cannot reflect the business scenario in the production environment.
Requirements
The TPS proportion of each business in the test results must be consistent with the business proportion of a business model in a production environment. You can use the PTS-exclusive Requests Per Second (RPS) mode that can test the throughput to ensure consistency. For example, if interfaces A and B have a ratio of 1:4 and the RT of 1 ms and 100 ms respectively, you need only to set RPS for the two interfaces at a ratio of 1:4 in the stress testing by using PTS. If the traditional concurrency mode is used, the concurrency of the two interfaces needs to be converted to ensure that the ratio is 1:400. This way, the final business model is consistent with that of a production environment.

Monitoring

Analysis
Monitoring is mainly designed to perform performance testing analysis, monitor systems in a comprehensive manner, and locate bottlenecks in a more efficient manner. In most cases, operating systems, middlewares, databases, and applications need to be monitored. Make sure that the metrics configured for each type of monitoring are more comprehensive.
Risks
The lack of comprehensive system monitoring, causes failures to perform performance analysis, locate system bottlenecks, and identify tuning items.
Requirements
- Focus on the following metrics:Operating system-specific metrics: CPU utilization that can be reflected by the User, Sys, Wait, or Idle metric, memory utilization (including swap partition utilization), disk I/O, network I/O, and kernel parameters.
- Middleware-specific metrics: thread pools, Java Database Connectivity (JDBC) connection pools, and JVMs (including GC size, full heap GC size, or heap size).
- Database-specific metrics: inefficient SQL statements, locks, cache, sessions, and the number of processes.
- Application-specific metrics: method duration, synchronization and asynchronization, buffering, and cache.

Bottleneck analysis

Analysis
Bottleneck locating is designed to analyze bottleneck points in a system and prepare for tuning. The performance bottleneck points of the system are mainly distributed in the operating system resources, middleware parameter settings, database issues, and application algorithms. Targeted tuning is conducive to the improvement of system performance.
Risks
If the bottleneck points of a system cannot be analyzed, the new business online or core business is at risk, which may cause poor system performance experiences or even system crashes during the peak hours.
Requirements
Focus on the following metrics for the analysis of system bottleneck points:
- Operating system resource consumption metrics: CPU, memory, disk I/O, and network I/O.
- Middleware metrics: thread pools, JDBC connection pools, and JVMs (including GC size, full heap GC size, or heap size).
- Database metrics: inefficient SQL statements, lock waits, deadlocks, cache hit ratio, sessions, and processes.
- Application: method duration, algorithms, synchronization and asynchronization, cache, and buffering.
- Load generator: the resource consumption of load generators. In most cases, load generators have a low possibility of becoming a bottleneck point. Load generators used in PTS have protection and scheduling mechanisms without the need for separate attention.

Tuning

Analysis
Tuning is designed to improve the system performance. Tuning can analyze system bottleneck points and verify the improvement of system performance through testing.
Risks
After a system that is not tuned goes online, the user experiences may be poor, and even the system may crash.
Requirements
System tuning follows the following rules:
- Middleware tuning: thread pools, database connection pools, JVMs.
- Database tuning: Inefficient SQL statements, deadlocks and lock waits, cache hit ratio, processes, and session parameters.
- Application tuning: method duration, algorithms, synchronization and asynchronization, cache, and buffering.
- System resources: In most cases, the high consumption of system resources, such as CPUs, are caused by improper settings of applications and parameters instead of insufficient resources.

Distributed cloud-based stress testing tools used in the performance testing

Overview
PTS is a Web-based software as a service (SaaS) performance testing platform that delivers powerful distributed stress testing capabilities and can simulate the real business scenarios of massive users.
PTS can simulate access from Content Delivery Network (CDN) nodes that are deployed in hundreds of cities and various operators around the world. Compared with the cloud hosts of industry products, PTS quickly initiates the stress testing in more regions and delivers the more powerful pulse and traffic regulation capabilities. PTS pays more attention to the visual page orchestration, allows the complex interactive stress testing without coding, supports the RPS stress testing mode, and delivers the capabilities to take effect immediately upon regulation.
PTS is designed to continuously simplify the work of performance stress testing so that you can return more energy to focusing on your businesses and performance issues. PTS can be used to construct complex interactive traffic that is closest to real business scenarios at low labor and resource cost, quickly measure the business performance status of your system, and facilitate the execution of performance issue locating, the completion of capacity ratio settings, the traffic construction of full-session stress testing, which can further enhance user experiences, promote business development, and maximize the business value of your enterprises.
Features
1. Stress testing scenario construction
  PTS supports the APIs that execute the stress testing in sequence or orchestrate the stress testing in parallel. allows the parameterized combinations that include data files, system functions, strings, and response parameters, provides a high compatibility with cookies, and delivers a wide range of instruction extension scenarios. The debugging feature provided by PTS allows you to easily verify the data flow in complex scenarios.
2. Stress testing traffic regulation
  PTS supports the concurrency and RPS modes to quickly start the stress testing within minutes. PTS reduces the deviance, supports the automatic and pure manual modes, makes the adjustment of the stress testing traffic take effect within seconds, and implements the instantaneous pulse of tens of millions of traffic, which ensures that the stressor testing traffic stops in time.
3. Monitoring and stress testing reports
  Stress testing data includes the concurrency, TPS, RT, and sampled logs of each API.
Benefits
1. Stability and reliability
  - PTS provides a high technical stability.
  - PTS is suitable for a wide range of industries, including e-commerce, multimedia, finance and insurance, logistics express, advertising and marketing, and social networking.
2. Powerful features
  - PTS is of the complete SaaS form without additional installation and deployment.
  - PTS covers the recording plug-ins of mainstream browsers.
  - PTS servers as a data factory that format the request parameters of APIs and URLs used in the stress testing through simple encoding.
  - PTS orchestrates complex scenarios in a full visualization manner, supports logon status sharing, parameter passing, and business assertion, and provides the extensible instruction feature that supports multi-form thinking time and traffic regulation.
  - PTS supports multiple stress testing modes, such as RPS and concurrency.
  - PTS allows you to dynamically adjust the traffic within seconds and instantly initiates millions of queries per second (QPS).
  - PTS provides the powerful reporting feature that displays and statistics the real-time data of a stress testing client in multiple dimensions, and automatically generates reports for post-event reference.
  - PTS allows you to debug stress testing APIs and scenarios and query the logs of a stress testing process.
3. Real traffic
  - Traffic comes from hundreds of cities around the world and covers all operators. PTS truly simulates the traffic source of end users. Therefore, the corresponding reports and data are closer to the real user experiences.
  - PTS delivers powerful stress capabilities and supports stress testing traffic with a high RPS.