The system protection feature provides node-level traffic protection capabilities to handle various unexpected situations in different scenarios. For example, an interface is not configured with traffic protection rules. If traffic on the interface surges, the system protection feature can provide basic traffic protection capabilities to ensure the stability of applications. Microservices Governance provides a wide range of protection capabilities for traffic on servers and clients. The capabilities include adaptive overload protection, throttling based on the total queries per second (QPS), throttling based on the total concurrency, circuit breaking for abnormal calls, and circuit breaking for slow calls.
For more information about the relationship between system protection and traffic protection, see What is the relationship between system protection and traffic protection? in this topic.
Prerequisites
Microservices Governance Enterprise Edition is activated. For more information, see Activate Microservices Governance.
Microservices Governance is enabled for your application. For more information, see Enable Microservices Governance for Java microservice applications in an ACK or ACS cluster and Enable Microservices Governance for microservice applications on ECS instances.
Procedure
Log on to the MSE console, and select a region in the top navigation bar.
In the left-side navigation pane, choose Microservices Governance > Application Governance.
On the Application list page, click the resource card of the destination application. In the left-side navigation pane, click Traffic management.
Click the System Protection tab and configure the relevant feature.
Adaptive overload protection
To use adaptive overload protection, you must make sure that the version of the agent is V3.1.4 or later.
Description
The adaptive overload protection capability uses the CPU utilization as the basis for measuring the system load and adaptively adjusts the throttling percentage of server traffic. The adaptive overload protection capability can also keep the CPU utilization fluctuating within a small range around the configured threshold in unexpected traffic surge scenarios.
Effective scope
Adaptive overload protection takes effect on all server interfaces and has a lower priority than traffic protection rules.
Scenarios
Adaptive overload protection provides CPU-based basic protection for server interfaces and is suitable for CPU-sensitive applications. If unexpected traffic surges occur on an interface, the system CPU load increases. As a result, the response time (RT) of the core interface is increased.
The steady-state CPU utilization varies based on the business of different applications. You can use stress testing or historical data to determine the maximum CPU utilization in the steady state and configure a larger value as the threshold.
GUI description
On the left side of the Adaptive Overload Protection section, you can view the adaptive overload protection events. On the right side of the section, you can view the trend of the average CPU utilization of each application node in the previous 5 minutes.
Events are reported for nodes to indicate the throttling status changes based on algorithms. The system generates events when throttling starts, works, and ends.
You can click View in the Actions column of an event to query the CPU utilization that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the node information, such as the CPU utilization and throttling probability, when the event was reported.
Parameter | Description |
ON |
|
vCPU Utilization | The expected CPU utilization threshold. If adaptive overload protection is enabled, the system uses algorithms to adaptively adjust the probability of triggering interface throttling based on the actual CPU utilization and the configured CPU utilization threshold. This allows the system to reject specific requests in high load scenarios and keeps the CPU utilization fluctuating within a small range around the configured threshold. |
Exception Settings | For more information, see Exception settings. |
Throttling based on the total QPS
Description
Throttling based on the total QPS allows the system to measure the total QPS of a node. The total QPS is the sum of the QPS of all server interfaces on a single node. If the total QPS exceeds the configured threshold, the system performs throttling on the requests.
To implement throttling based on the total QPS, you must make sure that the agent version is 4.2.0 or later.
Effective scope
Throttling based on the total QPS takes effect on all server interfaces and has a lower priority than traffic protection rules.
Scenarios
The performance of some system behaviors may not be related to the CPU utilization. In low CPU utilization scenarios, specific applications may encounter performance deterioration due to issues related to the memory, network, or other objects. If you enable throttling based on the total QPS, the system throttles requests based on the total QPS of a node and provides traffic-based protection methods.
If unexpected traffic surges occur on an interface, resource competition occurs and resources become insufficient. As a result, the core interface is adversely affected.
You can use stress testing or historical data to determine the total QPS of a node in the steady state and configure a larger value as the threshold.
GUI description
On the left side of the Total QPS Throttling section, you can view the events for throttling based on the total QPS. On the right side of the section, you can view the trend of the average total QPS of each application node in the previous 5 minutes.
Events are reported for nodes and interfaces on which requests are throttled based on the total QPS in the previous 5 minutes. The event reporting interval is 5 minutes.
You can click View in the Actions column of an event to query the total QPS that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the total QPS of the relevant node and check whether throttling works as expected when the event was reported. If you need to view detailed information, such as the data of interfaces or nodes, you can go to the API Details or Node details page. The page redirection capability will be provided later.
Parameter | Description |
ON |
|
Total QPS Threshold | The total QPS threshold of a node. |
Exception Settings | For more information, see Exception settings. |
Throttling based on the total concurrency
Description
Throttling based on the total concurrency allows the system to measure the total concurrency of a node. The total concurrency is the sum of the concurrency of all server interfaces on a single node. If the total concurrency exceeds the configured threshold, the system performs throttling on the requests.
To implement throttling based on the total concurrency, you must make sure that the agent version is 4.2.0 or later.
Effective scope
Throttling based on the total concurrency takes effect on all server interfaces and has a lower priority than traffic protection rules.
Scenarios
If the RT of a call is high (longer than 1s in most cases), an obvious issue occurs when throttling based on the total QPS is performed. If system resources, such as thread pools, memory resources, and connection pools, are occupied, requests are queued and the interface RT increases. In this case, if you only perform throttling based on the QPS, a small number of requests are still initiated per second. However, queued requests cannot be processed in seconds. As a result, more requests are queued, and the RT of both the old and new requests significantly increases. If you use throttling based on the total concurrency together with throttling based on the total QPS, the system directly rejects new requests if specific requests are not processed. After the system processes the requests, the system allows subsequent requests and completes request processing with a shorter queuing duration. This way, the success rate and average RT of requests can be significantly improved.
If unexpected traffic surges occur on an interface, resource competition occurs, resources become insufficient, and requests are queued. As a result, the RT of all requests increases.
You can use stress testing or historical data to determine the total concurrency of a node in the steady state and configure a larger value as the threshold.
GUI description
On the left side of the Total Concurrency Throttling section, you can view the events for throttling based on the total concurrency. On the right side of the section, you can view the trend of the average total concurrency of each application node in the previous 5 minutes.
Events are reported for nodes and interfaces on which requests are throttled based on the total concurrency in the previous 5 minutes. The event reporting interval is 5 minutes.
You can click View in the Actions column of an event to query the total concurrency that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the total concurrency of the relevant node and check whether throttling works as expected when the event was reported. If you need to view detailed information, such as the data of interfaces or nodes, you can go to the API Details or Node details page. The page redirection capability will be provided later.
Parameter | Description |
ON |
|
Total Concurrency Threshold | The total concurrency threshold of a node. |
Exception Settings | For more information, see Exception settings. |
Circuit breaking for abnormal calls
Description
Circuit breaking for abnormal calls allows the system to measure the abnormal call percentage of each client interface. If the abnormal call percentage exceeds the configured threshold, the system triggers circuit breaking for the interface. During the circuit breaking period, the interface quickly fails, and the system sends detection requests at a specific interval. If the requests are successful, the circuit breaking process ends.
To implement circuit breaking for abnormal calls, you must make sure that the agent version is 4.2.0 or later.
Effective scope
Circuit breaking for abnormal calls takes effect on all client interfaces, except for the interfaces that are configured with interface-level circuit breaking rules.
Scenarios
Circuit breaking for abnormal calls is suitable for two types of scenarios.
Timeout scenarios: If timeout issues frequently occur on a client interface, service providers have exceptions with a high probability. This causes more requests of the caller application to be queued and affects other interfaces of the application. In these scenarios, circuit breaking allows the service providers to fail in a short period of time to prevent request queuing.
Non-timeout scenarios: If non-timeout issues frequently occur on a client interface, circuit breaking for abnormal calls allows the system to report relevant errors for user handling. This minimizes the impact of the issues and optimizes the user experience when the issues occur.
GUI description
On the left side of the Abnormal Call Circuit Breaking section, you can view the events that are reported for circuit breaking for abnormal calls. On the right side of the section, you can view the top 10 application interfaces with a high abnormal call percentage in the previous 5 minutes.
Events are reported for nodes and interfaces on which circuit breaking is triggered for abnormal calls in the previous 5 minutes. The event reporting interval is 5 minutes.
Parameter | Description |
ON |
|
Circuit Breaking Percentage Threshold (%) | The abnormal call percentage threshold for triggering circuit breaking on an interface. |
Exception Settings | For more information, see Exception settings. |
Advanced Settings | |
Statistics Window Duration (s) | The length of the statistics time window. You can specify the length of the time window from 1 second to 120 minutes. |
Circuit Breaking Duration (s) | The period in which circuit breaking is implemented. If circuit breaking is implemented on the resources, all requests quickly fail in the configured duration. |
Minimum number of requests | The minimum number of requests to trigger circuit breaking. If the number of requests in the current time window is less than the value of this parameter, circuit breaking is not triggered even if the circuit breaking rule is met. |
Fuse recovery strategy | Specifies whether a circuit breaker retriggers circuit breaking after the circuit breaking period elapses. Valid values:
|
Circuit breaking for slow calls
Description
Circuit breaking for slow calls allows the system to measure the slow call percentage of each client interface. If the slow call percentage is greater than the configured threshold, the system triggers circuit breaking for the interface. During the circuit breaking period, the interface quickly fails, and the system sends detection requests at a specific interval. If the requests are successful, the circuit breaking process ends.
To implement circuit breaking for slow calls, you must make sure that the agent version is 4.2.0 or later.
Effective scope
Circuit breaking for slow calls takes effect on all client interfaces, except for the interfaces that are configured with interface-level circuit breaking rules.
Scenarios
Circuit breaking for slow calls is suitable for timeout scenarios where circuit breaking for abnormal calls can also be triggered. Unlike circuit breaking for abnormal calls, circuit breaking for slow calls allows you to dynamically adjust the RT value that is used to determine slow calls without considering timeout settings.
GUI description
On the left side of the Slow Call Circuit Breaking section, you can view the circuit breaking events that are reported for slow calls. On the right side of the section, you can view the top 10 average RT values of the application in the previous 5 minutes.
Events are reported for nodes and interfaces on which circuit breaking is triggered for slow calls in the previous 5 minutes. The event reporting interval is 5 minutes.
Parameter | Description |
ON |
|
Slow Call RT (ms) | Request calls whose RT values exceed the parameter value are considered as slow calls. |
Degradation Threshold (%) | If the percentage of request calls whose RT values are greater than the value of Slow Call RT (ms) exceeds the threshold specified by this parameter, circuit breaking is triggered. |
Exception Settings | For more information, see Exception settings. |
Advanced Settings | |
Statistics Window Duration (s) | The length of the statistics time window. You can specify the length of the time window from 1 second to 120 minutes. |
Circuit Breaking Duration (s) | The period in which circuit breaking is implemented. If circuit breaking is implemented on the resources, all requests quickly fail in the configured duration. |
Minimum number of requests | The minimum number of requests to trigger circuit breaking. If the number of requests in the current time window is less than the value of this parameter, circuit breaking is not triggered even if the circuit breaking rule is met. |
Fuse recovery strategy | Specifies whether a circuit breaker retriggers circuit breaking after the circuit breaking period elapses. Valid values:
|
Exception settings
Description
You can configure exception settings for all system protection features. For the interfaces that are listed in the exception settings, requests on the interfaces are directly allowed to pass without rule checking.
To configure exception settings, you must make sure that the agent version is 4.2.0 or later.
Scenarios
In most cases, you need to configure exception settings only for health check interfaces and key interfaces of the system. For health check interfaces, exception settings prevent affecting the health status of nodes. For key interfaces of the system, separate throttling limits are imposed. It is expected that the key interfaces of the system are not subject to the systemwide throttling mechanism.
GUI description
In the Available Interfaces section on the left side, the interfaces that are recently called are displayed. For the interfaces that are not displayed, you can enter the interface name in the search box, click the search icon to search for the interface, and then add the interface to the Selected Interfaces section on the right side.
FAQ
What is the relationship between system protection and traffic protection?
Both system protection and traffic protection can ensure that applications are in a steady state. However, the scenarios and traffic loss of system protection and traffic protection are different.
After throttling is triggered, the system returns the 429 HTTP status code. Custom configurations are not supported.
System protection provides traffic protection based on node-level metrics. This ensures that applications are in a steady state in most scenarios. System protection is implemented from the aspect of applications, and the same system protection rules are applied to all interfaces of an application. However, the interfaces of an application have different levels of importance and have different impacts on system loads. Traffic protection allows you to configure different thresholds for different interfaces to cover more scenarios and minimizes the amount of traffic that is throttled.
Both system protection and traffic protection can provide protection capabilities. However, traffic protection delivers better protection performance in terms of scenario coverage and traffic loss. Compared with traffic protection, system protection allows you to configure settings in a simpler way. Therefore, we recommend that you follow best practices for using system protection together with traffic protection. System protection helps ensure the stability of applications, and traffic protection helps reduce the amount of throttled traffic based on fine-grained configurations without compromising protection performance.
References
For more information about traffic protection policies, see Traffic protection.