All Products
Search
Document Center

Microservices Engine:Configure system protection

Last Updated:Jan 14, 2025

The system protection feature provides node-level traffic protection capabilities to handle various unexpected situations in different scenarios. For example, an interface is not configured with traffic protection rules. If traffic on the interface surges, the system protection feature can provide basic traffic protection capabilities to ensure the stability of applications. Microservices Governance provides a wide range of protection capabilities for traffic on servers and clients. The capabilities include adaptive overload protection, throttling based on the total queries per second (QPS), throttling based on the total concurrency, circuit breaking for abnormal calls, and circuit breaking for slow calls.

Note

For more information about the relationship between system protection and traffic protection, see What is the relationship between system protection and traffic protection? in this topic.

Prerequisites

Procedure

  1. Log on to the MSE console, and select a region in the top navigation bar.

  2. In the left-side navigation pane, choose Microservices Governance > Application Governance.

  3. On the Application list page, click the resource card of the destination application. In the left-side navigation pane, click Traffic management.

  4. Click the System Protection tab and configure the relevant feature.

Adaptive overload protection

Note

To use adaptive overload protection, you must make sure that the version of the agent is V3.1.4 or later.

Description

The adaptive overload protection capability uses the CPU utilization as the basis for measuring the system load and adaptively adjusts the throttling percentage of server traffic. The adaptive overload protection capability can also keep the CPU utilization fluctuating within a small range around the configured threshold in unexpected traffic surge scenarios.

Effective scope

Adaptive overload protection takes effect on all server interfaces and has a lower priority than traffic protection rules.

Scenarios

Adaptive overload protection provides CPU-based basic protection for server interfaces and is suitable for CPU-sensitive applications. If unexpected traffic surges occur on an interface, the system CPU load increases. As a result, the response time (RT) of the core interface is increased.

The steady-state CPU utilization varies based on the business of different applications. You can use stress testing or historical data to determine the maximum CPU utilization in the steady state and configure a larger value as the threshold.

GUI description

On the left side of the Adaptive Overload Protection section, you can view the adaptive overload protection events. On the right side of the section, you can view the trend of the average CPU utilization of each application node in the previous 5 minutes.

Events are reported for nodes to indicate the throttling status changes based on algorithms. The system generates events when throttling starts, works, and ends.

You can click View in the Actions column of an event to query the CPU utilization that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the node information, such as the CPU utilization and throttling probability, when the event was reported.

Parameter

Description

ON

  • Close: Adaptive overload protection is disabled.

  • Simulated Execution: In this state, if adaptive overload protection is triggered, only the relevant events are generated, and the traffic protection policies are not adjusted.

  • Open: In this state, if adaptive overload protection is triggered, the traffic protection policies are adjusted to throttle a specific percentage of ingress traffic.

vCPU Utilization

The expected CPU utilization threshold. If adaptive overload protection is enabled, the system uses algorithms to adaptively adjust the probability of triggering interface throttling based on the actual CPU utilization and the configured CPU utilization threshold. This allows the system to reject specific requests in high load scenarios and keeps the CPU utilization fluctuating within a small range around the configured threshold.

Exception Settings

For more information, see Exception settings.

Throttling based on the total QPS

Description

Throttling based on the total QPS allows the system to measure the total QPS of a node. The total QPS is the sum of the QPS of all server interfaces on a single node. If the total QPS exceeds the configured threshold, the system performs throttling on the requests.

Note

To implement throttling based on the total QPS, you must make sure that the agent version is 4.2.0 or later.

Effective scope

Throttling based on the total QPS takes effect on all server interfaces and has a lower priority than traffic protection rules.

Scenarios

The performance of some system behaviors may not be related to the CPU utilization. In low CPU utilization scenarios, specific applications may encounter performance deterioration due to issues related to the memory, network, or other objects. If you enable throttling based on the total QPS, the system throttles requests based on the total QPS of a node and provides traffic-based protection methods.

If unexpected traffic surges occur on an interface, resource competition occurs and resources become insufficient. As a result, the core interface is adversely affected.

You can use stress testing or historical data to determine the total QPS of a node in the steady state and configure a larger value as the threshold.

GUI description

On the left side of the Total QPS Throttling section, you can view the events for throttling based on the total QPS. On the right side of the section, you can view the trend of the average total QPS of each application node in the previous 5 minutes.

Events are reported for nodes and interfaces on which requests are throttled based on the total QPS in the previous 5 minutes. The event reporting interval is 5 minutes.

You can click View in the Actions column of an event to query the total QPS that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the total QPS of the relevant node and check whether throttling works as expected when the event was reported. If you need to view detailed information, such as the data of interfaces or nodes, you can go to the API Details or Node details page. The page redirection capability will be provided later.

Parameter

Description

ON

  • Close: Throttling based on the total QPS is disabled.

  • Enable: Throttling is performed on requests if the configured threshold is exceeded.

Total QPS Threshold

The total QPS threshold of a node.

Exception Settings

For more information, see Exception settings.

Throttling based on the total concurrency

Description

Throttling based on the total concurrency allows the system to measure the total concurrency of a node. The total concurrency is the sum of the concurrency of all server interfaces on a single node. If the total concurrency exceeds the configured threshold, the system performs throttling on the requests.

Note

To implement throttling based on the total concurrency, you must make sure that the agent version is 4.2.0 or later.

Effective scope

Throttling based on the total concurrency takes effect on all server interfaces and has a lower priority than traffic protection rules.

Scenarios

If the RT of a call is high (longer than 1s in most cases), an obvious issue occurs when throttling based on the total QPS is performed. If system resources, such as thread pools, memory resources, and connection pools, are occupied, requests are queued and the interface RT increases. In this case, if you only perform throttling based on the QPS, a small number of requests are still initiated per second. However, queued requests cannot be processed in seconds. As a result, more requests are queued, and the RT of both the old and new requests significantly increases. If you use throttling based on the total concurrency together with throttling based on the total QPS, the system directly rejects new requests if specific requests are not processed. After the system processes the requests, the system allows subsequent requests and completes request processing with a shorter queuing duration. This way, the success rate and average RT of requests can be significantly improved.

If unexpected traffic surges occur on an interface, resource competition occurs, resources become insufficient, and requests are queued. As a result, the RT of all requests increases.

You can use stress testing or historical data to determine the total concurrency of a node in the steady state and configure a larger value as the threshold.

GUI description

On the left side of the Total Concurrency Throttling section, you can view the events for throttling based on the total concurrency. On the right side of the section, you can view the trend of the average total concurrency of each application node in the previous 5 minutes.

Events are reported for nodes and interfaces on which requests are throttled based on the total concurrency in the previous 5 minutes. The event reporting interval is 5 minutes.

You can click View in the Actions column of an event to query the total concurrency that corresponds to a specific node IP address and play back the data during the interval in which the event was reported. This allows you to observe the total concurrency of the relevant node and check whether throttling works as expected when the event was reported. If you need to view detailed information, such as the data of interfaces or nodes, you can go to the API Details or Node details page. The page redirection capability will be provided later.

Parameter

Description

ON

  • Close: Throttling based on the total concurrency is disabled.

  • Enable: Throttling is performed on requests if the configured threshold is exceeded.

Total Concurrency Threshold

The total concurrency threshold of a node.

Exception Settings

For more information, see Exception settings.

Circuit breaking for abnormal calls

Description

Circuit breaking for abnormal calls allows the system to measure the abnormal call percentage of each client interface. If the abnormal call percentage exceeds the configured threshold, the system triggers circuit breaking for the interface. During the circuit breaking period, the interface quickly fails, and the system sends detection requests at a specific interval. If the requests are successful, the circuit breaking process ends.

Note

To implement circuit breaking for abnormal calls, you must make sure that the agent version is 4.2.0 or later.

Effective scope

Circuit breaking for abnormal calls takes effect on all client interfaces, except for the interfaces that are configured with interface-level circuit breaking rules.

Scenarios

Circuit breaking for abnormal calls is suitable for two types of scenarios.

Timeout scenarios: If timeout issues frequently occur on a client interface, service providers have exceptions with a high probability. This causes more requests of the caller application to be queued and affects other interfaces of the application. In these scenarios, circuit breaking allows the service providers to fail in a short period of time to prevent request queuing.

Non-timeout scenarios: If non-timeout issues frequently occur on a client interface, circuit breaking for abnormal calls allows the system to report relevant errors for user handling. This minimizes the impact of the issues and optimizes the user experience when the issues occur.

GUI description

On the left side of the Abnormal Call Circuit Breaking section, you can view the events that are reported for circuit breaking for abnormal calls. On the right side of the section, you can view the top 10 application interfaces with a high abnormal call percentage in the previous 5 minutes.

Events are reported for nodes and interfaces on which circuit breaking is triggered for abnormal calls in the previous 5 minutes. The event reporting interval is 5 minutes.

Parameter

Description

ON

  • Close: Circuit breaking for abnormal calls is disabled.

  • Enable: Circuit breaking is triggered if the abnormal call percentage exceeds the configured threshold.

Circuit Breaking Percentage Threshold (%)

The abnormal call percentage threshold for triggering circuit breaking on an interface.

Exception Settings

For more information, see Exception settings.

Advanced Settings

Statistics Window Duration (s)

The length of the statistics time window. You can specify the length of the time window from 1 second to 120 minutes.

Circuit Breaking Duration (s)

The period in which circuit breaking is implemented. If circuit breaking is implemented on the resources, all requests quickly fail in the configured duration.

Minimum number of requests

The minimum number of requests to trigger circuit breaking. If the number of requests in the current time window is less than the value of this parameter, circuit breaking is not triggered even if the circuit breaking rule is met.

Fuse recovery strategy

Specifies whether a circuit breaker retriggers circuit breaking after the circuit breaking period elapses. Valid values:

  • Single detection recovery: After the circuit breaking period elapses, the circuit breaker detects the next request. If a slow call or an abnormal call does not occur on the request, the circuit breaking process ends. Otherwise, circuit breaking is triggered again.

  • Progressive recovery: If you select this option, you must set the Number of recovery phases and Minimum number of passes per step parameters.

  • After the circuit breaking period elapses, the circuit breaker performs progressive recovery based on the specified number of recovery stages. If the number of requests in a stage reaches the value of the Minimum number of passes per step parameter, a check is triggered. If the number of checked requests does not exceed the configured threshold, the percentage of requests that are allowed to pass is gradually increased until all the requests are allowed to pass. If the number of checked requests exceeds the configured threshold in a stage, circuit breaking is triggered again.

  • The request percentage is calculated based on the following formula: Request percentage (T) = 100/Number of recovery stages (N). The request percentage in the first stage is T and the request percentage in the second stage is 2T. The calculation ends until the request percentage is equal to 100%.

  • For example, if the number of recovery stages is 3 and the minimum number of requests that are allowed to pass in each stage is 5, requests are distributed in the three stages based on the percentages of 33%, 67%, and 100%. If the number of requests in each stage is greater than or equal to 5, a check is triggered. If the number of requests in each stage is less than 5, the system enters the next recovery stage until all the requests are allowed to pass.

Circuit breaking for slow calls

Description

Circuit breaking for slow calls allows the system to measure the slow call percentage of each client interface. If the slow call percentage is greater than the configured threshold, the system triggers circuit breaking for the interface. During the circuit breaking period, the interface quickly fails, and the system sends detection requests at a specific interval. If the requests are successful, the circuit breaking process ends.

Note

To implement circuit breaking for slow calls, you must make sure that the agent version is 4.2.0 or later.

Effective scope

Circuit breaking for slow calls takes effect on all client interfaces, except for the interfaces that are configured with interface-level circuit breaking rules.

Scenarios

Circuit breaking for slow calls is suitable for timeout scenarios where circuit breaking for abnormal calls can also be triggered. Unlike circuit breaking for abnormal calls, circuit breaking for slow calls allows you to dynamically adjust the RT value that is used to determine slow calls without considering timeout settings.

GUI description

On the left side of the Slow Call Circuit Breaking section, you can view the circuit breaking events that are reported for slow calls. On the right side of the section, you can view the top 10 average RT values of the application in the previous 5 minutes.

Events are reported for nodes and interfaces on which circuit breaking is triggered for slow calls in the previous 5 minutes. The event reporting interval is 5 minutes.

Parameter

Description

ON

  • Close: Circuit breaking for slow calls is disabled.

  • Enable: If calls to the requests exceed the configured threshold, the calls are considered as slow calls.

Slow Call RT (ms)

Request calls whose RT values exceed the parameter value are considered as slow calls.

Degradation Threshold (%)

If the percentage of request calls whose RT values are greater than the value of Slow Call RT (ms) exceeds the threshold specified by this parameter, circuit breaking is triggered.

Exception Settings

For more information, see Exception settings.

Advanced Settings

Statistics Window Duration (s)

The length of the statistics time window. You can specify the length of the time window from 1 second to 120 minutes.

Circuit Breaking Duration (s)

The period in which circuit breaking is implemented. If circuit breaking is implemented on the resources, all requests quickly fail in the configured duration.

Minimum number of requests

The minimum number of requests to trigger circuit breaking. If the number of requests in the current time window is less than the value of this parameter, circuit breaking is not triggered even if the circuit breaking rule is met.

Fuse recovery strategy

Specifies whether a circuit breaker retriggers circuit breaking after the circuit breaking period elapses. Valid values:

  • Single detection recovery: After the circuit breaking period elapses, the circuit breaker detects the next request. If a slow call or an abnormal call does not occur on the request, the circuit breaking process ends. Otherwise, circuit breaking is triggered again.

  • Progressive recovery: If you select this option, you must set the Number of recovery phases and Minimum number of passes per step parameters.

  • After the circuit breaking period elapses, the circuit breaker performs progressive recovery based on the specified number of recovery stages. If the number of requests in a stage reaches the value of the Minimum number of passes per step parameter, a check is triggered. If the number of checked requests does not exceed the configured threshold, the percentage of requests that are allowed to pass is gradually increased until all the requests are allowed to pass. If the number of checked requests exceeds the configured threshold in a stage, circuit breaking is triggered again.

  • The request percentage is calculated based on the following formula: Request percentage (T) = 100/Number of recovery stages (N). The request percentage in the first stage is T and the request percentage in the second stage is 2T. The calculation ends until the request percentage is equal to 100%.

  • For example, if the number of recovery stages is 3 and the minimum number of requests that are allowed to pass in each stage is 5, requests are distributed in the three stages based on the percentages of 33%, 67%, and 100%. If the number of requests in each stage is greater than or equal to 5, a check is triggered. If the number of requests in each stage is less than 5, the system enters the next recovery stage until all the requests are allowed to pass.

Exception settings

Description

You can configure exception settings for all system protection features. For the interfaces that are listed in the exception settings, requests on the interfaces are directly allowed to pass without rule checking.

Note

To configure exception settings, you must make sure that the agent version is 4.2.0 or later.

Scenarios

In most cases, you need to configure exception settings only for health check interfaces and key interfaces of the system. For health check interfaces, exception settings prevent affecting the health status of nodes. For key interfaces of the system, separate throttling limits are imposed. It is expected that the key interfaces of the system are not subject to the systemwide throttling mechanism.

GUI description

In the Available Interfaces section on the left side, the interfaces that are recently called are displayed. For the interfaces that are not displayed, you can enter the interface name in the search box, click the search icon to search for the interface, and then add the interface to the Selected Interfaces section on the right side.

FAQ

What is the relationship between system protection and traffic protection?

  • Both system protection and traffic protection can ensure that applications are in a steady state. However, the scenarios and traffic loss of system protection and traffic protection are different.

  • After throttling is triggered, the system returns the 429 HTTP status code. Custom configurations are not supported.

  • System protection provides traffic protection based on node-level metrics. This ensures that applications are in a steady state in most scenarios. System protection is implemented from the aspect of applications, and the same system protection rules are applied to all interfaces of an application. However, the interfaces of an application have different levels of importance and have different impacts on system loads. Traffic protection allows you to configure different thresholds for different interfaces to cover more scenarios and minimizes the amount of traffic that is throttled.

  • Both system protection and traffic protection can provide protection capabilities. However, traffic protection delivers better protection performance in terms of scenario coverage and traffic loss. Compared with traffic protection, system protection allows you to configure settings in a simpler way. Therefore, we recommend that you follow best practices for using system protection together with traffic protection. System protection helps ensure the stability of applications, and traffic protection helps reduce the amount of throttled traffic based on fine-grained configurations without compromising protection performance.

References

For more information about traffic protection policies, see Traffic protection.