Ensure the availability of HSF applications by removing outlier instances - Enterprise Distributed Application Service

In a microservice framework, service calls are affected when service consumers cannot perceive abnormal application instances of service providers. This further affects the serviceability and availability of service consumers. The outlier instance removal feature monitors the availability of High-Speed Service Framework (HSF) applications and service instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).

Background information

The following figure shows a system that requires outlier ejection. In this example, the system has Applications A, B, C, and D, among which Application A calls the instances of Applications B, C, and D. If the instances of Application B, C, or D become abnormal, and Application A does not identify the abnormal instances, a part of calls initiated by Application A fail. In the following figure, Application B has one abnormal instance, and Applications C and D have two abnormal instances each. If Applications B, C, and D have a large number of abnormal instances, the service performance and availability of Application A may be affected.

To ensure the service performance and availability of Application A, you can configure an outlier ejection policy for Application A. After the policy is configured, Enterprise Distributed Application Service (EDAS) can monitor the instance status of Applications B, C, and D, and dynamically add or remove instances to ensure successful service calls.

The following content describes the process of outlier ejection:

EDAS detects whether Application B, C, or D has abnormal instances. If abnormal instances are found, EDAS determines whether to remove the abnormal instances from the application based on the Instance Removal Rate Threshold parameter.
EDAS does not distribute the call requests of Application A to the removed instances.
EDAS detects whether the abnormal instances are recovered based on the Recovery Detection Unit Time parameter.
The detection interval linearly increases with the value of the Recovery Detection Unit Time parameter. The default value of Recovery Detection Unit Time is 30000 ms, which equals 0.5 minutes. If the threshold specified by the Max Number of Instance Checked Before Restoration parameter is reached, EDAS detects whether the abnormal instances are recovered at the maximum detection interval.
After the abnormal instances are recovered, EDAS adds the instances back to the application to process call requests. The detection interval is reset to the value of the Recovery Detection Unit Time parameter, such as 30000 ms.

Note

If the ratio of abnormal instances of a provider exceeds the threshold that is specified by the Instance Removal Rate Threshold parameter, EDAS removes abnormal instances based on this threshold.
If the provider has only one instance available, EDAS does not remove this instance even if the threshold specified by the Error Rate Threshold parameter is exceeded.

Create an outlier instance removal policy

For HSF applications, you can create application- and service-level outlier instance removal policies.

Log on to the EDAS console.
In the left-side navigation pane, choose Application Management > Microservice Configurations > Configurations.
In the top navigation bar, select a region. On the Configurations page, select a microservices namespace from the drop-down list. Then, click Create Configuration.
In the Create Configuration panel, set the parameters. Then, click Create in the lower part of the panel.
The following section describes the parameters for creating an outlier instance removal policy:
- Region: The value is the region that you select before you create the outlier instance removal policy and cannot be changed.
- Microservice Namespace: The value is the namespace that you select before you create the outlier instance removal policy and cannot be changed.
- Data ID: Enter an ID for the outlier instance removal policy in the format of <App ID>.QOSCONFIG. You can obtain the App ID (ID of an application) on the details page of the application.
- Group: The value is HSF and cannot be changed.
- Data encryption: Turn on or off the switch to specify whether to encrypt the data. If the outlier instance removal policy contains sensitive data, we recommend that you turn on Data encryption to reduce the risk of data leaks.
- Configuration Format: Select a data format for the content of the outlier instance removal policy. The system verifies the data based on the format that you select.
- Configuration Content: Enter the content of the outlier instance removal policy.
  You can create an outlier instance removal policy for an HSF application at the application or service level by using the related properties and the values that you specify for them. The following examples show how to create outlier instance removal policies at these two levels.
  Note A service-level outlier instance removal policy takes precedence over an application-level outlier instance removal policy.
  - Example on how to create an application-level outlier instance removal policy
```
{
"DEFAULT": {
"errorRateThreshold":0.5,
"isolationTime":60000,
"maxIsolationRate":0.2,
"maxIsolationTimeMultiple":15,
"qosEnabled":true,
"requestThreshold":20,
"timeWindowInSeconds":10,
"ipDimension":true
}
}
```
  - Example on how to create a service-level outlier instance removal policy
```
{
"DEFAULT": {
"errorRateThreshold":0.5,
"isolationTime":60000,
"maxIsolationRate":0.2,
"maxIsolationTimeMultiple":15,
"qosEnabled":true,
"requestThreshold":20,
"timeWindowInSeconds":10
},
"service:version": {
"errorRateThreshold":0.5,
"isolationTime":60000,
"maxIsolationRate":0.2,
"maxIsolationTimeMultiple":15,
"qosEnabled":true,
"requestThreshold":20,
"timeWindowInSeconds":10
}
}
```
  If you have other requirements, see Parameters for creating an outlier instance removal policy.

Parameters for creating an outlier instance removal policy

You can create an outlier instance removal policy by using related properties in configuration management, or by using -D parameters for Java Virtual Machine (JVM). Outlier instance removal policies created in configuration management take precedence over those created by using the -D parameters. We recommend that you create an outlier instance removal policy in configuration management.


Parameter	Property	-D parameter	Description	Default value
Maximum number of calls	requestThreshold	-Dhsf.qos.request.threshold	The maximum number of calls. An outlier instance is removed only when the number of calls in the most recent statistics window exceeds the threshold.	10
Lower error rate limit	errorRateThreshold	-Dhsf.qos.error.rate.threshold	The lower limit of the error rate. When the error rate of an instance deployed with the called application or service exceeds the lower limit, the instance is removed.	0.5
Upper limit of instance removal ratio	maxIsolationRate	-Dhsf.qos.max.isolation.rate	The maximum proportion of abnormal instances to be removed. If the threshold is reached, no more abnormal instances are removed. For example, an application has six instances in total. If you set this parameter to 60%, the maximum number of instances that can be removed is 3.6, which is rounded down to the nearest integer 3. The number is calculated by using the following formula: 6 × 60% = 3.6. If the calculation result is less than 1, one instance is removed.	0.2
Recovery detection unit time	isolationTime	-Dhsf.qos.isolation.time	The unit time used to detect whether abnormal instances are recovered. After abnormal instances are removed, EDAS continuously detects whether abnormal instances are recovered at an interval that accumulates by the specified unit time. The unit is ms.	60 × 1,000 ms (1 minute)
Maximum cumulative number of times not restored	maxIsolationTimeMultiple	-Dhsf.qos.max.isolation.time.multiple	The maximum number of detections. EDAS continuously detects abnormal instances, and the detection interval linearly increases with the number of detections by the recovery detection unit time. When the specified maximum number of detections is reached, EDAS continuously detects whether abnormal instances are recovered based on the longest detection interval. For example, the recovery detection unit time is set to 60,000 ms, and the maximum cumulative number of times not recovered is set to 60. If an abnormal instance remains abnormal after it is detected 60 times, the instance is subsequently detected at intervals of 60 minutes, which is calculated by using the following formula: 60 × 60,000 ms = 60 minutes. If the instance is recovered before the specified maximum number of detections is reached, the detection interval is reset to the initial interval, which is the value of the recovery detection unit time.	60
Enable outlier instance removal	qosEnabled	-Dhsf.qos.enable	Specifies whether to enable outlier instance removal for the application or service.	false
Time window for statistics	timeWindowInSeconds	-Dhsf.qos.time.window.in.seconds	The time window for statistics on the maximum number of calls. This time window is the statistical period.	10s
Exception type	bizExceptionPredicateClassName	-Dhsf.qos.biz.exception.class.name	The exception type of the instances of the application or service. By default, all service exceptions are considered as exceptions. You can also define specific service exceptions by using custom interfaces. For example, you can define exceptions in the following ways: Define all service exceptions as exceptions: com.taobao.hsf.exception.CountBizExceptionPredicate. Ignore all service exceptions: com.taobao.hsf.exception.IgnoreBizExceptionPredicate. Configure the instance deployed with the application whose code contains bizExceptionPredicate and com.taobao.hsf.Predicate. com.taobao.hsf.Predicate is the implementation of bizExceptionPredicate.	com.taobao.hsf.exception.CountBizExceptionPredicate: defines all service exceptions as exceptions.

Verify the result

The outlier ejection feature is enabled after you configure and create an outlier ejection policy. You can go to the details page of the application for which you have configured outlier ejection to view the application monitoring information. For example, you can check whether call requests are still forwarded to abnormal instances and whether Error Rate / 1 Min for application calls is higher than the value of the Error Rate Threshold parameter on the Topology tab. This way, you can check whether the outlier ejection policy takes effect. For more information, see Application overview.