FAQ about graceful start and shutdown - Microservices Engine

This topic provides answers to some frequently asked questions about the graceful start and shutdown solution provided by Microservices Governance of Microservices Engine (MSE).

How do I configure the advanced settings of the graceful start and shutdown solution in the new version of Microservices Governance in the MSE console? What is the impact of the new version of Microservices Governance on applications? How do I enable or disable the features provided by the graceful start and shutdown solution in the new version of Microservices Governance?

Important

If you enable the new version of Microservices Governance for your applications in the MSE console, you do not need to take note of this issue.

Advanced settings for the graceful start and shutdown solution are provided in the old version of Microservices Governance in the MSE console. However, the advanced settings involve a multitude of microservices concepts. For ease of use, the advanced settings are hidden in the new version of Microservices Governance. In the old version of Microservices Governance, the advanced settings involve the following features:

Completion of service registration before readiness probe
In the new version of Microservices Governance in the MSE console, the feature to complete service registration before readiness probe is still provided and is enabled by default. If you do not enable this feature for your application in the old version, this feature is automatically enabled for your application after the graceful start feature is enabled in the new version. If you enable this feature for your application in the old version, we recommend that you do not disable this feature in the new version. This feature is enabled by default in the new version. This has no negative impact on your application, and can prevent the risk of traffic dropping to zero in specific scenarios during version release. For more information, see What is 55199/health? If I do not configure 55199/health, traffic may drop to zero. Why?
Completion of service prefetching before readiness probe
The feature to complete service prefetching before readiness probe is no longer provided in the new version of Microservices Governance. The initial design objective of this feature is to ensure that the prefetching effect can meet expectations and prevent a sudden increase in queries per second (QPS) of prefetched services in the curve of the line chart. The implementation principle of this feature is to delay the time to pass a check performed by a Kubernetes readiness probe. This prolongs the overall version release time. This also ensures that the system provides sufficient time for new nodes to prefetch and allows old nodes to process specific amounts of traffic during the service prefetching. This ensures that the QPS value of prefetched services in the curve gradually increases over time. In the version release process, if all the old nodes are disabled during the prefetching of the new nodes, the new nodes need to process all online traffic. As a result, low-traffic service prefetching cannot be performed. If you enable this feature for your application in the old version of Microservices Governance, you can continue to use this feature in the new version for your application, and the feature has no negative impact on your application. However, you cannot enable this feature for new applications when you configure rules for the applications in the new version. If you enable this feature in the old version, we recommend that you do not disable this feature in the new version. To achieve the expected effect of low-traffic service prefetching in the new version of Microservices Governance, you can follow the instructions in What is the best practice for low-traffic service prefetching?

How does the low-traffic service prefetching feature work? If I want to use low-traffic service prefetching, I need to enable Microservices Governance for both consumers and providers. Why?

When a consumer calls a service, the consumer selects a provider of the service. If low-traffic service prefetching is enabled for providers, Microservices Governance optimizes the provider selection process. When a consumer selects a provider, the weight of each provider is calculated as a percentage value (0% to 100%). The consumer preferentially selects and calls a provider that has a high weight. If low-traffic service prefetching is enabled for a provider, when the provider starts, the weight of the provider calculated by the consumer is low, and the consumer calls the prefetched nodes with a low probability. The calculated weight gradually increases over time and finally reaches 100%. When the calculated weight reaches 100%, the prefetching is complete, and the prefetched nodes can receive traffic as normal nodes. The provider includes its startup time in the metadata of the service registration information. The consumer calculates the weight of the provider based on the startup time. This process requires that Microservices Governance is enabled for both consumers and providers.

Important

Low-traffic service prefetching starts after the application receives the first request and ends after the configured prefetching duration elapses. The default prefetching duration is 120s. If the application does not receive external requests, the prefetching is not triggered.
To trigger low-traffic prefetching, you must make sure that service registration is complete. If the prefetching start event is reported before the service registration event in the MSE console, the application triggers low-traffic prefetching but has not finished service registration. You can resolve this issue by following the instructions in Why are prefetching events of my application reported before service registration events?

Why is the prefetching data trend in the curve of the line chart not as expected? How do I resolve this issue?

Note

Before you read this answer, we recommend that you learn about the principles of low-traffic service prefetching.

The following line chart shows an example of the QPS curve for low-traffic service prefetching in normal cases.

In some cases, however, the QPS value in the curve does not gradually increase over time for applications for which low-traffic service prefetching is performed. The following content describes the two common situations that you may encounter.

The QPS value surges at a point in time
In most cases, this situation occurs during service releases. If old nodes are disabled before new nodes are completely prefetched, the new nodes are frequently called when the consumer selects a provider. In the preceding QPS line chart, the QPS value of a new node slowly increases and then surges at a point in time after all the old nodes are disabled. You can resolve this issue by following the instructions in What is the best practice for low-traffic service prefetching?
The QPS value does not gradually increase
If this situation occurs, we recommend that you check whether Microservices Governance is enabled for the consumer. If Microservices Governance is not enabled for the consumer, you can enable Microservices Governance for the consumer to resolve this issue. If the traffic of the application that you need to prefetch is sourced from an external service, such as a Java gateway, Microservices Governance is not enabled for the consumer, and low-traffic service prefetching is not supported in this scenario.

What is the best practice for low-traffic service prefetching?

Incomplete prefetching often occurs during rolling deployment. You can refer to the following practice to ensure that service prefetching meets expectations.

(Recommended) Configure the minimum preparation time: You can configure .spec.minReadySeconds for your workload to specify the time interval after a pod is declared to be ready and before the pod becomes available. Make sure that the specified time interval is longer than the low-traffic service prefetching duration of the pod. This way, your Kubernetes cluster can perform the rolling deployment only after the pod is completely prefetched. If your cluster is a Container Service for Kubernetes (ACK) cluster, you can directly go to the ACK console, find the workload on the Deployments page, and click More in the Actions column. In the list that appears, select Upgrade Policy. In the Upgrade Policy dialog box, configure the Minimum Ready Time (minReadySeconds) parameter. The parameter setting allows the system to deploy workloads only after the newly started pod is ready for a specified period of time.
(Recommended) Use the batch deployment method: You can use methods or tools such as OpenKruise to deploy workloads in batches. The deployment interval between the batches must be longer than the low-traffic service prefetching duration. After the new nodes in a batch are fully prefetched, you can continue to release the next batch of nodes.

You can also increase the value of the initialDelaySeconds parameter to resolve the issue. However, we recommend that you do not use this method. The initialDelaySeconds parameter specifies the time to wait before the first readiness probe of the workload. You must make sure that the parameter value is greater than the sum of the low-traffic service prefetching duration, delayed registration duration, and application startup time. Take note that you can obtain the application startup time from actual log outputs. The application startup time varies along with business development. If you delay the time to pass the readiness probe, the endpoints of the newly started nodes may fail to be added to the endpoints of the Kubernetes services. Therefore, if you want to ensure the optimal prefetching effect, we recommend that you do not use this method.

Note

If you follow the best practice to perform operations and the prefetching QPS curve still does not meet your expectations, you can check whether the traffic received by the application is all sourced from the consumers for which Microservices Governance is enabled. If Microservices Governance is not enabled for specific consumers or the traffic that is called by external load balancers exists, the QPS curve during application prefetching also does not meet expectations.

What is 55199/health? If I do not configure 55199/health, traffic may drop to zero. Why?

55199/health serves as a built-in HTTP readiness check port provided by MSE Microservices Governance. When 55199/health is configured for the Kubernetes readiness probe of your application, if a new node does not complete service registration before it is started, the readiness probe returns the status code 500. If the node completes service registration before it is started, the readiness probe returns the status code 200.

Based on the default deployment policy of Kubernetes, you can disable old nodes only after new nodes are ready. If 55199/health is configured for the readiness probe, a new node is ready after it completes service registration. This way, old nodes are disabled only after new nodes complete service registration. This ensures that the service registered with a registry always has available nodes. If you do not configure 55199/health for the readiness probe, the old node may be disabled before a new node is registered when the service is released. As a result, no node is available for the service in the registry, and all consumers of the service report an error when the consumers call providers. This may cause traffic to drop to zero. Therefore, we strongly recommend that you enable the graceful start feature and configure 55199/health for the readiness probe of your application.

Why are prefetching events of my application reported before service registration events? What do I do if this issue occurs?

In the current version, when a service receives the first external request, the service starts the prefetching process and reports a prefetching start event. In some cases, however, the first request that is received by an application may not be a microservices call request. Such a request does not follow the business logic to trigger the prefetching. For example, you configure a Kubernetes liveness probe for the workload of an application. In this case, when a new node is started, even if the service registration operation is not performed on the node, the system considers that the prefetching has started when Kubernetes performs a check by using the liveness probe.

To prevent this issue, you can configure the following parameter in the environment variables of the provider. This way, the system no longer triggers the prefetching logic for these requests.

# Disable the triggering of the prefetching for requests whose paths are /xxx or /yyy/zz.
profile_micro_service_record_warmup_ignored_path="/xxx,/yyy/zz"

Important

This parameter can also be configured as a Java virtual machine (JVM) startup parameter.
The value of this parameter does not support regular expression matching.

What is proactive notification? When do I need to enable proactive notification?

Proactive notification is an advanced capability that is provided by the graceful shutdown module. This capability allows a Spring Cloud provider in the offline state to proactively initiate a network request to notify the consumer of the offline state.` After the consumer receives the notification, the consumer no longer calls the provider node. In normal cases, if both providers and consumers use the Spring Cloud framework, the consumers cache the provider node list on the on-premises machine. In specific scenarios, even if a consumer receives a notification from the registry, the local cache may not be refreshed at the earliest opportunity. As a result, the consumer still initiates calls to offline nodes. The proactive notification feature is introduced to resolve this issue.

By default, the proactive notification feature is disabled. Based on the default graceful shutdown solution, after Microservices Governance is enabled, if a provider that is going offline receives a request, the provider adds a special header to the response, and returns the response to the consumer. After the consumer receives the response, the consumer identifies the header and no longer calls the provider node. Therefore, if the consumer sends requests to a provider that is going offline, the consumer can detect that the provider is about to go offline and automatically remove the provider node from the provider node list. However, if the consumer does not send requests to the provider within the grace period (about 30 seconds) before the provider becomes offline, the consumer may not detect that the provider node is in the offline state and may send a request when the provider is about to go offline. As a result, a request error is reported. In this case, you can enable the proactive notification feature. We recommend that you enable the proactive notification feature for providers if a consumer sends few requests at long intervals to the providers.

After a graceful shutdown event is reported, traffic does not drop to zero in a short period of time. Why?

In most cases, traffic drops to zero in a short period of time after a graceful shutdown event is reported. The following content describes the possible causes and solutions to the issue that the traffic does not drop to zero.

The application receives requests from non-microservices objects such as external load balancers. Another possible cause is that the application receives requests that are initiated by using call methods such as local scripts and scheduled tasks.
The graceful start and shutdown solution supports only requests from microservice applications for which Microservices Governance is enabled. The solution is not supported in the preceding scenarios. We recommend that you configure custom solutions based on the graceful shutdown features that are provided by the infrastructures and frameworks.
The proactive notification feature is not enabled for the application. (For more information, see What is proactive notification? When do I need to enable proactive notification?)
We recommend that you check whether the curve in the line chart meets your expectations after you enable the proactive notification feature.
The version of the framework that is used by the application is not supported by the graceful start and shutdown solution. For more information about frameworks that are supported by the graceful start and shutdown solution of Microservices Governance, see Java frameworks supported by Microservices Governance.
If the framework version of your application is not supported, you can upgrade the framework version of your application.

Completion of service registration before readiness probe

Completion of service prefetching before readiness probe

How does the low-traffic service prefetching feature work? If I want to use low-traffic service prefetching, I need to enable Microservices Governance for both consumers and providers. Why?

Why is the prefetching data trend in the curve of the line chart not as expected? How do I resolve this issue?

The QPS value surges at a point in time

The QPS value does not gradually increase

What is the best practice for low-traffic service prefetching?

What is 55199/health? If I do not configure 55199/health, traffic may drop to zero. Why?

Why are prefetching events of my application reported before service registration events? What do I do if this issue occurs?

What is proactive notification? When do I need to enable proactive notification?

After a graceful shutdown event is reported, traffic does not drop to zero in a short period of time. Why?

The application receives requests from non-microservices objects such as external load balancers. Another possible cause is that the application receives requests that are initiated by using call methods such as local scripts and scheduled tasks.

The proactive notification feature is not enabled for the application. (For more information, see What is proactive notification? When do I need to enable proactive notification?)