Microservices: How to Release New Versions under Heavy Traffic

1. Microservices Threshold Reduced: The Key to Effective Utilization

Research data indicates that about 70% of production failures result from changes. How do enterprise applications on Alibaba Cloud address the stability risks brought about by changes, achieving smooth application releases during periods of heavy daytime traffic? Today, we will uncover the veil of Alibaba Cloud microservices' end-to-end graceful release solution.

As microservices have become open-sourced and their ecosystem has matured, the standardization of microservice technology has been significantly promoted. The figure below illustrates the choice between monolithic and microservice architectures, where the x-axis represents system complexity and the y-axis represents productivity. The green curve represents monolithic architecture, while the blue curve represents microservice architecture. We can see that as system complexity increases, there is an inflection point between monolithic and microservice architectures, and this inflection point shifts to the right, making microservices a more suitable choice. The standardization of microservice technology has greatly lowered the threshold for microservices on the cloud, and the inflection point of choice between them continues to move left. As a result, more and more enterprises are adopting microservices. When enterprises start to promote the large-scale use of microservices, the key to making full use of them lies in stability and efficiency.

2. The High Risk of Microservice Change Leads to a Midnight Release

Let's take a look at the practical demands of some customers. For instance, a tea-drinking company once experienced an accident rate of over 60% due to changes. As a result, new versions had to be released during off-peak hours, usually in the early morning. Similarly, a new energy company required high business continuity, with its core services needing 24/7 online support. However, rapid business growth also demanded efficient iteration. To ensure business stability, releases had to be made during off-peak hours, typically early in the morning. Another technology company lacked online canary capabilities, leading to uncontrollable impacts and certain risks with each full release.

According to research data, 70% of production failures are caused by changes. In addition to application self-failures, runtime risks are summarized as uncertain traffic, unstable calls, and unstable infrastructure. The solution we are talking about today is to completely eliminate change risks and solve the stability risks of production.

First, we will analyze why microservices applications cannot be released during the day. Compared with monolithic applications, the microservice has a complex network topology call and dependency structure. This means that if our microservices application does not gracefully start and shutdown, then any of the microservices will have a short period of service unavailability during the release process, and a large number of exceptions will also occur in a short period of time, resulting in business interruption.

Think that when we accidentally click on the redeployment of an application, then the alert group will receive a large number of alerts that the success rate of orders has dropped. It is very scary. Secondly, the release is the last step of the feature update to the online process. Some problems accumulated from the process of design, R&D, and testing will only be triggered in the final release step. If multiple applications are involved in the release, how can we properly release the application without traffic loss caused by version issues? We don't want to face the situation where the same user visits the new version when placing an order but encounters the old version at the payment stage.

The following figure shows the overall performance of the stress testing request after I performed stress testing on the open source Spring Cloud application and restarted the application during the stress testing. We can see that there are a large number of exceptions. Further, if it is a production system under heavy traffic during the day, even a small problem will be rapidly amplified due to the heavy traffic, which will lead to an uncontrollable impact. Therefore, many enterprises and business teams prefer to release in the early morning.

3. Graceful End-to-End Release Provided by MSE to Eliminate Change Risks

How can we thoroughly resolve stability changes? That is what we will introduce today, the graceful end-to-end release provided by MSE.

We can analyze the problem from two perspectives:

If there is no problem with our new version of the code, how do we solve the problem of traffic loss during the process of service start and shutdown?
If there is a problem with our new version of the cold, how can we control the impact of the problem?

First of all, since there is no problem with the code, why is there a traffic loss during the process of service start and shutdown? When our microservice provider shuts down, the service consumer cannot detect that the node has been shut down in real time and continuously calls the address of the shutdown node, then the traffic will continue to report errors. If the service provider shuts down during the execution of the request and the application node stops directly, then the in-transit request and the in-process request will fail, and the online data will even be inconsistent in severe cases. Similarly, there are many problems in the application start process. For example, if a newly started node bears the heavy traffic directly before completing preheat, then the node will be directly broken down, and even service cannot be expanded in heavy traffic scenarios in severe cases. Similarly, if the lifecycle of a microservice system is not aligned with that of Readiness/Liveness in Kubernetes, a request exception may occur during the release process.

If there is a problem with our new version of the code, it is expected to control the impact as much as possible during the release of the new version. Alibaba's best practices for safe production show that the release of changes should ensure safe production by canary release, observability, and rollback. Canary release is used to control the impact of the problem through canary. If the new version of our business code has a bug, we can only choose to release it during off-peak hours (early morning) in most cases. Microservices require end-to-end canary release. However, it is currently difficult for open source self-built databases to achieve end-to-end canary release. If the traffic loss occurs, the canary traffic will flow to the production environment, which introduces additional uncontrollable risks. Rollback requires us to quickly address the problem after it appears. However, if there is no complete set of rollback solutions and strategies, both the slow rollback and a larger problem introduced during the rollback process are unacceptable.

How to Solve the Problem of Traffic Loss in the Process of Microservice Start and Shutdown

If there is no problem with our new version of the code, how do we solve the problem of traffic loss during the process of service start and shutdown?

Reducing unnecessary API errors provides the best user experience and the best microservice development experience. However, how can we solve this troublesome problem in the field of microservices? Before that, let's first analyze why our microservices may suffer from traffic loss during the shutdown process.

Principle Analysis: Graceful Shutdown

As shown on the right in the figure above, this is a normal process for a microservice node to be shut down.

Before the node shuts down, the consumer calls the service provider according to the load balancing rules, and the business is normal.
The service provider node A plans to shut down. First, the operation is performed on one of the nodes to trigger the signal of stopping the Java process.
During the process that the node stops the Java, the service provider node sends the service node logoff to the registry center.
After receiving the change signal of the service provider node list, the registry center will notify the consumer that the nodes in the service provider list have been shut down.
After the service consumer receives the list of new service provider nodes, it will refresh the client's address list cache, and then recalculate the routing and load balancing based on the new address list.
Eventually, the service consumer no longer calls the shutdown node.

Although the process of microservices shutdown is relatively complicated, the whole process is still very logical. Since the microservices architecture detects nodes through service registration and discovery, it naturally detects node shutdown change in the same way.

You may change your views after referring to some practical data given here. From step 2 to step 6, Eureka takes 2 minutes in the worst case, and even Nacos takes 50 seconds in the worst case. In step 3, all versions before Dubbo 3.0 use the service-level registry and discovery model, which means that when the business volume is too large, the registry center will be under great pressure. Assuming that each registration/logoff behavior takes 20~30 ms, 500 or 600 services need to be registered/logged off for nearly 15 seconds. In step 5, the Ribbon load balancing used by Spring Cloud has a default address cache refresh time of 30 seconds, which means that the client will obtain the signal of the shutdown node from the registry center in real time, but there will still be a period of time when the client will balance the request to the old node.

As shown on the left in the preceding figure, only when the client detects that the server is shut down and uses the latest address list for routing and load balancing are the requests not balanced to nodes that are shut down. Then, service requests may report errors during the period from the time when the nodes start to shut down to the time when the requests are no longer sent to the nodes that are shut down. This is referred to as the period for reporting service call errors.

Through the analysis of the microservices shutdown process, we know that the key to solving the problem of microservices shutdown is to ensure that the period for reporting service call errors is shortened as much as possible during each microservices shutdown process, and the to-be-started node has processed all requests sent to the node before the node goes shut down.

So how do we shorten the period for reporting service call errors? There are some strategies:

Advance step 3, that is, the process of the node performing service shutdown to the registry center, to step 2, that is, the notification behavior of service logoff is performed before the application shuts down. Given that Kubernetes provides the Prestop interface, we can abstract the process and put it into the Kubernetes Prestop for triggering.
If the registry center is not capable, is it possible for the server to bypass the registry center and directly inform the client of the signal that the current server node will be shut down before shutdown? This behavior can also be triggered in the Kubernetes Prestop interface.
Is it possible that the client actively refreshes the address list cache of the client after receiving the notification from the server?

How can we ensure that the server node shuts down after processing all requests sent to the node? From the perspective of the server, after the client is notified of the shutdown signal, a waiting mechanism can be provided to ensure that all in-transit requests and requests being processed by the server have been processed before the shutdown process.

Practice: Performance of Graceful Shutdown during Scaling-in

Let's take a look at the effect and performance of graceful shutdown through practice:

This practice is divided into four steps. First, we configure scheduled scaling tasks in the ACK console to simulate application scaling scenarios. We can configure scale-out tasks in the second minute and scale-in tasks in the fourth minute within 5 minutes. Second, after the application is connected to the MSE service governance, the application will have the graceful shutdown capability by default. We need to add environment variables to the baseline environment to disable the graceful shutdown capability as a control group. As the experimental group, the graceful shutdown capability is enabled by default after the canary environment is connected to the MSE service governance without any operation. Third, we initiate baseline and canary traffic at the same time to observe their performance. Fourth, in the detailed QPS data module of MSE applications, we can see that 99 exceptions occur during scaling in the unlabeled environment (applications with graceful shutdown disabled), but no errors occur in the canary environment.

📝 For more experiment details, see the MSE document. Result verification: Graceful Shutdown [1]

Principle Analysis: Graceful Start

Why do we need a graceful start when the graceful shutdown is available? There are also many problems in the process of application startup. We abstract the startup process of a microservices application into the following processes, and we will encounter some problems in the following stages.

Let's take a look at the stages of a standard microservice startup process:

The first stage is the application initialization, during which the Spring Bean loading and initialization logic will be performed.
The second stage is the connection establishment of connection pools. For example, if we use Redis, we will create a connection pool involving JedisPool. By default, a connection will not be created immediately after the connection pool is created but only after the request is entered. There is a practical case of a head customer. During the application startup process, the connection had not been established, but there was a large influx of traffic. As a result, a large number of threads were blocked during the connection establishment which led to a large number of request errors. Therefore, we need to establish the core connection of the connection pools in advance at this stage.
The third stage is the service registration. We know that once the service registration is completed, it means that the Consumer application can discover the current service, and microservice traffic will enter the current application. Therefore, we need to ensure that the application is initialized before the service registration. In some scenarios, such as MaxCompute, services need to load some preparation resources asynchronously after startup and big data services need to pull hundreds of megabytes of data from OSS in advance. The service can only be provided after the application is ready. Therefore, if the application directly registers the service after the application is started, the traffic before the asynchronous resource is ready will report errors because the preceding resource is not ready.
The fourth stage needs to link the Kubernetes lifecycle. Once the Kubernetes readiness check is approved, the current pod is considered to have been started. During the rolling release of the Kubernetes, if the Readiness is passed, it means that the pods are ready to start, and the Kubernetes automatically releases the next batch of pods. If Readiness is only associated with a port, the old pods will be stopped, but the earliest new pods still do not complete the service registration. In this case, the service will not have an address in the registry center within a short period of time, and the client may encounter the service no provider exception during the release process.
The fifth stage is that when the application and Pod are ready, the capacity of the microservices application will be much smaller than normal after the application is just started due to JIT preheating and lazy loading of framework resources when the application is just started. The problem of Java application in this scenario will be more obvious. If no preprocessing is performed at this time, the application directly receives the large amount of traffic that is evenly distributed online, which is prone to slow response to a large number of requests, resource blocking, and application instance downtime. Therefore, at this stage, we need to preheat a small number of instances to reasonably allocate a small amount of traffic to fully preheat the system.

Only after properly handling all the above processes can the microservice application gracefully face the heavy traffic of production after it is started.

Practice: Graceful Start

Let's take a quick look at the Demo of the graceful start:

First, we need to enable the graceful start on the MSE graceful start page and configure the preheat duration. Then, after restarting the application, we can see that the traffic curve of the released pod is slowly increasing, as shown in the preceding figure, in line with the low-traffic preheating curve.

📝 For more experiment details, see the MSE document. Result verification: Graceful Start [2]

Practice: Graceful Start and Shutdown Performance in Stress Testing State

The demo of this practice uses the open-source Spring Cloud framework as an example. We have prepared the following demo. Traffic is initiated by Alibaba Cloud Performance Testing Service (PTS) and flows into our system through the open source Zuul gateway. The following figure shows the service call process:

In the figure, traffic flows from the Ingress corresponding to Netflix Zuul, which calls the services corresponding to the SC-A application. The SC-A application calls the services of the SC-B application, and the SC-B application calls the services of the SC-C application.

Let's look at the effect of graceful start and shutdown in real stress testing (continuous stress testing with 500qps) scenarios:

The left figure shows the performance of an open source Spring Cloud application that is not connected to MSE. It can be found that error reporting starts at 17:53:42 and stops at 17:54:24. The duration has lasted for 42 seconds, with a total of more than 10,000 exceptions. Compared with the experimental group, for applications that are connected to MSE graceful start and shutdown, the application release does not cause any loss to the business traffic, and the whole process is very smooth in this process.

Since the performance of a simple demo application is so surprising, under microservice conditions, when the production system faces a traffic peak of tens of thousands of requests per second, even if the period for reporting service call errors is only a few seconds in this process, it will cause severe impact for enterprises. In some extreme cases, the period for reporting service call errors may deteriorate to several minutes, which causes many enterprises not to release the new version, and finally, they have to schedule each release at two or three o'clock in the morning. For R&D, each release requires great attention. The introduction of graceful start and shutdown is to solve this problem and ensure that traffic is lossless during the release and change process.

How to Control Risks during Microservice Changes

If there is a problem with our new version of the code, how can we effectively control the impact of the problem?

_10

In order to control the risks in the change process, we all know that the impact of the problem can be controlled by the canary release. However, in microservices scenarios, the traditional canary release often cannot meet the complex and diversified requirements of microservice delivery. This is because:

• The microservice trace is long and complex. In the microservices architecture, the trace between services is complex. Changes to one service may affect the entire trace, thus affecting the stability of the entire application.

• A canary release may involve multiple modules, and the entire trace must call the new version. Due to the interdependence of services in the microservices architecture, the modification of one service may lead to the adjustments of other services. As a result, new versions of multiple services need to be called at the same time during the canary release. This increases the complexity and uncertainty of the release.

• Multiple projects in parallel need multiple environments to be deployed, which are inflexible to build and costly. In the microservices architecture, there are often multiple projects developed in parallel, which need to be supported by multiple environments. This increases the difficulty and cost of building the environment, resulting in inefficient release.

To solve these problems, we need to adopt a more flexible, controllable, and suitable release method for microservice scenarios. This is where the end-to-end canary release comes from. In most cases, a canary environment or group is deployed for each microservices application to receive canary traffic. We hope that the traffic that enters the upstream canary environment can also enter the downstream canary environment to ensure that one request is always passed in the canary environment, thus forming a traffic lane. Even if some microservices in the traffic lane do not have a canary release environment, these applications can still return to the canary release environment when requesting downstream.

_11

The production traffic is end-to-end, and the implementation of the end-to-end canary lane means that we need to ensure a closed loop of traffic in the frontend, gateway, and backend components. In addition to RPC/HTTP traffic, we also need to ensure that asynchronous calls such as MQ traffic comply with calling rules of the end-to-end "lane". The complexity of traffic control involved in the whole process is also very high.

MSE End-to-End Canary Solution

_12

MSE uses a simple UI to show the model of traffic "lanes". You can create lane groups and lanes to quickly implement end-to-end canary releases under the microservices architecture.

1. MSE supports dynamic configuration of traffic matching rules in the console to introduce fine-grained traffic

a) It also supports flexible condition matching rules such as numeric comparison, regular expression matching, and percentage conditions to meet complex canary release demands.

b) HTTP traffic can be matched by parameters such as header, param, and cookie, while Dubbo traffic can be matched by service, method, and parameter.

2. End-to-end isolation of traffic lanes

a) You can "dye" the required traffic by setting traffic rules, and the 'dyed' traffic is routed to the canary machine.

b) The canary traffic is automatically transferred with canary labels to form canary traffic lanes. By default, the unlabeled baseline environment is selected for canary traffic in non-canary environments. By default, traffic label transfer supports traffic such as RPC and MQ.

3. End-to-end stable baseline environment

a) The unlabeled application belongs to the baseline stable version of the application, that is, the stable online environment. When we release the corresponding canary version code, we can then configure rules to direct the introduction of specific online traffic to control the risk of the canary code.

b) In the process of routing each hop, the canary traffic preferentially goes to the machine corresponding to the traffic label. If no match is found, the canary traffic will fall back to the baseline environment.

4. One-click dynamic traffic switching: After traffic rules are configured, one-click stop and start, addition, deletion, modification, and query can take effect in real time as required. The canary route of traffic is more convenient.

5. Low-cost access without modification of business code based on Java Agent technology: It seamlessly supports all Spring Cloud and Dubbo versions on the market for nearly 5 years. Users can use them without changing any line of code or changing the existing architecture of the business. It can be used at any time without binding.

In addition, the MSE end-to-end canary capability also provides the corresponding observability. Through the end-to-end canary observation, we can observe the canary traffic curve in real time. Whether it is in the testing and verification phase or in the production canary release process, we can know the specific situation of the traffic.

4. Summary

The stability of microservice usage has become a topic of concern. This article outlines the best practices for seamless releases of Alibaba Cloud microservices, aiming to help cloud-based enterprises make the most of microservices.

Reference

[1] Result Verification: Graceful Start
[2] Result Verification: Graceful Shutdown

Community

Microservices: How to Release New Versions under Heavy Traffic

1. Microservices Threshold Reduced: The Key to Effective Utilization

2. The High Risk of Microservice Change Leads to a Midnight Release

3. Graceful End-to-End Release Provided by MSE to Eliminate Change Risks

How to Solve the Problem of Traffic Loss in the Process of Microservice Start and Shutdown

Principle Analysis: Graceful Shutdown

Practice: Performance of Graceful Shutdown during Scaling-in

Principle Analysis: Graceful Start

Practice: Graceful Start

Practice: Graceful Start and Shutdown Performance in Stress Testing State

How to Control Risks during Microservice Changes

MSE End-to-End Canary Solution

4. Summary

Reference

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Microservices Engine (MSE)

Mobile Testing

Penetration Test

Cloud-Native Applications Management Solution