Research data indicates that about 70% of production failures result from changes. How do enterprise applications on Alibaba Cloud address the stability risks brought about by changes, achieving smooth application releases during periods of heavy daytime traffic? Today, we will uncover the veil of Alibaba Cloud microservices' end-to-end graceful release solution.
As microservices have become open-sourced and their ecosystem has matured, the standardization of microservice technology has been significantly promoted. The figure below illustrates the choice between monolithic and microservice architectures, where the x-axis represents system complexity and the y-axis represents productivity. The green curve represents monolithic architecture, while the blue curve represents microservice architecture. We can see that as system complexity increases, there is an inflection point between monolithic and microservice architectures, and this inflection point shifts to the right, making microservices a more suitable choice. The standardization of microservice technology has greatly lowered the threshold for microservices on the cloud, and the inflection point of choice between them continues to move left. As a result, more and more enterprises are adopting microservices. When enterprises start to promote the large-scale use of microservices, the key to making full use of them lies in stability and efficiency.
Let's take a look at the practical demands of some customers. For instance, a tea-drinking company once experienced an accident rate of over 60% due to changes. As a result, new versions had to be released during off-peak hours, usually in the early morning. Similarly, a new energy company required high business continuity, with its core services needing 24/7 online support. However, rapid business growth also demanded efficient iteration. To ensure business stability, releases had to be made during off-peak hours, typically early in the morning. Another technology company lacked online canary capabilities, leading to uncontrollable impacts and certain risks with each full release.
According to research data, 70% of production failures are caused by changes. In addition to application self-failures, runtime risks are summarized as uncertain traffic, unstable calls, and unstable infrastructure. The solution we are talking about today is to completely eliminate change risks and solve the stability risks of production.
First, we will analyze why microservices applications cannot be released during the day. Compared with monolithic applications, the microservice has a complex network topology call and dependency structure. This means that if our microservices application does not gracefully start and shutdown, then any of the microservices will have a short period of service unavailability during the release process, and a large number of exceptions will also occur in a short period of time, resulting in business interruption.
Think that when we accidentally click on the redeployment of an application, then the alert group will receive a large number of alerts that the success rate of orders has dropped. It is very scary. Secondly, the release is the last step of the feature update to the online process. Some problems accumulated from the process of design, R&D, and testing will only be triggered in the final release step. If multiple applications are involved in the release, how can we properly release the application without traffic loss caused by version issues? We don't want to face the situation where the same user visits the new version when placing an order but encounters the old version at the payment stage.
The following figure shows the overall performance of the stress testing request after I performed stress testing on the open source Spring Cloud application and restarted the application during the stress testing. We can see that there are a large number of exceptions. Further, if it is a production system under heavy traffic during the day, even a small problem will be rapidly amplified due to the heavy traffic, which will lead to an uncontrollable impact. Therefore, many enterprises and business teams prefer to release in the early morning.
How can we thoroughly resolve stability changes? That is what we will introduce today, the graceful end-to-end release provided by MSE.
We can analyze the problem from two perspectives:
First of all, since there is no problem with the code, why is there a traffic loss during the process of service start and shutdown? When our microservice provider shuts down, the service consumer cannot detect that the node has been shut down in real time and continuously calls the address of the shutdown node, then the traffic will continue to report errors. If the service provider shuts down during the execution of the request and the application node stops directly, then the in-transit request and the in-process request will fail, and the online data will even be inconsistent in severe cases. Similarly, there are many problems in the application start process. For example, if a newly started node bears the heavy traffic directly before completing preheat, then the node will be directly broken down, and even service cannot be expanded in heavy traffic scenarios in severe cases. Similarly, if the lifecycle of a microservice system is not aligned with that of Readiness/Liveness in Kubernetes, a request exception may occur during the release process.
If there is a problem with our new version of the code, it is expected to control the impact as much as possible during the release of the new version. Alibaba's best practices for safe production show that the release of changes should ensure safe production by canary release, observability, and rollback. Canary release is used to control the impact of the problem through canary. If the new version of our business code has a bug, we can only choose to release it during off-peak hours (early morning) in most cases. Microservices require end-to-end canary release. However, it is currently difficult for open source self-built databases to achieve end-to-end canary release. If the traffic loss occurs, the canary traffic will flow to the production environment, which introduces additional uncontrollable risks. Rollback requires us to quickly address the problem after it appears. However, if there is no complete set of rollback solutions and strategies, both the slow rollback and a larger problem introduced during the rollback process are unacceptable.
If there is no problem with our new version of the code, how do we solve the problem of traffic loss during the process of service start and shutdown?
Reducing unnecessary API errors provides the best user experience and the best microservice development experience. However, how can we solve this troublesome problem in the field of microservices? Before that, let's first analyze why our microservices may suffer from traffic loss during the shutdown process.
As shown on the right in the figure above, this is a normal process for a microservice node to be shut down.
Although the process of microservices shutdown is relatively complicated, the whole process is still very logical. Since the microservices architecture detects nodes through service registration and discovery, it naturally detects node shutdown change in the same way.
You may change your views after referring to some practical data given here. From step 2 to step 6, Eureka takes 2 minutes in the worst case, and even Nacos takes 50 seconds in the worst case. In step 3, all versions before Dubbo 3.0 use the service-level registry and discovery model, which means that when the business volume is too large, the registry center will be under great pressure. Assuming that each registration/logoff behavior takes 20~30 ms, 500 or 600 services need to be registered/logged off for nearly 15 seconds. In step 5, the Ribbon load balancing used by Spring Cloud has a default address cache refresh time of 30 seconds, which means that the client will obtain the signal of the shutdown node from the registry center in real time, but there will still be a period of time when the client will balance the request to the old node.
As shown on the left in the preceding figure, only when the client detects that the server is shut down and uses the latest address list for routing and load balancing are the requests not balanced to nodes that are shut down. Then, service requests may report errors during the period from the time when the nodes start to shut down to the time when the requests are no longer sent to the nodes that are shut down. This is referred to as the period for reporting service call errors.
Through the analysis of the microservices shutdown process, we know that the key to solving the problem of microservices shutdown is to ensure that the period for reporting service call errors is shortened as much as possible during each microservices shutdown process, and the to-be-started node has processed all requests sent to the node before the node goes shut down.
So how do we shorten the period for reporting service call errors? There are some strategies:
How can we ensure that the server node shuts down after processing all requests sent to the node? From the perspective of the server, after the client is notified of the shutdown signal, a waiting mechanism can be provided to ensure that all in-transit requests and requests being processed by the server have been processed before the shutdown process.
Let's take a look at the effect and performance of graceful shutdown through practice:
This practice is divided into four steps. First, we configure scheduled scaling tasks in the ACK console to simulate application scaling scenarios. We can configure scale-out tasks in the second minute and scale-in tasks in the fourth minute within 5 minutes. Second, after the application is connected to the MSE service governance, the application will have the graceful shutdown capability by default. We need to add environment variables to the baseline environment to disable the graceful shutdown capability as a control group. As the experimental group, the graceful shutdown capability is enabled by default after the canary environment is connected to the MSE service governance without any operation. Third, we initiate baseline and canary traffic at the same time to observe their performance. Fourth, in the detailed QPS data module of MSE applications, we can see that 99 exceptions occur during scaling in the unlabeled environment (applications with graceful shutdown disabled), but no errors occur in the canary environment.
📝 For more experiment details, see the MSE document. Result verification: Graceful Shutdown [1]
Why do we need a graceful start when the graceful shutdown is available? There are also many problems in the process of application startup. We abstract the startup process of a microservices application into the following processes, and we will encounter some problems in the following stages.
Let's take a look at the stages of a standard microservice startup process:
Only after properly handling all the above processes can the microservice application gracefully face the heavy traffic of production after it is started.
Let's take a quick look at the Demo of the graceful start:
First, we need to enable the graceful start on the MSE graceful start page and configure the preheat duration. Then, after restarting the application, we can see that the traffic curve of the released pod is slowly increasing, as shown in the preceding figure, in line with the low-traffic preheating curve.
📝 For more experiment details, see the MSE document. Result verification: Graceful Start [2]
The demo of this practice uses the open-source Spring Cloud framework as an example. We have prepared the following demo. Traffic is initiated by Alibaba Cloud Performance Testing Service (PTS) and flows into our system through the open source Zuul gateway. The following figure shows the service call process:
In the figure, traffic flows from the Ingress corresponding to Netflix Zuul, which calls the services corresponding to the SC-A application. The SC-A application calls the services of the SC-B application, and the SC-B application calls the services of the SC-C application.
Let's look at the effect of graceful start and shutdown in real stress testing (continuous stress testing with 500qps) scenarios:
The left figure shows the performance of an open source Spring Cloud application that is not connected to MSE. It can be found that error reporting starts at 17:53:42 and stops at 17:54:24. The duration has lasted for 42 seconds, with a total of more than 10,000 exceptions. Compared with the experimental group, for applications that are connected to MSE graceful start and shutdown, the application release does not cause any loss to the business traffic, and the whole process is very smooth in this process.
Since the performance of a simple demo application is so surprising, under microservice conditions, when the production system faces a traffic peak of tens of thousands of requests per second, even if the period for reporting service call errors is only a few seconds in this process, it will cause severe impact for enterprises. In some extreme cases, the period for reporting service call errors may deteriorate to several minutes, which causes many enterprises not to release the new version, and finally, they have to schedule each release at two or three o'clock in the morning. For R&D, each release requires great attention. The introduction of graceful start and shutdown is to solve this problem and ensure that traffic is lossless during the release and change process.
If there is a problem with our new version of the code, how can we effectively control the impact of the problem?
In order to control the risks in the change process, we all know that the impact of the problem can be controlled by the canary release. However, in microservices scenarios, the traditional canary release often cannot meet the complex and diversified requirements of microservice delivery. This is because:
• The microservice trace is long and complex. In the microservices architecture, the trace between services is complex. Changes to one service may affect the entire trace, thus affecting the stability of the entire application.
• A canary release may involve multiple modules, and the entire trace must call the new version. Due to the interdependence of services in the microservices architecture, the modification of one service may lead to the adjustments of other services. As a result, new versions of multiple services need to be called at the same time during the canary release. This increases the complexity and uncertainty of the release.
• Multiple projects in parallel need multiple environments to be deployed, which are inflexible to build and costly. In the microservices architecture, there are often multiple projects developed in parallel, which need to be supported by multiple environments. This increases the difficulty and cost of building the environment, resulting in inefficient release.
To solve these problems, we need to adopt a more flexible, controllable, and suitable release method for microservice scenarios. This is where the end-to-end canary release comes from. In most cases, a canary environment or group is deployed for each microservices application to receive canary traffic. We hope that the traffic that enters the upstream canary environment can also enter the downstream canary environment to ensure that one request is always passed in the canary environment, thus forming a traffic lane. Even if some microservices in the traffic lane do not have a canary release environment, these applications can still return to the canary release environment when requesting downstream.
The production traffic is end-to-end, and the implementation of the end-to-end canary lane means that we need to ensure a closed loop of traffic in the frontend, gateway, and backend components. In addition to RPC/HTTP traffic, we also need to ensure that asynchronous calls such as MQ traffic comply with calling rules of the end-to-end "lane". The complexity of traffic control involved in the whole process is also very high.
MSE uses a simple UI to show the model of traffic "lanes". You can create lane groups and lanes to quickly implement end-to-end canary releases under the microservices architecture.
1. MSE supports dynamic configuration of traffic matching rules in the console to introduce fine-grained traffic
a) It also supports flexible condition matching rules such as numeric comparison, regular expression matching, and percentage conditions to meet complex canary release demands.
b) HTTP traffic can be matched by parameters such as header, param, and cookie, while Dubbo traffic can be matched by service, method, and parameter.
2. End-to-end isolation of traffic lanes
a) You can "dye" the required traffic by setting traffic rules, and the 'dyed' traffic is routed to the canary machine.
b) The canary traffic is automatically transferred with canary labels to form canary traffic lanes. By default, the unlabeled baseline environment is selected for canary traffic in non-canary environments. By default, traffic label transfer supports traffic such as RPC and MQ.
3. End-to-end stable baseline environment
a) The unlabeled application belongs to the baseline stable version of the application, that is, the stable online environment. When we release the corresponding canary version code, we can then configure rules to direct the introduction of specific online traffic to control the risk of the canary code.
b) In the process of routing each hop, the canary traffic preferentially goes to the machine corresponding to the traffic label. If no match is found, the canary traffic will fall back to the baseline environment.
4. One-click dynamic traffic switching: After traffic rules are configured, one-click stop and start, addition, deletion, modification, and query can take effect in real time as required. The canary route of traffic is more convenient.
5. Low-cost access without modification of business code based on Java Agent technology: It seamlessly supports all Spring Cloud and Dubbo versions on the market for nearly 5 years. Users can use them without changing any line of code or changing the existing architecture of the business. It can be used at any time without binding.
In addition, the MSE end-to-end canary capability also provides the corresponding observability. Through the end-to-end canary observation, we can observe the canary traffic curve in real time. Whether it is in the testing and verification phase or in the production canary release process, we can know the specific situation of the traffic.
The stability of microservice usage has become a topic of concern. This article outlines the best practices for seamless releases of Alibaba Cloud microservices, aiming to help cloud-based enterprises make the most of microservices.
[1] Result Verification: Graceful Start
[2] Result Verification: Graceful Shutdown
206 posts | 12 followers
FollowAlibaba Cloud Native - September 4, 2024
Alibaba Cloud Native - May 11, 2022
Alibaba Clouder - December 22, 2020
Aliware - March 19, 2021
Alibaba Container Service - March 29, 2019
Alibaba Clouder - February 11, 2021
206 posts | 12 followers
FollowMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreProvides comprehensive quality assurance for the release of your apps.
Learn MorePenetration Test is a service that simulates full-scale, in-depth attacks to test your system security.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMore Posts by Alibaba Cloud Native