By Jiahuan He, R&D engineer of Alibaba Cloud Microservices Engine (MSE)
In modern microservice architecture, we break down the system into a series of services and connect them through remote procedure calls. This approach brings some advantages but also presents challenges.
There are some challenges of microservice architecture in production. As AIGC has become extremely popular in the past year, related websites or services have experienced unavailability issues due to the surge in traffic. This can cause us to miss out on an optimal growth opportunity.
Another challenge is the lack of fault tolerance mechanism. For instance, if a service on a video website encounters an exception, the exception will propagate along the call chain, resulting in the unavailability of the entire site portal. This can negatively impact millions of users and lead to significant economic losses. These recurrent production failures highlight the importance of stability in microservices.
To ensure the stability of microservices, architectural evolution is necessary. Let's take a look at the three major components of microservices on the left, which you are already familiar with. While these components enable normal application usage, there is still a long way to go before achieving production readiness. Enterprises and communities have made various explorations and practices to bridge this gap. For example, the Dubbo community has introduced capabilities such as traffic management and high availability in Dubbo 3 to ensure microservice stability. These measures collectively fall under the umbrella of microservices governance.
It is evident that microservices governance is an essential aspect of successfully implementing and deploying microservices. However, the specific tasks and approaches involved in microservices governance remain somewhat unclear.
From the perspective of the software lifecycle, microservices governance can be divided into three domains: development and testing, change management, and runtime.
Each of these domains faces various challenges, and there have been explorations and practices to address these challenges. For example, we can use graceful start and shutdown configurations to solve release loss issues. The impact of changes can be controlled through grayscale deployment. Traffic control and hotspot protection can be used for uncertain traffic. Fusing and isolation can be applied to handle unstable calls.
There are mature solutions and effective practices in each domain. However, both Alibaba Cloud and other companies encounter many problems when implementing systematic microservices governance.02
Firstly, there are many components involved. In a microservice architecture, we need a calling framework like Dubbo, a registry like Nacos, and stability middleware such as Sentinel and Hystrix. If these components cannot be managed in a unified way, the control cost will be high.
Additionally, there is a lack of unified concepts. For example, the isolation in Envoy and the isolation in Sentinel have completely different meanings. Envoy's isolation is about removing unhealthy instances, while Sentinel's isolation is about concurrency control. This can make it difficult for developers to understand.
Moreover, each enterprise community has its own best practices, resulting in misalignment of capabilities and the absence of a unified standard.
There is also an issue of inconsistent configuration. You may have experienced this problem. For instance, Sentinel, Hystrix, and Istio all have circuit-breaking capabilities, but their configurations differ. Developers need to learn and differentiate them, which hinders understanding and unified control.
These problems create significant resistance when implementing systematic microservices governance. What we need is a unified governance interface that enables better microservices governance. In response to this, we have proposed the OpenSergo project.
OpenSergo aims to propose an open, general, and cloud-native-oriented set of microservices governance solutions and standards to ensure the high availability of microservices. The four parts shown in the above figure represent the visions of the OpenSergo community.
The OpenSergo community abstracts industry scenarios and practices of microservices governance into specifications. This approach solves the problems of inconsistent concepts, configurations, and capabilities mentioned earlier. A unified control plane is used to support these specifications and reduce the costs of usage and maintenance.
Vertically, each section in the entire link is abstracted to cover the complete scenario. Horizontally, all languages, such as Java ecosystem and Go ecosystem, and all architectures, including traditional microservices and Mesh architecture, will be incorporated into this unified system.
However, as an open standard, OpenSergo cannot be built solely by Alibaba Cloud. Therefore, we have collaborated with many companies and communities, such as Bilibili and China Mobile, to jointly build it, with the hope of truly addressing the risk of destabilizing microservices..
In the following section, I will describe the OpenSergo architecture. The OpenSergo community abstracts scenarios into OpenSergo specifications. However, this is just the first step. To support these specifications, a control plane is needed. In the initial evolution, the community chose to develop a control plane from scratch to manage, monitor, and enforce governance rules.
As the community evolved, we found that expanding based on Istio has lower costs and allows for more capabilities to be reused. Hence, in subsequent evolution, we will choose to combine Istio to expand the control plane and decision center for unified control of governance rules and pre-calculation of governance policies.
Alongside the control plane, we also need the data plane to implement specific governance capabilities, which could be middleware like Sentinel or a framework. Communication between the control plane and the data plane in the initial architecture is built on gRPC. However, after deciding to base the subsequent evolution on Istio, the community chooses to embrace XDS as much as possible for the service mesh link. For links that cannot be supported, we will use our own gRPC link.
As mentioned earlier, the evolution of the community control plane is based on the expansion of Istio. Istio itself has some traffic governance capabilities and is quite popular. However, Istio primarily focuses on traffic management and directing traffic to the appropriate destinations, rather than microservices governance. Therefore, these capabilities provided by Istio are not sufficient to meet our needs in terms of microservice stability.
To address this, we have abstracted and established specifications and standards based on microservice stability scenarios, such as change state stability and runtime stability, building upon Istio. Our aim is to make these specifications more suitable for microservice scenarios. In this way, we will be expanding upon Istio's capabilities in the field of microservices governance, rather than being mutually exclusive.
Next, let's explore how OpenSergo's standards and specifications solve the aforementioned scenarios.
First, let's discuss traffic routing. The main function of traffic routing is to direct traffic that meets certain characteristics to specific workloads. This capability is commonly used to implement grayscale and intra-zone routing.
The community has defined traffic routing specifications based on the format of Istio's VirtualService/DestinationRule. However, through research and practical experience, we have found that these specifications do not fully meet the requirements of microservice scenarios. Therefore, we have extended them to be more aligned with microservice scenarios. For example, we have added logic to handle routing failures, which is a common requirement in microservice architectures.
Since Istio primarily focuses on HTTP requests, its Custom Resource Definition (CRD) does not adequately support RPC (Remote Procedure Call) calls such as Dubbo. To address this, we have added more support for RPC models. In the future, we will also explore solutions that combine our specifications with community standards to make them more universal and standardized.
In Alibaba Group's practices over the years, grayscale has been defined as one of the three key elements of secure changes, along with monitoring and rollback capabilities. Grayscale is an essential ability to control the impact area of changes and ensure their stability.
There are several solutions to implement grayscale. The first is physical isolation, where we deploy two identical environments to achieve grayscale. However, this solution incurs high deployment and maintenance costs.
To improve resource utilization, the second solution is traffic grayscale. Instead of deploying a separate environment, we match the characteristics of traffic at each hop and determine whether it should go to a grayscale instance or a base instance. This solution is more flexible and efficient than the previous one, and can be implemented using the traffic routing capability mentioned earlier. However, configuring routing rules for each hop can be somewhat cumbersome.
Furthermore, because certain information cannot be obtained in subsequent links, such as uid, it is difficult to implement this solution. Therefore, the third solution, end-to-end grayscale release, is used to match traffic at the traffic ingress and tag it. The tag is automatically carried along the call chain, and subsequent links are routed based on the tag. This allows us to define grayscale more concisely. OpenSergo abstracts the corresponding CRD for this scenario.
We have named this CRD as TrafficLane, which represents a swimming lane. It is a vivid analogy. Please refer to the image above. The orange color represents the normal traffic direction, while the grey color represents the grayscale traffic direction. Together, they resemble dividing a pool into multiple swimming lanes.
The CRD for TrafficLane consists of three parts, which are easy to understand. First, we need to match the grayscale traffic, so we define the matching conditions. Then, we define the tag to be applied to the traffic. Finally, we define how this tag is carried between services.
By using this CRD, we define a grayscale lane. However, solely defining it is not enough to achieve end-to-end grayscale release. We need to rely on the support of the end-to-end all-round framework of the OpenSergo system to enable the automatic passthrough of tags in these frameworks. These frameworks can also route through tags. Traffic dyeing and tag passthrough are implemented using the standard Trcae system, such as OpenTelemetry.
On the right side of the image above, you can find an example of a CRD. Please take a quick look.
Next, let's examine the scenario of running state stability.
The first scenario involves a traffic surge, such as during a flash sale event like Double 11. Initially, the traffic and the system are both stable. However, when the traffic surges, the system starts heading towards instability. Abnormal calls also increase, eventually rendering the system unavailable. In such scenarios, we can use the traffic control capability to reject requests that exceed the system's capacity. Alternatively, we can use traffic smoothing to maintain a relatively stable traffic level and avoid service unavailability.
The second scenario involves unstable calls leading to service unavailability. For instance, when calling some third-party services, instability often occurs. Here, instability refers to abnormal or slow calls. Taking Dubbo as an example, when a service provider has slow calls, it leads to the accumulation of threads in the service consumer. This affects other normal calls and even the overall stability of the system. This risk is transmitted and spreads along the call chain in reverse, ultimately impacting the stability of the entire system. In this case, we can use concurrency control or fuse protection to limit the resource occupation of slow calls and ensure the overall stability of the system.
For the aforementioned scenarios, OpenSergo has developed relevant CRDs for the mentioned scenarios. Sentinel is an industry-proven traffic protection solution. It has accumulated numerous scenarios and practices related to traffic protection within Alibaba Cloud. After being open-sourced in 2018, it has further enriched these practices in the industry. We have extracted a set of specifications and standards for traffic protection from these accumulations.
Now, let's consider what a traffic protection rule should contain.
Firstly, we need to determine the type of traffic we want to target. We can divide it by interface or based on request characteristics. Once the targets are identified, we need to define the governance strategy to adopt. These strategies include the ones mentioned earlier, as well as advanced strategies like self-overload protection.
Lastly, throttling itself involves some loss. However, we do not want this loss to be passed on to the user side. Therefore, we need to configure different behaviors for different rules to ensure a user-friendly performance. For example, in the case of basic throttling for panic buying scenarios, we can return a prompt saying Please wait in line.
The image on the right side illustrates an example of a CRD. The traffic target is a request with the interface name as /foo. The strategy is global throttling with a threshold of 10, and the fallback is a specific return body.
With these CRD configurations, we can easily leverage the traffic protection capability in both the Dubbo framework and other frameworks.
There are two ways for framework developers to access the OpenSergo system.
One way is to connect to the data plane of the OpenSergo system. Framework developers only need to implement the adaptation module of Sentinel to complete the integration. For frameworks with special requirements or closer to specific scenarios, they can access the OpenSergo system using OpenSergo standards.
Regardless of the method chosen, by introducing some dependencies, developers can gain the microservices governance capabilities defined by OpenSergo and control the capabilities of these frameworks in a unified control plane. This greatly improves the experience and efficiency of using microservices governance. Now, let's take a look at the effects after accessing the system.
The first practice is that end-to-end grayscale release controls and eliminates the risk of change state instability. In a simple demo, we deploy a CRD and define a request for /A/dubbo. When the parameter "name=xiaoming" appears, the request is directed to a grayscale environment. For traffic that doesn't meet the requirements, it continues to follow the baseline environment. The current request trend aligns with our expectations. In a more complex production environment, involving various frameworks like RocketMQ and Spring Cloud Alibaba, by connecting these frameworks to the OpenSergo system, the same CRD can be used to achieve end-to-end and full-framework grayscale releases.
The second practice is that traffic protection and fault tolerance ensure runtime stability - unstable call scenarios. Using a simple demo, application A calls application B through Dubbo. On the right, there is a traffic line chart showing normal and slow call interfaces. Purple represents total traffic, yellow represents rejected traffic, and orange represents abnormal traffic. Initially, with no slow calls, the system is in a steady state with no abnormal traffic.
However, when a slow call is introduced, abnormal traffic occurs. The slow call consumes a significant amount of Dubbo thread resources, causing the normal call resources to be occupied and leading to a large amount of abnormal traffic. The Dubbo side also experiences thread pool exhaustion exceptions.
To solve this problem, traffic control rules can be configured. Throttling traffic may seem like a solution, but it only further accumulates requests in slow call scenarios. Concurrency control is the real solution. By limiting the concurrency of the slow call interface, the number of requests being processed is restricted. With this restriction, even if slow calls still exist, the resources they can occupy are limited, allowing normal interfaces to be called without expanding instability risks and ensuring application stability.
The third practice is that traffic protection and fault tolerance ensure runtime stability - adaptive overload protection. In our demo, under continuous high load, abnormal traffic gradually increases, disrupting the system's steady state. In this case, adaptive overload protection rules can be configured to adjust throttling behavior and eliminate abnormal requests, helping the system return to a steady state.
The open source community has already supported BBR (Bottleneck Bandwidth and RTT) and also utilizes PID (Process Identification) internally. These strategies are not discussed in detail here, but you can participate in the discussion in our open source community if you are interested..
From these three examples, we can see that after Dubbo connects to the OpenSergo system through Sentinel, it gains the general governance capabilities defined by OpenSergo and can be managed through a unified control plane.
The same applies to other frameworks. If all frameworks involved in production are connected to the OpenSergo system, all services and microservices governance capabilities can be controlled on one control plane, better ensuring system stability.
Above is the ecosystem figure of multilingual service governance. From an ecological perspective, our goal for OpenSergo is to achieve end-to-end multilingual isomerization. We will primarily focus on the Java/Go + Gateway + Mesh ecosystem and aim to expand our coverage to include more frameworks within the ecosystem.
In terms of capabilities, we are committed to abstracting and implementing additional general microservices governance capabilities. These include traffic protection, self-healing, service fault tolerance, and service authentication.
Currently, we have established contacts and partnerships with various communities such as Dubbo, ShenYu, APISIX, Higress, RocketMQ, and MOSN. Many of these collaborations have already made significant progress.
Now let's discuss our recent plans:
• Control Plane: We will gradually promote the availability of the control plane for production and aim to release the GA version in March next year. This will allow everyone to verify the microservices governance system in a production environment.
• Specifications: We will support microservices security governance and outlier instance removal, while continuously integrating our specifications with community standards.
• Governance Capability Evolution: Our focus will be on completing the upgrade of Sentinel 2.0 traffic governance and exploring secure and adaptive approaches.
• Community Cooperation: We will continue to foster exchanges and cooperation among different communities, promoting the implementation of the ecosystem, unification of the control plane, and co-construction of specifications in various microservices governance areas.
The issues related to stability are complex and the scenarios are diverse. The evolution of microservices governance technology, ecosystem, and standardization requires the participation of various enterprises and communities.
You can contribute to the community in the following three ways:
• Microservices Governance Specifications: As leaders in their respective fields, communities and enterprises can collaborate to develop and improve standards and specifications based on their own scenarios and best practices.
• Unified Control Plane Evolution: The unified control plane of microservices offers numerous possibilities. As a decision-maker, it has a holistic view of the entire system and great potential in the era of booming AI technologies.
• Governance Capability and Community Ecology Contribution: You can participate in the evolution of service governance capabilities and contribute to the integration between various communities and the OpenSergo system.
Implement a New Service Mesh Integrating Sidecarless and Sidecar Modes
508 posts | 48 followers
FollowAlibaba Cloud Native Community - February 22, 2023
Alibaba Cloud Native Community - April 6, 2023
Alibaba Cloud Native Community - May 23, 2023
Alibaba Cloud Native Community - September 12, 2023
Alibaba Cloud Native Community - November 22, 2023
Alibaba Cloud Native Community - December 19, 2023
508 posts | 48 followers
FollowMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreMore Posts by Alibaba Cloud Native Community