By Zehuan Shi
Today, distributed microservices are indisputably the mainstream architectural pattern. Microservices divide complex large-scale systems into loosely coupled microsystems, enabling more flexible iteration and higher fault tolerance in large-scale systems. For individual microservices within this architecture, horizontal scaling is supported, allowing flexible adjustment of resources used by each service based on business needs. To properly invoke a multi-replica service, two fundamental issues need to be addressed: service discovery and load balancing. As a deployment platform widely used in the industry, Kubernetes solves the problem of service discovery and load balancing through service in a non-intrusive manner.
However, confronted with increasingly complex business scenarios, Kubernetes service still has some unsatisfactory aspects. For example, its load balancing algorithm only supports random. For some specific scenarios, the performance of the random algorithm is poor and there is no room for tuning. Therefore, more and more microservice operation and maintenance teams choose Service Mesh as the network infrastructure, which provides more powerful load balancing capabilities than Kubernetes on the basis of Kubernetes services.
In this article, the author will analyze the implementation principles and usage scenarios of various Service Mesh load balancing algorithms, and provide a reference for the selection of Service Mesh load balancing algorithms.
Kubernetes Service implements connection-level load balancing through iptables rules. This load balancing does not need to invade the application but directly performs DNAT on the traffic from the network level to realize load balancing. However, load balancing of Kubernetes has some significant shortcomings.
First of all, since it uses iptables rules to implement load balancing, which makes the time complexity of the Kubernetes load balancing algorithm O(n), the performance of the load balancing may be significantly affected when the cluster size gradually expands.
Second, it only supports random load balancing algorithms, which do not strictly distribute requests evenly across backend instances. Due to its implementation based on iptables rules, it operates at Layer 3 (IP layer) and can only achieve load balancing at Layer 4, that is, at the TCP connection level. However, most mainstream applications today use HTTP/GRPC protocols for communication.
Therefore, Kubernetes Service load balancing based on iptables rules is inherently unable to achieve request-level routing. In contrast, Service Mesh control plane obtains workload and service information from the Kubernetes cluster's ApiServer and distributes this information to Sidecars. Leveraging its Layer 7 capabilities, Service Mesh enables request-level load balancing for HTTP/GRPC requests. It also supports a variety of load balancing algorithms suitable for different scenarios. Users can choose the appropriate load balancing algorithm based on their service's characteristics, optimizing resource utilization more effectively.
Alibaba Cloud Service Mesh (ASM) provides a variety of load balancing algorithms, including RANDOM, ROUND_ROBIN, and LEAST_REQUEST, which are compatible with the Istio community, and PEAK_EWMA, which is unique to ASM. In the following sections of this article, the author will systematically analyze the characteristics of each load balancing algorithm and their respective application scenarios.
ASM provides the classic RANDOM and ROUND_ROBIN algorithms. For random load balancing, unlike, Kubernetes services, ASM Sidecars work at Layer 7. Therefore, ASM Sidecars can provide request-level random load balancing for HTTP and GRPC communications. Random load balancing cannot guarantee that requests will be evenly distributed among backend servers due to the inherent unpredictability of random calculations. If both the performance of the workload and the computational load generated by requests are relatively balanced, using a round-robin algorithm for load balancing can be considered. The round-robin algorithm evenly distributes requests across backend workloads. These two algorithms are widely used in load balancing across various distributed systems. They are simple, intuitive, and supported by almost all load balancing hardware and software. However, in practice, there are scenarios where their performance may not meet expectations.
The RANDOM and ROUND_ROBIN load balancing algorithms do not consider the status of the backend service at all and only use the local status to calculate the load balancing result. Therefore, the status change of the backend service does not affect the load balancing calculation result at all, which leads to unsatisfactory load balancing decisions. For example, in the following scenarios:
When backend processing capabilities are uneven, evenly distributing requests can actually degrade service quality. Due to the uneven backend load capabilities, weaker servers may become overloaded sooner. Despite this, traditional load balancing algorithms continue to distribute requests evenly across all servers. Requests routed to already overloaded servers exacerbate their load, further deteriorating service latency and overall performance. For example, in the diagram below, Server-1 has greater processing capacity and can accept new requests, whereas Server-2 is already overloaded. However, random or round-robin load balancing algorithms still direct new requests to Server-2 instead of the less burdened Server-1. This results in significantly prolonged response times for these requests due to Server-2's overload.
Even if backend processing capabilities are similar, the pressure caused by similar requests on the backend can vary significantly. A typical example is in LLM applications, where prompts like "write me a novel" and "what is the time" can vary drastically in computational intensity. This difference can lead to significant variations in response times. For instance, in the example below, both Server-1 and Server-2 are handling two requests. However, Server-2 is processing a request with heavy computation, causing it to become overloaded. Yet, random or round-robin load balancing algorithms still direct new requests to Server-2 instead of the less burdened Server-1. This results in significantly prolonged response times for these requests due to Server-2's overload.
Even in scenarios where computational load of requests are roughly equal, erroneous requests can still lead to similar issues. For example, if a client sends an incorrect request and the server fails the validation without further processing, this can indirectly cause uneven computational loads among requests.
The deficiencies of RANDOM and ROUND_ROBIN are essentially caused by the load balancing’s inability to perceive the status of the backend. Theoretically, the optimal load balancing decision should be made based on the real-time status of the backend and the potential computation amount of the request. However, both of them have great challenges in the real load balancing implementation. Is there a way to indirectly perceive the status of the backend? The author will introduce the other two load balancing algorithms supported by ASM in the following sections.
If the user does not specify the LEAST_REQUEST algorithm, the LEAST_REQUEST algorithm is used as the default load balancing algorithm provided by ASM. This load balancer records the number of outstanding requests on each backend. Backends with stronger processing capabilities consume this count faster because requests are completed faster. The load balancer leverages this phenomenon to indirectly perceive the backend's pressure status and always routes requests to the backend with the fewest number of incomplete requests.
Next, let's see the advantage of the LEAST_REQUEST algorithm through an example. First, consider the scenario of uneven backend processing capacities mentioned earlier. At the time T1, the load balancing assigns 3 requests each to Server-1 and Server-2. Due to Server-1's greater concurrent processing capacity, by the time T2, Server-1 has completed processing all 3 requests, while Server-2 still has one request pending. Since LEAST_REQUEST always directs requests to the backend with the fewest ongoing requests, any new incoming requests will be routed to Server-1, which has a lower load at that moment.
Let's consider another scenario where the computational load of requests varies significantly among servers. In the example below, at time T1, Server-1 has 2 ongoing requests, while Server-2 has only one. However, Server-2 is handling a request with heavy computation (for example, a complex prompt). By the time T2, Server-1 has completed processing and returned responses for its 2 requests, while Server-2 is still in the process of computing and returning the response for a complex request. At this point, if the load balancing receives a new request, it will allocate it to Server-1 because Server-1 has fewer ongoing requests compared to Server-2.
Combining the two examples above, we can intuitively see that the LEAST_REQUEST load balancing algorithm demonstrates greater adaptability in handling more complex scenarios compared to RANDOM and ROUND_ROBIN algorithms. By dynamically sensing the backend state through changes in ongoing request counts, it makes optimal load balancing decisions in real time. This approach reduces average request latency and enhances overall system utilization. This feature is ideal for AI scenarios where the amount of computation varies for similar requests.
LEAST_REQUEST can provide better load balancing performance in most cases, but is LESAT_REQUEST the best solution in any case? Let's consider a scenario where backend servers encounter errors due to exceptions, causing erroneous requests to return immediately without actual processing. In the LEAST_REQUEST load balancer, the number of ongoing requests recorded for the errored node quickly decreases. In this evaluation system, the errored backend might be mistakenly perceived as "high-performance," leading to more requests being routed to this backend. Consequently, this can increase the overall error rate of the system. Clearly, this flaw arises because the LEAST_REQUEST evaluation system cannot perceive error rates. So, is there a better option? Next, the author will introduce the PEAK_EWMA load balancer provided by ASM.
In version 1.21, ASM introduced the PEAK_EWMA load balancer. This load balancer calculates scores for endpoints based on factors such as response times, error rates over a period, and static weights. Endpoints with shorter response times, lower error rates, and higher static weights receive higher scores, making them more likely to be selected by the load balancing. Compared to LEAST_REQUEST, PEAK_EWMA provides a more comprehensive perception. Increases in latency and error rates will decrease the backend's weight, resulting in fewer requests being routed to that backend. This approach aims to maximize overall service latency and success rate performance.
Here, we provide an end-to-end example of the PEAK_EWMA load balancer. This example uses the "simple-server" application as the server, which can simulate various server response behaviors such as different status code returns and random delay ranges through startup parameters. In this example, we will deploy two Deployments, where the deployment named simple-server-503
uses the --mode=503
startup parameter set to a fixed return 503 to simulate a backend that immediately returns 503 status code after receiving the request. The deployment named simple-server-normal
uses --mode=normal--delayMin=50--delayMax=200
to simulate a backend that normally returns 200, and its processing delay fluctuates between 50 and 200.
You have created an ASM instance of the 1.21 or later version.
You have created Kubernetes 1.21 or later clusters and added Kubernetes clusters to the ASM instance.
In the Kubernetes cluster, use the following YAML to deploy the simple-server application and service, and the sleep application that is used to initiate test traffic:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: simple-server
name: simple-server-normal
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: simple-server
template:
metadata:
labels:
app: simple-server
spec:
containers:
- args:
- --mode
- normal
image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.2-gae1f6f9-aliyun
imagePullPolicy: IfNotPresent
name: simple-server
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: simple-server
name: simple-server-503
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: simple-server
template:
metadata:
labels:
app: simple-server
spec:
containers:
- args:
- --mode
- "503"
image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.2-gae1f6f9-aliyun
imagePullPolicy: IfNotPresent
name: simple-server
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
---
apiVersion: v1
kind: Service
metadata:
labels:
app: simple-server
name: simple-server
namespace: default
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: simple-server
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sleep
spec:
replicas: 1
selector:
matchLabels:
app: sleep
template:
metadata:
labels:
app: sleep
spec:
terminationGracePeriodSeconds: 0
serviceAccountName: sleep
containers:
- name: sleep
image: registry-cn-hongkong.ack.aliyuncs.com/test/curl:asm-sleep
command: ["/bin/sleep", "infinity"]
imagePullPolicy: IfNotPresent
To avoid the impact of ASM's default retry mechanism (which retries 2 times by default) on validation results, we explicitly disable retries for the "simple-server" application using VirtualService. Use the kubeconfig file of the ASM instance to apply the following VirtualService:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: simple-server
namespace: default
spec:
hosts:
- simple-server.default.svc.cluster.local
http:
- name: default
retries:
attempts: 0 # Disable retries.
route:
- destination:
host: simple-server.default.svc.cluster.local
Use the Kubernetes cluster kubeconfig, and run the following command to initiate the test:
$ kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" simple-server:8080/hello; done'
200
200
200
503
503
503
503
503
200
503
It can be observed that due to one backend of the "simple-server" service consistently returning 503 errors, the default LEAST_REQUEST algorithm does not effectively avoid routing requests to this error-prone node during the load balancing phase. Next, let's take a look at the performance of the PEAK_EWMA load balancer of ASM. Apply the following DestinationRule in the ASM instance to enable the PEAK_EWMA load balancer for the simple-server service:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: simple-server
namespace: default
spec:
host: simple-server.default.svc.cluster.local
trafficPolicy:
loadBalancer:
simple: PEAK_EWMA
---
Run the test again using the same command:
$ kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" simple-server:8080/hello; done'
200
503
200
200
200
200
200
200
200
200
It can be seen that after returning a 503 error, subsequent responses are all 200. This is because the PEAK_EWMA load balancer upon receiving a 503 error from a backend, increases the error rate, thereby decreasing the endpoint's score. Consequently, during a subsequent period, this backend will have a lower score compared to normal endpoints, and thus will not be selected for routing. However, after some time, this effect is mitigated due to the moving average, allowing the load balancer to potentially select this backend again. As we mentioned earlier, the PEAK_EWMA algorithm can not only sense the error rate of the backend, but also sense the change of the backend latency, and route requests to endpoints with lower latency as much as possible. ASM official documentation provides an example of this and welcomes everyone to experience it.
Load balancing is a crucial topic in modern distributed applications that cannot be overlooked. Each load balancing algorithm has its suitable scenarios, and selecting the right algorithm based on the business context can greatly enhance application performance. After combining a large number of actual customer scenarios, the Alibaba Cloud Service Mesh team introduced the PEAK_EWMA load balancing algorithm in version 1.21. This load balancing algorithm goes further than the traditional load balancing algorithm and provides an load balancing selection capability that integrates backend delay, error rate, and static weight, allowing load balancing to dynamically adapt to changes in application status and make better load balancing decisions, thus improving the overall performance of applications.
Use Alibaba Cloud ASM to Efficiently Manage LLM Traffic Part 1: Traffic Routing
Best Practices for Configuring Throttling Rules for Applications in ASM
166 posts | 30 followers
FollowXi Ning Wang(王夕宁) - December 16, 2020
Alibaba Cloud Native Community - December 11, 2023
Alibaba Cloud Native - November 3, 2022
Alibaba Clouder - December 19, 2019
Adrian Peng - February 1, 2021
Alibaba Developer - September 22, 2020
166 posts | 30 followers
FollowMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreRespond to sudden traffic spikes and minimize response time with Server Load Balancer
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Container Service