Service Mesh Optimization Center: Optimizing Service Mesh for Higher Performance and Availability

By Xining Wang

Service Mesh is a solution that manages network communication in a microservice architecture. By adding a proxy to each service instance, it enables functions such as traffic control, service discovery, and load balancing. Although Service Mesh offers many great features, there are some performance issues to be aware of:

• Increased latency: Each service request in Service Mesh needs to go through a series of proxies, which can result in increased latency during the processing of the proxies.
• Increased resource usage: Each proxy requires a certain amount of CPU, memory, and other resources. Therefore, as the number of services increases, the resource usage of the proxies also increases.
• In addition, if the TLS security mechanism is enabled, the proxy in Service Mesh will encrypt and decrypt traffic, which consumes additional resources.

In practical usage, it is important to choose and optimize Service Mesh according to specific situations to achieve better performance and user experience. Alibaba Cloud service mesh (ASM) is the industry's first fully managed Istio-compatible service mesh product. Its architecture is designed to align with the community and industry trends from the beginning. The control plane components are hosted on Alibaba Cloud and are independent of the user clusters on the data plane. ASM has successfully passed the service mesh performance evaluation of trusted cloud. It has received advanced certification and is ranked first with full scores in all categories.

ASM is customized based on the open-source Istio. It provides component capabilities for fine-grained traffic management and security management on the managed control plane. The managed mode decouples the lifecycle management of Istio components from the managed Kubernetes clusters, making the architecture more flexible and improving system scalability.

The following figure shows the technical architecture of ASM.

Managed ASM serves as the infrastructure for unified management of various heterogeneous computing services. It provides unified traffic management capabilities, unified service security capabilities, unified service observability capabilities, and WebAssembly-based unified proxy scalability capabilities to build enterprise-level capabilities.

For more information, please visit https://www.alibabacloud.com/product/servicemesh

The following section explains how ASM establishes a mesh optimization center using multi-dimensional and diverse performance optimization methods to improve product performance indicators and ensure high availability:

Converging the scope of service discovery to enhance the efficiency of mesh configuration push.
Automatically recommending and generating sidecar objects based on access log analysis to reduce proxy resource consumption.
AdaptiveXDS: Adaptive configuration push optimization.
Continuously promoting software and hardware performance optimization.
Supporting resource overcommit mode.
Data plane performance optimization based on eBPF tcpip-bypass.

Converge the Scope of Service Discovery to Improve the Efficiency of Mesh Configuration Push

In Istio, the scope of service discovery can be converged in the following ways to enhance the efficiency of pushing mesh configurations in the control plane. By converging the scope of service discovery, the following objectives can be achieved: effectively reducing CPU and memory resource consumption of the control plane component, and reducing bandwidth resource consumption during communication between the control plane component and the mesh proxy.

The Discovery Selector allows you to define a set of filtering rules to filter the service discovery information that needs to be synchronized to the control plane:

• Automatically discover services in a specified namespace in a Kubernetes cluster on the data plane based on namespace-based filtering.
• Directly interact with Istiod to effectively improve the processing efficiency of the control plane.

In addition, ExportTo and Workload Selector are used to converge the scope of rule configurations.

• ExportTo is used to control the availability scope of a rule (VirtualService, DestinationRule, ServiceEntry, etc.). Workload Selector is used to configure the service scope to which a rule (DestinationRule, ServiceEntry, Envoy Filter, Sidecar) applies.
• It can support global settings (meshconfig), naming level settings, and fine-grained settings (per service/config).
• The definition is similar to the following:

ASM provides corresponding features. Specifically, you can select or deselect a namespace to define the scope of service discovery.

You can also edit a label selector to select services in the corresponding namespace. If the namespace of the Kubernetes cluster on the data plane matches any label selector, the services in the namespace are included in the automatic discovery scope. At the same time, the namespace of the Kubernetes cluster on the data plane must match all the rules defined in each label selector before it can be selected by the selector.

For more details, see https://www.alibabacloud.com/help/en/doc-detail/363023.html

Automatically Recommend and Generate Sidecar Objects Based on Access Log Analysis to Reduce Proxy Resource Consumption

This optimization method utilizes xDS and Sidecar objects. Let's start with some background information.

xDS (X Discovery Service) is the communication protocol between the Service Mesh control plane and data plane Sidecar Proxy. The "X" represents a collection of various protocols such as LDS (Listener Discovery Service), CDS (Cluster Discovery Service), EDS (Endpoint Discovery Service), and RDS (Route Discovery Service).

xDS serves as a transmission protocol for the mesh proxy to obtain configuration information. It acts as a bridge between Istio and the mesh proxy. Essentially, xDS consists of service discovery data and governance rules within a mesh. The size of xDS data is directly related to the scale of the mesh.

By default, xDS employs a full delivery policy, meaning that all mesh proxies within the mesh have access to the complete service discovery data of the entire mesh.

However, in many cases, a simple workload in a large-scale cluster only communicates with a few other workloads. Restricting its configuration to include only the necessary services can significantly reduce the memory usage of the mesh proxy. This is where Sidecar resource objects come into play, helping to define these configuration constraints.

To automatically recommend and generate sidecar resource objects based on access log analysis, the following implementation principles are applied:

• By analyzing the access logs generated by the mesh proxy on the data plane, the call dependencies between services on the data plane are obtained. Consequently, the corresponding sidecar resource objects are automatically recommended and generated for each workload on the data plane.
• After generating the sidecar resource objects based on the analysis results, users can double-check or customize the content according to their needs.

This method suits scenarios where log data for these service calls already exists, and the call dependencies of existing business services remain relatively unchanged. By utilizing this approach, optimization can be achieved in one go. Taking the Bookinfo example, after a few requests, each mesh proxy generates an access log. For instance, the recommended and generated sidecar resource objects for the productpage service are as follows:

Of course, there are certain prerequisites for using this feature.

You need to enable the log service to collect these access logs, and the generated logs must cover all business calls in order to obtain all dependencies. If a business path does not make calls that generate logs, the corresponding service dependencies may be lost. As a result, the definition of the generated sidecar resource object may be inaccurate, leading to potential failures in subsequent access calls to the business path.

ASM provides a corresponding feature. It offers the ability to automatically recommend sidecar resource objects based on access log analysis, thereby improving the efficiency of xDS push, as shown in the diagram below.

For more details, see https://www.alibabacloud.com/help/en/doc-detail/386398.html

AdaptiveXDS: Adaptive Configuration Push Optimization

To overcome the limitations in the preceding solution, we provide another optimization method, that is, the ability to push xDS configurations on demand and adapt to changes in application services.

Adaptive configuration push optimization analyzes dependencies between services based on access logs of services in the mesh and automatically generates sidecar resource objects for services to optimize configuration push for the service workload. After the feature is enabled, an egress gateway named istio-axds-egressgateway is deployed in the cluster. All HTTP traffic called between services is initially directed to this egress gateway, and the dependencies between services are automatically analyzed through the access logs recorded by the gateway.

Specifically, the following architecture diagram shows:

• On the hosting side, the Adaptive Xds Controller component manages the lifecycle of the AdaptiveXds-EgressGateway and generates the Envoy filter and Bootstrap configurations required by the AdaptiveXds-EgressGateway.
• After this feature is enabled, the AdaptiveXds-EgressGateway will report access logs to the Access Log Service (ALS).
• ALS is responsible for receiving the log data sent by the mesh proxy, analyzing the log content with ALS Analyzer, and then generating the corresponding sidecar resource object according to the service call dependency.

To enable the adaptive configuration push optimization capability for services in a cluster, you can enable it according to the namespace scope. After you enable this feature, all services in the namespace will automatically push and optimize configurations based on sidecar resource objects.

You can also add the asm.alibabacloud.com/asm-adaptive-xds: true annotation to the annotations of the Kubernetes service to enable the optimization option for that service.

In a customer scenario that uses ASM, after optimization with this method, the configurations in the mesh proxy are reduced by 90%, and the consumed memory consumption is reduced from 400 MB to 50 MB.

ASM provides the corresponding function, which supports adaptive configuration push optimization to improve the push efficiency of xDS and reduce unnecessary configuration of mesh proxies.

For more information, see: https://www.alibabacloud.com/help/en/doc-detail/479108.html

Continue to Promote the Performance Optimization of Software and Hardware

The data plane comes in various forms, with different ECS specifications, models, and OS versions running on each node. By detecting the characteristics of the nodes, we can better understand their support capabilities. For example, we can determine whether the relevant features of eBPF are supported based on the Kernel version, enable TLS encryption and decryption processing capabilities based on AVX instruction set support, or determine whether Device Plugin is provided.

In other words, by detecting the hardware features available on each node in the Kubernetes cluster, including CPUID features and instruction set extensions, we can dynamically configure the corresponding features. This allows us to adaptively enable or disable these features without any noticeable impact on users.

This approach allows us to fully utilize the node environment used by users and dynamically enable these features to enhance their capabilities. The launched features of ASM products include dynamically enabling the Multi-Buffer feature based on AVX instruction set support to improve TLS encryption and decryption performance.

Specifically:

1) On the service mesh control plane, users are provided with a unified declarative configuration definition by extending MeshConfig or CRD.

2) The configuration of the control plane is delivered to the Envoy proxy on the data plane through the xDS protocol. This part is also some expansion capabilities implemented in the ASM products.

3) Preferentially schedule workload pods and support adaptive dynamic configuration for workload pods. By identifying the features of a node, ASM preferentially schedules the pod that enables the Multi-Buffer feature to the corresponding node, so that the related features can be enabled. In addition to scheduling, ASM supports adaptive dynamic configuration. That is, even if there is no corresponding node to schedule, these pods can adaptively disable these functions when deployed on other nodes.

Support in Resource Overcommit Mode

In the Kubernetes system, Kubelet manages the resource quality of standalone containers by referring to the QoS level of pods, such as OOM (Out of Memory) priority control. The QoS levels of Pod are divided into Guaranteed, Burstable, and BestEffect. The QoS level depends on the request and limit (CPU and memory) configured by the pod.

ack-koordinator can dynamically overcommit resources. ack-koordinator monitors the loads of a node in real time and then schedules resources that are allocated to pods but are not in use.

Note that resource restrictions and required resource configurations are generally not forced to be the same value. We recommend that you configure resource restrictions and required resources by referring to workload types.

• If the QoS is a workload of the Guaranteed type, we recommend that you set the two to the same value.
• If you use other types of pods, we recommend that you ensure that the resources required in the native resource types are less than the resource limit.

ASM provides the corresponding features. You can configure ACK resources that can be dynamically overcommitted for the injected sidecar proxy container and isito-init container.

For more information, see Configure Sidecar Proxies: https://www.alibabacloud.com/help/en/doc-detail/613582.html

MerBridge: The Data Plane Performance Optimization Based on eBPF tcpip-bypass

eBPF is a widely recognized and popular technology that offers numerous potential optimization functions, particularly in terms of traffic and performance. For instance, it can replace iptables with eBPF for traffic redirection within the service mesh field.

Merbridge, an open-source project under CNCF, focuses on leveraging eBPF to accelerate service mesh. By utilizing eBPF technology to replace iptables, Merbridge enables traffic interception within the service mesh.

With the eBPF and msg_redirect technologies, Merbridge can improve the transmission speed between sidecars and applications and reduce latency, as shown in the following figure.

eBPF provides the ability to rewrite the tcp_sendmsg function, allowing the eBPF Program to take over tcp_sendmsg.

Additionally, eBPF provides helper functions related to bpf_msg_redirect_hash, which allows for direct short connection between the two socket transmission paths on the host within the taken-over tcp_sendmsg, bypassing the kernel-mode protocol stack and accelerating inter-process access.

These helper functions rely on a kernel-level map to store connection information, ensuring non-conflicting keys for each connection in the map.

In Ambient mode, adding or removing a mesh becomes more flexible. The CNI mode is no longer applicable as it only takes effect during pod creation. However, in Ambient mode, pods can be added to a mesh by modifying annotations.

Container egress traffic is no longer intercepted at 127.0.0.1:15001; instead, it must be forwarded to the current node's ztunnel instance.

Since pods do not contain sidecars, the previous solution of determining whether to intercept pods by listening to port 15001 no longer applies.

Now, how can we address these challenges in Ambient mode?

Through eBPF, we can observe the creation of processes and associate the cgroup ID and pod IP in user mode. Additionally, we can store pod information.

In the eBPF program, we can use the cgroup ID of the current process to find the pod IP of that process. (Processes within a container should have the same cgroup ID and should not change.)

If traffic originates from a pod in Ambient mode, it will be forwarded to ztunnel.

The ASM product team is also focused on improving the performance of certain network plugins based on eBPF and Alibaba Cloud Container Service ACK. We are actively collaborating with the Merbridge team.

Although we have implemented performance optimization methods in the above dimensions, it is widely understood that performance is an ongoing concern. We will continue to explore and study in order to discover more effective solutions to enhance the performance and stability of Service Mesh.

Community

Service Mesh Optimization Center: Optimizing Service Mesh for Higher Performance and Availability

Converge the Scope of Service Discovery to Improve the Efficiency of Mesh Configuration Push

Automatically Recommend and Generate Sidecar Objects Based on Access Log Analysis to Reduce Proxy Resource Consumption

AdaptiveXDS: Adaptive Configuration Push Optimization

Continue to Promote the Performance Optimization of Software and Hardware

Support in Resource Overcommit Mode

MerBridge: The Data Plane Performance Optimization Based on eBPF tcpip-bypass

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Alibaba Cloud Service Mesh

Cloud-Native Applications Management Solution

Simple Log Service

Managed Service for Prometheus