Improving Kubernetes Service Network Performance with Socket eBPF

Background Information

The network features in Kubernetes mainly include POD network, service network, and network policies. Among them, the POD network and network policies specify the model but do not provide a built-inimplementation. As Serivce is a built-in feature of Kubernetes , the official version has multiple implementations:

Service mode	Description
userspace proxy mode	The kube-proxy is responsible for list/watch, rule settings, and user state forwarding.
iptables proxy mode	The kube-proxy is responsible for list/watch and rule settings. IPtables-related kernel modules are responsible for forwarding.
IPVS proxy mode	The kube-proxy is responsible for watching the k8s resource and setting rules. IPVS in Linux kernel is responsible for forwarding.

Among the several Service implementations that appeared in Kubernetes, the overall purpose is to provide higher performance and extensibility.

Service network is a distributed server load balancer. With kube-proxy deployed in daemonset mode, it listens to endpoint（or endpointslice） and service , and generates forwarding table items locally to node. Currently, the iptables and IPVS modes are used in the production environment. The following is the principle:

In this article, the logic of using socket eBPF to complete server load balancer at the socket level is introduced. This eliminates packet-by-packet NAT transformation processing and improves the Service network's performance.

Socket eBPF-based Data Plane

Introduction to the Socket eBPF

Whether kube-proxy uses IPVS or tc eBPF service network acceleration mode, each network request from a pod must go through IPVS or tc eBPF (POD <--> Service <--> POD). As traffic increases, there will be performance overhead. Then, can we change the cluster IP address of the service to the corresponding pod IP in the connection? The service network services based on Kube-proxy and IPVS are implemented based on per-report processing plus session.

With socket eBPF, we can implement SLB logic without processing messages and NAT conversion. The Service network is optimized to POD <--> POD in synchronization so that the service network performance is equivalent to that of the POD network. The following is the software structure:

In the Linux kernel, BPF_PROG_TYPE_CGROUP_SOCK types of eBPF hook can be used to insert necessary EBPF programs for socket system calls to hook.

By file descriptors that attach to a specific cgroup, we can control the scope of the hook interface.
With sock eBPF hook, we can hijack specific socket interfaces at the socket level to complete the SLB logic.
The forwarding of the POD-SVC-POD is converted into the forwarding of POD-POD.

The current Linux kernel is improving related hooks to support more bpf_attach_type. The following is some distances: BPF_CGROUP_INET_SOCK_CREATE BPF_CGROUP_INET4_BIND BPF_CGROUP_INET4_CONNECT BPF_CGROUP_UDP4_SENDMSG BPF_CGROUP_UDP4_RECVMSG BPF_CGROUP_GETSOCKOPT BPF_CGROUP_INET4_GETPEERNAME BPF_CGROUP_INET_SOCK_RELEASE

TCP Workflow

TCP is based on connection, so the implementation is simple. You only need to hook connect system calls, as shown in the following:

Connect system call hijacking logic:

Take the dip+dport from the connect call context to find the svc table. If it is not found, the return will not be processed.
Find the affinity session. If it is found, get the backend_id, and go to 4. Otherwise, go to 3.
Randomly schedule to assign a backend.
According to backend_id, check the be table to obtainthe IP+ port of the be.
Update the affinity information.
Modify connect to call the dip+dport parameter in the context as be's ip+port.
Done.

At the socket level, port conversion is completed. For the clusterip access to TCP, it is equivalent to the east-west communication between PODs, thus minimizing the overhead of clusterip.

The package-by-package dnat behavior is not required.
The behavior of looking up for svc package by package is not required.

UDP Workflow

UDP is connectionless and is more complex, as shown in the following figure:

For the definition of the nat_sk table, see: LB4_REVERSE_NAT_SK_MAP

Hijack connect and sendmsg system calls:

Take dip+dport from the system call context to find the svc table. If it is not found, the return will not be processed.
Find the affinity session. If the session is found, the backend_id is obtained, and go to 4. Otherwise, go to 3.
Randomly schedule to assign a backend.
According to backend_id, check the be table to obtain the IP+ port of the be.
Update affinity tables.
Update the nat_sk table. The key is the ip+port of be, and the value is the vip+vport of svc.
Modify the dip+dport in the system call context as be 's ip+port.
Done.

Hijacking recvmsg system call

Find the NAT_SK table from the system call context remote IP+port. If it is unable to be found, the return will not be processed.
If it is found, remove the IP+port to find the svc table. If it is not found, delete the corresponding table entry of nat_sk and return the value.
Use ip+port found in nat_sk to set the remote IP+port in the system call context.
Done.

About Address Correction

The implementation of the clusterIP bases on socket eBPF. In addition to the preceding basic forwarding principles, there are some special details to consider. One of them is the peer address. Unlike implementations such as IPVS, on the clusterIP of socket eBPF, the client communicates directly with backend, and the intermediate service is bypassed. The following is the forwarding path:

If the APP on the client calls interface query peer address such as getpeername, the address obtained at this time is inconsistent with the address initiated by the connect. If the app has a judgment or special purpose for the peeraddr, there may be unexpected situations.

In view of this situation, we can correct it at the socket level by eBPF:

Add a bpf_attach_type to the guest kernel, which can be used to add hook processing to getpeername and getsockname.
At the time of connection, in the corresponding socket hook process, define the map and record the calls of responded VIP: VPort and RSIP: RSPort.
When APP calls getpeername/getsockname interface, use eBPF program to modify the returned data: modify in the context the remote IP+port to vip+vport.

Summary

Differences between TC-EBPF/IPVS performance

Test environment: 4 vCPUs and 8 GB mem secure container instances; single client, single clusterip, and 12 backend. socket BPF: a service implementation based on socket ebpf. tc eBPF: a cls-bpf-based service implementation, which has been applied in the ack service. IPVS-raw: remove all security group rules and overhead such as veth and only implement the service IPVS forwarding logic. Socket BPF improves all performance metrics to varying degrees. For a large number of concurrent short connections, the throughput is improved by 15%, and the latency is reduced by 20%. Comparison of forwarding performance (QPS)

Comparison of 90% Forwarding Latency (ms)

Continue to Evolve

The service based on the implementation of socket eBPF simplifies the load balancerlogic implementation and reflects the flexible and compact features of eBPF. These features of eBPF fit cloud-native scenarios well. Currently, this technology has been implemented in Alibaba Cloud to accelerate the service network of kubernetes.

Community

Improving Kubernetes Service Network Performance with Socket eBPF

Background Information

Socket eBPF-based Data Plane

Introduction to the Socket eBPF

TCP Workflow

Connect system call hijacking logic:

UDP Workflow

Hijack connect and sendmsg system calls:

Hijacking recvmsg system call

About Address Correction

Summary

Differences between TC-EBPF/IPVS performance

Continue to Evolve

Read previous post:

Read next post:

OpenAnolis

You may also like

Comments

OpenAnolis

Related Products

Cloud-Native Applications Management Solution

Container Service for Kubernetes

Networking Overview

Accelerated Global Networking Solution for Distance Learning

A Free Trial That Lets You Build Big!