Analysis of Alibaba Cloud Container Network Data Link (1): Flannel

By Yu Kai

Co-Author: Xie Shi (Alibaba Cloud Native Application Platform)

Introduction

In recent years, the trend of enterprise infrastructure cloud-native has become popular. From the initial IaaS to the current microservice, customers have more demands for fine granularity and observability. Container network has been developing at a high speed to meet customers' requirements for higher performance and higher density. This brings high thresholds and challenges to customers' observability of cloud-native networks. In order to improve the observability of the cloud-native network and facilitate the readability of business links for customers and frontend and backend developers, ACK and AES jointly developed the observability series of ack net-exporter and cloud-native network data planes to help them understand the cloud-native network architecture system. It lowers the observability threshold of the cloud-native network and optimizes the experience of customer operation and maintenance and after-sales on difficult problems. It improves the stability of cloud-native network links.

The entire container network can be divided into Pod, Service, and Node CIDR blocks. If these three networks want to achieve interconnection and access control, what is the technical principle? What is the whole link? What are the restrictions? What is the difference between Flannel and Terway? What is the network performance in different modes? These require customers to choose according to their business scenarios before building containers. After building containers, the relevant architecture cannot be changed. Thus, customers need to have a full understanding of the characteristics of each architecture. For example, the following figure is a schematic diagram. The pod network realizes network communication and control between pods of the same ECS and access between different ECS Pods. The backend of pods accessing SVC may be on the same ECS or other ECS. Under different modes, the data link forwarding modes are different, and the performance from the service side is also different.

This article is the first part of the series. It introduces the forwarding links of data plane links in Kubernetes Flannel mode. First, by understanding the forwarding links of the data plane in different scenarios, it can find the reasons for the performance of customer access resulting in different scenarios and help customers further optimize the business architecture. On the other hand, by understanding the forwarding links in-depth, customer O&M and Alibaba Cloud developers can know which link points to deploy and observe manually to delimit the direction and cause of the problem.

Flannel Mode Architecture Design

Under Flannel mode, an ECS instance only has one primary ENI and no other secondary network interface controller. Pods on the ECS instance and nodes communicate with external servers through the primary network interface controller. ACK Flannel creates a virtual cni0 on each node as a bridge between the pod network and the primary cni0 ECS.

Each node of the cluster starts a flannel agent and pre-allocates a pod CIDR block to each node. This pod CIDR block is a subset of the pod CIDR block of the ACK cluster.

The network namespace of the container contains a virtual network interface controller of eth0 and a route with a next hop that points to the network interface controller. The network interface controller serves as an ingress and egress for data exchange between the container and the host kernel. The data link between the container and the host is exchanged through the veth pair. Now that we have found one of the veth pairs, how can we find the other veth?

As shown in the figure, we can see eth0@if81 through p addr in the network namespace of the container, where '81' will help us find the other veth pair in the network namespace of the container in ECS OS. In ECS OS, we can find the vethd7e7c6fd virtual network interface controller through ip addr | grep 81:, which is the other veth pair on the ECS OS side.

So far, connections between containers and OS data links have been established. How does data traffic in ECS OS determine which container to go to? From the OS Linux Routing, we can see that all traffic destined for the Pod CIDR block is forwarded to the virtual cni0. Then, cni0 points data link destined for different purposes to different vethxxx through the bridge mode. Up to this point, the network namespace of ECS OS and Pods has established a complete configuration of ingress and egress links.

Analysis of Container Network Data Link in Flannel Mode

Based on the characteristics of container networks, we can divide the network links in Flannel mode into two major SOP scenarios: Pod IP and SVC. We can subdivide them into ten different small SOP scenarios.

The data link of these ten scenarios can be summarized into the following five types of scenarios.

Client and server pods are deployed on the same ECS instance.
Client and server pods are deployed on different ECS instances.
When accessing SVC External IP, if the ExternalTrafficPolicy is Cluster, the client and server pods are deployed on different ECS instances. The client is outside the cluster.
When accessing SVC External IP, if the ExternalTrafficPolicy is Local, the client and server pods are deployed on different ECS instances. The client is in the cluster.
When accessing SVC External IP, if the ExternalTrafficPolicy is Local, the client and server pods are deployed on different ECS instances. The client is outside the cluster.

Scenario 1: Client and Server Pods are Deployed on the Same ECS

This scenario contains the following sub-scenarios. Data links can be summarized into one.

Pod IP is used to provide external service. The client and pods are deployed on the same node.
SVC ClusterIP is used to provide external service. Client and SVC backend pods are deployed on the same node.
The client and SVC backend pods are deployed on the same node when the SVC ExternalIP is used to provide external services and the ExternalTrafficPolicy is Cluster/Local.

Environment

Two pods exist on the ap-southeast-1.10.0.0.180 node: centos-67756b6dc8-rmmxt IP address of 172.23.96.23 and nginx-7d6877d777-6jkfg of 172.23.96.24.

Kernel Routing

The centos-67756b6dc8-rmmxt IP address is 172.23.96.23. The PID of the container on the host is 503478, and the container network namespace has a default route pointing to container eth0.

This container eth0 corresponding to the veth pair in ECS OS is vethd7e7c6fd.

Through a similar method, the nginx-7d6877d777-6jkfg IP address 172.23.96.24 can be found, the PID of the container on the host is 2981608, and the corresponding veth pair of the container eth0 in ECS OS is vethd3fc7ff4.

In ECS OS, there is a route that points to the pod CIDR and the next hop is cni0 and the vethxxx bridge information of cni0 containing two containers.

Summary: Destination Can Be Accessed

▲ Diagram of Data Link Forwarding

▲ Kernel Protocol Stack Diagram

Data Link: ECS1 Pod1 eth0 → vethxxx1 → cni0 → vethxxxx2 → ECS1 Pod2 eth0
The data link goes through three kernel protocol stacks: Pod1 protocol stack, ECS OS protocol stack, and Pod2 protocol stack.