Achieving Business Continuity at the Edge with OpenYurt

This article discusses problems with Kubernetes in unstable network environments and OpenYurt Edge’s self-consistent capabilities that ensure business continuity.

By Lu Chen and Dong Chen

Background

The OpenYurt project aims to decentralize the powerful control capabilities of Kubernetes on the cloud to edge testing and integrate massive heterogeneous edge resources into a unified edge computing platform. However, some features of edge scenarios do not conform to the presuppositions of Kubernetes, which is designed to run on the cloud. This is what OpenYurt needs to solve. Edge autonomy was created in this context.

Unlike a secure and stable cloud network environment, edge nodes and cloud nodes are usually not in the same network plane and need to be connected to the cloud through the Internet in edge scenarios. The public network connection brings several problems (such as the high cost of public network traffic, the need for cross-domain communication capabilities, and the instability of public network connections) this article will discuss. The OpenYurt system solves all of these problems.

Today, we want to share the OpenYurt community's thoughts on these questions and the OpenYurt edge autonomy designed for it.

Problems with Kubernetes in an Unstable Network Environment

Let's see how native Kubernetes behaves under an unstable network environment. When the network connection of a node is interrupted, the Kubernetes cluster has [1] a series of actions to handle this event.

The kubelet on the node finds a network problem within 10s and updates the NodeStatus but cannot report it to the Control Plane because the network is disconnected.
The NodeLifeCycle Controller of the Control Plane fails to receive the heartbeat of the node within 40s. The node status is changed to Not Ready, and no new pods are scheduled to the node.
The NodeLifeCycle controller of the Control Plane fails to receive the heartbeat of the node within five minutes and starts to evict all pods on the node.

When a node fails to report a heartbeat, the Kubernetes cluster determines that the node has an exception. As an abnormal resource, the node is no longer suitable for supporting upper-layer applications. This approach is appropriate for machines online 24 hours a day in the data center, but such a strategy remains to be discussed in a network environment with complex edge scenarios.

First, in some edge scenarios, edge nodes need to actively interrupt network connections to support network disconnection maintenance requirements. As such, native Kubernetes will evict edge containers, and some edge components will report errors (or exit) due to APIServer connection failure and resource synchronization failure. This is unacceptable. More in-depth, there may be two reasons behind the phenomenon that the node cannot report a heartbeat. Either the machine fails and hangs up with all workload, or the machine is still running normally, but the network is disconnected. Kubernetes does not differentiate between the two cases and directly sets the node without a heartbeat to Not Ready. However, in edge scenarios, network disconnection is a common scenario or a requirement. We can distinguish the two types of causes and only migrate and rebuild pods when nodes fail.

Secondly, there is a typical type of edge service that requires that pods not be evicted even when a node fails. They need to bind specific pods to specific nodes. For example, the application of image processing needs to be bound to the machine corresponding to the camera, and the application of smart transportation needs to be fixed on the machine at a certain intersection. This requirement to bind nodes violates the Kubernetes design concept of isolating underlying resources from upper-layer applications. However, this is a requirement of edge services and needs to be supported by OpenYurt.

Finally, we need to consider the situation of network disconnection and restart. In the native Kubernetes architecture, the container information of the slave agent (Kubelet) is stored in the memory, and the service data cannot be obtained from the cloud when the network is disconnected. If the edge node or the Kubelet of the edge node is restarted abnormally, the service container cannot be recovered.

OpenYurt Edge’s Self-Consistent Capabilities to Ensure Business Continuity

In order to summarize the requirements for edge autonomy in one sentence, ensure that edge services continue to run in weak or even disconnected environments. We need to solve the following problems to achieve this capability under the Kubernetes system.

When the node is abnormal or restarted, the memory data is lost, and the business container cannot be restored when the network is disconnected.
The network is disconnected for a long time, and the cloud controller evicts the business container.
How is an edge service bound to a specific edge node?

OpenYurt offers a complete set of solutions from cloud to edge to address the challenges of edge autonomy.

Edge-Side Data Cache

At the edge, OpenYurt introduces an important component: YurtHub. YurtHub provides web caching and request proxy capabilities on edge nodes. System components (such as kubelet) on nodes and communication between business containers and the cloud will be proxied through this component.

When the cloud-edge network is normal, YurtHub is equivalent to a transparent gateway with a data caching function, which forwards requests to the cloud and caches the returned data.
When the cloud edge network is disconnected, YurtHub cuts the request to the local cache, so the edge component can still successfully obtain resources. If a node or component restarts at this time, the edge service can be restored using the local data cache without relying on the data on the cloud.
After communication with the cloud is restored, Yurthub flows back to the central site on the cloud, the local cache is updated, and proxy requests resume normal forwarding.

YurtHub solves the problem of network disconnection and restart (problem 1), and the additional encapsulation of APIServer at this layer also extends [2] many other important OpenYurt capabilities.

Centralized Heartbeat Proxy Mechanism

OpenYurt enhances the pod eviction policy of native Kubernetes to a certain extent. In native Kubernetes, if the heartbeat of an edge node has not been reported for a certain period, the cloud controller will evict the pod on the node (delete and rebuild it on the normal node). Edge businesses have different requirements in cloud-edge collaboration scenarios. Some businesses expect that when the network on the cloud side is disconnected, and the heartbeat cannot be reported (the node itself is normal), the business pods can be maintained (no eviction occurs), and the pods are migrated and rebuilt only when the node fails.

OpenYurt 1.2 provides a centralized heartbeat proxy mechanism based on Pool-Coordinator and YurtHub, as shown in the following figure:

When the cloud-edge network of a node is normal, Kubelet uses the YurtHub component to report the heartbeat to the cloud and Pool-Coordinator.
When the cloud-edge network of a node is disconnected, Kubelet fails to report the heartbeat to the cloud through the YurtHub component. At this point, the heartbeat reported to Pool-Coordinator carries a specific tag.
Leader YurtHub will list or watch the heartbeat data in the pool-coordinator in real-time. When the obtained heartbeat data has a specific tag, Leader YurtHub will forward the heartbeat to the cloud.

The heartbeat proxy mechanism implemented by Pool-Coordinator and YurtHub ensures that the heartbeat of a node can continue to be reported to the cloud when the node is disconnected from the cloud edge network. This ensures that pods on the node are not evicted (problem 2). At the same time, the node whose heartbeat is reported by the proxy will also be added with special taints in real-time to restrict scheduling a new pod to this node.

Node Binding

Some edge services require that pods are not evicted when a node fails, and the service is bound to the node. OpenYurt provides two perspectives to solve this problem.

The first is from the perspective of the node. For example, you want all pods on this machine to be bound to this machine. Then, we can tag this node node.beta.openyurt.io/autonomy=true.

The second perspective is from the business, such as the aforementioned smart transportation business, which wants its lifecycle to be consistent with the lifecycle of the nodes it runs on. OpenYurt version 1.2 adds the apps.openyurt.io/binding label. If a pod has this label, it means the pod needs the ability to bind nodes.

In both methods, the binding capability is implemented by adding toleration to the corresponding pods.

Summary

The cloud-side network connection is unstable in the edge scenario. Therefore, the edge needs to have a certain degree of autonomy in the absence of cloud support. Based on the native Kubernetes architecture, OpenYurt provides a non-intrusive solution to solve several pain points of edge autonomy (such as node restart, node deportation, and node business binding).

OpenYurt 1.2 enhances edge autonomy based on the Pool-Coordinator + YurtHub architecture. There is still a lot of room for improvement in the edge autonomy field. For example, in addition to maintaining basic pods in the disconnected state, OpenYurt will provide node pool O&M capabilities in later versions. Interested colleagues are welcome to participate in the construction and explore the de facto standard of a stable and reliable non-invasive cloud-native edge computing platform.

References

[1] A series of actions to handle the event
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/589-efficient-node-heartbeats/README.md

[2] Many other important OpenYurt capabilities
https://openyurt.io/docs/core-concepts/yurthub/

Community

Achieving Business Continuity at the Edge with OpenYurt

Background

Problems with Kubernetes in an Unstable Network Environment

OpenYurt Edge’s Self-Consistent Capabilities to Ensure Business Continuity

Edge-Side Data Cache

Centralized Heartbeat Proxy Mechanism

Node Binding

Summary

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Accelerated Global Networking Solution for Distance Learning

Container Service for Kubernetes

Networking Overview

Managed Service for Prometheus