By Tongtao
A container is virtualization technology to package an application along with all its dependencies and configurations so that it can be run in different computer environments. Software containers provide a lightweight and consistent runtime environment, making applications more portable and reliable during development, testing, and deployment.
In 1974, Popek and Goldberg clearly set out three conditions for a virtualized system structure in their paper Formal Requirements for Virtualizable Third Generation Architectures.
Based on these conditions, virtualization can be divided into two types: the currently popular bare-metal hypervisor and the hosted hypervisor.
A Type 1 hypervisor runs directly on the physical hardware without requiring an underlying operating system. This type is typically used for enterprise-level virtualization platforms such as VMware ESXi and Microsoft Hyper-V.
A Type 2 hypervisor runs as an application in an operating system. This type is commonly used in development and testing environments such as Oracle VirtualBox and VMware Workstation.
Two types
Three classic implementations
It is a full software simulation that runs on top of the operating system, and has the following characteristics:
1. OS-based Runtime: The Type 2 hypervisor is installed as a software application in the host operating system that manages hardware resources and provides services for the hypervisor and its virtual machines.
2. Performance Overhead: Since the Type 2 hypervisor runs within the host operating system, there is an additional layer between the virtual machine and the physical hardware. This will result in a larger performance overhead than that of the Type 1 hypervisor that runs directly on the hardware.
3. Ease of Use and Convenient Installation: The Type 2 hypervisor is generally easier to install and configure than the Type 1 hypervisor. Users can treat the Type 1 hypervisor just as a regular software application when installing it on a standard operating system.
4. Usage: The Type 2 hypervisor is typically used in development, testing, and desktop virtualization scenarios. It provides users with a convenient method to run multiple operating systems on a single machine without the need for dedicated hardware or complex configurations.
5. Isolation: Each virtual machine created by a Type 2 hypervisor is isolated from other virtual machines and the host system. This enables users to try different operating systems, configurations, and applications in a controlled environment.
Common products with Type 2 hypervisors include:
Hardware virtualization is a technology that abstracts and separates physical computation resources to create multiple independent virtual environments. The goal of this virtualization is to run multiple operating systems and applications on the same physical hardware, achieving more efficient use of hardware resources. This technology typically involves separating physical hardware into multiple virtual environments by using a software layer called a virtual machine (VM).
Common products with Type 1 hypervisors include:
The following gives a brief introduction to the evolution from virtualization to containerization over time:
In 1979, the Unix v7 system supported chroot to build an independent virtual filesystem view for applications.
In 1999, FreeBSD 4.0 supported Jail, marked as the first commercial OS virtualization technology.
In 2004, Solaris 10 supported Solaris Zone, marked as the second commercial OS virtualization technology.
In 2005, OpenVZ was released, marked as a very important pioneer in Linux OS virtualization technology.
From 2004 to 2007, OS virtualization technologies such as cgroup were used on a large scale within Google.
In 2006, the Process Container technology used within Google was open-sourced and later renamed cgroup.
In 2008, cgroup functionality was merged into the Linux kernel mainline.
In 2008, the LXC (Linux Container) project had a prototype for the Linux container.
In 2011, CloudFoundry developed the Warden system, marked as a complete prototype of a container management system.
In 2013, Google open-sourced its internal container system through the Let Me Contain That For You (LMCTFY).
In 2013, the Docker project was officially released, allowing Linux container technology to gradually take the world by storm.
In 2014, the Kubernetes project was officially released, marking that container technology and orchestration systems began to grow hand in hand.
In 2015, the Cloud Native Computing Foundation (CNCF) was jointly founded by Google, Red Hat, Microsoft, and several major cloud vendors, starting the wave of cloud-native.
From 2016 to 2017, the container ecosystem began to move towards modularization and standardization. The CNCF accepted containerd and rkt projects, the Open Container Initiative (OCI) released version 1.0, and both the Container Runtime Interface (CRI) and the Container Network Interface (CNI) were widely supported.
From 2017 to 2018, the commercialization of container services saw significant developments. AWS ECS, Google EKS, Alibaba ACK/ASK/ECI, Huawei CCI, Oracle Container Engine for Kubernetes, VMware, Redhat, and Rancher began to offer commercial services based on Kubernetes.
From 2017 to 2019, container engine technology was rapidly developed and new technologies continuously emerged. At the end of 2017, the Kata Containers community was established. In May 2018, Google open-sourced the gVisor code. In November 2018, AWS open-sourced the Firecracker, and Alibaba Cloud released the Sandboxed-Container V1.
From 2020 to 202x, container engine technology has been upgraded. Kata Containers has begun its 2.0 architecture, and Alibaba Cloud has also released Sandboxed-Container V2...
The development of container technology over the past 20 years can be roughly divided into four stages.
Currently, the container ecosystem is primarily based on Kubernetes.
Kubernetes ecosystem
cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes. cgroups provide fine-grained control over system resources (such as CPU, memory, disk I/O, etc.) and allow system administrators to allocate and limit resources to a group of processes. It was proposed by Google in 2007 and merged into the Linux kernel V2.6 in 2008.
The use cases mainly include:
cgroup consists of subsystems, such as CPU and memory.
Each hierarchy is a tree structure, where each node of the tree is a cgroup structure (for example, cpu_cgrp and memory_cgrp). The first cgroup hierarchy is attached to the CPU subsystem and the cpuacct subsystem. Then, the cgroup structures within the current cgroup hierarchy can limit the CPU resources and count for the CPU usage of processes. The second cgroup hierarchy is attached to the memory subsystem, so the cgroup structures within the current cgroup hierarchy can limit the memory resources.
Currently, cgroups have been used for over 2 years in the client of Alibaba Cloud Security Center, which greatly improves the stability of the client and reduces the impact on the normal business of customers.
Linux namespaces provide a kernel-level method to isolate system resources by placing the global resources of the system in different namespaces.
Type | Description |
---|---|
Cgroup | cgroup root directory |
IPC | System V IPC, POSIX message queues |
Network | Network devices, stacks, ports, etc. |
Mount | Mount points |
PID | Process IDs |
User | User and group IDs |
UTS | System hostnames and NIS (Network Information Service) hostnames (sometimes called domain names) |
Time | Clock |
Virtualized containers are implemented based on the isolation of Linux namespaces, so Linux network virtualization technology is naturally responsible for resolving communication issues between isolated containers, between containers and the host machine, and even between different containers across physical networks.
The main technologies of Linux network virtualization are the Network Namespace and various virtual devices such as Veth, Linux Bridge, and TAP/TUN. The essence of virtualization is the mapping of the real world. These virtual devices cooperate like physical devices in the real world and connect separate namespaces to build various network topologies that are not limited by the physical environment.
Isolation of namespaces on the network
Veth: Virtual Ethernet, virtual Ethernet devices used to allow two isolated network namespaces to communicate with each other. Since they are always created in pairs, they are also known as veth pairs (veth-pair).
Linux Bridge: At the host level, if multiple hosts need to be networked, a switch (a Layer 2 device) is required. In the Linux virtual network system, we can achieve this feature through a virtual network bridge. Linux Bridge is a Layer 2 forwarding tool provided by the Linux kernel v2.2. It is consistent with the mechanism of a physical switch and supports connecting any Layer 2 network device (whether it is a real physical device such as eth0, or a virtual device such as veth and tap). However, Linux Bridge is slightly different from common physical switches. A common switch only performs simple Layer 2 forwarding, while Linux Bridge can also send packets received to the host's Layer 3 protocol stack.
TUN/TAP: TUN and TAP are two relatively independent virtual network devices provided by Linux. TAP simulates a network layer device, works at Layer 3, and operates IP messages, while TUN simulates an Ethernet device, works at Layer 2, and operates data frames. The fundamental protocol of current cloud networks, VxLAN, is implemented based on tunneling technology, such as the basic SDN (Software-Defined Network) of the cloud network.
Docker was designed with the slogan Build, Ship, and Run Any App, Anywhere.
Docker started in 2013 and has experienced 10 years of development. The following figure demonstrates the changes that have occurred after Docker's integration with Kubernetes.
Milestones:
• In June 2015, DockerCon announced the promotion of container standards and the establishment of the OCI (Open Container Initiative).
• In December 2015, the runc project was open-sourced, and in 2019, it was taken over by the OCI and integrated into the OCI ecosystem.
• In February 2017, Docker announced the open-source of containerd, and in March it became an incubation project of the CNCF.
• In March 2017, the CRI V1.0 was released, defining the standards for Kubernetes and runc.
Many components, such as dockerd and containerd-shim, have gradually been phased out in the container ecosystem.
With the understanding of Docker's development, we can divide container runtimes into several types according to features:
These two types of runtimes collaborate according to their responsibilities to manage the entire lifecycle of containers.
Early Kubernetes was completely dependent on and bound to Docker, and the possibility of using other container engines in the future was not much considered. At that time, Kubernetes managed containers by directly calling the Docker API through the internal DockerManager to create and manage containers.
After the popularity of Docker, CoreOS released the rkt runtime implementation, and Kubernetes implemented the support for rkt. With the vigorous development of container technology, more and more runtime implementations appeared. If Kubernetes continued to be strongly bound to Docker, its workload would become extremely huge, so it needed to reconsider the compatibility and adaptation of all container runtimes.
From version 1.5, Kubernetes, in compliance with OCI, abstracts container operations into an interface that serves as a bridge between Kubelet and runtime implementations. Kubelet initiates container startup and management by sending interface requests. Each container runtime that implements this interface can integrate with Kubernetes. This interface is referred to as CRI (Container Runtime Interface).
CRI is implemented as a set of APIs defined by a protocol buffer, as shown in the following figure:
As can be seen from the above figure: CRI mainly includes three components: a gRPC client, a gRPC server, and the specific container runtime implementation. Kubelet serves as a gRPC client to call CRI, and CRI shim serves as a gRPC server to respond to CRI requests and convert the content of CRI requests into specific runtime management operations. Therefore, any container runtime implementation that wants to access Kubernetes needs to implement a CRI shim (gRPC server) based on the CRI specification.
The containerd is mainly responsible for the following tasks:
• Container Lifecycle Management. It interacts with the underlying operating system and hardware, and is responsible for the lifecycle management of containers, including creation, starting, stopping, and deletion.
• Image Management. It manages the download, storage, and loading of container images.
• Container Networking. It provides a set of interfaces that allow networking plugins to interact with containers through the Container Networking Interface (CNI), enabling the network connection and configuration of containers.
• Security and Isolation. It supports the management of container security and isolation. It ensures that containers are isolated from other containers and host systems at runtime by integrating technologies such as Linux namespaces and cgroups.
• OCI Specification Support. It adheres to the OCI specifications. That is to say, it is compatible with containers and images that conform to the OCI specifications. This standardization allows containerd to integrate with other tools and platforms that conform to the same specification.
• Plug-in System. It provides a plugin system that allows users to extend the functionality as needed. This means that users can choose to use specific storage backends, loggers, and other plugins to meet their specific needs.
Source: https://containerd.io/
The official architecture diagram provided by containerd shows that it adopts a Client/Server (C/S) architecture. The server exposes low-level gRPC APIs through Unix domain sockets, while the client manages containers on nodes through these APIs. Each containerd is only responsible for one machine, handling tasks such as pulling images, operating on containers (starting, stopping, etc.), networking, and storage, all of which are completed by containerd. runc is responsible for running containers. In fact, all containers that comply with OCI specifications can be supported.
For the decoupling purpose, containerd divides the system into different components, each of which is completed by one or more modules (in the Core and Backend sections). Each type of module is integrated into containerd in the form of a plugin, and the plugins are interdependent. For example, in the above diagram, each long dashed box represents a type of plugin, including the Service Plugin, Metadata Plugin, GC Plugin, and Runtime Plugin. The Service Plugin, in turn, depends on the Metadata Plugin, GC Plugin, and Runtime Plugin. Each small box indicates a subdivided plugin, such as the Metadata Plugin depends on Containers Plugin and Content Plugin.
• Content Plugin: It provides access to the addressable content within an image, where all immutable content is stored.
• Snapshot Plugin: It is used to manage filesystem snapshots of container images, where each layer of the image is decompressed into a file system snapshot.
As a high-level container runtime, containerd can be divided into three blocks: Storage, Metadata, and Runtimes.
Data flow for creating a bundle in containerd
Bundles refer to the configuration, metadata, and rootfs data used by the runtime. A bundle is the on-disk representation of a runtime container, which can be simplified to a directory in the file system.
The Linux Foundation established the OCI (Open Container Initiative) in June 2015 to create an open industry standard around container formats and runtimes.
The purpose of the standardized container is specifically divided into the following five articles.
• Operation Standardization: Standardized container operations include using standard containers to create, start, and stop containers, using standard filesystem tools to copy and create container snapshots, and using standardized network tools for downloading and uploading.
• Content-independent: No matter what the specific container content is, the standard container operations can produce the same effect after they are executed. For example, the container can be uploaded and started in the same way, whether it is a PHP application or a MySQL database service.
• Infrastructure-independent: Whether it is a personal laptop, AWS S3, Openstack, or any other infrastructure, it should support various container operations.
• Tailored for Automation: The development of a unified standard for containers is one of the fundamental purposes of content-independent and infrastructure-independent operations, which is to automate container operations across platforms.
• Industry-level Delivery: A major goal of developing container standards is to enable the real-world implementation of industry-level delivery for software distribution.
The OCI contains three specifications:
• runtime-spec (Runtime Specification): Define the configuration, environment, and lifecycle of container runtimes. That is, how to run a container, how to manage the container status and lifecycle, and how to use the underlying features (namespace, cgroup, and pivot_root) of the operating system.
• image-spec (Image Specification): Define the image format, configurations (including application parameters and environmental information), and the dependent metadata format. Simply put, it is a static description of the image.
• distribution-spec (Distribution Specification): Specify the network interaction process for uploading and downloading images.
According to the OCI specifications, there are three popular runc solutions:
• opencontainers/runc: As mentioned many times before, it is the reference implementation for the OCI Runtime.
• kata-containers/runtime: The container standard counterattack virtual machine, formerly known as clearcontainers/runtime and hyperhq/runv, provides high-performance OCI-compatible hardware virtualization containers through virtcontainers. It is only available for Linux and requires specific hardware support.
• Google/gVisor: gVisor is a user-space kernel implemented in Go, which includes an OCI-compatible runtime implementation. The goal is to provide a container runtime sandbox that can run untrusted code. Currently, it is only available for Linux, but support for other architectures may be added in the future.
Proactive defense for containers supported by Security Center
Although containers have many technical advantages, the traditional soft isolation based on shared kernel technology represented by runc still has certain risks. If a malicious application escapes from a container by exploiting a system defect, it will pose a serious threat to the host, especially in the public cloud environment. The security threat is likely to affect the data and business of other users.
Then, the security advantages of virtual machines are combined with the high speed and manageability of containers to provide users with standardized, secure, and high-performance container solutions. Therefore, Kata Containers come into being.
The Kata Containers runtime complies with OCI specifications and is compatible with Kubernetes CRI (a VM-level implementation of Pods). To shorten the call chain of containers and efficiently integrate with Kubernetes CRI, Kata-Container directly integrates containerd-shim, kata-shim, and kata-proxy into a single entity. The integration of CRI and Kata Containers is shown in the following figure:
Container orchestration allows developers to automatically deploy, scale, and manage containerized applications. Container orchestration tools are designed to simplify and automate tasks such as communication, scheduling, scaling, and maintenance between multiple containers. These tools ensure that containers run consistently, reliably, and efficiently throughout the application lifecycle.
Common orchestration tools include Mesos, Swarm, and Kubernetes, among which Kubernetes is currently the most popular in the market.
Why does Kubernetes “win”?
Single-tenant -> Multi-tenant
ECI has evolved from version 1.0 to the current 3.0 and transitioned from operating on ECS to co-locating with ECS.
Hierarchical architecture:
After introducing Kata Containers, next, we'll introduce RunD which is Alibaba Cloud's implementation solution for container security. RunD is a lightweight secure container runtime that proposes a host-to-guest full-stack optimization scheme to address the following three issues:
• The file system of the container can be customized according to the characteristics of read-only user images that don't need to be persisted.
• The base image of the operating system in the client can be shared and compressed on demand among multiple secure containers to reduce memory overhead.
• Creating cgroups with high concurrency results in high synchronization latency, especially high scheduling overhead in high-density scenarios.
Problems solved by RunD and steps to start the Kata Containers concurrently
When using Kata as the container runtime, the concurrency bottleneck lies in creating the rootfs (red box, Step 1) and creating cgroups (red line, Step 3). The density bottleneck is due to the high memory overhead of MicroVMs (blue box, Step 2) and the overhead of scheduling and maintaining a large number of cgroups (blue box, Step 3).
The following figure shows the architecture of the RunD scheme:
Architecture of RunD
RunD designs and summarizes the host-to-guest full-stack solution. The RunD runtime provides a read-only layer through virtio-fs, creates a non-persistent read-write layer for virtio-blk by using built-in storage, and mounts the former and the latter as the final container rootfs by using overlayfs, thereby achieving read/write splitting. RunD utilizes a MicroVM template integrated with a streamlined kernel and creates a new MicroVM using pre-processed images, further distributing the overhead associated with different MicroVMs. When creating a secure container, RunD binds a lightweight cgroup from the cgroup pool for resource management.
Based on the above optimizations, when using RunD as the secure container runtime, the secure container will be started according to the following steps:
Step 1: Once containerd receives a user call, it forwards the request to the RunD runtime.
Step 2: RunD prepares the rootfs of the runc container for the VM hypervisor. rootfs is divided into the read-only layer and the writable layer.
Step 3: The hypervisor uses the MicroVM template to create the required sandbox and mounts the rootfs to the sandbox through overlayfs.
The last lightweight cgroup is renamed from the cgroup pool and then bound to the sandbox to manage resource usage.
Summary of concurrency performance: RunD can start a single sandbox within 88ms, and support the concurrency to start 200 sandboxes at the same time within 1 second. Compared to existing technologies, it has the smallest latency fluctuation and CPU overhead.
https://cloud.tencent.com/developer/article/1496919
https://cloud.tencent.com/developer/article/2327479?areaId=106001
https://developer.aliyun.com/article/775778
https://developer.aliyun.com/article/981453
https://developer.aliyun.com/article/1007365
https://blog.frognew.com/2021/05/relearning-container-08.html
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Level up Your Cloud Game With Alibaba Cloud's Double 11 Mega Sale
1,080 posts | 265 followers
FollowAlibaba Clouder - June 10, 2020
Alibaba Cloud Native Community - March 8, 2021
Alibaba Container Service - October 13, 2022
Alibaba Cloud Native Community - December 16, 2021
Alibaba Cloud Native - May 23, 2022
Alibaba Cloud Serverless - May 26, 2021
1,080 posts | 265 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Cloud Community