Introduction to Container Technology and Its Basic Principles

This article provides a comprehensive overview of the development, key technologies, architecture, and current industry ecosystem of container technology.

By Tongtao

Introduction

What Is a Container?

A container is virtualization technology to package an application along with all its dependencies and configurations so that it can be run in different computer environments. Software containers provide a lightweight and consistent runtime environment, making applications more portable and reliable during development, testing, and deployment.

Features of Containers

Cross-platform: Containers can run on different operating systems and cloud platforms and ensure the consistency of applications in various environments. This feature facilitates the application to port and deploy.
Consistency and Repeatability: Containers package an application along with all its dependencies and configurations to ensure its consistency between development, testing, and production environments. By using containers, you can avoid issues caused by environmental differences and achieve a repeatable construction and deployment process.
Resource Isolation: Containers provide a certain degree of isolation, allowing multiple containers to run in parallel on the same host without impacting each other. This isolation ensures the stability and security of applications.
Quick Deployment and Startup: Containers can be started in a few seconds, which is significantly faster compared to traditional virtual machines. This feature enables the deployment and scaling of applications to be faster and more flexible.
High Scalability: The container architecture supports automatic scaling and allows you to dynamically adjust the number of container instances based on requirements. This feature enables applications to better respond to changes in traffic and load.
Environment Isolation: Containers provide an independent runtime environment. Each container has its own file system, network, and process space. This feature helps prevent applications from impacting each other and improves system stability and security.
Resource Efficiency: Containers share the kernel of the host operating system, making them more lightweight and more resource-efficient compared to virtual machines.
Continuous Integration and Continuous Deployment (CI/CD): Containers are tightly integrated with CI/CD tools, making it easier for development teams to implement automatic construction, testing, and deployment processes.

Evolution and Development of Containers

Virtualization

In 1974, Popek and Goldberg clearly set out three conditions for a virtualized system structure in their paper Formal Requirements for Virtualizable Third Generation Architectures.

Resource Control: The VMM must have complete control of all system resources.
Equivalence: Programs (including the operating system) run under the VMM should exhibit an effect identical with that demonstrated if the program had been run on the original machine directly, with the possible exception of differences caused by the availability of system resources and differences caused by timing dependencies. In addition, pre-defined privileged instructions should be freely executable.
Efficiency: A statistically dominant subset of the virtual processor's instructions is executed directly by the real processor, with no software intervention by the VMM.

Based on these conditions, virtualization can be divided into two types: the currently popular bare-metal hypervisor and the hosted hypervisor.

Type 1 Hypervisor: Bare-Metal Hypervisor (Hardware Virtualization)

A Type 1 hypervisor runs directly on the physical hardware without requiring an underlying operating system. This type is typically used for enterprise-level virtualization platforms such as VMware ESXi and Microsoft Hyper-V.

Type 2 Hypervisor: Hosted Hypervisor (Software Virtualization)

A Type 2 hypervisor runs as an application in an operating system. This type is commonly used in development and testing environments such as Oracle VirtualBox and VMware Workstation.

Two types

Three classic implementations

Software Virtualization (Type 2 Hypervisor)

It is a full software simulation that runs on top of the operating system, and has the following characteristics:

1. OS-based Runtime: The Type 2 hypervisor is installed as a software application in the host operating system that manages hardware resources and provides services for the hypervisor and its virtual machines.

2. Performance Overhead: Since the Type 2 hypervisor runs within the host operating system, there is an additional layer between the virtual machine and the physical hardware. This will result in a larger performance overhead than that of the Type 1 hypervisor that runs directly on the hardware.

3. Ease of Use and Convenient Installation: The Type 2 hypervisor is generally easier to install and configure than the Type 1 hypervisor. Users can treat the Type 1 hypervisor just as a regular software application when installing it on a standard operating system.

4. Usage: The Type 2 hypervisor is typically used in development, testing, and desktop virtualization scenarios. It provides users with a convenient method to run multiple operating systems on a single machine without the need for dedicated hardware or complex configurations.

5. Isolation: Each virtual machine created by a Type 2 hypervisor is isolated from other virtual machines and the host system. This enables users to try different operating systems, configurations, and applications in a controlled environment.

Common products with Type 2 hypervisors include:

VMware before V5.5
Xen before V3.0
Virtual PC 2004

Hardware Virtualization (Type 1 Hypervisor)

Hardware virtualization is a technology that abstracts and separates physical computation resources to create multiple independent virtual environments. The goal of this virtualization is to run multiple operating systems and applications on the same physical hardware, achieving more efficient use of hardware resources. This technology typically involves separating physical hardware into multiple virtual environments by using a software layer called a virtual machine (VM).

Common products with Type 1 hypervisors include:

VMware 5.5 and later versions
Xen 3.0 and later versions
Virtual PC 2005
KVM

From Virtualization to Containers

Evolution of Containerization

The following gives a brief introduction to the evolution from virtualization to containerization over time:

In 1979, the Unix v7 system supported chroot to build an independent virtual filesystem view for applications.
In 1999, FreeBSD 4.0 supported Jail, marked as the first commercial OS virtualization technology.
In 2004, Solaris 10 supported Solaris Zone, marked as the second commercial OS virtualization technology.
In 2005, OpenVZ was released, marked as a very important pioneer in Linux OS virtualization technology.
From 2004 to 2007, OS virtualization technologies such as cgroup were used on a large scale within Google.
In 2006, the Process Container technology used within Google was open-sourced and later renamed cgroup.
In 2008, cgroup functionality was merged into the Linux kernel mainline.
In 2008, the LXC (Linux Container) project had a prototype for the Linux container.
In 2011, CloudFoundry developed the Warden system, marked as a complete prototype of a container management system.
In 2013, Google open-sourced its internal container system through the Let Me Contain That For You (LMCTFY).
In 2013, the Docker project was officially released, allowing Linux container technology to gradually take the world by storm.
In 2014, the Kubernetes project was officially released, marking that container technology and orchestration systems began to grow hand in hand.
In 2015, the Cloud Native Computing Foundation (CNCF) was jointly founded by Google, Red Hat, Microsoft, and several major cloud vendors, starting the wave of cloud-native.
From 2016 to 2017, the container ecosystem began to move towards modularization and standardization. The CNCF accepted containerd and rkt projects, the Open Container Initiative (OCI) released version 1.0, and both the Container Runtime Interface (CRI) and the Container Network Interface (CNI) were widely supported.
From 2017 to 2018, the commercialization of container services saw significant developments. AWS ECS, Google EKS, Alibaba ACK/ASK/ECI, Huawei CCI, Oracle Container Engine for Kubernetes, VMware, Redhat, and Rancher began to offer commercial services based on Kubernetes.
From 2017 to 2019, container engine technology was rapidly developed and new technologies continuously emerged. At the end of 2017, the Kata Containers community was established. In May 2018, Google open-sourced the gVisor code. In November 2018, AWS open-sourced the Firecracker, and Alibaba Cloud released the Sandboxed-Container V1.
From 2020 to 202x, container engine technology has been upgraded. Kata Containers has begun its 2.0 architecture, and Alibaba Cloud has also released Sandboxed-Container V2...

Development of Containers

The development of container technology over the past 20 years can be roughly divided into four stages.

Currently, the container ecosystem is primarily based on Kubernetes.

Kubernetes ecosystem

Container Technology

Container Technology Basics

cgroup

Introduction to cgroup

cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes. cgroups provide fine-grained control over system resources (such as CPU, memory, disk I/O, etc.) and allow system administrators to allocate and limit resources to a group of processes. It was proposed by Google in 2007 and merged into the Linux kernel V2.6 in 2008.

The use cases mainly include:

Resource limiting and quota
Process isolation
Resource statistics and monitoring
Priority control
Dynamic resource management

Exploration of cgroup

cgroup consists of subsystems, such as CPU and memory.

Each hierarchy is a tree structure, where each node of the tree is a cgroup structure (for example, cpu_cgrp and memory_cgrp). The first cgroup hierarchy is attached to the CPU subsystem and the cpuacct subsystem. Then, the cgroup structures within the current cgroup hierarchy can limit the CPU resources and count for the CPU usage of processes. The second cgroup hierarchy is attached to the memory subsystem, so the cgroup structures within the current cgroup hierarchy can limit the memory resources.

Currently, cgroups have been used for over 2 years in the client of Alibaba Cloud Security Center, which greatly improves the stability of the client and reduces the impact on the normal business of customers.

namespace

Linux namespaces provide a kernel-level method to isolate system resources by placing the global resources of the system in different namespaces.

Type	Description
Cgroup	cgroup root directory
IPC	System V IPC, POSIX message queues
Network	Network devices, stacks, ports, etc.
Mount	Mount points
PID	Process IDs
User	User and group IDs
UTS	System hostnames and NIS (Network Information Service) hostnames (sometimes called domain names)
Time	Clock

Container Network

Network Virtualization

Virtualized containers are implemented based on the isolation of Linux namespaces, so Linux network virtualization technology is naturally responsible for resolving communication issues between isolated containers, between containers and the host machine, and even between different containers across physical networks.

The main technologies of Linux network virtualization are the Network Namespace and various virtual devices such as Veth, Linux Bridge, and TAP/TUN. The essence of virtualization is the mapping of the real world. These virtual devices cooperate like physical devices in the real world and connect separate namespaces to build various network topologies that are not limited by the physical environment.

Isolation of namespaces on the network

Linux Virtual Devices

Veth: Virtual Ethernet, virtual Ethernet devices used to allow two isolated network namespaces to communicate with each other. Since they are always created in pairs, they are also known as veth pairs (veth-pair).

Linux Bridge: At the host level, if multiple hosts need to be networked, a switch (a Layer 2 device) is required. In the Linux virtual network system, we can achieve this feature through a virtual network bridge. Linux Bridge is a Layer 2 forwarding tool provided by the Linux kernel v2.2. It is consistent with the mechanism of a physical switch and supports connecting any Layer 2 network device (whether it is a real physical device such as eth0, or a virtual device such as veth and tap). However, Linux Bridge is slightly different from common physical switches. A common switch only performs simple Layer 2 forwarding, while Linux Bridge can also send packets received to the host's Layer 3 protocol stack.

TUN/TAP: TUN and TAP are two relatively independent virtual network devices provided by Linux. TAP simulates a network layer device, works at Layer 3, and operates IP messages, while TUN simulates an Ethernet device, works at Layer 2, and operates data frames. The fundamental protocol of current cloud networks, VxLAN, is implemented based on tunneling technology, such as the basic SDN (Software-Defined Network) of the cloud network.

A Comprehensive Guide to Docker

Docker was designed with the slogan Build, Ship, and Run Any App, Anywhere.

Composition

Docker started in 2013 and has experienced 10 years of development. The following figure demonstrates the changes that have occurred after Docker's integration with Kubernetes.

Milestones:

• In June 2015, DockerCon announced the promotion of container standards and the establishment of the OCI (Open Container Initiative).

• In December 2015, the runc project was open-sourced, and in 2019, it was taken over by the OCI and integrated into the OCI ecosystem.

• In February 2017, Docker announced the open-source of containerd, and in March it became an incubation project of the CNCF.

• In March 2017, the CRI V1.0 was released, defining the standards for Kubernetes and runc.

Many components, such as dockerd and containerd-shim, have gradually been phased out in the container ecosystem.

Runtime Classification

With the understanding of Docker's development, we can divide container runtimes into several types according to features:

Runtimes that focus solely on the basic features such as namespaces, cgroups, and image unpacking are known as low-level container runtimes. The most widely used low-level container runtimes at present are runc and kata (run-v).
Runtimes that support more advanced features such as image management and Container Runtime Interface (CRI) implementation are known as high-level container runtimes. The most widely used high-level container runtimes at present are containerd and CRI-O.

These two types of runtimes collaborate according to their responsibilities to manage the entire lifecycle of containers.

High-level Container Runtime

CRI Specification

Early Kubernetes was completely dependent on and bound to Docker, and the possibility of using other container engines in the future was not much considered. At that time, Kubernetes managed containers by directly calling the Docker API through the internal DockerManager to create and manage containers.

After the popularity of Docker, CoreOS released the rkt runtime implementation, and Kubernetes implemented the support for rkt. With the vigorous development of container technology, more and more runtime implementations appeared. If Kubernetes continued to be strongly bound to Docker, its workload would become extremely huge, so it needed to reconsider the compatibility and adaptation of all container runtimes.

From version 1.5, Kubernetes, in compliance with OCI, abstracts container operations into an interface that serves as a bridge between Kubelet and runtime implementations. Kubelet initiates container startup and management by sending interface requests. Each container runtime that implements this interface can integrate with Kubernetes. This interface is referred to as CRI (Container Runtime Interface).

CRI is implemented as a set of APIs defined by a protocol buffer, as shown in the following figure:

As can be seen from the above figure: CRI mainly includes three components: a gRPC client, a gRPC server, and the specific container runtime implementation. Kubelet serves as a gRPC client to call CRI, and CRI shim serves as a gRPC server to respond to CRI requests and convert the content of CRI requests into specific runtime management operations. Therefore, any container runtime implementation that wants to access Kubernetes needs to implement a CRI shim (gRPC server) based on the CRI specification.

containerd

The containerd is mainly responsible for the following tasks:

• Container Lifecycle Management. It interacts with the underlying operating system and hardware, and is responsible for the lifecycle management of containers, including creation, starting, stopping, and deletion.

• Image Management. It manages the download, storage, and loading of container images.

• Container Networking. It provides a set of interfaces that allow networking plugins to interact with containers through the Container Networking Interface (CNI), enabling the network connection and configuration of containers.

• Security and Isolation. It supports the management of container security and isolation. It ensures that containers are isolated from other containers and host systems at runtime by integrating technologies such as Linux namespaces and cgroups.

• OCI Specification Support. It adheres to the OCI specifications. That is to say, it is compatible with containers and images that conform to the OCI specifications. This standardization allows containerd to integrate with other tools and platforms that conform to the same specification.

• Plug-in System. It provides a plugin system that allows users to extend the functionality as needed. This means that users can choose to use specific storage backends, loggers, and other plugins to meet their specific needs.

Source: https://containerd.io/

The official architecture diagram provided by containerd shows that it adopts a Client/Server (C/S) architecture. The server exposes low-level gRPC APIs through Unix domain sockets, while the client manages containers on nodes through these APIs. Each containerd is only responsible for one machine, handling tasks such as pulling images, operating on containers (starting, stopping, etc.), networking, and storage, all of which are completed by containerd. runc is responsible for running containers. In fact, all containers that comply with OCI specifications can be supported.

For the decoupling purpose, containerd divides the system into different components, each of which is completed by one or more modules (in the Core and Backend sections). Each type of module is integrated into containerd in the form of a plugin, and the plugins are interdependent. For example, in the above diagram, each long dashed box represents a type of plugin, including the Service Plugin, Metadata Plugin, GC Plugin, and Runtime Plugin. The Service Plugin, in turn, depends on the Metadata Plugin, GC Plugin, and Runtime Plugin. Each small box indicates a subdivided plugin, such as the Metadata Plugin depends on Containers Plugin and Content Plugin.

• Content Plugin: It provides access to the addressable content within an image, where all immutable content is stored.

• Snapshot Plugin: It is used to manage filesystem snapshots of container images, where each layer of the image is decompressed into a file system snapshot.

As a high-level container runtime, containerd can be divided into three blocks: Storage, Metadata, and Runtimes.

Data flow for creating a bundle in containerd

Bundles refer to the configuration, metadata, and rootfs data used by the runtime. A bundle is the on-disk representation of a runtime container, which can be simplified to a directory in the file system.

Instruct the Distribution Controller to pull a specific image. Distribution stores the layered content of the image in the content store, and registers the image name and root manifest pointers in the metadata store.
Once the image is pulled, the user can instruct the Bundle Controller to split the image into a bundle. After being consumed from the content store, the layers in the image are decompressed into a snapshot component.
When the snapshot of the container's rootfs is ready, the Bundle Controller can use the image manifest and configurations to prepare execution configurations. One part of this process is to enter the mount from the snapshot module into the execution configuration.
Then, give the prepared bundle to the runtime subsystem for execution. The runtime subsystem will read the bundle configuration to create a running container.

Low-level Container Runtime

OCI Specifications

The Linux Foundation established the OCI (Open Container Initiative) in June 2015 to create an open industry standard around container formats and runtimes.

The purpose of the standardized container is specifically divided into the following five articles.

• Operation Standardization: Standardized container operations include using standard containers to create, start, and stop containers, using standard filesystem tools to copy and create container snapshots, and using standardized network tools for downloading and uploading.

• Content-independent: No matter what the specific container content is, the standard container operations can produce the same effect after they are executed. For example, the container can be uploaded and started in the same way, whether it is a PHP application or a MySQL database service.

• Infrastructure-independent: Whether it is a personal laptop, AWS S3, Openstack, or any other infrastructure, it should support various container operations.

• Tailored for Automation: The development of a unified standard for containers is one of the fundamental purposes of content-independent and infrastructure-independent operations, which is to automate container operations across platforms.

• Industry-level Delivery: A major goal of developing container standards is to enable the real-world implementation of industry-level delivery for software distribution.

The OCI contains three specifications:

• runtime-spec (Runtime Specification): Define the configuration, environment, and lifecycle of container runtimes. That is, how to run a container, how to manage the container status and lifecycle, and how to use the underlying features (namespace, cgroup, and pivot_root) of the operating system.

• image-spec (Image Specification): Define the image format, configurations (including application parameters and environmental information), and the dependent metadata format. Simply put, it is a static description of the image.

• distribution-spec (Distribution Specification): Specify the network interaction process for uploading and downloading images.

According to the OCI specifications, there are three popular runc solutions:

• opencontainers/runc: As mentioned many times before, it is the reference implementation for the OCI Runtime.

• kata-containers/runtime: The container standard counterattack virtual machine, formerly known as clearcontainers/runtime and hyperhq/runv, provides high-performance OCI-compatible hardware virtualization containers through virtcontainers. It is only available for Linux and requires specific hardware support.

• Google/gVisor: gVisor is a user-space kernel implemented in Go, which includes an OCI-compatible runtime implementation. The goal is to provide a container runtime sandbox that can run untrusted code. Currently, it is only available for Linux, but support for other architectures may be added in the future.

Secure Containers

Proactive defense for containers supported by Security Center

Although containers have many technical advantages, the traditional soft isolation based on shared kernel technology represented by runc still has certain risks. If a malicious application escapes from a container by exploiting a system defect, it will pose a serious threat to the host, especially in the public cloud environment. The security threat is likely to affect the data and business of other users.

Then, the security advantages of virtual machines are combined with the high speed and manageability of containers to provide users with standardized, secure, and high-performance container solutions. Therefore, Kata Containers come into being.

The Kata Containers runtime complies with OCI specifications and is compatible with Kubernetes CRI (a VM-level implementation of Pods). To shorten the call chain of containers and efficiently integrate with Kubernetes CRI, Kata-Container directly integrates containerd-shim, kata-shim, and kata-proxy into a single entity. The integration of CRI and Kata Containers is shown in the following figure:

Container Orchestration

What is Container Orchestration?

Container orchestration allows developers to automatically deploy, scale, and manage containerized applications. Container orchestration tools are designed to simplify and automate tasks such as communication, scheduling, scaling, and maintenance between multiple containers. These tools ensure that containers run consistently, reliably, and efficiently throughout the application lifecycle.

Development of Container Orchestration

Common orchestration tools include Mesos, Swarm, and Kubernetes, among which Kubernetes is currently the most popular in the market.

Why does Kubernetes “win”?

Open source: Kubernetes is strongly backed by the community.
Standardization: Kubernetes defines a set of its own standards and does not strongly rely on Docker. As long as it conforms to OCI and CRI specifications, it can be accessed.
Strong ecosystem: Based on the community, Kubernetes is supported by many plugins.
Popularity of container standards.

Alibaba Cloud Containers

Single-tenant -> Multi-tenant

Elastic Container Instance

ECI has evolved from version 1.0 to the current 3.0 and transitioned from operating on ECS to co-locating with ECS.

Hierarchical architecture:

User experience and ecosystem.
IaaS product control (IaaS resource products: ECS, ECI, next-generation products...).
IaaS middleware (resource assembly workshop), EBS, and the services without the concept of IaaS products but all resources.
Resource supply, resource scheduling, and resource orchestration.

How to Create an Elastic Container Instance

The ECI control side calls pync (Alibaba Cloud's standalone control component), and further calls DPDKAVS (Alibaba Cloud's standalone network component) / TDC (Alibaba Cloud's standalone storage component) to produce NICs and disks respectively [Note: Here, the disk is only used as a data disk, and all guest system disks share a single disk to achieve the shared storage].
The ECI control side calls the eciproxy (ECI control and forwarding component) and further calls the libvirt API to produce an iohub instance (the device simulation component on the Alibaba Cloud X-Dragon platform) on the MOC. The produced iohub instance maps the NIC and disk to the CN in the form of BDF.
eciproxy calls the ecilet (the modified version of kubelet) located in the SYS_VM, sending the sandbox configuration to containerd. Here, ecilet will call the ResourceManager to allocate a vsock CID for each user's virtual machine and provide VMs with different vsock identifiers for vsock communication.
containerd calls the sys-agent on CN (since containerd is placed in SYS_VM and cannot communicate directly with rund, sys-agent is required to forward requests) and then calls rund to produce a sandbox. containerd in rund guest pulls the container image and then kata-agent produces the container. At this point, a complete elastic container instance has been produced, and users can run their applications.

RunD Secure Container

After introducing Kata Containers, next, we'll introduce RunD which is Alibaba Cloud's implementation solution for container security. RunD is a lightweight secure container runtime that proposes a host-to-guest full-stack optimization scheme to address the following three issues:

• The file system of the container can be customized according to the characteristics of read-only user images that don't need to be persisted.

• The base image of the operating system in the client can be shared and compressed on demand among multiple secure containers to reduce memory overhead.

• Creating cgroups with high concurrency results in high synchronization latency, especially high scheduling overhead in high-density scenarios.

Problems solved by RunD and steps to start the Kata Containers concurrently

When using Kata as the container runtime, the concurrency bottleneck lies in creating the rootfs (red box, Step 1) and creating cgroups (red line, Step 3). The density bottleneck is due to the high memory overhead of MicroVMs (blue box, Step 2) and the overhead of scheduling and maintaining a large number of cgroups (blue box, Step 3).

The following figure shows the architecture of the RunD scheme:

Architecture of RunD

RunD designs and summarizes the host-to-guest full-stack solution. The RunD runtime provides a read-only layer through virtio-fs, creates a non-persistent read-write layer for virtio-blk by using built-in storage, and mounts the former and the latter as the final container rootfs by using overlayfs, thereby achieving read/write splitting. RunD utilizes a MicroVM template integrated with a streamlined kernel and creates a new MicroVM using pre-processed images, further distributing the overhead associated with different MicroVMs. When creating a secure container, RunD binds a lightweight cgroup from the cgroup pool for resource management.

Based on the above optimizations, when using RunD as the secure container runtime, the secure container will be started according to the following steps:

Step 1: Once containerd receives a user call, it forwards the request to the RunD runtime.

Step 2: RunD prepares the rootfs of the runc container for the VM hypervisor. rootfs is divided into the read-only layer and the writable layer.

Step 3: The hypervisor uses the MicroVM template to create the required sandbox and mounts the rootfs to the sandbox through overlayfs.

The last lightweight cgroup is renamed from the cgroup pool and then bound to the sandbox to manage resource usage.

Summary of concurrency performance: RunD can start a single sandbox within 88ms, and support the concurrency to start 200 sandboxes at the same time within 1 second. Compared to existing technologies, it has the smallest latency fluctuation and CPU overhead.

References:

https://cloud.tencent.com/developer/article/1496919
https://cloud.tencent.com/developer/article/2327479?areaId=106001
https://developer.aliyun.com/article/775778
https://developer.aliyun.com/article/981453
https://developer.aliyun.com/article/1007365
https://blog.frognew.com/2021/05/relearning-container-08.html

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.