By Zhiheng Tao (Junchuan)
Cloud computing is a computing model that provides dynamically scalable resources through Internet services. After years of development, it has become an important support for enterprise IT technology. Virtualization is one of the core technologies of cloud computing. It abstracts a computer into multiple logical computers, namely virtual machines. Each virtual machine is a separate and secure environment that can run different operating systems without affecting each other.
Virtualization technology has brought convenience to resource use and scheduling. Cloud computing systems can schedule resources in time according to load conditions, improving resource utilization while ensuring that applications and services will not affect service quality due to insufficient resources. However, virtualization also has a cost. The abstraction of resources causes performance loss, which is also a problem that virtualization has been trying to solve.
The resource abstraction of virtualization can be divided into three parts: CPU virtualization, memory virtualization, and device virtualization. Among them, device virtualization can realize network, storage, and other devices pass-through virtual machines without performance loss. CPU virtualization (with the support of hardware features) executes ordinary instructions with the same performance as a bare machine. There is still a difference in memory virtualization (compared with a bare machine), which is a problem worthy of attention.
When it comes to memory virtualization, we have to mention the concept of virtual memory. Early operating systems only had physical addresses and limited space. Processes must be careful when using memory to avoid overwriting the memory of other processes. The concept of virtual memory is abstracted to avoid this problem and ensure each process has a continuous, independent virtual memory space. The process uses the memory through the Virtual Address (VA). VA issued during CPU access is intercepted by the hardware Memory Management Unit (MMU) and converted to the Physical Address (PA). The mapping from the VA to the PA is managed by the page table. The MMU automatically queries the page table during translation.
Similar to the concept of virtual memory, each virtual machine on a host considers itself to occupy the entire physical address space. Therefore, the memory needs to be abstracted again, which means the memory is virtualized to ensure that each virtual machine has an independent address space. As such, there are VA and PA concepts in both virtual machines and physical machines, namely Guest Virtual Address (GVA), Guest Physical Address (GPA), Host Virtual Address (HVA), and Host Physical Address (HPA). The program in the virtual machine uses GVA, which needs to be translated into HPA. The mapping from the VA to the PA (GVA to GPA and HVA to HPA) is still managed by the page table. The mapping from the GVA to the HVA typically consists of several linear mapping., determined by the Virtual Machine Monitor (VMM).
The VA must be translated to the PA for process memory access. After the introduction of memory virtualization, the translation path has changed. Originally, you only needed to convert VA to PA. After virtualization, the translation process becomes GVA → GPA → HVA → HPA. After the path becomes longer and more complex, it brings challenges to the security and performance of memory access. These two points are the goals of memory virtualization:
Many virtualization solutions have been proposed to achieve the goal of memory virtualization. Shadow Page Table (SPT) and Extended Page Table (EPT) are two typical solutions, and they are also the most familiar ones. Let's take this as a starting point to see how they work and then discuss other virtualization solutions.
Since the original hardware only supports one-layer page table translation, it cannot be directly used to translate GVA to HPA acorrding to the mapping from VA to PA on virtual machines or physical machines. Therefore, SPT establishes a shortcut to directly manage the mapping from GVA to HPA, which is the shadow page table shown in the following figure. Each shadow page table instance corresponds to a process in the virtual machine, and the creation of the shadow page table requires the VMM to query the page table of the process in the virtual machine.
Since the shadow page table manages the direct mapping from GVA to HPA, the SPT address translation path is equivalent to the physical machine path. The address translation can be completed by querying the page table. When using a level 4 page table, the translation process is shown in the following figure:
Advantages: The SPT address translation process has low overhead and is equivalent to physical machines.
Disadvantages:
Later, hardware added support for nested page tables for virtualization, allowing the hardware to complete two-layer page table translation automatically. EPT is a hardware-based solution. Based on the virtual machine page table managing the mapping from GVA to GPA, an extended page table is added to manage the mapping from GPA to HPA, as shown in the following figure. These two layers of page tables are independent of each other, and the transformation of the two layers of mapping relationship is automatically completed by hardware.
Since the contents of the page tables (gL4, gL3, gL2, gL1) at all levels in the virtual machine are only GPA, the extended page tables (nL4, nL3, nL2, nL1) must be walked to get HPA first when querying the next level, making the entire translation path long. When both page tables have 4 levels, the translation process is shown in the following figure:
Advantages: The overhead of establishing address translation relationships is low. The existence of an independent EPT page table ensures the validity of address translation. Therefore, the page table of a virtual machine can be modified without the intervention of VMM.
Disadvantages: The translation process costs a lot. In the worst case, 24 (4+4+4 * 4) hardware table queries are required. Both classic schemes have solid guarantees of safety, but each has defects in performance. SPT has paid costs to establish the translation relationship to ensure the legitimacy of address translation. EPT has eliminated the overhead of establishing the translation relationship, but the translation path is longer.
There are still many explorations in the industry and academia about memory virtualization. The basic ideas are similar to SPT or EPT and can be divided into three categories:
1) One-Layer Page Table Solution: Similar to SPT, a one-tier page table is used to manage the mapping from GVA to HPA.
2) Two-Layer Page Table Solution: Similar to EPT, two layers of independent page tables are used to manage the mapping from GVA to GPA and from GPA to HPA, respectively.
3) Hybrid Solution: A dynamic selection combining the first two types of schemes
There is the one-layer page table scheme, which is the paravirtualization scheme used by Xen when early hardware only supported one-layer page tables. Compared with SPT, the biggest difference is that the virtual machine page table from GVA to GPA is not maintained separately. The virtual machine knows it is in a virtualized environment, and it knows that its page table content is HPA. Virtual machines also need to trap out to modify the page table, but they can be batch by actively trapping out, while SPT is passively trapping out. When reading the page table, only the HPA is available, and a Machine to Physical (M2P) table must be queried to get the GPA.
Direct Paging also uses a page table to manage the mapping from GVA to HPA. The path of address translation is the same as SPT. When using a 4-level page table, only four times the table queries are needed in the worst case.
Advantages: The address translation process has low overhead and is equivalent to physical machines.
Disadvantages:
The two-layer page table scheme is based on the new hardware in academia. The mapping management of GVA to GPA is the same as EPT and also uses multi-level page tables. However, the mapping from GPA to HPA uses a segmentation mechanism. When converting GPA to HPA, All that needs to be done is to add an offset through hardware.
Although GPA is not equal to HPA, the mapping relationship between the two is simple. Only one offset is required for Direct Segment hardware. Compared with the path of the physical machine, the difference between the entire translation path is small, and only a few more hardware offsets are required. When the virtual machine uses a 4-Level page table, the translation path is shown in the following figure, where DS indicates the hardware support for GPA to HPA translation.
Advantages: The overhead of establishing address translation relationships is low, and the overhead of the translation process is low.
Disadvantages:
There is a two-layer page table scheme, which is a new hardware-based scheme proposed by academia. The whole idea is similar to EPT. The only difference is that EPT manages multi-level page tables from GPA to HPA. Generally, it has four levels, with 512 items on each level. Flat EPT only uses one level of flat page tables, and table items are far more than 512.
Similar to EPT, the content of page tables at all levels in a virtual machine is also GPA. When querying the next level, you need to translate the flat extended page table (nL4) to HPA first. Since the flat extended page table only has one level, the translation path is shorter than EPT. When using a 4-level page table in a virtual machine, the conversion path is shown in the following figure. At worst, nine times (4+1+4 * 1) the hardware table queries are required.
Advantages: The overhead of establishing address translation relationships is low, and the overhead of the translation process is also low. Compared with Direct Segment, the memory allocation requirements are low. Only a small amount of continuous memory is required for flat extended page tables. (8G virtual machines only need 16M.)
Disadvantages: The hardware is required to support flat extended page tables. The current hardware only supports multi-level extended page tables whose table entries are 512.
The hybrid scheme, which was proposed earlier in academia, is a dynamic time-sharing switching SPT and EPT. Monitor and collect TLB miss and Page Fault data when the virtual machine is running, and switch between SPT and EPT when the two reach the set threshold, as shown in the following figure:
Advantages: It makes full use of the advantages of SPT and EPT to achieve better performance.
Disadvantages:
The significant advantage of the one-layer page table is the low overhead of the address translation process, which is the same as physical machines. The problem that needs to be solved is reducing the overhead of address translation establishment. One possible direction is to give up security and make page table modifications lighter. Another more practical direction is to use it in appropriate scenarios for loads with infrequent page table modifications.
The advantage of the two-layer page table is that the overhead of address translation is small, and the virtual machine can modify the page table independently. The problem to be considered is to shorten the translation path. This is feasible, but if the solutions rely on the support of new hardware, it is unlikely that new hardware that meets the requirements will appear in the short term.
The hybrid page table intends to make full use of the advantages of the two types of page tables, but it is difficult to do dynamic mode switching. Load differences and even hardware differences may affect the effect of switching. Perhaps directional tuning for known loads is feasible.
In the long run, if there is the blessing of new hardware, the two-layer page table (especially the Flat EPT) is a complete solution. Address translation can be efficient, and there is no need to make some sacrifices in security and versatility. However, it is early for the new hardware to do further exploration and optimization in the short term. It is more practical to do these on the first-level page table scheme. We will continue to explore more possibilities in the path of memory virtualization. You are welcome to join OpenAnolis for additional discussion.
Zhiheng Tao (Junchuan) joined the Alibaba Cloud Operating System-Cloud-Native Underlying System Team in 2020 and is currently engaged in performance optimization.
SMC-R Interpretation Series – Part 2: SMC-R: A hybrid solution of TCP and RDMA
Remote Attestation EAA: The Final Link for Secure Deployment of Confidential Containers
84 posts | 5 followers
FollowAlibaba Cloud Community - July 1, 2022
Alibaba Container Service - April 11, 2019
Alibaba Cloud Community - July 27, 2022
ApsaraDB - February 9, 2021
Alibaba Clouder - December 14, 2017
Alibaba Clouder - February 19, 2021
84 posts | 5 followers
FollowA virtual private cloud service that provides an isolated cloud network to operate resources in a secure environment.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreAlibaba Cloud DNS PrivateZone is a Virtual Private Cloud-based (VPC) domain name system (DNS) service for Alibaba Cloud users.
Learn MoreA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreMore Posts by OpenAnolis