On the DMA Mapping Problem in Direct Device Assignment 配额映射-SYSTOR-2010

On the DMA Mapping Problem in Direct Device Assignment

ABSTRACT

I/O intensive workloads running in virtual machines can suffer massive performance degradation. Direct assignment of I/O devices to virtual machines is the best performing I/O virtualization mechanism, but its performance still remains far from the bare-metal (non-virtualized) case. The primary gap between direct assignment I/O performance and baremetal I/O performance is the overhead of mapping the VM’s memory pages for DMA in IOMMU translation tables. One could avoid this overhead by mapping all of the VM’s pages for the lifetime of the VM, but this leads to memory consumption which is unacceptable in many scenarios.
在虚拟机中运行的I/O密集型工作负载可能会导致性能大幅下降。将I/O设备直接分配给虚拟机是性能最好的I/O虚拟化机制，但其性能仍远未达到裸金属(非虚拟化)的水平。直接分配I/O性能和裸金属I/O性能之间的主要差距是在IOMMU转换表中为DMA映射VM内存页的开销。可以通过在VM的生命周期内映射VM的所有页面来避免这种开销，但这将导致内存消耗，这在许多场景中是不可接受的。

The DMA mapping problem can be stated brieﬂy as“when should a memory page be mapped or unmapped for DMA?” We begin by presenting a theoretical framework for reasoning about the DMA mapping problem. Then, using a quotabased approach, we propose the on-demand DMA mapping strategy, which provides the best DMA mapping performance for a given amount of memory consumed. In particular, on-demand mapping can achieve the same performance as state-of-the-art mapping strategies while consuming much less memory (exact amount depends on the workload’s requirements). We present the design and implementation of on-demand mapping in the Linux-based KVM hypervisor and an experimental evaluation of its application to various workloads.
DMA映射问题可以简单地表述为“什么时候应该为DMA映射内存页或不映射内存页?”我们首先提出一个理论框架来推理DMA映射问题。然后，使用基于配额的方法，我们提出了按需DMA映射策略，该策略在给定的内存消耗情况下提供最佳的DMA映射性能。特别是，按需映射可以实现与最先进的映射策略相同的性能，同时消耗更少的内存(确切数量取决于工作负载的需求)。我们提出了在基于linux的KVM管理程序中按需映射的设计和实现，并对其在各种工作负载下的应用进行了实验评估。

1. INTRODUCTION

A constantly increasing number of computer systems in today’s data-centers are running multiple operating systems simultaneously, using virtualization technology. Most new CPUs manufactured for servers, desktops, laptops, and even some embedded systems, has virtualization capabilities built into the hardware. Virtualization is clearly here to stay.
在当今的数据中心中，越来越多的计算机系统使用虚拟化技术同时运行多个操作系统。大多数为服务器、台式机、笔记本电脑、甚至一些嵌入式系统制造的新cpu都在硬件中内置了虚拟化功能。虚拟化显然会继续存在下去。

Virtualizing a computer system’s CPU and memory is a challenging but fairly well understood problem. However, a computer system has three equally important components: CPU, memory, and I/O. Virtualizing I/O is far more challenging and not nearly as well understood as virtualizing the CPU and memory.
对计算机系统的CPU和内存进行虚拟化是一个具有挑战性的问题，但这是一个相当容易理解的问题。然而，计算机系统有三个同等重要的组成部分:CPU、内存和I/O。虚拟化I/O要比虚拟化CPU和内存具有更大的挑战性，而且不像虚拟化CPU和内存那样容易理解。

There are three prevailing approaches to I/O virtualization: emulating a real (hardware) I/O device [29], using para-virtualized I/O drivers [5, 17, 25], and giving a virtual machine direct access to an I/O device [21, 23, 32, 34]. Other approaches, such as virtualizing the entire I/O stack or dedicating a core to I/O processing, are also possible [20, 27, 28].

I/O虚拟化有三种流行的方法:模拟真实的(硬件)I/O设备[29]，使用准虚拟化I/O驱动程序[5,17,25]，以及让虚拟机直接访问I/O设备[21,23,32,34]。其他方法，如虚拟化整个I/O堆栈或将一个核心专门用于I/O处理，也有可能[20,27,28]。

[5] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In SOSP ’03: 19th ACM
Symposium on Operating Systems Principles.
[7] M. Ben-Yehuda, J. Mason, J. Xenidis, O. Krieger,L. van Doorn, J. Nakajima, A. Mallick, and E. Wahlig. Utilizing IOMMUs for virtualization in Linux and Xen.
In OLS ’06: The 2006 Ottawa Linux Symposium, pages 71–86, July 2006.
[25] R. Russell. virtio: towards a de-facto standard for virtual I/O devices. SIGOPS Oper. Syst. Rev., 42(5):95–103, 2008.

[21] J. Liu, W. Huang, B. Abali, and D. K. Panda. High performance VMM-bypass I/O in virtual machines. In USENIX ’06 Annual Technical Conference, page 3.
[23] H. Raj and K. Schwan. High performance and scalable I/O virtualization via self-virtualized devices. In HPDC ’07, pages 179–188, 2007.
[32] P. Willmann, J. Shafer, D. Carr, A. Menon, S. Rixner, A. L. Cox, and W. Zwaenepoel. Concurrent direct network access for virtual machine monitors. In High Performance Computer Architecture, pages 306–317, 2007.
[34] B.-A. Yassour, M. Ben-Yehuda, and O. Wasserman. Direct device assignment for untrusted fully-virtualized virtual machines. Technical report, IBM Research Report H-0263, 2008.

[20] J. Liu and B. Abali. Virtualization polling engine (vpe): using dedicated cpu cores to accelerate i/o virtualization. In ICS ’09: Proceedings of the 23rd international conference on Supercomputing, pages 225–234, New York, NY, USA, 2009. ACM.
[27] J. Satran, L. Shalev, M. Ben-Yehuda, and Z. Machulsky. Scalable I/O—a well-architected way to do scalable, secure and virtualized I/O. In WIOV ’08: The
First Workshop on I/O Virtualization, 2008.
[28] L. Shalev, J. Satran, E. Borovik, and M. Ben-Yehuda. Isostack—highly efficient network processing on dedicated cores. In USENIX ATC ’10: USENIX Annual
Technical Conference, 2010.

Emulation means that the host emulates a device that the guest already has a driver for [29]. The host traps all device accesses and converts them to operations on a real, possibly diﬀerent, device, as depicted in Figure 1.
模拟意味着主机模拟客户端已经拥有[29]驱动程序的设备。主机捕获所有设备访问并将它们转换为实际的(可能不同的)设备上的操作，如图1所示。
[29] J. Sugerman, G. Venkitachalam, and B.-H. Lim. Virtualizing I/O devices on VMware workstation’s hosted virtual machine monitor. In 2002 USENIX Annual Technical Conference, pages 1–14.

With para-virtualized I/O devices, special hypervisor-aware I/O drivers are installed in the guest. All modern hypervisors implement such para-virtualized drivers [5, 17, 25], but their performance is still far from native [24, 26] and they require special drivers.
对于半虚拟化I/O设备，客户端中安装了特殊的感知hypervisor的I/O驱动程序。所有现代虚拟机监控程序都实现这种准虚拟化驱动程序[5,17,25]，但它们的性能仍远未达到原生[24,26]，而且它们需要特殊的驱动程序。

[5] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In SOSP ’03: 19th ACM
Symposium on Operating Systems Principles.
[17] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. KVM: the linux virtual machine monitor. In Ottawa Linux Symposium, pages 225–230, July 2007.
[25] R. Russell. virtio: towards a de-facto standard for virtual I/O devices. SIGOPS Oper. Syst. Rev., 42(5):95–103, 2008.

[24] K. K. Ram, J. R. Santos, Y. Turner, A. L. Cox, and S. Rixner. Achieving 10Gbps using safe and transparent network interface virtualization. In VEE ’09: 2009 Conference on Virtual Execution Environments.
[26] J. R. Santos, Y. Turner, j. G. Janakiraman, and I. Pratt. Bridging the gap between software and hardware techniques for I/O virtualization. In USENIX Annual Technical Conference, pages 29–42, June 2008.

Direct device access (interchangeably referred to as“direct device assignment”, “direct access” or “pass-through access”) means that the guest sees a real device and interacts with it directly, without a software intermediary (see Figure 2).
直接设备访问(可以互换称为“直接设备分配”、“直接访问”或“直通访问”)意味着访客看到真实的设备并直接与之交互，而不需要软件中介(参见图2)。

Direct access does away with the software intermediary which other I/O virtualization approaches require. Direct access can provide much better performance than the alternative I/O virtualization approaches [19, 21, 23, 32, 34]. This is its primary beneﬁt and its importance cannot be overstated: the diﬀerence in performance for I/O intensive workloads is such that direct access makes it possible to virtualize workloads that otherwise would bring the virtualization system to its knees. Another beneﬁt of direct access is that the guest can use any device it has a driver for.
直接访问不需要其他I/O虚拟化方法所需要的软件中介。直接访问可以提供比其他I/O虚拟化方法更好的性能[19,21,23,32,34]。这是它的主要好处，它的重要性怎么强调都不为过:I/O密集型工作负载的性能差异如此之大，直接访问使得虚拟化工作负载成为可能，否则虚拟化系统就会崩溃。直接访问的另一个好处是来宾程序可以使用它有驱动程序的任何设备。
[19] J. Liu. evaluating standard-based self-virtualizing devices: A performance study on 10 GbE NICs with SRIOV support. In IPDPS ’10: IEEE International Parallel and Distributed Processing Symposium, 2010.
[21] J. Liu, W. Huang, B. Abali, and D. K. Panda. High performance VMM-bypass I/O in virtual machines. In USENIX ’06 Annual Technical Conference, page 3.
[23] H. Raj and K. Schwan. High performance and scalable I/O virtualization via self-virtualized devices. In HPDC ’07, pages 179–188, 2007.
[32] P. Willmann, J. Shafer, D. Carr, A. Menon, S. Rixner, A. L. Cox, and W. Zwaenepoel. Concurrent direct network access for virtual machine monitors. In High Performance Computer Architecture, pages 306–317, 2007.
[34] B.-A. Yassour, M. Ben-Yehuda, and O. Wasserman. Direct device assignment for untrusted fully-virtualized virtual machines. Technical report, IBM Research Report H-0263, 2008.

To fully beneﬁt from direct access, some hardware support is necessary. An I/O memory management unit (IOMMU) is needed to protect and translate device memory accesses [3, 7], and a self-virtualizing adapter is needed in order to share the adapter between diﬀerent virtual machines [21, 23, 32]. IOMMUs such as Intel’s VT-d [2, 3], IBM’s Calgary [7], and AMD’s IOMMU [1], and PCI standard SR-IOV devices [13, 19] are now becoming available.
为了充分受益于直接访问，需要一些硬件支持。需要一个I/O内存管理单元(IOMMU)来保护和转换设备内存访问[3,7]，需要一个自虚拟化适配器来在不同的虚拟机之间共享适配器[21,23,32]。IOMMUs，如英特尔的VT-d[2,3]、IBM的卡尔加里[7]和AMD的IOMMU[1]，以及PCI标准SR-IOV设备[13,19]现在已经可以使用了。

Although direct access can in theory provide bare-metal performance (i.e., the same performance as running the same workload in a non-virtualized environment), in practice sizable performance gaps remain. This paper deals with one such performance gap—the so-called “DMA mapping problem.”
尽管在理论上直接访问可以提供裸金属性能(即，与在非虚拟化环境中运行相同工作负载的性能相同)，但在实践中仍然存在相当大的性能差距。本文将讨论其中一个性能差距——所谓的“DMA映射问题”。

1.1 DMA and IOMMUs

In virtualized environments, guests have their own view of physical memory, which in the Linux-based KVM hypervisor is referred to as “guest physical”, and which is distinct from the host’s “host physical” view of memory [17]. Although there are ways of giving fully-virtualized guests access to portions of host memory without hardware support [18], such approaches can only work for trusted guests. Giving untrusted guests and devices access to system memory requires hardware support in the form of an IOMMU [7, 15].
在虚拟化环境中，来宾有自己的物理内存视图，在基于linux的KVM管理程序中称为“来宾物理视图”，它不同于主机的内存[17]的“主机物理”视图。尽管有一些方法可以让完全虚拟化的客户机访问主机内存的一部分，而不需要硬件支持[18]，但这些方法只能用于受信任的客户机。让不受信任的来宾和设备访问系统内存需要IOMMU形式的硬件支持[7,15]。

An I/O Memory Management Unit (IOMMU) validates and translates all device accesses to host memory. To understand why an IOMMU is needed, consider what would happen if we didn’t have one: ﬁrst, the guest operating system would need to know that it is running on a virtualized system, and be able to translate guest physical addresses to host physical addresses on its own before handing them over to the device. This conﬂicts with one of the key value propositions of x86 virtualization, running unmodiﬁed guest operating systems. Second, we would need to trust the OS to program the device with the right addresses, since otherwise the OS could program the device to DMA anywhere in system memory, including on top of the hypervisor or other virtual machines. This would defeat another key value proposition of virtualization, running untrusted operating systems in isolation.
I/O内存管理单元(IOMMU)验证并转换到主机内存的所有设备访问。要理解为什么需要一个IOMMU,考虑会发生什么,如果我们没有一个:首先,来宾操作系统需要知道它是一个虚拟化的系统上运行,并能够客户物理地址转换成主机物理地址的前送交设备。这与x86虚拟化的一个关键价值主张相冲突，即运行未经修改的客户操作系统。其次，我们需要信任操作系统使用正确的地址对设备进行编程，否则操作系统可以将设备编程到系统内存中的任何位置的DMA，包括在管理程序或其他虚拟机之上。这将击败虚拟化的另一个关键价值主张，即独立运行不受信任的操作系统。

IOMMUs work by interposing on DMAs made by devices and validating and translating the addresses in those DMAs before letting them read or write main memory. This work used Intel’s VT-d IOMMU [3], but the details are substantially similar for other IOMMUs. In order to validate and translate DMAs generated by diﬀerent devices VT-d maintains a separate translation table in memory for each PCI device (as denoted by its PCI Bus/Device/Function ID). When the device initiates a DMA operation, the IOMMU walks through the translation table and checks whether the access is valid. If the translation entry is not valid (for example if the device is trying to write to memory that it only has read permissions for), the DMA is aborted. If the translation entry is valid, the address is translated and the DMA operation reads or writes memory at the translated address.
IOMMUs的工作原理是，在设备读取或写入主存之前，对设备生成的dma进行插入，并对这些dma中的地址进行验证和转换。这项工作使用了英特尔的VT-d IOMMU[3]，但细节与其他IOMMUs基本相似。为了验证和转换由不同设备生成的dma, VT-d在内存中为每个PCI设备维护一个单独的转换表(以其PCI总线/设备/功能ID表示)。当设备发起DMA操作时，IOMMU遍历转换表并检查访问是否有效。如果翻译条目无效(例如，如果设备试图写入内存，而它只具有读权限)，DMA将被中止。如果转换项有效，则对地址进行转换，DMA操作读取或写入转换地址上的内存。

In order to speed up translation entry lookups, most IOMMUs, including VT-d, cache translations in IOTLBs. An IOTLB serves the same purpose a TLB serves for the MMU. IOTLBs improve IOMMU performance, but software must keep them coherent when it modiﬁes a translation table. VT-d deﬁnes three modes of IOTLB invalidations: global, page level, and directory level. When software invalidates the IOTLB it can either poll for invalidation completion or request an interrupt. There are two mechanisms for invalidating the entries: register based and queued invalidations. With the register based approach software needs to wait for the completion of each address invalidation. When several entries need to be modiﬁed, queued invalidation is more eﬃcient. This feature is important, since as we show, an important optimization is to batch modiﬁcations. With the queuing method the IOTLB invalidation time can be reduced. Unfortunately, the VT-d revision on our systems did not support queued invalidations.
为了加快翻译条目查找，大多数IOMMUs(包括VT-d)都在iotlb中缓存翻译。IOTLB与MMU的TLB功能相同。iotlb提高了IOMMU的性能，但是软件在修改翻译表时必须保持它们的一致性。VT-d定义了IOTLB失效的三种模式:全局、页级和目录级。当软件使IOTLB失效时，它可以轮询失效完成或请求中断。有两种机制可以使条目无效:基于寄存器的无效和排队的无效。使用基于寄存器的方法，软件需要等待每个地址失效完成。当需要修改多个条目时，排队失效会更有效。这个特性很重要，因为正如我们所展示的，一个重要的优化是批量修改。采用排队的方法可以减少IOTLB失效时间。不幸的是，我们系统上的VT-d修订版不支持排队失效。

When new mappings are added to the translation table which cause an entry to go from non-present to present then no IOTLB invalidation is needed, since VT-d does not cache non-present entries. On the other hand, VT-d’s internal write-buﬀer 1 does need to be ﬂushed, and software needs to wait for completion. We note that when multiple addresses are to be added, only a single write-buﬀer ﬂush is needed. Therefore, batching multiple mappings can signiﬁcantly reduce the mapping cost.
当向转换表中添加新映射时，会导致一个条目从不存在变为存在，那么就不需要IOTLB失效，因为VT-d不缓存不存在的条目。另一方面，VT-d的内部写缓冲区1确实需要刷新，软件需要等待完成。我们注意到，当要添加多个地址时，只需要一次写缓冲区刷新。因此，批量处理多个映射可以显著降低映射成本。

1.2 The DMA Mapping Problem

The key goal of direct access is to allow a guest operating system to access a device directly, which requires mapping the guest OS’s memory in an IOMMU so that the device could DMA to it directly. An obvious question which follows is “when should a page of memory be mapped or unmapped in the IOMMU?”. This, in a nutshell, is the DMA mapping problem.
直接访问的关键目标是允许客户操作系统直接访问设备，这需要映射IOMMU中的客户操作系统的内存，以便设备可以直接DMA到它。接下来的一个明显问题是“在IOMMU中什么时候应该映射或取消映射内存页?”简单地说，这就是DMA映射问题。

The DMA mapping problem arises because mapping (and unmapping) a page of memory in an IOMMU translation table is expensive, as shown in previous works by Ben-Yehuda et al. [8] and by Willman, Rixner, and Cox [31], and conﬁrmed in our experiments. A breakdown of the costs is provided in Section 6.
DMA映射问题的出现是因为在IOMMU转换表中映射(和解除映射)内存页的代价很高，正如Ben-Yehuda等人[8]和Willman、Rixner和Cox[31]之前的工作所示，并在我们的实验中得到证实。开销：第6节
[8] M. Ben-Yehuda, J. Xenidis, M. Ostrowski, K. Rister, A. Bruemmer, and L. van Doorn. The price of safety: evaluating IOMMU performance. In OLS ’07: The 2007 Ottawa Linux Symposium, pages 9–20, July 2007.
[31] P. Willmann, S. Rixner, and A. L. Cox. Protection strategies for direct access to virtualized I/O devices. In USENIX Annual Technical Conference, 2008, pages
15–28.

The early direct access implementations used one of two extreme approaches for DMA mapping: they either mapped all of a guest operating system’s memory up-front (thus incurring minimal run-time overhead), or they only mapped memory once immediately before it was DMA’d to or from, and unmapped it immediately when the DMA operation was done [7]. Willman, Rixner, and Cox named these strategies direct mapping and single-use mapping, respectively. In addition, they presented two other strategies: shared mapping and persistent mapping [31].
直接访问实现早期使用DMA映射两种极端的方法之一:他们要么所有来宾操作系统的内存映射的前期(从而导致最小的运行时开销),或者他们只映射内存一次之前立即DMA或从,和地图上未标明的DMA操作时立即完成[7]。Willman、Rixner和Cox分别将这些策略命名为直接映射和一次性映射。此外，他们还提出了另外两种策略:共享映射和持久映射[31]。
[7] M. Ben-Yehuda, J. Mason, J. Xenidis, O. Krieger, L. van Doorn, J. Nakajima, A. Mallick, and E. Wahlig.Utilizing IOMMUs for virtualization in Linux and Xen.
In OLS ’06: The 2006 Ottawa Linux Symposium, pages
71–86, July 2006.

Single-use mapping has a non-negligible performance overhead [8] but protects the guest’s memory from malicious devices and buggy drivers. Thus it sacriﬁces performance for reduced memory consumption and increased protection. Direct mapping, on the other hand, is transparent to the guest and requires minimal CPU overhead—but requires pinning all of the guest’s memory, and provides no protection inside a guest (intra-guest protection), only between diﬀerent guests (inter-guest protection). Thus it sacriﬁces memory and protection for increased performance.
一次性映射具有不可忽略的性能开销[8]，但可以保护客户的内存不受恶意设备和有bug的驱动程序的影响。因此，它牺牲了性能，以减少内存消耗和增加保护。另一方面，直接映射对客户机是透明的，需要最少的CPU开销，但是需要固定客户机的所有内存，并且在客户机内部不提供保护(客户机内部保护)，只在不同的客户机之间提供保护(客户机间保护)。因此，它牺牲了内存和保护来提高性能。
[8] M. Ben-Yehuda, J. Xenidis, M. Ostrowski, K. Rister, A. Bruemmer, and L. van Doorn. The price of safety:evaluating IOMMU performance. In OLS ’07: The
2007 Ottawa Linux Symposium, pages 9–20, July 2007.

Shared mapping and persistent mapping provide diﬀerent tradeoﬀs between performance, memory consumption, and protection. Shared mapping reuses a single mapping if more than one device is trying to DMA to the same memory location at the same time, and persistent mapping keeps mappings around once they have been created in case they will be reused in the future. A key question is whether there is an optimal mapping strategy, and if yes, what is it. We attempt to address this question in the remainder of this paper.
共享映射和持久映射提供了性能、内存消耗和保护之间的不同权衡。如果多个设备同时试图DMA到相同的内存位置，共享映射将重用单个映射，而持久化映射将在创建映射后保持映射，以防将来重用它们。一个关键问题是，是否存在最佳映射策略，如果存在，是什么策略。我们试图在本文的其余部分解决这个问题。

The DMA mapping problem is not speciﬁc to virtual machine direct access. Variants of it also appear in many other areas, such as high-speed networks (Inﬁniband, Quadrics, Myrinet) with OS-bypass and/or hypervisor-bypass [21, 33], userspace drivers and micro-kernels. In general, whenever a trusted entity wishes to grant access to memory to a device controlled by an untrusted entity, the privileged entity needs to solve a variant of the DMA mapping problem.
DMA映射问题不是特定于虚拟机直接访问的。它的变体也出现在许多其他领域，如高速网络(Infiniband, Quadrics, Myrinet)与操作系统旁路和/或管理程序旁路[23,33]，用户空间驱动程序和微内核。通常，每当受信任实体希望将对内存的访问权授予由不受信任实体控制的设备时，特权实体需要解决DMA映射问题的变体。
[23] H. Raj and K. Schwan. High performance and scalable I/O virtualization via self-virtualized devices. In HPDC ’07, pages 179–188, 2007.
[33] P. Wyckoﬀ and J. Wu. Memory registration caching correctness. In CCGRID ’05, pages 1008–1015.

1.3 Our Contributions

We make the following contributions in this paper:

A theoretical framework for reasoning about the DMA mapping problem is presented in Section 2. Using the framework, we build a quota-based model in Section 2.1.
The on-demand mapping strategy, which provides the best performance for a given amount of memory consumed, is presented in Section 2.2.
A prefetching algorithm which reduces the DMA mapping overhead with small quotas is presented in Section 2.5, and several batching mechanisms are presented in Section 3.2.
The design and implementation of on-demand mapping in the KVM is presented in Section 3, with an evaluation of its performance and memory consumption with various workloads and quotas in Section 4. Intra-guest protection is discussed in Section 5, and a cycle breakdown of the cost of creating and destroying a DMA mapping is presented in Section 6.

在本文中，我们做出了以下贡献:

第2节给出了DMA映射问题推理的理论框架。使用这个框架，我们在第2.1节中构建了一个基于配额的模型。
按需映射策略在给定的内存消耗情况下提供最佳性能，见第2.2节。

3.在第2.5节中，我们介绍了一种预取算法，它可以用较小的配额减少DMA映射开销，在第3.2节中介绍了几种批处理机制。

第3节介绍了KVM中按需映射的设计和实现，第4节介绍了在各种工作负载和配额下对其性能和内存消耗的评估。第5节讨论了客户内部保护，第6节给出了创建和销毁DMA映射成本的周期分解。

##2. THEORETICAL frameWORK

We begin by presenting several requirements that any DMA mapping scheme must satisfy. The ﬁrst requirement is for the assigned device to operate correctly, that is, to operate as it would in a non-virtualized environment. Our working assumption is that in the majority of cases when a device initiates a DMA request and the DMA request fails it will require device reset, since no current hardware supports I/O page faults [27]. It is therefore a requirement that the IOMMU must have a valid translation for every address that the device can validly DMA to. Naturally, if the device tries to DMA to an invalid address, then the request will be blocked.
我们首先提出了任何DMA映射方案都必须满足的几个需求。第一个要求是使所分配的设备能够正确地操作，也就是说，能够像在非虚拟化环境中那样操作。我们的工作假设是，在大多数情况下，当设备发起DMA请求且DMA请求失败时，将需要设备重置，因为当前硬件不支持I/O页面错误[27]。因此，IOMMU必须对设备可以有效DMA到达的每个地址都有有效的转换。当然，如果设备试图DMA到一个无效的地址，那么请求将被阻止。
[27] J. Satran, L. Shalev, M. Ben-Yehuda, and Z. Machulsky. Scalable I/O—a well-architected way to do scalable, secure and virtualized I/O. In WIOV ’08: The First Workshop on I/O Virtualization, 2008.

This requirement leads to the following proposition引出下列命题:

Proposition 2.1. Every guest physical address (gpa) that the guest programmed the device to DMA to must be backed up by a valid guest-physical-to-host-physical mapping in the IOMMU translation table for that device.
命题2.1。客户将设备编程为DMA的每个客户物理地址(gpa)必须由该设备的IOMMU转换表中的有效客户物理GP到主机物理HP映射进行备份。

For correct operation, every guest-physical-to-host-physical mapping in the IOMMU translation table for a given device must also be backed up by an equivalent mapping in the guest-physical to host-physical translation table used for MMU address translations for the guest VM driving that device (i.e., the software or hardware (EPT/NPT) shadow page table) [4, 9]. Note that if a host physical address (hpa) that has a valid translation leading to it in the IOMMU translation table does not have a valid translation in the shadow page table, then it might be in use by another guest or the host. Since the guest-physical-to-host-physical mapping exists in the IOMMU translation table, the device could read or write to it, thereby reading or writing a host frame that is owned by another guest or by the host and violating the protection (isolation) guarantees of the IOMMU.
为了正确操作，对于给定设备，IOMMU转换表中的每个来宾-物理到主机-物理映射还必须由来宾-物理到主机-物理转换表中的等价映射进行备份，该映射用于驱动该设备的来宾VM的MMU地址转换(即，软件或硬件(EPT/NPT)阴影页表)[4,9]。注意，如果在IOMMU转换表中具有指向它的有效转换的主机物理地址(hpa)在影子页表中没有有效转换，那么它可能正在被另一个客户机或主机使用。由于客户机-物理到主机-物理的映射存在于IOMMU转换表中，因此设备可以对它进行读写，从而读取或写入由另一个客户机或主机拥有的主机帧，从而违反了IOMMU的保护(隔离)保证。

Proposition 2.2. Every guest-physical-to-host-physical mapping in the IOMMU translation table of a given device must have a corresponding mapping in the guest-physical-to-host-physical translation table for CPU MMU translations for the guest VM driving that device.
命题2.2

From Propositions 2.1 and 2.2 we have:

Proposition 2.3. Every valid gpa that the guest programmed the device to DMA to or from must have a mapping to hpa, which must be pinned (cannot be changed) as long as the device may legitimately use that gpa.

从命题2.1和2.2中，我们有:

命题2.3。客户将设备编程为DMA的每个有效gpa必须有到hpa的映射，只要设备可以合法地使用该gpa，就必须固定(不能更改)该映射。

The host can apply various policies for programming the IOMMU translation table. Since changing the IOMMU mapping requires a world switch and an IOMMU IOTLB ﬂush, both of which are expensive, the main optimization target is to minimize the number of IOMMU remappings. Mapping the entire guest physical address space requires a single IOMMU mapping operation that is not changed throughout the period that the device is assigned to that guest. On the down side, based on Proposition 2.3, mapping the entire guest physical address space means that that the entire guest address space would then be pinned!
host可以应用各种策略来编程IOMMU转换表。由于更改IOMMU映射需要一个全局开关和IOMMU IOTLB刷新，这两种操作都很昂贵，所以主要的优化目标是最小化IOMMU映射的数量。映射整个客户物理地址空间需要一个IOMMU映射操作，在将设备分配给该客户的整个过程中，该操作不会更改。在不利的一面，基于命题2.3，映射整个客户物理地址空间意味着整个客户地址空间将被固定!

Mapping and pinning the entire guest physical memory might be a valid solution in some scenarios, but is not acceptable in the general case, since memory is a precious resource and signiﬁcant eﬀort is expended in trying to conserve, share, and otherwise make full use of it in virtualized environments [16, 22, 30]. Indeed, memory is considered the most important resource and the main barrier to scalability in virtualized environments. All commonly used hypervisors over-commit memory and try to balance the memory needs of VMs dynamically by continually adjusting the amount of memory assigned to each VM. We need a better solution for DMA mapping, one that does not require pinning a large amount of host physical memory, for an unbounded length of time under the control of an untrusted guest.
物理内存映射和把整个客人可能是一个有效的解决方案在某些场景中,但在一般情况下是不能接受的,因为内存是一个宝贵的资源,消耗大量的工作为了节约,分享,否则充分利用虚拟化环境中[16,22,30]。实际上，内存被认为是最重要的资源，也是虚拟化环境中可伸缩性的主要障碍。所有常用的虚拟机管理程序都会过度使用内存，并试图通过不断调整分配给每个虚拟机的内存数量来动态平衡虚拟机的内存需求。我们需要一种更好的DMA映射解决方案，这种解决方案不需要将大量主机物理内存固定在一个不受信任的客户机控制的无限长的时间内。

The simplest solution to avoid mapping the entire guest memory is for the guest to notify the host of the guest pages which are the targets of each DMA request. This way the host can update the IOMMU translation tables only for the addresses that the guest is using for DMA. However, our experiments with KVM in Section 6 conﬁrm earlier experimental results: remapping IOMMU translation entries is expensive[8, 31]. If the guest initiates a hypercall for each DMA transaction, performance is degraded signiﬁcantly. Therefore it is crucial for the system to minimize the number of IOMMU remappings. In an attempt to reduce the number of IOMMU remappings, the persistent mapping strategy keeps guest physical addresses mapped in the IOMMU translation tables once they have been mapped. Then, if and when the guest tries to reuse that address, the overhead of remapping is avoided since the address is already mapped. Clearly, with such a strategy, after a while all of the guest’s memory pages will be pinned as in the case of the direct mapping strategy, leading to unacceptable memory consumption.
避免映射整个客户内存的最简单的解决方案是，客户将每个DMA请求的目标客户页面通知主机。这样，主机就可以只针对客户机用于DMA的地址更新IOMMU转换表。然而，我们在第6节中使用KVM进行的实验证实了早期的实验结果:重新映射IOMMU转换条目是昂贵的[8,31]。如果来宾对每个DMA事务发起hypercall，则性能会显著下降。因此，对于系统来说，最小化IOMMU重新映射的数量是至关重要的。为了减少IOMMU映射的数量，持久映射策略在映射完客户物理地址之后，将它们映射到IOMMU转换表中。然后，如果客户尝试重用该地址，则可以避免重新映射的开销，因为该地址已经被映射了。显然，使用这种策略后，与直接映射策略一样，一段时间后所有来宾的内存页面都将被固定，从而导致不可接受的内存消耗。

We propose a model in which one ﬁrst deﬁnes a quota of guest memory pages that the guest can pin at any given time for DMA. Without such restrictions a selﬁsh or malicious guest with direct device access can pretend to DMA to the entire guest memory, thus making sure that all of its physical memory is pinned.
我们提出了一个模型，首先定义客户内存页面的配额，客户可以在任何给定时间为DMA锁定这些页面。如果没有这样的限制，具有直接设备访问的自私或恶意的客户机可以假装对整个客户机内存进行DMA，从而确保其所有物理内存都被固定。

Note that with such an approach, the correct minimal quota needs to be determined. If the quota is insuﬃcient for correct operation of the guest, it is equivalent to running a guest with insuﬃcient host physical memory.
注意，使用这种方法，需要确定正确的最小配额。如果配额不足以使guest正确运行，则相当于运行一个主机物理内存不足的guest。

The quota can either be deﬁned manually in a static fashion just like the amount of memory that is assigned to the guest, or be changed dynamically by the host. A combination of the two where a range of quotas is provide statically, and the exact quota within that range is determined dynamically is also possible. We now ask two questions:

• Guest perspective Given a quota of DMA mappings, what is the optimal eviction strategy? In other words, how should the guest use its given quota of DMA mappings so that the total number of remappings (evictions) is minimized?

• Host perspective What is the optimal allocation of memory for DMA purposes (i.e., quotas) between one or more guests, so that the shared resource (memory) is shared fairly between the guests and each guest’s I/O performance is maximized?

配额可以像分配给来宾的内存量一样以静态方式手动定义，也可以由主机动态更改。也可以采用这两种方法的组合，其中静态提供限额范围，并动态确定该范围内的确切限额。现在我们提出两个问题:

•给定DMA映射的配额，最优的驱逐策略是什么?换句话说，客户应该如何使用其给定的DMA映射配额，以使重新映射(逐出)的总数最小化?

•主机视角一个或多个客户机之间的DMA目的(即配额)的最佳内存分配是什么，以便在客户机之间公平地共享共享资源(内存)，并使每个客户机的I/O性能最大化?

We addressed both questions when designing the on-demand mapping implementation, as detailed in Section 3. We also note that while keeping pages mapped in the IOMMU translation tables improves performance and preserves inter-guest protection, it does not guarantee intra-guest protection. We discuss intra-guest protection in Section 5.
我们在设计按需映射实现时解决了这两个问题，详见第3节。我们还注意到，尽管在IOMMU转换表中保留页面映射可以提高性能并保留inter-guest 保护，但它不能保证intra-guest保护：第5节讨论

2.1 A Model for DMA Mapping

Next we deﬁne a formal model for the DMA mapping problem. Given a quota Q and a series of requests by a guest driver to map and unmap guest pages < g i >, where the guest can make a hypercall to the host and ask the host to create or destroy one or more mappings in the IOMMU translation table mapping those pages, what is the optimal guest mapping strategy such that at every point in time all of the following properties hold:
接下来，我们为DMA映射问题定义一个正式模型。给定一个配额Q和地图的一系列请求客人司机和客人的映射页< g >,可以让客人hypercall主机和问主人创造或毁灭一个或多个映射IOMMU转换表映射这些页面,什么是最佳的客人映射策略，以便在每个时间点，以下所有的假设:

The guest has a set S of guest pages which are mapped in the IOMMU translation table. The set S, which we term the map cache, is composed of the set of guest pages which are candidates for eviction (E) and the set of guest pages which are pinned §. S = E ∪ P.
Each guest page g i is either a candidate for eviction because the driver has already asked the map cache to unmap it: g i ∈ E, or is pinned because the driver is still using it: g i ∈ P.
The quota is never exceeded, i.e., the total size of the map cache is smaller than or equal to the quota: | S | ≤ Q.
When a driver tries to map a new guest page g new , if g new ∈ / S than g new is added to S: S = S ∪ { g new } . If adding g new to S would cause the size of S to exceed the quota, than a guest page g e which is a candidate for eviction (g e ∈ E) is evicted from S: E = E/ { g e } . If there are no candidates for eviction, the mapping request is denied.
Adding or removing one or more guest pages from the map cache S requires a remapping hypercall. 5. 从映射缓存S中添加或删除一个或多个客户页面需要一个重新映射超调用。
The number of remapping hypercalls is minimized.

Note that our target is to minimize the number of hypercalls rather than minimizing the number of map or unmap operations, since a single hypercall can map or unmap several guest pages, and we assume that the cost of remapping a single page or several pages is similar. This is a reasonable assumption since the cost of the remapping (other than the hypercall) is the cost of the IOMMU write buﬀer and/or IOTLB ﬂush, and a single IOMMU write buﬀer and/or IOTLB ﬂush can be used for changing several mappings at the same time.
注意，我们的目标是最小化超调用的数量，而不是最小化映射或取消映射操作的数量，因为单个超调用可以映射或取消映射多个客户页面，并且我们假设重新映射单个页面或多个页面的成本是类似的。这是一个合理的假设，因为重新映射的成本(除了超调用)是IOMMU写缓冲区和/或IOTLB刷新的成本，一个IOMMU写缓冲区和/或IOTLB刷新可以用于同时更改多个映射。

The DMA mapping problem, as formalized above, has two variants: oﬄine—requests are known in advance, and online—-requests are not known in advance.
如上所述，DMA映射问题有两种变体:离线请求-事先知道的，在线请求-事先不知道的。

once formalized this way, the DMA mapping problem is remarkably similar to the well known page replacement problem [10]. However, there is a key diﬀerence. In the page replacement problem a single page is accessed at a given time, and any other page can be evicted to make place for it.
In the DMA mapping problem, there is a set of pages which cannot be evicted (the pages which a driver mapped but not yet unmapped, i.e., the pages which a device may be actively using at the moment). These pages cannot be evicted from the set of existing mappings; only pages which were previously unmapped are candidates for eviction. This diﬀerence stems from Proposition 2.1, which itself stems from the fact that there is no mechanism for I/O page faults in current IOMMUs, I/O devices, and protocols [27].
一旦以这种方式形式化，DMA映射问题就非常类似于众所周知的页面替换问题[10]。然而，有一个关键的区别。在页面替换问题中，在给定的时间访问单个页面，并且可以移除任何其他页面来为其腾出位置。
在DMA映射问题中，有一组页面不能被移除(驱动程序映射但尚未取消映射的页面，即设备当前可能正在积极使用的页面)。
这些页面不能从现有映射集中移除;只有以前unmapped的页面有可能被驱逐。这种差异源于2.1提案，它本身源于这样一个事实，即在当前的IOMMUs、I/O设备和协议[27]中没有针对I/O页面错误的机制。

It therefore follows that in the DMA mapping problem there are some access patterns which cannot be satisﬁed. If the guest driver tries to map more pages than the quota allows and there are no evictable pages than the mapping operation will fail.
因此，在DMA映射问题中，有一些访问模式不能得到满足。如果客户驱动程序试图映射超过配额允许的页面，并且没有可收回的页面，那么映射操作将失败。

sb了吧！！这个之前翻译过了！！

2.2 The On-Demand Mapping Strategy

Using the quota model presented in the previous section, we devise a mapping strategy which maximizes performance (i.e., minimizes remapping hypercalls) for a given amount of memory consumed (i.e., a given quota). We restrict the discussion here to strategies which provide inter-guest protection only; intra-guest protection is discussed in Section 5. In the direct mapping strategy, we pin and map the entire guest memory in advance. This maximizes performance since it does not require any remapping hypercalls, but it also requires pinning all of the guest’s memory (i.e., has the worst possible memory consumption). Thus direct mapping is optimal when the quota is equal to the guest’s entire physical memory, but cannot be used when the quota is smaller than the guest’s physical memory.

使用上一节介绍的配额模型，我们设计了一种映射策略，它可以在给定的内存消耗(即给定的配额)下，使性能最大化(即最小化重映射超调用)。这里我们只讨论提供客人间保护的策略;客人内部保护将在第5节中讨论。在直接映射策略中，我们提前固定和映射整个客户内存。这使性能最大化，因为它不需要任何重新映射超调用，但它也需要固定客户机的所有内存(也就是说，可能会有最糟糕的内存消耗)。因此，当配额等于客户的整个物理内存时，直接映射是最优的，但当配额小于客户的物理内存时，不能使用直接映射。

The single-use mapping strategy is to map and unmap a guest page in the IOMMU translation table whenever the driver maps or unmaps a page for DMA. This strategy incurs the highest number of hypercalls and IOMMU remappings, but it can be used with very small quotas since it also minimizes memory consumption.

单一使用的映射策略是在驱动程序映射或映射DMA页面时映射和取消映射IOMMU转换表中的客户页面。这种策略会导致最多的超调用和IOMMU重新映射，但是它可以在非常小的配额下使用，因为它还可以最小化内存消耗。

The persistent mapping strategy maps guest pages in the IOMMU translation table when the drivers map them for the ﬁrst time. It is not speciﬁed when the pages are unmapped. We introduce a reﬁnement of persistent mapping which we term on-demand mapping, based on the quota-based model. In the on-demand mapping strategy, mappings are created in the IOMMU translation table when the guest driver maps them for the ﬁrst time, assuming the number of existing mappings is less than the quota. When the guest tries to create a new mapping which would cause the quota to be exceeded, one or more old mappings are evicted from the map cache and the new mapping created in their place.

当驱动程序第一次映射时，持久映射策略映射IOMMU翻译表中的客户页面。在取消页面映射时没有指定。在基于配额的模型的基础上，我们引入了一种对持久映射的细化，我们称之为随需应变映射。在按需映射策略中，当客户驱动程序第一次映射它们时，映射将在IOMMU转换表中创建，假设现有映射的数量小于配额。当来宾操作系统尝试创建一个可能导致配额超出的新映射时，将从映射缓存中删除一个或多个旧映射，并在其位置上创建新的映射。

On-demand mapping avoids the need to map the entire guest memory up front, and maps it only as needed, until the quota is reached. If the quota is set to some minimal value, on-demand behaves like single mapping. If the quota is set to the size of guest physical memory, on-demand behaves like direct and persistent mapping. If the quota is set to some value in-between then the behavior of on-demand depends on the workload. See Section 4.1 for a discussion of the “right” quota to set.

按需映射避免了预先映射整个客户内存的需要，并且只在需要时映射，直到达到配额为止。如果配额被设置为某个最小值，则按需操作类似于单一映射。如果配额设置为客户物理内存的大小，则按需操作类似于直接映射和持久映射。如果配额被设置为介于两者之间的某个值，则随需应变的行为取决于工作负载。关于设定配额的“权利”的讨论，见第4.1节。

We have observed that a quota which is signiﬁcantly smaller than guest physical memory is often suﬃcient for a workload’s needs (see Section 4.1). In this case on-demand has the same performance as persistent or direct mapping, while consuming much less memory, and unlike persistent, it will not grow to consume all of memory. Other times, when the host is under memory pressure, on-demand enables the host to decide how much memory to allow a given guest to pin, while also enabling the guest to achieve the best performance given the amount of memory allotted to it.

我们已经观察到，一个明显小于客户物理内存的配额通常足以满足工作负载的需求(参见4.1节)。在这种情况下，随需应变与持久映射或直接映射具有相同的性能，但消耗的内存要少得多，而且与持久映射不同的是，它不会增长到消耗全部内存。在其他情况下，当主机面临内存压力时，随需应变使主机能够决定允许给定客户机使用多少内存，同时也使客户机能够在分配给它的内存数量的情况下获得最佳性能。

2.3 Optimal Solution to the Offline DMA Mapping Problem

Let us ignore batching of requests for a moment, and assume that we can only map or unmap a single page per hy- percall. Let us further assume that the set of pinned pages P is always strictly smaller than the quota ( P < Q). Under these assumptions, the oﬄine version of the DMA mapping problem is for all intents and purposes equivalent to the page replacement problem, for which the optimal oﬄine solution is Belady’s theoretical memory caching algorithm [6]. Brieﬂy stated, this algorithm always evicts the page that is going to be accessed later than any other page in the cache. The optimal oﬄine batching algorithm is to simply map the next N diﬀerent pages in the access pattern in a single batch. The upper bound on the miss rate is 1 , which is the theoretical bound, and is much lower than the miss ratewithout batching, which is 1 when there is no reuse
暂时忽略批处理请求，并假设每次hypercall只能映射或取消映射单个页面。进一步假设组固定页面P总是严格小于配额(P < Q)。在这些假设下,离线版本的DMA映射问题实际上相当于页面置换问题,离线最优解决方案是Belady理论内存缓存算法[6]。简单地说，这个算法取出最不可能被访问的页面。

The optimal oﬄine batching algorithm is to simply map the next N diﬀerent pages in the access pattern in a single batch. The upper bound on the miss rate is 1 , which is the theoretical bound, and is much lower than the miss ratewithout batching, which is 1 when there is no reuse
最优的offline批处理算法是简单地将访问模式中的下N个不同页面映射到一个批处理中。缺失率的上界为1/n+1，为理论界，远低于不批处理时的缺失率，即无重用时的缺失率为1

2.4 online Algorithms

Since the DMA mapping problem is related to the page replacement problem, what can we learn from the known online algorithms for the page replacement problem?
The most familiar online algorithm is the Least Recently Used (LRU) algorithm that is used as a base in many systems with additional optimizations. However, LRU also has well-known deﬁciencies with certain access patterns.
由于DMA映射问题与页面替换问题有关，对于页面替换问题，我们可以从已知的在线算法中学到什么?
最常见的在线算法是最近最少使用(LRU)算法，它在许多具有额外优化的系统中被用作基础。然而，LRU在某些访问模式上也有众所周知的缺陷。
The best known algorithms for the page replacement problem make use of the access graph model [10, 11]. In this model, there is a graph which models the input sequence. In the graph there is an edge between two vertexes (v, u) if an access to v is followed by an access to u (or vice versa, since the graph is usually not directed). Obviously a single graph can model many diﬀerent input sequences, provided they share the same access patterns. The best known algorithms for deciding which page to evict next in the access graph model ares FAR [12] and FARL [14]. Intuitively, both perform an online approximation of what the optimal oﬄine algorithm does, evicting the graph node whose next access is the farthest in the future.
对于页面替换问题，最著名的算法就是使用访问图模型[10,11]。在这个模型中，有一个对输入序列建模的图。在图中，两个顶点(v, u)之间有一条边，如果对v的访问后跟对u的访问(反之亦然，因为图通常是无向的)。显然，一个图可以为许多不同的输入序列建模，只要它们共享相同的访问模式。在访问图模型中，决定下一步驱逐哪个页面的最著名算法是FAR[12]和FARL[14]。直观地说，这两种算法都执行最优离线算法所做的在线近似操作，将下一次访问距离未来最远的图节点逐出。
FAR and FARL show us that as long as an access pattern has some repetition (some reuse) and is not completely random, then inspecting the history and deciding which pages to evict and which to keep, based on history, is likely to be useful. We present a prefetching algorithm for on-demand mapping which exploits this insight in the next section.
FAR和FARL向我们表明，只要访问模式有一些重复(一些重用)，并且不是完全随机的，那么检查历史并根据历史决定哪些页面要删除，哪些页面要保留，可能是有用的。在下一节中，我们将介绍一种预取算法用于按需映射，它利用了这一洞察力。

2.5 Prefetching Algorithm

The frequency-based prefetching algorithm takes advantage of the low cost of batching. Many I/O workloads exhibit recurring patterns in their access sequences, often because pages are used as buﬀers which are reused over and over again. We exploit the fact that if a guest driver maps a page, there is a high probability that it will next try to map additional pages which were mapped in the past following this page.
基于频率的预取算法利用了批处理成本低的优点。许多I/O工作负载在其访问序列中显示了重复模式，这通常是因为页面被用作缓冲区，并被反复重用。我们利用了这样一个事实:如果客户驱动程序映射了一个页面，那么它很可能会在接下来的页面之后尝试映射过去已经映射过的其他页面。
The prefetching algorithm is as follows: for each page gi keep track of which pages were mapped most frequently in the past right after gi. We say that a page gi has a follower if there is a page gj that was mapped right after gi at least twice. If there is more than one such page gj, the follower is the page which was mapped most often after gi.
预取算法如下:对于每一个页面，gi跟踪哪些页面在过去最频繁地在gi之后被映射。我们说一个页面gi有一个追随者，如果有一个页面gj被映射在gi之后至少两次。如果有超过一个这样的页面gj，追随者是最经常映射在gi之后的页面。
When a page gi is mapped by the driver which is not in the map cache, we make a hypercall to map it, and if it has a follower page, we also add the follower page to the batch. If the follower page has a follower page we add that page to the batch as well, repeating the process until there is no follower page or a maximal batch size is reached.
当一个page gi被不在map缓存中的驱动程序映射时，我们做一个hypercall来映射它，如果它有一个follower页面，我们也将这个follower页面添加到批处理中。如果跟随者页面有跟随者页面，我们也将该页面添加到批处理中，重复这个过程，直到没有跟随者页面或达到最大批处理大小。
Assuming that the quota has been reached and that the remapping hypercall is trying to map n new pages, n old pages need to be evicted to make room for them. The pages to be evicted are chosen by a standard LRU algorithm.
假设已经达到了配额，并且重新映射hypercall试图映射n个新页面，那么需要删除n个旧页面，以便为它们腾出空间。要删除的页面由标准LRU算法选择。

3. ON-DEMAND MAPPING

We designed and implemented on-demand mapping in the Linux-based KVM hypervisor [17]. This implementation built upon our earlier implementation of direct access for KVM [34].
The original implementation of direct access for KVM used the direct mapping strategy, enabling it to run unmodiﬁed guests and avoiding guest changes. The on-demand mapping strategy, however, requires a paravirtualized guest DMA mapping interface, which we implemented using hypercalls for guest-to-host communication.
我们在基于linux的KVM管理程序[17]中设计并实现了按需映射。这个实现构建在我们之前直接访问KVM[34]的实现之上。KVM直接访问的原始实现使用了直接映射策略，使它能够运行未修改的客户机并避免客户机更改。
然而，按需映射策略需要一个半虚拟化的来宾DMA映射接口，我们使用来宾到主机通信的超调用实现了该接口。

3.1 Map Cache

The key guest-side component of on-demand mapping is the map cache, which caches DMA mappings inside the guest in order to avoid the overhead of going to the hypervisor to create or destroy a new mapping in the IOMMU translation table. We implemented the map cache as a Linux DMA-API implementation, which all DMA-using drivers call into [7]. The map cache is limited in size: it has a set quota of mappings it caches. The quota can be changed at run-time, and is dictated by the hypervisor, as discussed in Section 2.1.
按需映射的关键客户端组件是map cache，它在客户端中缓存DMA映射，以避免在IOMMU转换表中创建或销毁新映射的开销。**我们将map cache实现为Linux DMA-API实现，所有使用dma的驱动程序都会调用[7]。**Map cache的大小是有限的:它有一个映射缓存的配额。配额可以在运行时更改，并由hypervisor决定，如2.1节所述。
[7] M. Ben-Yehuda, J. Mason, J. Xenidis, O. Krieger, L. van Doorn, J. Nakajima, A. Mallick, and E. Wahlig. Utilizing IOMMUs for virtualization in Linux and Xen.
In OLS ’06: The 2006 Ottawa Linux Symposium, pages 71–86, July 2006.

Pages (mappings) in the map cache are either pinned or candidates for eviction. When a driver asks the map cache to create a mapping for some guest page, the map cache checks if a mapping for the page already exists in the map cache. If it is, a reference count is incremented. If the page isn’t mapped, the map cache makes a hypercall and asks the hypervisor to map the page in the IOMMU translation table. If the call succeeds, the map cache increments the reference count on the page. Unmap requests by the driver are handled similarly: the reference count on mappings of that page is decremented, and if it drops to zero, the page is moved to the candidates for eviction set. Note however that the page remains mapped in the IOMMU and used for DMA from the host’s perspective.
映射缓存中的页面(映射)要么是固定的，要么是要被驱逐的候选页面。当驱动程序请求map cache为某个客户页面创建映射时，map cache会检查该页面的映射是否已经存在。如果是，则引用计数递增。如果页面没有映射，映射缓存将执行一个超调用，并要求管理程序在IOMMU转换表中映射该页面。如果调用成功，映射缓存将增加页面上的引用计数。驱动程序处理Unmap请求的方式类似:该页面映射的引用计数递减，如果它降为零，则该页面被移动到候选的退出集。但是请注意，该页仍然映射在IOMMU中，并且从主机的角度将其用于DMA。
When the map cache makes a map or unmap hypercall, the hypervisor makes any necessary checks (e.g., that the guest is not trying to unmap a page that it hasn’t mapped, or that the guest is not about to go over quota) and then makes the necessary changes to the IOMMU translation table.
当map cache 使地图或映射hypercall, hypervisor使任何必要的检查(例如,客人不是试图映射一个页面没有映射,或者客人不会超过配额),然后进行必要的更改IOMMU的转换表。
The main data structure used by the map cache is a redblack tree, using the standard Linux red-black tree implementation. The tree holds all of the pages that are mapped and their reference counts. All pages with a reference count of zero are stored in a free list and are candidates for unmapping from the IOMMU translation table and eviction from the map cache.
映射缓存使用的主要数据结构是一个红黑树，使用标准的Linux红黑树实现。该树保存所有被映射的页面及其引用计数。引用计数为0的所有页面都存储在一个空闲列表中，它们可以从IOMMU转换表中解除映射并从映射缓存中回收。
While the map cache greatly improves performance by caching mappings—thereby reducing the number of DMA remappings—further optimizations are possible. The next two sections describe two classes of optimizations which reduce the amount of interaction between the guest and the hypervisor further.
虽然映射缓存通过缓存映射极大地提高了性能(从而减少了DMA重新映射的数量)，但仍有可能进行进一步的优化。接下来的两部分将描述两类优化，它们将进一步减少客户机和管理程序之间的交互量。

3.2 Batching Driver Mapping Requests

Many commonly used drivers such as the e1000e NIC driver use the Linux kernel functions dma map single and dma unmap single to map or unmap memory for DMA.
When a large buﬀer with multiple pages is mapped for DMA, the driver calls dma map single several times to map the entire buﬀer. Each such call into the map cache could cause a map cache miss and a subsequent hypercall to map the page. Since the driver will not use the mappings until it has completed the entire sequence of mapping requests, we can minimize the number of hypercalls by batching these requests into a single hypercall at the end of the sequence. Batching the sequence reduces the number of hypercalls and enables us to use a single IOMMU write buﬀer ﬂush for multiple changes.
许多常用的驱动程序，如e1000e网卡驱动程序使用Linux内核函数dma映射单和dma unmap单来映射或unmap内存为dma。
当为DMA映射一个有多个页面的大缓冲区时，驱动程序多次调用DMA map来映射整个缓冲区。对映射缓存的每一个这样的调用都可能导致映射缓存丢失和后续的hypercall来映射页面。由于驱动程序在完成整个映射请求序列之前不会使用映射，所以我们可以通过将这些请求批处理为序列末尾的单个超调用来最小化超调用的数量。批处理序列减少了超调用的数量，并使我们能够对多个更改使用单个IOMMU写缓冲区刷新。

We identiﬁed and implemented the following batching opportunities:

Batch map requests: When a large buﬀer is mapped by the NIC driver in the guest we map it using a single hypercall. We note that this optimization does not violate the intra-guest protection property.

Batch unmap requests: The equivalent of “batch map requests” for unmap requests.

Unmap piggyback: Perform a single hypercall including both map and unmap requests. In general when the on-demand strategy is used, then there is no reason to unmap a page before the quota is exceeded. Since we only unmap as a result of a new map request, the most eﬃcient way to execute the unmap request is to piggyback the unmap request on the map request replacing it.

批映射请求:当客户机中的NIC驱动程序映射一个大缓冲区时，我们使用单个hypercall来映射它。我们注意到，这种优化并不违反客户内部保护属性。

批量取消映射请求:相当于取消映射请求的“批量映射请求”。

Unmap piggyback:执行一个超级调用，包括map和Unmap请求。通常，当使用按需策略时，没有理由在超出配额之前取消对页面的映射。由于我们只在新映射请求时取消映射，所以执行取消映射请求的最有效方法是将取消映射请求带上替换它的映射请求。

Adding the batching code involves minimal changes to the driver. For exapmle in the case of the e1000 driver we added only six new lines of code.
添加批处理代码只需要对驱动程序进行最小的更改。例如，在e1000驱动程序的例子中，我们只添加了6行新代码。

3.3 Prefetching Mappings

In addition to the batching optimizations mentioned above, we also implemented the prefetching algorithm as described in Section 2.5. The key practical diﬀerence between batching and prefetching is that batching requires driver changes, and prefetching doesn’t. Put diﬀerently, batching would need to be implemented in every Linux driver, while prefetching can be implemented once in the DMA-API layer.
In order to minimize the algorithm’s memory consumption, we keep for each page no more than 3 pages which followed it, and the number of occurrences for each of these 3 pages. For each mapped page we prefetch the follower with the highest number of occurrences—if the number of occurrences is higher than 2.

除了上面提到的批处理优化，我们还实现了2.5节中描述的预取算法。批处理和预取之间的关键实际区别是，批处理需要更改驱动程序，而预取不需要。换句话说，批处理需要在每个Linux驱动程序中实现，而预取可以在DMA-API层中实现一次。
为了最小化算法的内存消耗，我们保持每个页面不超过3个紧随其后的页面，以及这3个页面中的每个页面的出现次数。对于每个映射页面，如果出现次数大于2，我们就会预取出现次数最多的follower。

3.4 Quota Control Policy

From the point of view of the guest, the host communicates a quota and the guest needs to keep its map cache within the conﬁnes of the quota. From the point of view of the host, what quota should it set to the guest?
从客户机的角度来看，主机通信一个配额，客户机需要将其映射缓存保持在配额的范围内。从host的角度来看，应该给客人多少配额?

3.4.1 Cooperative Guests

For cooperative guests, we note that with all modern hypervisors, virtual machines already have a certain amount of memory assigned to them (regardless of direct access) and that amount can be changed dynamically by the hypervisor, either by paging virtual machine pages out to disk, or by using a para-virtualized balloon approach [30].
我们注意到,与所有现代虚拟机监控程序,虚拟机已经有了一定数量的内存分配给他们(无论直接访问),hypervisor可以动态地改变量,通过页面分页虚拟机磁盘,或者通过使用一个气球的虚拟化方法[30]。

The balloon driver is a hypervisor-controlled guest driver which the hypervisor uses to “steal” or “withdraw” memory pages from the guest, by asking the balloon driver to allocate them inside the guest. once the balloon driver has allocated them, the hypervisor can safely remove the guest-physicalto-host-physical mappings of these pages in both CPU MMU page tables and IOMMU translation tables (if the page was previously mapped for DMA) and give the host frames to some other virtual machine.
气球驱动程序是一个由虚拟机监控程序控制的客户驱动程序，虚拟机监控程序使用它从虚拟机“窃取”或“提取”内存页，方法是要求气球驱动程序在虚拟机内部分配内存页。一旦气球驱动程序分配了它们，系统管理程序就可以在CPU MMU页表和IOMMU转换表中安全地删除这些页面的guest-physical - to-host-physical映射(如果该页面之前已映射为DMA)，并将主机帧提供给其他虚拟机。

Therefore, one possible approach for the dynamic quota policy for cooperative guests is to deﬁne the quota as whatever amount of memory is allocated to the guest at the moment anyhow. We already gave the guest all of that memory, so we might as well allow it to use it for DMA too, relying on its cooperation (via the balloon driver) when we want to reclaim some of those pages.
因此，对于合作的Guest的动态配额策略，一种可能的方法是将配额定义为当前分配给客户的内存量。我们已经给了guest所有的内存，所以我们也可以允许它将其用于DMA，当我们想要回收一些页面时，依靠它的合作(通过气球驱动程序)。

Let us assume that the host would like to share memory between multiple virtual machines, and that the load on each virtual machine might change. The host decides how much memory to assign to each guest at each point in time according to a predeﬁned policy. We denote the guest total memory size Mmax, and the amount of memory that is allocated to the guest at any given time as M (t), M (t) Mmax. We denote the current DMA mapping quota as Q(t), Q(t) M (t). The host will use the balloon driver in the guest to “withdraw” memory from the guest as need for other purposes: we denote the amount of memory withdrawn (i.e,. the balloon’s current size) as B(t).
让我们假设主机希望在多个虚拟机之间共享内存，并且每个虚拟机上的负载可能会改变。主机根据预定义的策略决定在每个时间点为每个来宾分配多少内存。我们表示客户的总内存大小Mmax，以及在任何给定时间分配给客户的内存数量M (t)， M (t) Mmax。我们将当前DMA映射配额表示为Q(t)， Q(t) M (t)。为B(t)。

It follows that when the balloon is used in the guest, the host does not need to explicitly deﬁne a DMA mapping quota, since the quota is implicitly deﬁned by the current size of the balloon: Q(t) M (t) = Mmax B(t). Assuming the host has given the guest enough memory for its regular operation, the guest can also use as much of that memory as it wishes for DMA.
因此，当在客户机中使用气球时，主机不需要显式定义DMA映射配额，因为配额是由气球的当前大小隐式定义的:Q(t) M (t) = Mmax B(t)。假设主机为来宾机提供了足够的内存来进行常规操作，来宾机也可以根据自己的意愿使用多少内存来进行DMA。

In other words, the quota will be indirectly applied since the pages that are used by the balloon will never be added to the map cache, and pages which were previously in the map cache and the balloon allocated will be removed from the map cache. If the hypervisor wishes to give the guest more pages (whether for DMA purposes or for other purposes) it will shrink the balloon; if the hypervisor wishes to withdraw some pages from the guest, it will inﬂate the balloon and cause the the guest to release some of its pages. Since the map cache is only limited by the amount of currently available memory M (t) it will eventually ﬁll and in the steady state there are going to be no hypercalls for DMA remappings, maximizing performance.
换句话说，配额将被间接应用，因为气球使用的页面将永远不会添加到映射缓存中，以前在映射缓存中和分配的气球的页面将从映射缓存中删除。如果管理程序希望为来宾程序提供更多页面(无论是用于DMA目的还是用于其他目的)，它将收缩气球;如果管理程序希望从来宾程序中收回一些页面，它将使气球膨胀，并导致来宾程序释放一些页面。由于映射缓存只受其最终将填充的当前可用内存M (t)的限制，并且在稳定状态下不会有对DMA重新映射的超调用，从而最大化性能。

3.4.2 Selfish Guests

With an implicit DMA mapping quota, the host does not limit the amount of pages the guest can pin. A selﬁsh guest could map its entire physical memory in the map cache, and ignore or disable the balloon. In such cases, setting a quota Q(t) that is strictly smaller than available memory M (t) is critical for fair sharing of memory between multiple guests. The speciﬁc quota which should be set could depend on multiple factors (e.g., workload, quality of service the host wishes to provide to diﬀerent guests, or I/O adapters in use) and determining it is left as future work.
It is interesting to note that there are circumstances where the host will want to limit the amount of DMAs done by the guest even when the guest is cooperative, for example in order to conserve memory bandwidth or because there is only a limited number of IOMMU mappings available. In such cases the host can also set Q(t) to be strictly smaller than M (t).
使用隐式DMA映射配额，主机不限制客户机可以锁定的页面数量。自私的客户机可以在映射缓存中映射其整个物理内存，并忽略或禁用气球。在这种情况下，设置一个严格小于可用内存M (t)的配额Q(t)对于在多个来宾之间公平共享内存至关重要。应该设置的特定配额可能取决于多个因素(例如，工作负载、主机希望向不同客户提供的服务质量，或正在使用的I/O适配器)，并决定将其留给未来的工作。
有趣的是，在某些情况下，即使客户机是合作的，主机也会希望限制客户机完成的dma数量，例如，为了节省内存带宽，或者因为只有有限数量的IOMMU映射可用。在这种情况下，主机也可以设置Q(t)严格小于M (t)。

4. EXPERIMENTAL evalUATION

As noted previously, if the host sets the quota to be larger than the workload’s requirement, then on-demand mapping achieves a steady state where the workload’s pages are mapped and no DMA remapping is needed, i.e., we achieve maximal performance from the point of view of DMA mapping. However, if the workload size is equal to the entire guest memory size, then we might as well use persistent mapping. We therefore evaluated the quota requirements of two common networking protocols, and show that the needed quota is related to the workload size rather than the guest’s memory size, thereby demonstrating the beneﬁts of on-demand mapping over persistent mapping.
如前所述，如果主机将配额设置为大于工作负载的需求，则按需映射将达到稳定状态，其中工作负载的页面被映射，不需要DMA重映射，也就是说，从DMA映射的角度来看，我们实现了最大的性能。但是，如果工作负载大小等于整个客户内存大小，那么我们也可以使用持久映射。因此，我们评估了两种常见网络协议的配额需求，并表明所需的配额与工作负载大小有关，而不是与客户的内存大小有关，从而证明了按需映射比持久映射的好处。
Our setup consisted of two Lenovo M57p machines with the Intel Q35 chipset which includes VT-d. Each machine had a 2.66GHz dual-core Intel Core 2 Duo CPU with 4GB of memory. The machines were connected directly with a 1GbE cable. One Lenovo machine ran native Linux (Ubuntu 7.10 for x86 64) and the other ran Linux (Ubuntu 7.10 for x86 64) with KVM, with a single virtual machine running Fedora Core 8 (64 bit), with 1GB of memory. All runs, native and virtualized, used the on-board e1000e PCI-e NIC.
我们的设置包括两台联想M57p机器，英特尔Q35芯片组，其中包括VT-d。每台机器都有2.66GHz双核英特尔酷睿2双核CPU和4GB内存。这些机器直接用1GbE电缆连接。一台联想电脑运行本地Linux (Ubuntu 7.10 for x86 64)，另一台运行带有KVM的Linux (Ubuntu 7.10 for x86 64)，有一台运行Fedora Core 8(64位)的虚拟机，内存为1GB。所有运行，本地和虚拟化，使用板载e1000e PCI-e网卡。

4.1 Quota Requirements of Common Workloads

We began by looking at applications which use the standard TCP/IP stack via the socket API. Due to its semantics, the send socket API must copy the data from the application buﬀer to a kernel buﬀer associated with that socket. The kernel buﬀer is then mapped for DMA and accessed by the NIC. Since the Linux memory allocator recycles pages, there is high likelihood that the same guest pages will be reused for socket buﬀers. Each socket has an upper bound on the socket buﬀer size, so the total number of mappings needed to cover all of the send and receive sockets at a given point is given by:

Total =SendBuﬀerSize × NumSendSockets +

ReceiveBuﬀerSize × NumReceiveSockets

If the quota is larger than this total and there is a high probability of reuse then after the initial mapping, no further remappings will be required. On the other hand if the quota is smaller than this total and there is little reuse then the IOMMU translation table will constantly need to change. This means that the“right”quota is a function of the number of sockets in the system and the level of reuse. Both are likely to be fairly steady for a workload which is in a steady state.

To avoid copying of the data from userspace to kernel space most systems and speciﬁcally Linux oﬀer a zero-copy API, namely sendpage and sendfile. In this case the data is passed to the device without being copied from userspace to kernel space.

sendpage can be used to map any number of pages, and the usage is completely dependent on the application. However in the case of sendfile the number of pages is bounded by the ﬁle size. We tested the Apache webserver and found that the optimal quota is equal to the sum of the sizes of the ﬁles being served by Apache at a given point in time. Again, for a workload that is in the steady state, the needed quota is likely to be fairly steady.

4.1普通负载配额要求

我们首先看看通过套接字API使用标准TCP/IP堆栈的应用程序。由于其语义，发送套接字API必须将数据从应用程序缓冲区复制到与该套接字相关联的内核缓冲区。然后将内核缓冲区映射为DMA并由NIC访问。因为Linux内存分配器会回收页面，所以相同的客户页面很可能会被套接字缓冲区重用。每个套接字都有一个套接字缓冲区大小的上限，所以在给定的点上覆盖所有发送和接收套接字所需的映射总数是:

Total =SendBufferSize × NumSendSockets +

下面的×NumReceiveSockets

如果配额大于这个总数，并且有很高的重用概率，那么在初始映射之后，就不需要再进行重新映射了。另一方面，如果配额小于这个总数，并且几乎没有重用，那么IOMMU转换表就需要不断更改。这意味着“正确”配额是系统中套接字数量和重用级别的函数。对于处于稳定状态的工作负载来说，这两种情况都可能是相当稳定的。

为了避免将数据从用户空间复制到内核空间，大多数系统，特别是Linux提供了零复制API，即sendpage和sendfile。在这种情况下，数据被传递到设备，而不是从用户空间复制到内核空间。

Sendpage可用于映射任意数量的页面，其使用情况完全取决于应用程序。但是，在sendfile的情况下，页面的数量受文件大小的限制。我们测试了Apache web服务器，发现最优配额等于Apache在给定时间点所服务的文件大小的总和。同样，对于处于稳定状态的工作负载，所需的配额很可能是相当稳定的。

4.2 Eviction Strategies驱逐策略

Except when the quota is equal to the entire guest’s memorywhich is wasteful—there will always be cases where the quota is not suﬃcient for a workload’s needs. We therefore evaluated the eviction strategies described in Section 2.

In order to evaluate each strategy’s performance separately from a speciﬁc implementation, we began by considering the map cache hit rate that is achieved by each strategy on diﬀerent workloads. We recorded I/O access patterns of real workloads and applied each of the strategies to them. As described in Section 2.1, we assume that batching has no additional cost. This assumption is not completely accurate in the real world, since it is dependent on the IOMMU implementation details.

除非配额等于整个客人的内存(这是浪费)，否则总是会出现配额不足以满足工作负载需求的情况。因此，我们评估了第2节中描述的驱逐策略。

为了从特定的实现中单独评估每个策略的性能，我们首先考虑每个策略在不同工作负载上实现的映射缓存命中率。我们记录了实际工作负载的I/O访问模式，并将每种策略应用于它们。如第2.1节所述，我们假设批处理没有额外成本。这个假设在现实世界中并不完全准确，因为它依赖于IOMMU实现细节。

We compare the following eviction strategies for selecting the page to evict from the cache:
• FIFO Evict pages in a ﬁrst in ﬁrst out order.

• LRU Evict the least recently used page.

• OPT Evict the page that is going to be used later then any other page in the cache. This is the optimal o ﬄ ine algorithm without batching, i.e., only a single page is replaced at a time.

• Optimal batching The optimal o ﬄ ine algorithm, but with batching as deﬁned in Section 2.3. Multiple pages can be replaced at the same time.

• Prefetching The prefetching algorithm as described in Sections 2.5 and 3.3.

我们比较了以下从缓存中选择要退出的页面的回收策略:
•先进先出(FIFO)以先进先出的顺序驱逐页面。

•LRU驱逐最近最少使用的页面。

•OPT将缓存中稍后要使用的页面和其他页面逐出。这是不需要批处理的最佳o ffl行算法，也就是说，一次只替换一个页面

• Prefetching

We recorded access patterns of two workloads, netperf send with a 65KB socket buﬀer, and an Apache webserver serving an httperf client that is requesting static wiki pages. For each workload we calculated the total working set size which is the total number of pages that are accessed during the run. We executed each of the eviction strategies for diﬀerent quota values varying from 0 to 100% of the working set size. For each execution we calculated the cache hit rate. We note that with both workloads the total workload size depended on the workload rather than the guest memory size.
我们记录了两个工作负载的访问模式，netperf发送带有65KB套接字缓冲区，以及一个Apache web服务器为请求静态wiki页面的httperf客户端提供服务。对于每个工作负载，我们计算总工作集大小，即运行期间访问的页面总数。我们针对从0到100%的工作集大小不等的不同配额值执行了每种回收策略。对于每次执行，我们计算缓存命中率。我们注意到，对于这两种工作负载，总工作负载大小取决于工作负载，而不是客户内存大小。

Figure 3 shows the hit rate for a netperf send access pattern for each of the eviction strategies and for diﬀerent quota values. The basic strategies, non-batching LRU and FIFO, might be useful for large quota due to their simplicity, but are highly ineﬃcient for low quotas. The optimal batching strategy performs signiﬁcantly better and achieves close to 100% hit rate even for quota that is 10% of the working set! This indicates the great potential of batching to reduce the DMA mapping CPU utilization overhead. The prefetching strategy takes advantage of extra knowledge to batch multiple page replacements together rather than replacing pages one by one. With this extra knowledge it achieves better results than the non-batching strategies including the optimal non-batching strategy.
图3显示了每种回收策略和不同配额值的netperf发送访问模式的命中率。基本策略，非批处理LRU和FIFO，由于它们的简单性，对于大配额可能有用，但对于低配额是非常低效的。最优批处理策略的性能明显更好，即使配额是工作集的10%，命中率也接近100% !这表明批处理在降低DMA映射CPU使用开销方面具有巨大潜力。预取策略利用额外的知识将多个页面替换批处理在一起，而不是逐个替换页面。有了这些额外的知识，它可以比非批处理策略(包括最优的非批处理策略)获得更好的结果。

Figure 4 shows the hit rate for an Apache access pattern for each of the eviction strategies. Again prefetching does very well, getting a 90% hit rate even with a quota that is only 10% of the working set.

图4显示了每种退出策略的Apache访问模式的命中率。同样，预取做得很好，即使配额只有工作集的10%，也能获得90%的命中率.

4.3 CPU utilization CPU利用率

Next we evaluated the eﬀect of the diﬀerent batching options and prefetching on the performance of two workloads, as measured by the CPU utilization (all tests saturated the 1GbE link used). We used the same netperf workload with a 65KB socket size, and an Apache workload where the client is downloading various large ﬁles. Again we looked at the diﬀerent quota values ranging from minimal quota (i.e., no caching allowed) to 100%, where the the quota is equal to the total total amount of pages that are required by the workload.
接下来，我们评估不同批处理选项和预取对两个工作负载性能的影响，通过CPU利用率(所有测试都饱和了所使用的1GbE链接)来衡量。我们使用相同的netperf工作负载(套接字大小为65KB)和Apache工作负载(客户机在其中下载各种大文件)。我们再次查看了从最小配额(即不允许缓存)到100%的不同配额值，其中配额等于工作负载所需的页面总数。

We evaluated the following optimizations:

• LRU Default LRU algorithm where no batching or caching is used.

• Piggyback Piggybacking unmaps on top of maps.

• Prefetching As previously described.

• Batching LRU with the map and unmap batching optimizations.

The ﬁrst thing to note in Figures 5 and 6 is the high CPU utilization regardless of optimization when no caching is used (minimal quota), highlighting again the importance of improving the DMA handling for direct access. As expected the CPU utilization for IOMMU remapping is reduced as the quota is enlarged.
在图5和图6中需要注意的第一件事是，在没有使用缓存(最小配额)的情况下，无论优化与否，CPU利用率都很高，这再次强调了为直接访问改进DMA处理的重要性。正如预期的那样，IOMMU重映射的CPU利用率会随着配额的增加而降低。

The piggyback optimization reduces CPU utilization by approximately 10% when compared with LRU and its impact is reduced as the quota increases. Prefetching reduces CPU utilization further, but its impact is reduced fast as the quota increases.
与LRU相比，背载优化降低了大约10%的CPU利用率，并且随着配额的增加，其影响也降低了。预取进一步降低了CPU利用率，但随着配额的增加，其影响会迅速降低。

Piggybacking and prefetching require no driver changes. Observing the batching optimization we see that making changes to drivers can reduce the CPU utilization signiﬁcantly. In the case of minimal quota the batching optimization halves the CPU utilization! Moreover, batching is the only optimization available when we wish to provide intraguest protection. The beneﬁt of batching is clear and leads us to the conclusion that signiﬁcant improvement can be gained by changing drivers to batch mappings.
附带和预取不需要更改驱动程序。通过观察批处理优化，我们可以看到对驱动程序进行更改可以显著降低CPU利用率。在配额最小的情况下，批处理优化将CPU利用率减半!此外，当我们希望提供利益保护时，批处理是唯一可用的优化。批处理的好处是显而易见的，并使我们得出结论，将驱动程序更改为批处理映射可以获得显著的改进。

5. INTRA-GUEST PROTECTION

Every mapping strategy discussed in this paper, and in particular the on-demand mapping strategy, provides interguest protection, i.e., protection of one virtual machine from another. In some circumstances it is also desirable to provide intra-guest protection: protection inside a single virtual machine from malicious or buggy devices.
本文讨论的每一种映射策略，特别是按需映射策略，都提供了interguest保护，即保护一个虚拟机免受另一个虚拟机的攻击。在某些情况下，还需要提供来宾内保护:在单个虚拟机内保护免受恶意或错误设备的攻击。

The only mapping strategies which provide intra-guest protection are single-use mapping and shared mapping. Unsurprisingly, they also have the lowest performance of the other mapping strategies. The on-demand mapping strategy, as formulated in Section 2.2, keeps around unmapped mappings, which opens a hole through which a buggy or malicious device could DMA to a page that is no longer being used for DMA. Therefore the on-demand mapping does not support intra-guest protection.
提供客户机内保护的唯一映射策略是一次性使用映射和共享映射。不出所料，在其他映射策略中，它们的性能也是最低的。按需映射策略，如2.2节中所述，保持了未映射映射，这就打开了一个漏洞，通过这个漏洞，有bug的或恶意的设备可以DMA到不再用于DMA的页面。因此，按需映射不支持来宾内保护。

In order to adapt on-demand to intra-guest protection in Linux and close that hole, we need to add two pieces of information to every guest page which do not currently exist: what is the page used for, and who owns that page. If every page in the system was marked as either“allocated for DMA” or not, we could cache only pages which have been marked for DMA, thereby providing a limited amount of intra-guest protection. If in addition we knew which component owns a given page, we could keep a page mapped only as long as it was owned by that component.
为了按需适应Linux中的客户机内部保护并关闭这个漏洞，我们需要向每个当前不存在的客户机页面添加两条信息:页面用于什么，以及谁拥有该页面。如果系统中的每个页面都被标记为“allocated for DMA or not”，那么我们可以只缓存那些标记为DMA的页面，从而提供有限数量的客户内部保护。此外，如果我们知道哪个组件拥有给定的页面，我们就可以只保留该组件拥有的页面映射。

The prevalent mode of DMA mapping for Linux device drivers is to map arbitrary pages for DMA and unmap them when done. Adding an “allocated for DMA” marker to every page can be done, by changing device drivers to allocate “DMA” pages through special interfaces (which already exist) rather than map them on the ﬂy. However, changing every Linux device driver is a tall order. Even more problematic is the fact that a driver may be handed a page from some other entity (e.g., the TCP/IP stack) and asked to DMA to or from that page. Changing the way the diﬀerent components in Linux interact is an even taller order and is left for future work.
Linux设备驱动程序的DMA映射的流行模式是为DMA映射任意页面，并在完成时取消映射。可以通过更改设备驱动程序来通过特殊的接口(已经存在的接口)分配“DMA”页面，而不是动态地映射它们，从而为每个页面添加“allocated for DMA”标记。然而，更改每个Linux设备驱动程序是一项艰巨的任务。更有问题的是，一个驱动程序可能从其他实体(例如，TCP/IP堆栈)提交一个页面，并被请求到DMA或从该页面。改变Linux中不同组件的交互方式是一个更高的任务，这将留给未来的工作。

Linux is a monolithic operating system, with no easily deﬁned boundaries between diﬀerent components (e.g., the page cache, TCP/IP stack, and device drivers). The different components freely pass around pages, which makes tracking page ownership diﬃcult.
Implementing the second form of intra-guest protection, where a page is only mapped as long as it is owned by the mapping component, requires both clear boundaries between components and tracking ownership. We believe that either would require extensive changes to Linux, but could be done fairly easily in micro-kernel based operating systems where diﬀerent components run with diﬀerent MMU address spaces.
Linux是一个单片操作系统，在不同的组件(例如，页面缓存、TCP/IP堆栈和设备驱动程序)之间没有容易定义的边界。不同的组件在页面中自由传递，这使得跟踪页面所有权变得困难。
实现第二种形式的客户内保护(在这种保护中，只要页面属于映射组件，就只对其进行映射)既需要在组件之间明确边界，也需要跟踪所有权。我们认为，这两种方法都需要对Linux进行大量的修改，但在基于微内核的操作系统中可以相当容易地完成，因为不同的组件使用不同的MMU地址空间运行。

We note that there is another potential relaxation of intraguest protection, where the map cache is only used for caching read-only mappings of pages. This provides protection against a device writing to a page of memory it shouldn’t (the most common form of device misbehavior that we would like to prevent), while providing a nice boost to workloads which mainly use DMA to read from memory rather than write to it.
我们注意到，还有另一种放松方法，即映射缓存仅用于缓存页面的只读映射。这提供了保护，防止设备写入不应该写入的内存页(这是我们希望防止的最常见的设备错误行为形式)，同时为主要使用DMA从内存读取而不是写入的工作负载提供了很好的提升.

6. ANALYZING THE COST OF A SINGLE MAPPING

As noted in Section 1.2, in order to reduce the total cost of DMA mapping operations, one could either reduce the frequency of operations, or one could reduce the cost of a single mapping request. Just how expensive is it to create or destroy a single mapping?

Using the experimental setup detailed in Section 4, we measured the time needed to create and destroy DMA mappings, breaking the cycle cost down into diﬀerent steps. In order to create a mapping, the following steps need to take place:

The guest needs to communicate to the host a request to create a mapping. This is most often done using a hypercall, although other techniques are also possible. A single empty hypercall took approximately 6000 cycles on our systems, just for the world switch.
The host needs to perform some internal implementation details such as retrieving the arguments to the hypercall from the guest’s memory space, and translating the guest physical address to host physical address, which might also require faulting in a page in case it is non-present. This step took approximately 4400 cycles in our setup.
The host needs to update the IOMMU translation table (IOMMU translation table walk and creation of the new I/O PTE): this step took approximately 1400 cycles.
The host needs to ﬂush the IOMMU write buﬀer: an additional 1200 cycles, on average. We note that with the Intel VT-d IOMMU in our systems, there is no need to ﬂush the IOTLB when an I/O PTE goes from not present to present. In the general case an IOTLB ﬂush may be required here as well.

In order to destroy a mapping, the following steps need to take place:

Again, the guest needs to communicate to the host a request to destroy a mapping: 6000 cycles for the hypercall.
The host needs to translate the guest page into a host physical address and perform other related bookkeeping. This step took approximately 2600 cycles in our setup.
The host needs to update the IOMMU translation table (page table walk and clearing of the I/O PTE): this step took approximately 1100 cycles.
When destroying a mapping, the host must ﬂush the entry out of the IOMMU’s IOTLB, which took approximately 2000 cycles on average.

Clearly optimizations can be made in both hardware and software which will reduce the single boundary crossing cost, such as optimizing the transfer of hypercall arguments and making world switches more eﬃcient. It is also possible to switch from a hypercall based communication mechanism to a polling mechanism wherein guest and host use shared memory areas to communicate map and unmap requests. But as long as the cost of a request is not negligible, we should also strive to make less requests.

6 分析单个映射的成本

如第1.2节所述，为了降低DMA映射操作的总成本，可以降低操作频率，也可以降低单个映射请求的成本。创建或销毁单个映射的代价有多大?

使用第4节中详细介绍的实验设置，我们测量了创建和销毁DMA映射所需的时间，将循环成本分解为不同的步骤。为了创建映射，需要执行以下步骤:

客户机需要向主机发送创建映射的请求。这通常是使用hypercall来完成的，不过也可以使用其他技术。一个空的hypercall在我们的系统上花费了大约6000个周期，just for the world switch
host需要执行一些内部实现细节，比如从来宾的内存空间检索hypercall的参数，并将来宾物理地址转换为主机物理地址，这可能还需要在页面中出现错误，以防该页面不存在。在我们的设置中，这个步骤大约需要4400个周期。
.主机需要更新IOMMU转换表(IOMMU转换表遍历和创建新的I/O PTE):这个步骤大约需要1400个周期。
主机需要刷新IOMMU写缓冲区:平均增加1200个周期。我们注意到，在我们的系统中使用了Intel VT-d IOMMU，当一个I/O PTE从不存在变为存在时，不需要刷新IOTLB。在一般情况下，这里也可能需要IOTLB刷新。

为了破坏映射，需要采取以下步骤:

同样，来宾需要向主机发送一个请求来销毁映射:hypercall的6000个周期。
主机需要将客户页面转换为主机物理地址，并执行其他相关的簿记。在我们的设置中，这个步骤大约需要2600个周期。
主机需要更新IOMMU转换表(页表遍历和清除I/O PTE):这个步骤大约需要1100个周期。
当销毁映射时，主机必须将该条目从IOMMU的IOTLB中清除，这平均花费大约2000个周期。

显然，可以在硬件和软件上进行优化，以减少单一边界的跨越成本，例如优化hypercall参数的传输，使世界切换更有效。这也是可能的

显然，可以在硬件和软件上进行优化，以减少单一边界的跨越成本，例如优化hypercall参数的传输，使世界切换更有效。也可以从基于超调用的通信机制切换到轮询机制，其中来宾和主机使用共享内存区域来通信映射和取消映射请求。但只要一个请求的成本不是可以忽略不计的，我们也应该努力少做请求。

7. ConCLUSIONS AND FUTURE WORK

Eﬃcient DMA mapping is a challenge for virtual machine direct access to I/O devices. Using a theoretical framework for the DMA mapping problem and a quota-based model, we developed the on-demand DMA mapping strategy. On-demand DMA mapping provides the best DMA mapping performance for a given amount of memory pinned for DMA.
对于虚拟机直接访问I/O设备来说，高效的DMA映射是一个挑战。利用DMA映射问题的理论框架和基于配额的模型，我们开发了按需DMA映射策略。按需DMA映射为给定数量的DMA固定内存提供了最佳的DMA映射性能

There are two complementary aspects to on-demand mapping. From the host’s perspective, the question is what is the right quota for a given guest. From the guest’s perspective, the question is how to achieve the best performance with a given quota. When given a quota that is suﬃcient for the workload, on-demand provides maximal performance. When the quota is smaller than the workload’s needs, a heuristic prefetching algorithm that takes advantage of repeating access patterns in I/O workloads can improve performance, without requiring driver modiﬁcations. If driver modiﬁcations are feasible, batching of map and unmap requests can provide even better performance. In both cases more extensive changes or more computationally-intensive algorithms might reduce the number of re-mappings even further.
按需映射有两个互补的方面。从host的角度来看，问题是对一个客人来说什么是正确的配额。从客人的角度来看，问题是如何在给定的配额下实现最佳性能。当为工作负载提供足够的配额时，按需提供最大的性能。当配额小于工作负载的需求时，利用I/O工作负载中的重复访问模式的启发式预取算法可以提高性能，而不需要修改驱动程序。如果驱动程序修改是可行的，批处理map和unmap请求可以提供更好的性能。在这两种情况下，更广泛的更改或更密集的计算算法可能会进一步减少重新映射的数量。

With this work we minimized the DMA mapping overhead and made direct access performance closer to baremetal, but closing the gap completely requires dealing with other sources of overhead such as interrupts. In addition, providing intra-guest protection, with minimal performance hit, and without requiring extensive changes to the guest operating system, still remains an open challenge.
通过这项工作，我们最小化了DMA映射开销，并使直接访问性能更接近于裸金属，但要消除这种差距完全需要处理其他来源的开销，如中断。此外，在不需要对客户操作系统进行大量更改的情况下，以最小的性能损失提供客户内部保护仍然是一个开放的挑战。

Last, but certainly not least, we believe this work demonstrates that fundamentally, an I/O device should be considered just another core that reads and writes system memory. IOMMUs will end up resembling MMUs even more than they do today, and DMA memory management algorithms will keep inching closer to CPU memory management algorithms. The key missing ingredient for such uniﬁcation, which we are also pursuing, is an eﬃcient I/O page fault mechanism [27].
最后，但当然并非最不重要的是，我们相信这项工作从根本上证明了，一个I/O设备应该被视为另一个读写系统内存的核心。IOMMUs最终将比现在更像mmu, DMA内存管理算法将逐渐接近CPU内存管理算法。这种统一所缺少的关键元素是高效的I/O页面错误机制[27]，这也是我们正在追求的。

On the DMA Mapping Problem in Direct Device Assignment 配额映射-SYSTOR-2010

Linux相关栏目本月热门文章