这是用户在 2025-7-8 19:43 为 https://chatgpt.com/s/dr_686d02423d588191bff5605670edeef2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

GPU 与 I/O 虚拟化教科书编写

内容可能未经验证或不安全。举报
ChatGPT

GPU and I/O Virtualization: A Comprehensive 6-Month Course TextbookGPU 和 I/O 虚拟化:全面的六个月课程教材

Chapter 1: Introduction to GPU and I/O Virtualization第 1 章:GPU 和 I/O 虚拟化简介

Core Concepts: This chapter introduces the why and what of GPU and I/O virtualization. We discuss how virtualization allows sharing expensive hardware resources (GPUs, NICs, etc.) among multiple virtual machines (VMs) and isolates their operation.
The historical lack of documentation and standardization in GPU virtualization has hindered adoptionopen-iov.org, underscoring the need for clear guidance. We define basic terms (host, guest, hypervisor) and outline use cases such as cloud multi-tenant GPU servers and high-performance virtual networking.
The chapter establishes a high-level view of direct device assignment versus emulated devices, and the challenges unique to I/O devices compared to CPU or memory virtualization (e.g. handling device registers and DMA).
Though we focus on GPUs, the same principles apply to other devices like NICs and storage controllersopen-iov.org.

Standards & History: We introduce relevant standards (PCI passthrough, device virtualization features in PCI Express) and note vendor-specific technologies (e.g. Intel GVT-g, NVIDIA vGPU, AMD MxGPU) to provide historical context. This sets the stage for detailed exploration in later chapters.标准与历史:我们介绍相关标准(PCI 直通、PCI Express 中的设备虚拟化特性),并指出厂商特定技术(如 Intel GVT-g、NVIDIA vGPU、AMD MxGPU),以提供历史背景。这为后续章节的详细探讨奠定基础。

Chapter 2: System Architecture and Virtualization Fundamentals第二章:系统架构与虚拟化基础

Core Concepts: This chapter builds foundational knowledge of the hardware and software components that enable I/O virtualization. We explain how PCI Express devices (GPUs, NICs) communicate via Transaction Layer Packets (TLPs) and present memory-mapped registers to the system through Base Address Registers (BARs)open-iov.org. We then introduce the Memory Management Unit (MMU) and specifically the I/O MMU (IOMMU), which enforces memory isolation for device access. The concept of IOMMU groups is explained: an IOMMU group is the smallest set of devices that the IOMMU cannot distinguish and thus must be treated as a unit for isolationopen-iov.org. These groups form the basis of assigning devices to VMs safely – all devices in a group must be managed together to avoid cross-VM interference.核心概念:本章构建了实现 I/O 虚拟化的硬件和软件组件的基础知识。我们解释了 PCI Express 设备(GPU、NIC)如何通过事务层数据包(TLP)进行通信,并通过基址寄存器(BAR)向系统呈现内存映射寄存器 open-iov.org。随后介绍了内存管理单元(MMU),特别是 I/O 内存管理单元(IOMMU),它负责执行设备访问的内存隔离。解释了 IOMMU 组的概念:IOMMU 组是 IOMMU 无法区分的最小设备集合,因此必须作为一个整体进行隔离 open-iov.org。这些组构成了安全分配设备给虚拟机的基础——组内所有设备必须一起管理,以避免跨虚拟机干扰。

VFIO Framework: We provide a first look at the Linux VFIO (Virtual Function I/O) framework, which is an IOMMU-agnostic interface for exposing direct device access to user space in a secure manneropen-iov.org. When a device is bound to the VFIO driver (e.g. vfio-pci), it is assigned to a VFIO group (the unit of ownership) which corresponds to the IOMMU groupopen-iov.org. This architecture ensures the hypervisor or user-space driver can safely control the device via file descriptors, with the IOMMU preventing any DMA to unauthorized memory.
By understanding these primitives – PCIe BARs, IOMMUs, VFIO groups – the reader builds a mental model of how a physical device can be cleanly and safely handed over to a VM.

Chapter 3: Virtualization Modes – Full, Para, and Assisted Approaches第 3 章:虚拟化模式——全虚拟化、半虚拟化和辅助方法

Core Concepts: In this chapter, we delve into different virtualization strategies for I/O devices and GPUs. We clarify full (native) virtualization versus paravirtualization in the context of device drivers. Full virtualization runs unmodified guest drivers, often by presenting a virtual device identical to real hardwareopen-iov.org. Paravirtualization uses modified drivers aware of virtualization, potentially achieving better performance at the cost of requiring guest changes. We also distinguish software-only (unassisted) virtualization from hardware-assisted methodsopen-iov.org. This leads to an overview of three key modes for device virtualization:核心概念:本章深入探讨了 I/O 设备和 GPU 的不同虚拟化策略。我们阐明了设备驱动程序上下文中的全虚拟化(原生虚拟化)与准虚拟化的区别。全虚拟化运行未修改的客户机驱动程序,通常通过呈现与真实硬件完全相同的虚拟设备来实现 open-iov.org。准虚拟化则使用经过修改、能够识别虚拟化的驱动程序,可能在性能上更优,但代价是需要对客户机进行修改。我们还区分了纯软件(无辅助)虚拟化与硬件辅助方法 open-iov.org。这引出了设备虚拟化的三种关键模式概述:

  • Mediated pass-through (Mdev Mode): A software-based approach requiring no special hardware support on the device. The host driver mediates access to a portion of the device for the guest. This mode can expose the full device API to the guest without hardware assistanceopen-iov.org.中介直通(Mdev 模式):一种基于软件的方法,不需要设备上的特殊硬件支持。主机驱动程序调解客户机对设备部分资源的访问。该模式可以在无硬件辅助的情况下向客户机暴露完整的设备 APIopen-iov.org。

  • Direct assignment with SR-IOV (Single Root I/O Virtualization): A hardware-assisted approach where the device itself can partition resources into Virtual Functions (VFs) that appear as independent PCIe devices. The host can be mostly ignorant of guest workload details, as the device and firmware handle isolationopen-iov.org.使用 SR-IOV(单根 I/O 虚拟化)进行直接分配:一种硬件辅助方法,设备本身可以将资源划分为虚拟功能(VFs),这些虚拟功能表现为独立的 PCIe 设备。主机大多不需要了解客户机的工作负载细节,因为设备和固件负责隔离 open-iov.org。

  • Scalable I/O Virtualization (SIOV): A newer hardware-assisted paradigm that combines aspects of SR-IOV and mediated devices, introducing fine-grained resource sharing using scalable IDs.
    Like SR-IOV, it uses hardware capabilities (e.g. PCIe Process Address Space IDs), but it also allows more flexible device sharing akin to Mdevopen-iov.orgopen-iov.org.

We present a comparison of these modes to highlight key differences. For example, Mdev requires no hardware support on the device, whereas SR-IOV and SIOV do rely on hardware capabilitiesopen-iov.org. In Mdev and SIOV, the host has deeper insight into and involvement with the guest’s use of the device (e.g. scheduling and memory management), while SR-IOV allows the host to remain relatively ignorant of the guest workload, offloading isolation to firmware/hardwareopen-iov.org. We also compare how error handling and performance monitoring differ (e.g. SR-IOV can report errors directly to the guest driver, whereas a pure software mediation may not propagate errors in the same wayopen-iov.org). By the end of this chapter, readers will understand why multiple virtualization modes exist and how to choose an approach based on hardware availability and desired control.我们对这些模式进行了比较,以突出关键差异。例如,Mdev 不需要设备上的硬件支持,而 SR-IOV 和 SIOV 则依赖硬件能力 open-iov.org。在 Mdev 和 SIOV 中,主机对客户机使用设备有更深入的了解和参与(例如调度和内存管理),而 SR-IOV 允许主机对客户机工作负载保持相对不了解,将隔离工作卸载给固件/硬件 open-iov.org。我们还比较了错误处理和性能监控的不同方式(例如,SR-IOV 可以直接向客户机驱动报告错误,而纯软件中介可能不会以相同方式传播错误 open-iov.org)。本章结束时,读者将理解为何存在多种虚拟化模式,以及如何根据硬件可用性和所需控制选择合适的方法。

Chapter 4: VFIO and Device Passthrough in Linux第 4 章:Linux 中的 VFIO 和设备直通

Core Concepts: This chapter provides a deep dive into the VFIO framework in Linux, which is fundamental for device passthrough. We explain the steps required to give a VM direct access to a physical device:核心概念:本章深入介绍 Linux 中的 VFIO 框架,这是设备直通的基础。我们解释了赋予虚拟机直接访问物理设备所需的步骤:

  • Device Binding: The device’s driver on the host must be detached, and the device is bound to the vfio-pci driver (or a vendor-specific VFIO variant). By binding a device to VFIO, it is assigned to a VFIO/IOMMU group and made available for secure user-space accessopen-iov.org. We outline how to identify a device’s IOMMU group and use sysfs or driver override methods to bind it to VFIO. (This process is often referred to as device assignment or binding.)设备绑定:必须先将设备在主机上的驱动程序卸载,然后将设备绑定到 vfio-pci 驱动程序(或厂商特定的 VFIO 变体)。通过将设备绑定到 VFIO,设备会被分配到一个 VFIO/IOMMU 组中,并可供安全的用户空间访问 open-iov.org。我们概述了如何识别设备的 IOMMU 组,并使用 sysfs 或驱动程序覆盖方法将其绑定到 VFIO。(此过程通常称为设备分配或绑定。)

  • VFIO Groups and Containers: We explain the relationship between VFIO groups and VFIO containers. A VFIO group corresponds to one IOMMU group (potentially containing multiple devices). The hypervisor (e.g. QEMU) opens the group, then associates it with a VFIO container (via VFIO_GROUP_SET_CONTAINER ioctl)open-iov.org. The container represents an IOMMU context (address space) that will be used for DMA mappings for that group.VFIO 组和容器:我们解释了 VFIO 组与 VFIO 容器之间的关系。一个 VFIO 组对应一个 IOMMU 组(可能包含多个设备)。虚拟机管理程序(例如 QEMU)打开该组,然后通过 VFIO_GROUP_SET_CONTAINER ioctl 将其关联到一个 VFIO 容器 open-iov.org。该容器代表一个 IOMMU 上下文(地址空间),将用于该组的 DMA 映射。

  • Memory Mapping and Security: Once the group is in a container, the hypervisor allocates memory for the guest and uses VFIO_IOMMU_MAP_DMA ioctls on the container to pin and map guest memory into the device’s IOMMU page tablesopen-iov.org. We discuss how memory-mapped I/O (MMIO) regions of the device are exposed to the guest. VFIO presents device MMIO regions as file descriptors so that reads/writes from the guest (through emulated PCI) are handled by VFIO and mapped to the real device registersopen-iov.org. Interrupts are handled via event file descriptors (eventfd/irqfd mechanism). For example, the guest writing to a register to signal an interrupt can be intercepted and forwarded to the host via an eventfd, and vice versa for device interrupts signalling the guestopen-iov.org.内存映射与安全性:一旦组被放入容器,虚拟机监控器会为客户机分配内存,并使用 VFIO_IOMMU_MAP_DMA ioctl 调用在容器上固定并映射客户机内存到设备的 IOMMU 页表中 open-iov.org。我们讨论了设备的内存映射 I/O(MMIO)区域如何暴露给客户机。VFIO 将设备的 MMIO 区域呈现为文件描述符,以便客户机通过模拟的 PCI 进行的读/写操作由 VFIO 处理并映射到真实设备寄存器 open-iov.org。中断通过事件文件描述符(eventfd/irqfd 机制)处理。例如,客户机写入寄存器以发出中断信号时,可以被拦截并通过 eventfd 转发给主机,反之设备中断信号也可以通过 eventfd 通知客户机 open-iov.org。

  • Configuration Example: We provide a practical example (with bullet-point steps) for assigning a GPU to a VM using VFIO:配置示例:我们提供了一个使用 VFIO 将 GPU 分配给虚拟机的实用示例(带有要点步骤):

    1. Enable the IOMMU in BIOS/UEFI and Linux kernel (e.g. intel_iommu=on or amd_iommu=on boot parameters).在 BIOS/UEFI 和 Linux 内核中启用 IOMMU(例如使用 intel_iommu=onamd_iommu=on 启动参数)。

    2. Identify the device’s PCI address and IOMMU group. Ensure the group contains only the device (or understand the security implications if not).识别设备的 PCI 地址和 IOMMU 组。确保该组仅包含该设备(如果不是,则需了解相关安全影响)。

    3. Unbind the device from its host driver (for instance, nvidia or amdgpu for GPUs, or network driver for NIC).将设备从其主机驱动程序解绑(例如,GPU 的 nvidiaamdgpu ,或网卡的网络驱动程序)。

    4. Bind the device to vfio-pci (using driverctl or by writing to /sys/bus/pci/devices/../driver_override and rescanning).将设备绑定到 vfio-pci (使用 driverctl 或通过写入 /sys/bus/pci/devices/../driver_override 并重新扫描)。

    5. Start the VM (using QEMU/KVM or via libvirt), specifying the VFIO device (QEMU uses -device vfio-pci,host=<BDF>). The VFIO infrastructure will provide the guest direct access to the device’s registers and DMA, with isolation enforced by the IOMMU.启动虚拟机(使用 QEMU/KVM 或通过 libvirt),指定 VFIO 设备(QEMU 使用 -device vfio-pci,host=<BDF> )。VFIO 架构将为客户机提供对设备寄存器和 DMA 的直接访问,隔离由 IOMMU 强制执行。

Pitfalls: We warn about common issues such as IOMMU grouping pitfalls (some devices can’t be isolated due to motherboard or chipset limitations), and the need for interrupt remapping to be enabled on the host. We also mention that some devices require quirks (special handling) in VFIO (for example, GPUs that don’t reset cleanly may require vendor-specific reset routinesopen-iov.org). This chapter gives readers a step-by-step understanding of how VFIO underpins safe device passthrough.注意事项:我们提醒常见问题,如 IOMMU 分组陷阱(由于主板或芯片组限制,某些设备无法隔离),以及主机上需要启用中断重映射。我们还提到某些设备在 VFIO 中需要特殊处理(quirks)(例如,无法干净重置的 GPU 可能需要厂商特定的重置例程,详见 open-iov.org)。本章为读者提供了 VFIO 如何支持安全设备直通的逐步理解。

Chapter 5: Single Root I/O Virtualization (SR-IOV)第 5 章:单根 I/O 虚拟化(SR-IOV)

Core Concepts: This chapter focuses on SR-IOV, a PCI Express standard for hardware-assisted virtualization of I/O devices. We start with how SR-IOV works at the hardware level: a device (Physical Function, PF) can create multiple Virtual Functions (VFs) that appear as lightweight PCIe devices. Each VF has its own PCI configuration space and BARs, allowing a guest to directly read/write registers as if it were a real deviceopen-iov.org. The hardware ensures that each VF’s DMA is isolated (usually via IOMMU and internal device logic) so that VFs cannot interfere with each other or with the PF.核心概念:本章重点介绍 SR-IOV,这是一种用于 I/O 设备硬件辅助虚拟化的 PCI Express 标准。我们首先讲解 SR-IOV 在硬件层面的工作原理:一个设备(物理功能,PF)可以创建多个虚拟功能(VF),这些 VF 表现为轻量级的 PCIe 设备。每个 VF 拥有自己的 PCI 配置空间和 BAR,允许客户机直接读写寄存器,就像操作真实设备一样 open-iov.org。硬件确保每个 VF 的 DMA 是隔离的(通常通过 IOMMU 和设备内部逻辑实现),以防止 VF 之间或 VF 与 PF 之间相互干扰。

Hardware and Standards: We cover key aspects of the SR-IOV specification (PCI-SIG standard):硬件与标准:我们介绍 SR-IOV 规范(PCI-SIG 标准)的关键内容:

  • The PF driver on the host allocates VFs by writing to a special register (the SR-IOV capability in PCI config space) to set the number of VFs.主机上的 PF 驱动通过写入一个特殊寄存器(PCI 配置空间中的 SR-IOV 能力)来分配 VF,从而设置 VF 的数量。

  • Each VF is a thin instance that shares the underlying device resources (e.g., a NIC’s VFs share the physical port and MAC/PHY; a GPU’s VFs might share the physical GPU’s engines and memory).每个 VF 都是一个轻量级实例,共享底层设备资源(例如,NIC 的 VF 共享物理端口和 MAC/PHY;GPU 的 VF 可能共享物理 GPU 的引擎和内存)。

  • Hardware assistance means the device itself routes I/O and interrupts from each VF to the correct context (e.g., an interrupt from VF1 is tagged so the host can inject it to VM1). The device’s internal MMU or firmware isolates memory accesses of VFs (often with the help of the system IOMMU as well)open-iov.org.硬件辅助意味着设备本身将来自每个 VF 的 I/O 和中断路由到正确的上下文(例如,来自 VF1 的中断被标记,以便主机可以将其注入到 VM1)。设备的内部 MMU 或固件隔离 VF 的内存访问(通常还借助系统 IOMMU)open-iov.org。

Applicable Scenarios: We discuss common use cases:适用场景:我们讨论常见用例:

  • Network adapters (NICs): SR-IOV is heavily used for high-performance network virtualization. For example, a single 10GbE NIC PF can spawn multiple VFs, each VF assigned to a different VM, achieving near-native throughput and latency by bypassing the software virtual switch.
    We explain that each VF can have its own MAC address and appears as a distinct NIC to the guest OS.

  • GPUs: Fewer GPUs support SR-IOV, but emerging models do (e.g., Intel’s recent GPUs and some AMD GPUs).
    We cover how GPU SR-IOV VFs function similarly – each VF exposes a virtual PCIe GPU to the guest. The hardware (with driver assistance) schedules and partitions GPU resources among PF and VFs.

Configuration Steps: A step-by-step look at enabling SR-IOV on a device:配置步骤:逐步介绍如何在设备上启用 SR-IOV:

  • Ensure the hardware and driver support SR-IOV. (We note that only specific devices have this capability, often found in server-grade NICs or GPUs. For instance, Intel’s 11th Gen and newer iGPUs support SR-IOV mode for virtual GPUsopen-iov.org.)确保硬件和驱动支持 SR-IOV。(我们注意到只有特定设备具备此功能,通常出现在服务器级别的网卡或 GPU 中。例如,Intel 第 11 代及更新的集成 GPU 支持虚拟 GPU 的 SR-IOV 模式,详见 open-iov.org。)

  • Load the host driver with SR-IOV enabled and use sysfs to create VFs. For example, for a NIC, one can write a number to /sys/class/net/<pf>/device/sriov_numvfs to spawn that many virtual functions.加载启用 SR-IOV 的主机驱动,并使用 sysfs 创建虚拟功能(VF)。例如,对于网卡,可以向 /sys/class/net/<pf>/device/sriov_numvfs 写入一个数字,以生成相应数量的虚拟功能。

  • Each VF appears as a new PCI device (with its own BDF - Bus/Device/Function). These can then be bound to VFIO or directly to VMs via hypervisor configuration.每个虚拟功能都会作为一个新的 PCI 设备出现(拥有自己的 BDF——总线/设备/功能号)。这些设备随后可以绑定到 VFIO,或通过虚拟机管理程序配置直接绑定到虚拟机。

  • Assign the VFs to guest VMs. The guest sees a normal device (e.g., a NIC or GPU) and can use it with the standard driver.将虚拟功能分配给客户虚拟机。客户机将看到一个普通设备(例如网卡或 GPU),并可以使用标准驱动进行操作。

We also highlight differences from Mdev: with SR-IOV, the host driver does not need to know the guest workload details, since the device itself largely handles multiplexing of resourcesopen-iov.org. However, the host driver (PF driver) typically manages the creation and destruction of VFs and may handle tasks like resetting VFs or providing some services that the hardware doesn’t virtualize fully (e.g., fan control on a GPU might remain in control of the PF).我们还强调了与 Mdev 的区别:使用 SR-IOV 时,主机驱动程序不需要了解客户机的工作负载细节,因为设备本身在很大程度上负责资源的多路复用 open-iov.org。然而,主机驱动程序(PF 驱动)通常管理 VF 的创建和销毁,并可能处理诸如重置 VF 或提供硬件未完全虚拟化的某些服务(例如,GPU 上的风扇控制可能仍由 PF 控制)等任务。

Pitfalls: We mention limitations such as the fixed maximum number of VFs a device supports (often limited by hardware resources or firmware, e.g., a NIC might support 64 VFs, a GPU might support a small number of VFs).
We also cover that not all features of a device may be available or isolated per VF (for example, a NIC VF might not support SR-IOV within a VF – no “nested” SR-IOV).
Additionally, sharing a device via SR-IOV means if the PF is used by the host, its performance might be impacted by heavy VF usage (unless the device supports strict QoS).
Attentive planning is needed to avoid oversubscription of resources that could degrade performance for all VMs.

Chapter 6: Mediated Devices (Mdev) and Software-Based Virtualization第 6 章:中介设备(Mdev)与基于软件的虚拟化

Core Concepts: This chapter explores Mediated Device virtualization (Mdev), a software-mediated approach for devices that do not have native SR-IOV support or for more flexible sharing. In Mdev mode, the physical device’s driver on the host divides the device’s capabilities among one or more virtual devices presented to guests.
Importantly, Mdev allows exposing the full device API to the guest driver without hardware virtualization supportopen-iov.org. The Linux kernel’s mediated device framework provides the infrastructure for this: a mediated core module works with the vendor’s physical device driver to create and manage “child” devices that can be assigned to VMs.

Implementation Mechanism: We break down how Mdev works in the Linux kernel:实现机制:我们分解了 Mdev 在 Linux 内核中的工作原理:

  • Mediated Device Framework: The framework adds a new bus and core support in the VFIO driver stackopen-iov.org. The vendor’s host driver (e.g., i915 for Intel GPU, or Nvidia’s nvidia.ko for their GPU, etc.) registers itself as supporting mediated devicesopen-iov.org. It then can create mediated device instances. Each mediated device is represented in sysfs (usually under the physical device’s directory) and has a unique identifier (often a UUID).中介设备框架:该框架在 VFIO 驱动栈中增加了一个新的总线和核心支持 open-iov.org。厂商的主机驱动(例如 Intel GPU 的 i915,或 NVIDIA GPU 的 nvidia.ko 等)注册自身为支持中介设备 open-iov.org。然后它可以创建中介设备实例。每个中介设备在 sysfs 中表示(通常位于物理设备的目录下),并具有唯一标识符(通常是 UUID)。

  • Creating Mdevs: The chapter provides an example of how an Mdev is created, e.g. by echoing a UUID into the mdev_supported_types sysfs interface for the deviceopen-iov.org. We note that tools like mdevctl or libraries like LibVF.IO can simplify this processopen-iov.org. When an Mdev is created, the host driver and mediated core collaborate to carve out a portion of the device’s resources (memory, engine time, etc.) for that Mdev instance.创建 Mdev:本章提供了如何创建 Mdev 的示例,例如通过将 UUID 写入设备的 mdev_supported_types sysfs 接口 open-iov.org。我们指出,像 mdevctl 这样的工具或 LibVF.IO 这样的库可以简化此过程 open-iov.org。当创建 Mdev 时,主机驱动和中介核心协作,为该 Mdev 实例划分设备资源的一部分(内存、引擎时间等)。

  • Guest Assignment: An Mdev appears as a VFIO device that can be passed to QEMU. We describe how QEMU will get a file descriptor for the Mdev (using VFIO just like for a physical device) and then pass that into the VMopen-iov.org. The VM sees a virtual PCI device, often a standard placeholder (for example, a vendor-specific ID that the guest driver recognizes as a virtual instance of the real device).来宾分配:Mdev 作为一个 VFIO 设备出现,可以传递给 QEMU。我们描述了 QEMU 如何获取 Mdev 的文件描述符(使用 VFIO,就像对待物理设备一样),然后将其传递给虚拟机。虚拟机看到的是一个虚拟 PCI 设备,通常是一个标准占位符(例如,来宾驱动识别为真实设备虚拟实例的厂商特定 ID)。

Resource Partitioning: A major part of this chapter is how the host driver virtualizes resources:资源分区:本章的一个主要部分是主机驱动如何虚拟化资源:

  • Memory: The host driver may partition VRAM (in the case of GPUs) among VMs, or use allocation tricks to give each VM the illusion of a contiguous framebuffer memory.内存:主机驱动可能会在虚拟机之间划分 VRAM(以 GPU 为例),或者使用分配技巧让每个虚拟机产生连续帧缓冲区内存的假象。

  • Computation: Scheduling of execution units or GPU processing is handled in software by the host driver. Different scheduling algorithms can be used (time-slicing, queuing, fixed share, etc., as determined by the host).计算:执行单元或 GPU 处理的调度由主机驱动在软件中处理。可以使用不同的调度算法(时间片轮转、排队、固定份额等,由主机决定)。

  • MMIO/Register access: For registers that can be directly exposed (safe to be shared or duplicated), the host can map them through to the guest (fast path).
    For registers that control global state, the host driver intercepts guest writes (via trap and emulate or using VFIO’s region emulation callbacks) and mediates them, ensuring one VM’s request doesn’t disrupt another. We explain that an Mdev thus often involves a mix of pass-through regions (directly mapped for performance) and emulated regions (handled by host, requiring traps)open-iov.org.

Applicable Scenarios: Mdev is commonly used for:适用场景:Mdev 通常用于:

  • GPU sharing: e.g., Intel’s GVT-g for integrated GPUs was an early example, allowing multiple VMs to time-share an iGPU.
    NVIDIA’s Grid vGPU (in a proprietary form) similarly uses a mediated approach – in fact, Linux’s Mdev was initially leveraged to allow NVIDIA’s closed vGPU driver to interface with KVM.
    We highlight that even some data center GPUs without SR-IOV use software mediation so that more VMs can share one GPU than the hardware VF limit.

  • Other devices: Some accelerators or specialized devices might also use mediated frameworks. For instance, before SR-IOV NVMe devices were common, there were software frameworks to virtualize NVMe command queues for VMs.其他设备:一些加速器或专用设备也可能使用中介框架。例如,在 SR-IOV NVMe 设备普及之前,曾有软件框架用于虚拟化虚拟机的 NVMe 命令队列。

Configuration Steps: How to set up an Mdev-based vGPU on a Linux host:配置步骤:如何在 Linux 主机上设置基于 Mdev 的 vGPU:

  1. Ensure the host driver supports Mdev (for example, use a specific kernel version or driver module that includes mediated device support).确保主机驱动支持 Mdev(例如,使用包含中介设备支持的特定内核版本或驱动模块)。

  2. Load the driver and verify mdev_supported_types in sysfs (which lists what types of mediated devices can be created, often including attributes like available VRAM per instance, etc.).加载驱动并在 sysfs 中验证 mdev_supported_types (列出可以创建的中介设备类型,通常包括每个实例可用的 VRAM 等属性)。

  3. Create an Mdev by writing a UUID to the appropriate sysfs path (or use mdevctl to manage it more easily). The Mdev now appears as a device (e.g., /dev/vfio/XX where XX is the group).通过向相应的 sysfs 路径写入 UUID 来创建 Mdev(或使用 mdevctl 更方便地管理它)。Mdev 现在作为一个设备出现(例如, /dev/vfio/XX ,其中 XX 是组号)。

  4. Launch the VM with the VFIO Mdev device (similar to normal VFIO, using QEMU’s -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/<UUID> option). The guest will load a driver expecting a virtual GPU (for example, Intel’s iGVT-g uses a variant of the i915 driver in the guest).使用 VFIO Mdev 设备启动虚拟机(类似于普通 VFIO,使用 QEMU 的 -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/<UUID> 选项)。客户机将加载期望虚拟 GPU 的驱动程序(例如,Intel 的 iGVT-g 在客户机中使用 i915 驱动的变体)。

Pitfalls: We caution that mediated devices rely heavily on the correctness of the host driver – bugs or inefficiencies in the mediation can affect all VMs. Overcommitment (assigning too many VMs to one device) can degrade performance.
We also note that Mdev-based GPUs often cannot be used by the host concurrently (unless the driver is a merged driver – see Chapter 9) because the host driver essentially partitions the whole device. Lastly, vendor support can be an issue: for example, AMD’s current drivers do not yet support mediated GPU instancesopen-iov.org, limiting options on that hardware.

Chapter 7: Scalable I/O Virtualization (SIOV) and Advanced Sharing第 7 章:可扩展 I/O 虚拟化(SIOV)与高级共享

Core Concepts: This chapter introduces Scalable I/O Virtualization (SIOV) – an emerging standard aimed at overcoming limitations of SR-IOV. We explain that SIOV (sometimes called virtio-dynamic in Intel parlance or Principal Device Sharing) enables a single physical device to support a large number of isolated work contexts without needing separate PCIe functions for each. Instead of VFs, SIOV uses the concept of Shared Work Queues and ID-based isolation. Specifically, devices supporting SIOV use identifiers like PCIe Process Address Space ID (PASID) and Virtual Device ID (virtio) or Enqueue IDs to distinguish requests from different VMs or processesopen-iov.org. Each transaction from the device is tagged with a PASID, allowing the IOMMU to provide isolation at a finer granularity than an entire function.核心概念:本章介绍了可扩展 I/O 虚拟化(SIOV)——一种旨在克服 SR-IOV 限制的新兴标准。我们解释了 SIOV(在 Intel 术语中有时称为 virtio-dynamic 或主设备共享)如何使单个物理设备支持大量隔离的工作上下文,而无需为每个上下文分配独立的 PCIe 功能。SIOV 不使用虚拟功能(VF),而是采用共享工作队列和基于 ID 的隔离概念。具体来说,支持 SIOV 的设备使用诸如 PCIe 进程地址空间 ID(PASID)、虚拟设备 ID(virtio)或入队 ID 等标识符来区分来自不同虚拟机或进程的请求(open-iov.org)。设备的每个事务都带有 PASID 标签,使 IOMMU 能够在比整个功能更细粒度的层面上实现隔离。

Mechanism: We detail how SIOV builds upon ideas from both SR-IOV and Mdev:机制:我们详细说明了 SIOV 如何基于 SR-IOV 和 Mdev 的理念构建:

  • Like Mdev, it allows a single physical function to be used by many guests, with the device hardware scheduling and separating workloads internally.与 Mdev 类似,它允许单个物理功能被多个客户机使用,设备硬件在内部调度和分离工作负载。

  • Like SR-IOV, it leverages hardware assistance: the device is aware of virtualization and can context-switch between guest workloads with minimal hypervisor involvement.
    For example, the device might have multiple command queues (one per guest or per virtual device instance) and hardware-enforced memory protections per queue.

  • The Linux kernel changes to support SIOV have included enhancements to VFIO and the IOMMU driversopen-iov.org. We mention that a new vfio subsystem (VFIO mediated + IOMMU aware) was introduced to handle PASIDs and multiple instances more efficiently.为了支持 SIOV,Linux 内核进行了修改,包括对 VFIO 和 IOMMU 驱动的增强 open-iov.org。我们提到引入了一个新的 vfio 子系统(VFIO 中介+IOMMU 感知),以更高效地处理 PASID 和多个实例。

Standards: We reference the SIOV specification (Revision 1.0 at OCP)open-iov.org – which formalizes how devices should implement these capabilities. For example, a network card with SIOV might allow thousands of isolated send/receive queues that can be assigned to many VMs or containers without the 1:1 coupling of VF = VM.标准:我们参考了 SIOV 规范(OCP 的 1.0 版本)open-iov.org——该规范正式定义了设备应如何实现这些功能。例如,带有 SIOV 的网卡可能允许成千上万个隔离的发送/接收队列,这些队列可以分配给多个虚拟机或容器,而不需要 VF 与虚拟机之间的 1:1 绑定。

Hardware Requirements: A device must explicitly support SIOV (older devices cannot do this via firmware update – it’s an architecture). Modern Intel devices (like some FPGA or accelerators under the term “Data Streaming Accelerator”, and potentially future GPUs/NICs) are early examples.
The platform IOMMU and OS also need to handle PASIDs (e.g., Intel VT-d posted I/O page tables per PASID). We note that Linux’s Intel IOMMU driver was extended to support PASID granular DMA mappings for SIOV use casesopen-iov.org.

Use Cases: SIOV is well-suited for scenarios where one might want many more than 64 virtual instances on a device (since SR-IOV typically has a relatively small VF limit). For example, in a cloud environment, a single accelerator might be shared across dozens of lightweight VMs or containers.
SIOV also reduces overhead by not requiring a full separate PCI function for each VM (saving system resources and configuration space).

Comparison: We compare SIOV to the prior two modes:比较:我们将 SIOV 与之前的两种模式进行比较:

  • Versus SR-IOV: SIOV does not create actual PCI Functions for each instance, so guest OSes typically use paravirtualized drivers (e.g., a virtio driver) to communicate with the device’s shared interface. This means SIOV often goes hand-in-hand with modern virtio standards (like virtio-net or virtio-gpu) but implemented in hardware. It offers more scalability and potentially dynamic instance creation/teardown without bus resets.与 SR-IOV 相比:SIOV 不会为每个实例创建实际的 PCI 功能,因此客户操作系统通常使用半虚拟化驱动程序(例如 virtio 驱动)与设备的共享接口通信。这意味着 SIOV 通常与现代 virtio 标准(如 virtio-net 或 virtio-gpu)配合使用,但在硬件中实现。它提供了更好的可扩展性,并且可能支持动态实例的创建和拆除,而无需总线重置。

  • Versus Mdev: SIOV pushes more functionality into hardware (reducing the software mediation bottleneck). It still might require a host driver for management, but much of the heavy lifting (queue management, isolation) is in silicon.与 Mdev 相比:SIOV 将更多功能推向硬件(减少软件中介瓶颈)。它仍可能需要主机驱动进行管理,但大部分繁重的工作(队列管理、隔离)都在硅芯片中完成。

Practical Status: We inform the reader of the current state (as of 2025): SIOV support in the Linux kernel and QEMU is evolving, with Intel leading the implementation for some devices. Devices supporting SIOV are just entering the market, so this chapter prepares the reader for the future of virtualization hardware. We encourage checking the latest Linux kernel documentation and hardware specs for updates. This forward-looking knowledge ensures the engineer can design solutions that will be compatible with upcoming tech.实际状态:我们告知读者截至 2025 年的现状:Linux 内核和 QEMU 中对 SIOV 的支持正在发展,Intel 在部分设备的实现中处于领先地位。支持 SIOV 的设备刚刚进入市场,因此本章旨在为读者准备未来的虚拟化硬件。我们建议查阅最新的 Linux 内核文档和硬件规格以获取更新。这种前瞻性的知识确保工程师能够设计出与即将到来的技术兼容的解决方案。

Chapter 8: Device Binding and Driver Configuration in Linux第 8 章:Linux 中的设备绑定与驱动配置

Core Concepts: This chapter serves as a practical guide on how to manipulate device drivers and configuration on a Linux host to set up virtualization. It covers the “plumbing” needed to get devices ready for passthrough or sharing:核心概念:本章作为实用指南,介绍如何在 Linux 主机上操作设备驱动和配置以设置虚拟化。内容涵盖了为设备准备直通或共享所需的“管道”工作:

  • PCI Device Binding/Unbinding: We explain how to unbind a device from its current driver and bind it to another (like vfio-pci). This includes using /sys/bus/pci/devices/<BDF>/driver/unbind and then .../driver_override to specify vfio-pci as the new driver, followed by a rescan. Tools like driverctl can automate this.PCI 设备绑定/解绑:我们解释如何将设备从当前驱动解绑并绑定到另一个驱动(如 vfio-pci )。这包括使用 /sys/bus/pci/devices/<BDF>/driver/unbind ,然后使用 .../driver_override 指定 vfio-pci 作为新驱动,随后进行重新扫描。像 driverctl 这样的工具可以自动化这一过程。

  • Blacklisting Drivers: In scenarios where a device is reserved for VMs from boot, one may blacklist the vendor driver (so it doesn’t claim the device) and let VFIO claim it by default. We provide examples (e.g., adding modprobe.blacklist=nouveau and ensuring vfio-pci.ids=<device_id> on kernel cmdline to pre-bind a GPU to VFIO).驱动黑名单:在设备从启动时就预留给虚拟机的场景中,可以将厂商驱动列入黑名单(使其不占用设备),让 VFIO 默认接管设备。我们提供了示例(例如,在内核命令行中添加 modprobe.blacklist=nouveau 并确保 vfio-pci.ids=<device_id> ,以预绑定 GPU 到 VFIO)。

  • IOMMU Setup: We detail necessary kernel parameters and settings to enable IOMMU (VT-d on Intel, AMD-Vi on AMD). We also mention ensuring ACS (Access Control Services) on the PCIe switches for better group isolation (some motherboards have ACS quirks).IOMMU 设置:我们详细介绍了启用 IOMMU 所需的内核参数和设置(Intel 的 VT-d,AMD 的 AMD-Vi)。我们还提到确保 PCIe 交换机上的 ACS(访问控制服务)以实现更好的组隔离(某些主板存在 ACS 异常)。

  • Hugepages and Memory Pinning: For high-performance device assignment (especially for GPUs and NICs), allocating hugepages for guest memory and understanding that VFIO will pin those pages (preventing them from swapping) is important. We introduce the concept of memory pinning limits (via /etc/security/limits.conf for the QEMU process, if needed).大页和内存固定:为了实现高性能设备分配(尤其是 GPU 和网卡),为客户机内存分配大页并理解 VFIO 会固定这些页(防止它们被交换)非常重要。我们介绍了内存固定限制的概念(如果需要,可通过 /etc/security/limits.conf 为 QEMU 进程设置)。

Kernel Interfaces and Modules: We outline the relevant kernel modules:内核接口和模块:我们概述了相关的内核模块:

  • vfio_pci (generic for any PCI device), vfio_pci (通用任何 PCI 设备),

  • vfio_mdev and vendor mediated drivers (like kvmgt for Intel GVT-g, or vendor GPU drivers that have Mdev support), vfio_mdev 和厂商中介驱动(如 Intel GVT-g 的 kvmgt ,或支持 Mdev 的厂商 GPU 驱动),

  • pci_stub or pci-assign (legacy alternatives, though VFIO is preferred modern interface). pci_stubpci-assign (传统替代方案,尽管 VFIO 是更受推荐的现代接口)。

For network devices with SR-IOV, we show how the host driver (e.g., ixgbe or mlx5) is used to create VFs and then those VFs can be bound to VFIO or even directly used by guests via macvtap or similar.对于带有 SR-IOV 的网络设备,我们展示了如何使用主机驱动(例如 ixgbemlx5 )创建 VF,然后这些 VF 可以绑定到 VFIO,甚至通过 macvtap 或类似方式直接被客户机使用。

Configuration Examples: This chapter uses a more tutorial style for a couple of representative devices:配置示例:本章采用更具教程风格,针对几个代表性设备进行讲解:

  • Example 1: GPU Passthrough Setup – walking through binding an NVIDIA GPU to vfio-pci (including dealing with the NVIDIA driver’s tendency to grab the device, possibly needing to disable the kernel NVIDIA driver or use nopci quirks).示例 1:GPU 直通设置——演示如何将 NVIDIA GPU 绑定到 vfio-pci(包括处理 NVIDIA 驱动倾向于占用设备的问题,可能需要禁用内核 NVIDIA 驱动或使用 nopci 特性)。

  • Example 2: NIC SR-IOV Setup – enabling SR-IOV on an Intel NIC by echoing to sriov_numvfs, then showing that new ethX interfaces or PCI devices appear for VFs, and binding those to VFIO or assigning via libvirt XML.示例 2:NIC SR-IOV 设置——通过向 sriov_numvfs 写入命令启用 Intel 网卡的 SR-IOV,然后展示新的 ethX 接口或 PCI 设备如何为虚拟功能(VF)出现,并将其绑定到 VFIO 或通过 libvirt XML 分配。

Pitfalls and Debugging: We list common issues and how to check for them:陷阱与调试:列出常见问题及其检查方法:

  • Device doesn’t appear in VFIO: check dmesg for IOMMU errors or group conflicts.设备未出现在 VFIO 中:请检查 dmesg 是否存在 IOMMU 错误或组冲突。

  • IOMMU group too large (contains unwanted devices): sometimes requiring moving devices to different slots or enabling ACS override (a kernel option, though not recommended for production due to security).IOMMU 组过大(包含不需要的设备):有时需要将设备移动到不同的插槽或启用 ACS 覆盖(这是一个内核选项,但由于安全原因不建议在生产环境中使用)。

  • On some GPUs, the ROM or certain registers might not be accessible – requiring enablevfio=1 for NVIDIA or UEFI boot with permissive settings.在某些 GPU 上,ROM 或某些寄存器可能无法访问——这需要针对 NVIDIA 使用 enablevfio=1 ,或使用带有宽松设置的 UEFI 启动。
    This chapter ensures the reader can confidently manipulate device assignments on the host, a necessary skill to implement the virtualization strategies discussed earlier.本章确保读者能够自信地操作主机上的设备分配,这是实现前面讨论的虚拟化策略所必需的技能。

Chapter 9: Kernel Driver Internals for Virtualization Support第 9 章:支持虚拟化的内核驱动程序内部结构

Core Concepts: Here we shift perspective to the device driver development side. Senior engineers and architects benefit from understanding how hardware drivers must be designed or adapted to support virtualization features. We cover the concept of Merged Drivers – a single driver that can serve the host and also expose resources to guests concurrentlyopen-iov.org. For example, Intel’s i915 GPU driver can simultaneously drive the physical display for the host (using the PF) and create virtual GPU instances for VMs (as VFs or Mdev devices)open-iov.org. We outline how this is achieved in driver architecture:核心概念:这里我们从设备驱动开发的角度切换视角。高级工程师和架构师通过了解硬件驱动程序如何设计或调整以支持虚拟化功能,将受益匪浅。我们介绍了合并驱动程序的概念——一种既能为主机服务,又能同时向客户机暴露资源的单一驱动程序 open-iov.org。例如,Intel 的 i915 GPU 驱动程序可以同时驱动主机的物理显示(使用 PF)并为虚拟机创建虚拟 GPU 实例(作为 VF 或 Mdev 设备)open-iov.org。我们概述了驱动架构中实现这一点的方法:

  • The driver must partition resources: memory (VRAM), engine time, interrupts, etc. It maintains multiple contexts – one for each VM – in addition to the host.驱动程序必须对资源进行划分:内存(VRAM)、引擎时间、中断等。它维护多个上下文——每个虚拟机一个,外加主机一个。

  • It registers with frameworks: e.g., an i915 driver registers with the mediated core if Mdev is usedopen-iov.org, or with the PCI core to create VFs if SR-IOV is used. In some cases, drivers support both (i915 is transitioning from software GVT-g to hardware SR-IOV, maintaining backward compatibility)open-iov.org.它需要向框架注册:例如,如果使用 Mdev,i915 驱动会向 mediated core 注册 open-iov.org;如果使用 SR-IOV,则向 PCI core 注册以创建 VF。在某些情况下,驱动程序支持两者(i915 正从软件 GVT-g 过渡到硬件 SR-IOV,同时保持向后兼容)open-iov.org。

  • Command Routing and Filtering: The driver might need to intercept certain commands from the guest. For instance, privileged operations that could harm the host or other guests must be caught. Drivers use mechanisms like irqfd/ioeventfd to handle guest signals safelyopen-iov.org. We describe how, in an Mdev model, a guest’s MMIO write that triggers an operation is handled by the host driver’s callback, which then performs the operation on hardware on behalf of the guest (ensuring it’s safe).命令路由和过滤:驱动程序可能需要拦截来自客户机的某些命令。例如,必须捕获可能损害主机或其他客户机的特权操作。驱动程序使用如 irqfd/ioeventfd 之类的机制来安全处理客户机信号 open-iov.org。我们描述了在 Mdev 模型中,客户机的 MMIO 写操作触发的操作如何由主机驱动的回调处理,随后代表客户机在硬件上执行该操作(确保其安全)。

  • Interrupt handling: The host driver for a virtualized device often handles interrupts from the hardware and then distributes or injects them to the appropriate guest. This could be via VFIO’s eventfd (if using VFIO for each instance) or via hypercalls.
    Modern designs might use hardware message signaling where possible (e.g., separate MSI-X vectors per VF in SR-IOV, or per PASID in SIOV).

Case Studies: We look at how different vendors’ drivers implement virtualization:案例研究:我们考察不同厂商的驱动如何实现虚拟化:

  • Intel i915: High-level architecture of splitting the GPU via GVT-g (software) and upcoming SR-IOV. We highlight the GuC (Graphics Microcontroller) firmware used in newer Intel GPUs to assist scheduling between contextsopen-iov.orgopen-iov.org and how the driver leverages it or bypasses it via direct submission modes (Execlists vs GuC scheduling)open-iov.org.Intel i915:通过 GVT-g(软件)和即将推出的 SR-IOV 实现 GPU 拆分的高级架构。我们重点介绍了用于较新 Intel GPU 的 GuC(图形微控制器)固件,它协助上下文之间的调度 open-iov.orgopen-iov.org,以及驱动程序如何利用它或通过直接提交模式(Execlists 与 GuC 调度)绕过它 open-iov.org。

  • NVIDIA: The proprietary NVIDIA driver historically provided vGPU by running a user-space mediation service with their closed source driver. However, NVIDIA’s move to open-source their kernel driver (Open GPU Kernel Modules) means we can discuss the structure of their open driver (OpenRM) which is highly object-oriented with distinct “engines” for each serviceopen-iov.org. We explain that their driver has to manage both physical GPU tasks and potentially mediated vGPU tasks, and how they handle resource accounting.
    (We note that as of now, NVIDIA’s open driver does not expose SR-IOV or Mdev to our knowledge, but the groundwork is there for cloud GPUs.)

  • AMD: AMD’s driver (AMDGPU) currently does not support mediated vGPUs, and SR-IOV support was limited to their older MxGPU (GIM) which was discontinuedopen-iov.org. We discuss what’s missing in AMD’s driver – for instance, hooks for virtual interrupt handling (using eventfd/irqfd callbacks) are not presentopen-iov.org, meaning it can’t yet coordinate with VFIO Mdev. This gives an example of what modifications are needed in a driver to support virtualization (AMD is working on SR-IOV for newer CDNA datacenter GPUs, so those drivers differ from the mainstream consumer driver).AMD:AMD 的驱动程序(AMDGPU)目前不支持中介虚拟 GPU(mediated vGPUs),SR-IOV 支持仅限于他们较早的 MxGPU(GIM),该产品已被停产 open-iov.org。我们讨论了 AMD 驱动中缺失的部分——例如,虚拟中断处理的钩子(使用 eventfd/irqfd 回调)尚未实现 open-iov.org,这意味着它还不能与 VFIO Mdev 协调工作。这举例说明了驱动程序为支持虚拟化所需的修改(AMD 正在为更新的 CDNA 数据中心 GPU 开发 SR-IOV,因此这些驱动与主流消费级驱动不同)。

Kernel Path Walkthrough: We provide a simplified walkthrough of the data flow in the kernel when a guest interacts with a device:内核路径演示:我们提供了一个简化的内核数据流演示,展示当客户机与设备交互时的流程:

  • Guest writes to device register -> trapped (if mediated) -> host driver callback -> operation on hardware.客户机写入设备寄存器 -> 被捕获(如果是中介模式)-> 主机驱动回调 -> 对硬件进行操作。

  • Hardware raises interrupt -> host driver interrupt handler -> determine which VM (or host) it’s for -> inject interrupt to guest (via VFIO or hypervisor call).硬件触发中断 -> 主机驱动中断处理程序 -> 确定中断属于哪个虚拟机(或主机)-> 向客户机注入中断(通过 VFIO 或虚拟机管理程序调用)。

  • Memory allocation in guest (e.g., guest driver allocates GPU memory) -> may correspond to host driver allocating real VRAM and mapping it into the guest’s IOMMU context.客体内存分配(例如,客体驱动分配 GPU 内存)-> 可能对应主机驱动分配真实的显存并将其映射到客体的 IOMMU 上下文中。

Understanding these flows helps an architect design systems with proper driver support. The chapter emphasizes that to successfully implement a virtualization solution, one often has to consider whether the device drivers (both host and guest) support the necessary features and what modifications or configurations are needed.
This insight prepares the reader for deeper vendor-specific chapters to come.

Chapter 10: Hypervisor and VMM Integration第 10 章:虚拟机监控器和 VMM 集成

Core Concepts: This chapter examines the role of the hypervisor and the Virtual Machine Monitor (VMM) (like QEMU/KVM, Xen, etc.) in managing virtualized I/O. The hypervisor is responsible for orchestrating the host-side setup (which we covered in previous chapters) and the guest-side exposure of devices. We outline the responsibilities of the hypervisor/VMM in various scenarios:核心概念:本章探讨了管理虚拟化 I/O 时,管理程序和虚拟机监控器(VMM)(如 QEMU/KVM、Xen 等)的作用。管理程序负责协调主机端的设置(我们在前几章中已介绍)以及设备在客户机端的暴露。我们概述了管理程序/VMM 在各种场景下的职责:

  • Direct Device Assignment (Passthrough): The VMM must present the device to the guest, often by emulating a generic PCI device until the guest driver takes over. For VFIO, QEMU creates a virtual PCI device of type vfio-pci and maps the real device’s BARs into the guest’s address space. We explain how QEMU uses the VFIO ioctls to add a device FD to a container and then passes the FD to the VMopen-iov.org – essentially the VMM is a conduit for the VFIO-managed resource to appear in the guest.直接设备分配(直通):VMM 必须将设备呈现给客户机,通常通过模拟通用 PCI 设备,直到客户机驱动接管。对于 VFIO,QEMU 创建一个类型为 vfio-pci 的虚拟 PCI 设备,并将真实设备的 BAR 映射到客户机的地址空间。我们解释了 QEMU 如何使用 VFIO 的 ioctl 将设备文件描述符添加到容器中,然后将该文件描述符传递给虚拟机——本质上,VMM 是 VFIO 管理资源在客户机中出现的通道。

  • Mediated Devices: QEMU has specific support to interface with Mdev. When configured, QEMU will locate the mediated device by its sysfs path and do similar steps as with a physical device. The difference is that the VFIO device in this case is not a whole PCI device but a mediated instance – however, QEMU presents it to the guest still as a PCI device (often with a vendor ID indicating a virtual device).
    We discuss any additional steps, like ensuring the Mdev UUID is passed and QEMU doesn’t reset the physical device, etc., which are handled by the host driver.

  • Emulated vs Paravirtualized Devices: Not all I/O in a VM is done via passthrough. We contrast using passthrough (VFIO) with using fully emulated devices (e.g., QEMU’s emulated e1000 NIC or Cirrus VGA) and paravirtual devices (virtio-net, virtio-gpu). The hypervisor plays a different role in each:仿真设备与准虚拟化设备:虚拟机中的所有 I/O 并非都通过直通完成。我们对比了使用直通(VFIO)与使用完全仿真设备(例如 QEMU 仿真的 e1000 网卡或 Cirrus VGA)以及准虚拟设备(virtio-net、virtio-gpu)。在每种情况下,虚拟机监控器扮演的角色不同:

    • With emulated devices, the hypervisor handles every operation (high overhead, but no special hardware needed).对于仿真设备,虚拟机监控器处理每一个操作(开销较大,但不需要特殊硬件)。

    • With paravirt, the hypervisor and guest cooperate (e.g., virtio uses shared memory buffers and virtqueues; this often yields good performance if the hypervisor is optimized, but still uses software switching for NIC or software rendering for GPU unless backed by real hardware).使用半虚拟化时,虚拟机监控器和客户机协同工作(例如,virtio 使用共享内存缓冲区和虚拟队列;如果虚拟机监控器经过优化,通常能获得良好性能,但除非有真实硬件支持,否则仍然使用软件切换进行网卡或软件渲染进行 GPU)。

    • With passthrough (our focus), the hypervisor mostly sets up and lets the guest directly talk to hardware, which is best for performance but requires the hardware and drivers to support it.使用直通(我们的重点),虚拟机监控器主要负责设置,并让客户机直接与硬件通信,这对性能来说是最佳选择,但需要硬件和驱动程序支持。

VMM Configuration and Management: We provide examples using common hypervisors:虚拟机监控器配置与管理:我们提供使用常见虚拟机监控器的示例:

  • How to specify a VF or Mdev in libvirt XML for a domain (VM), including setting vendor-specific properties if needed (like setting <vendor_id> in XML to hide “hypervisor” from NVIDIA driver to avoid Code 43, if relevant).如何在 libvirt XML 中为域(虚拟机)指定 VF 或 Mdev,包括在需要时设置厂商特定属性(例如,在 XML 中设置 <vendor_id> 以隐藏 NVIDIA 驱动中的“hypervisor”,以避免出现代码 43,如果相关)。

  • Differences in Xen or VMware ESXi approach if relevant (briefly, e.g., Xen has a concept of dom0 controlling PCI assignment, and VMware has their VMDirectPath for SR-IOV, etc., though we keep focus on Linux/KVM).如果相关,简要说明 Xen 或 VMware ESXi 的不同方法(例如,Xen 有 dom0 控制 PCI 分配的概念,VMware 则有用于 SR-IOV 的 VMDirectPath 等,尽管我们主要关注 Linux/KVM)。

Hypervisor-side Resource Allocation: We mention that the hypervisor must also allocate interrupts (e.g., assign host IRQs or MSIs to route to guests), manage IOMMU contexts for each VM (in KVM, this is mostly handled by VFIO + kernel, but orchestrated by QEMU).
For SIOV, the hypervisor might have to program PASID entries.

Live Migration Considerations: A brief note – passthrough devices are traditionally not migratable (since state lives in hardware). We discuss evolving solutions:实时迁移注意事项:简要说明——直通设备传统上不可迁移(因为状态存在于硬件中)。我们讨论了正在发展的解决方案:

  • For SR-IOV NICs, techniques like device state migration are being developed (e.g., saving NIC registers/state to migrate VMs with SR-IOVforum.level1techs.com). This requires hypervisor support plus vendor hardware support.对于 SR-IOV 网卡,正在开发诸如设备状态迁移的技术(例如,保存网卡寄存器/状态以迁移带有 SR-IOV 的虚拟机,参考 SR-IOVforum.level1techs.com)。这需要虚拟机监控器的支持以及厂商硬件的支持。

  • For mediated devices, live migration can be possible if the host driver supports transferring the internal state of the vGPU between hosts. (This is an advanced topic likely beyond scope, but we mention it as a consideration for architects.)对于中介设备,如果主机驱动支持在主机之间传输 vGPU 的内部状态,则可以实现实时迁移。(这是一个高级话题,可能超出本书范围,但我们提及它以供架构师考虑。)

By end of this chapter, the reader should appreciate that setting up a device for a VM is not just a kernel task but requires hypervisor configuration and sometimes creative workarounds (especially for consumer GPUs that were not intended for virtualization).
The hypervisor’s role is critical in exposing virtual devices in a way the guest OS expects, whether that’s an identical physical-like device or a paravirtual device for new paradigms like SIOV.

Chapter 11: Virtual GPU Resource Modeling and Management第 11 章:虚拟 GPU 资源建模与管理

Core Concepts: Focusing on GPUs, this chapter explains how a physical GPU’s resources can be modeled and divided into multiple virtual GPUs (vGPUs). We identify the major resource categories of a GPU that must be virtualized:核心概念:本章聚焦于 GPU,讲解如何对物理 GPU 的资源进行建模并划分为多个虚拟 GPU(vGPU)。我们确定了必须虚拟化的 GPU 主要资源类别:

  • Memory – both local GPU memory (VRAM) and graphics aperture (GGTT/FB aperture), as well as possibly shared system memory.内存——包括本地 GPU 内存(VRAM)和图形孔径(GGTT/FB 孔径),以及可能的共享系统内存。

  • Compute/Graphics Queues – the GPU’s engines (3D, compute, copy engines) and how work is submitted to them.计算/图形队列——GPU 的引擎(3D、计算、拷贝引擎)及其工作提交方式。

  • Context Scheduling – how multiple GPU contexts (from different VMs) are scheduled on the GPU.上下文调度——多个 GPU 上下文(来自不同虚拟机)如何在 GPU 上进行调度。

  • Display / Output – if VMs need virtual displays (for example, VMs with virtualized desktop GPUs).显示/输出——如果虚拟机需要虚拟显示器(例如,带有虚拟化桌面 GPU 的虚拟机)。

We discuss two primary models for vGPU:我们讨论两种主要的 vGPU 模型:

  • Time-Sliced Sharing: The GPU is not physically partitioned but each VM gets a time slice of the whole GPU. This requires fast context switching. We describe how NVIDIA vGPU and Intel GVT-g historically followed this model.时间片共享:GPU 没有物理分区,但每个虚拟机获得整个 GPU 的一个时间片。这需要快速的上下文切换。我们描述了 NVIDIA vGPU 和 Intel GVT-g 历史上如何遵循这一模型。

  • Partitioned (Spatial) Sharing: Newer GPUs (like NVIDIA A100’s MIG feature or AMD MxGPU in hardware) partition resources like memory and even specific compute units to each instance.
    This gives more predictable performance isolation but less flexibility (a partition can’t exceed its share even if others are idle).

Memory Management in vGPU: We explain how guest GPU memory is handled:vGPU 中的内存管理:我们解释了来宾 GPU 内存的处理方式:

  • The concept of Graphics Memory Management Unit (GMMU) or GPU MMU: each process (or VM) might see a virtual GPU address space. For example, Intel’s i915 uses a Global GTT (GGTT) shared by the whole system and Per-Process GTT (PPGTT) for each client/VMopen-iov.orgopen-iov.org. A vGPU must have its own address space so that one VM’s GPU memory allocations don’t clash with another’s. The host driver or hardware provides separation, e.g., assigning different ranges of physical VRAM or using the IOMMU to map a guest’s “VRAM” pages to real VRAM.图形内存管理单元(GMMU)或 GPU MMU 的概念:每个进程(或虚拟机)可能看到一个虚拟 GPU 地址空间。例如,Intel 的 i915 使用整个系统共享的全局 GTT(GGTT)和为每个客户端/虚拟机提供的每进程 GTT(PPGTT)。vGPU 必须拥有自己的地址空间,以避免一个虚拟机的 GPU 内存分配与另一个冲突。主机驱动或硬件提供隔离,例如,分配不同范围的物理显存,或使用 IOMMU 将来宾的“显存”页映射到真实显存。

  • We highlight Intel’s approach where each VM’s vGPU gets a PPGTT context on the GPU, and possibly an aliasing PPGTT for partially shared contextsopen-iov.org. This is an example of how fine-grained the memory modeling can be.我们重点介绍 Intel 的方法,其中每个虚拟机的 vGPU 在 GPU 上获得一个 PPGTT 上下文,并可能获得一个用于部分共享上下文的别名 PPGTT(open-iov.org)。这是内存建模细粒度的一个示例。

  • For mediated devices, often the host driver will manage memory on behalf of the guest. In contrast, in SR-IOV, the device might natively handle per-VF memory contexts. We compare these.对于中介设备,通常主机驱动程序会代表客户机管理内存。相比之下,在 SR-IOV 中,设备可能本地处理每个 VF 的内存上下文。我们对此进行了比较。

Scheduling and Queueing: We look at how commands from multiple VMs are handled. Using Intel as an example:调度与排队:我们研究了如何处理来自多个虚拟机的命令。以 Intel 为例:

  • The guest i915 driver in a VM produces batches of GPU commands and submits them (in GVT-g, they are trapped and forwarded to host; in SR-IOV, they go to the hardware via the VF).虚拟机中的客户机 i915 驱动生成 GPU 命令批次并提交它们(在 GVT-g 中,这些命令会被捕获并转发到主机;在 SR-IOV 中,它们通过 VF 发送到硬件)。

  • The host i915 or GPU firmware (GuC) will arbitrate. Intel has two modes: Execlists (software scheduling by host driver) vs GuC submission (firmware scheduling)open-iov.orgopen-iov.org. We explain that in virtualization, the choice of scheduler can affect performance and isolation. The host may choose a simpler method to ensure fairness (execlist) or rely on firmware (GuC) for potentially better efficiency.主机的 i915 或 GPU 固件(GuC)将进行仲裁。Intel 有两种模式:Execlists(由主机驱动软件调度)与 GuC 提交(固件调度)open-iov.orgopen-iov.org。我们解释了在虚拟化中,调度器的选择会影响性能和隔离性。主机可能选择更简单的方法以确保公平性(execlist),或者依赖固件(GuC)以获得潜在的更高效率。

  • We mention that NVIDIA’s scheduler for vGPU is in their driver/firmware, and AMD MxGPU had a static partitioning (so scheduling was less an issue since each VM had dedicated resources in that model).我们提到 NVIDIA 的 vGPU 调度器位于其驱动/固件中,而 AMD MxGPU 采用静态分区(因此调度问题较少,因为每个虚拟机在该模型中拥有专用资源)。

Performance Implications: This part educates on how modeling decisions impact performance:性能影响:本部分讲解建模决策如何影响性能:

  • Context-switch overhead vs dedicated contexts.上下文切换开销与专用上下文。

  • Memory overcommit vs strict allocation (if one VM isn’t using all VRAM, can another use it? Mdev might allow dynamic sharing, whereas SR-IOV partitions might not).内存过度分配与严格分配(如果一个虚拟机没有使用所有显存,另一个虚拟机能否使用它?Mdev 可能允许动态共享,而 SR-IOV 分区可能不支持)。

  • Overhead of virtualization – e.g., if every command requires a VM exit in an emulated model it’s slow, but modern methods strive for zero-copy sharing and direct execution. A prime example: Intel’s i915 driver now supports direct mapping of a guest’s vGPU frame buffer into the host GPU for display output, avoiding any copying or conversionopen-iov.org. This technique, using features like udmabuf, greatly improves performance for desktop virtualization by letting the host GPU scan out the guest’s rendered frame directly. We describe such optimizations as key to approaching native performance.虚拟化的开销——例如,在仿真模型中,如果每条命令都需要虚拟机退出,速度会很慢,但现代方法力求实现零拷贝共享和直接执行。一个典型例子是:Intel 的 i915 驱动现在支持将客户机的 vGPU 帧缓冲区直接映射到主机 GPU 以进行显示输出,避免了任何拷贝或转换 open-iov.org。该技术利用了诸如 udmabuf 之类的特性,通过让主机 GPU 直接扫描客户机渲染的帧,大大提升了桌面虚拟化的性能。我们将此类优化描述为接近原生性能的关键。

By understanding this modeling, the reader can appreciate the complexity hidden under “it’s just a virtual GPU”.
This chapter empowers them to evaluate or design how a GPU might be shared: whether to implement strict partitions or allow dynamic sharing, how to handle memory mapping, and how to ensure one VM’s workload doesn’t starve others (or the host).

Chapter 12: Memory, IOMMU, and Address Translation in Virtualized I/O第 12 章:虚拟化 I/O 中的内存、IOMMU 与地址转换

Core Concepts: In this chapter, we take a closer look at memory address translation and protection for device virtualization. The interplay between the CPU’s memory virtualization (EPT, extended page tables) and the IOMMU is critical. We describe a scenario step-by-step:核心概念:本章将深入探讨设备虚拟化中的内存地址转换与保护。CPU 的内存虚拟化(EPT,扩展页表)与 IOMMU 之间的相互作用至关重要。我们将逐步描述一个场景:

  • A guest application issues a command to a GPU, referencing a buffer in its virtual address space.一个客户机应用向 GPU 发出命令,引用其虚拟地址空间中的一个缓冲区。

  • The guest GPU driver translates that to a guest physical address (GPAs) or uses an aperture if GPU has its own VA space.客户机 GPU 驱动将其转换为客户机物理地址(GPA),或者如果 GPU 拥有自己的虚拟地址空间,则使用一个地址窗口。

  • When the command goes to hardware, the addresses might be guest physical, which the IOMMU on the host must translate to host physical addresses (HPAs). The device, if using DMA, will issue transactions that go through the IOMMU which consults the appropriate page tables.当命令发送到硬件时,地址可能是客户物理地址,主机上的 IOMMU 必须将其转换为主机物理地址(HPA)。如果设备使用 DMA,它将发出通过 IOMMU 的事务,IOMMU 会查询相应的页表。

We introduce the concept of dual-stage translation for VMs:我们引入虚拟机的双级转换概念:

  • First stage: Guest virtual -> guest physical (handled by CPU’s page tables, irrelevant to device unless the device supports GPU virtual memory).第一级:客户虚拟地址 -> 客户物理地址(由 CPU 的页表处理,除非设备支持 GPU 虚拟内存,否则与设备无关)。

  • Second stage: guest physical -> host physical (handled by the IOMMU for DMA). In Intel terms, this is done via the VT-d page tables associated with the VM’s context.第二级:客户物理地址 -> 主机物理地址(由 IOMMU 处理 DMA)。在 Intel 术语中,这是通过与虚拟机上下文相关联的 VT-d 页表完成的。

For devices like GPUs that have their own MMU (GMMU), an extra layer exists (the GPU does guest virtual -> guest physical via its own page tables, then IOMMU does guest physical -> host physical). We describe how mediated device frameworks often maintain shadow page tables: the host driver populates the device’s page tables with translations directly to host memory, since the guest cannot manage real memory. For instance, the i915 GVT-g mechanism involved shadow GTT pages where each time the guest thought it set up a GPU page, the host would map it to the real location in host memoryopen-iov.orgopen-iov.org.对于像 GPU 这样拥有自己内存管理单元(GMMU)的设备,存在一个额外的层次(GPU 通过其自己的页表完成来宾虚拟地址到来宾物理地址的转换,然后 IOMMU 完成来宾物理地址到主机物理地址的转换)。我们描述了中介设备框架通常如何维护影子页表:主机驱动程序直接将设备的页表填充为指向主机内存的映射,因为来宾无法管理真实内存。例如,i915 GVT-g 机制涉及影子 GTT 页,每当来宾认为它设置了一个 GPU 页时,主机会将其映射到主机内存中的真实位置 open-iov.orgopen-iov.org。

IOMMU Configuration: We detail how the hypervisor sets up the IOMMU for a VM. Each VM (with passthrough devices) gets an isolated I/O address space, often leveraging an IOMMU domain per VM. VFIO will attach the device’s group to an IOMMU container that uses the VM’s IOMMU context. We mention how modern hypervisors may use the IOMMU in either identity map mode (mapping guest physical 1:1 to host physical, which requires allocating guest memory in specific ways) or in dynamic mapping mode (mapping on demand as guest memory is pinned).IOMMU 配置:我们详细说明了虚拟机监控程序如何为虚拟机设置 IOMMU。每个虚拟机(带有直通设备)都会获得一个隔离的 I/O 地址空间,通常利用每个虚拟机一个 IOMMU 域。VFIO 会将设备组附加到使用虚拟机 IOMMU 上下文的 IOMMU 容器中。我们提到现代虚拟机监控程序可能会以身份映射模式(将来宾物理地址 1:1 映射到主机物理地址,这需要以特定方式分配来宾内存)或动态映射模式(根据需要在来宾内存被固定时进行映射)使用 IOMMU。

Interrupt Remapping: Memory virtualization is incomplete without mention of interrupt handling. With IOMMUs often comes interrupt remapping tables – ensuring that interrupts from a device (which may be passed through) get delivered to the correct VM and cannot be spoofed. We keep this brief, explaining that enabling the IOMMU’s interrupt isolation is required for safe interrupt delivery in passthrough scenarios.中断重映射:内存虚拟化若不提及中断处理则不完整。IOMMU 通常配备中断重映射表——确保来自设备(可能被直通)的中断能够正确传递到相应的虚拟机,且无法被伪造。我们简要说明,启用 IOMMU 的中断隔离对于直通场景中的安全中断传递是必需的。

PASID and Shared Virtual Memory: For advanced devices (SIOV, or modern GPUs), we describe how Shared Virtual Memory (SVM) works: the device can directly use virtual addresses from the process (even inside a VM) if the IOMMU supports it. PASID is used to tag these and the IOMMU does a combined translation.
We give an overview that a guest could one day have a GPU where the GPU threads use the guest’s virtual addresses and with nested translation the host IOMMU resolves it – but point out that this is still cutting-edge (further reading for those interested).

Known Issues: We reference a couple of known issues or complexities:已知问题:我们提及了几个已知的问题或复杂性:

  • Peer-to-peer DMA with virtualization: devices doing DMA between each other might bypass IOMMU and break isolationopen-iov.org, so system designers must be careful (or ensure such transactions are disallowed when using VFIO).虚拟化中的点对点 DMA:设备之间进行 DMA 时可能绕过 IOMMU,破坏隔离 open-iov.org,因此系统设计者必须小心(或确保在使用 VFIO 时禁止此类事务)。

  • IOMMU grouping oddities: some architectures (or errata) might place devices in the same group unexpectedly, limiting isolation. Also some older GPUs might not support being behind certain IOMMU schemes (e.g., need identity mapping for peer access).IOMMU 分组的奇异情况:某些架构(或错误)可能会意外地将设备放在同一组中,限制隔离。此外,一些较旧的 GPU 可能不支持在某些 IOMMU 方案后面运行(例如,点对点访问需要身份映射)。

By thoroughly understanding memory and address translation, readers will be equipped to troubleshoot why, for example, a device in a VM might say “out of memory” if the mappings aren’t correctly set, or why enabling/disabling certain IOMMU features affects performance.
This knowledge is foundational for ensuring correctness and performance in virtualization.

Chapter 13: Interrupt Handling and Emulation in Virtualized Devices第 13 章:虚拟化设备中的中断处理与仿真

Core Concepts: This chapter zooms in on how device interrupts are virtualized. We discuss the life cycle of an interrupt in a passthrough scenario:核心概念:本章聚焦于设备中断的虚拟化方式。我们讨论了直通场景中中断的生命周期:

  • A device raises an interrupt (asserts an MSI or MSI-X message, or legacy INTx).设备发出中断(触发 MSI 或 MSI-X 消息,或传统的 INTx 中断)。

  • On a bare-metal system, that would go to the CPU’s interrupt controller. In a virtualized setup, the interrupt must be routed to the guest if the device is assigned, or handled by the hypervisor/host if it’s a virtual device.在裸机系统中,中断会传递到 CPU 的中断控制器。在虚拟化环境中,如果设备被分配给虚拟机,中断必须路由到该虚拟机;如果是虚拟设备,则由虚拟机监控器/主机处理。

For MSI/MSI-X interrupts (which modern GPUs and NICs use), the IOMMU/interrupt remapper can directly inject the interrupt to the guest’s interrupt descriptor table if set up properly.
We explain how with VFIO, the hypervisor programs the device’s MSI address/data to point to an interrupt window that the hypervisor manages. The VFIO framework and KVM can then inject a virtual interrupt to the guest’s vCPU. This often leverages the irqfd mechanism: the kernel VFIO driver triggers an eventfd when the physical interrupt occurs, and KVM is listening on that to inject a virtual IRQ to the VM.

For mediated devices, since the host driver often handles the hardware interrupts, it might receive an interrupt and then decide which vGPU it belongs to and signal that VM using ioeventfd/irqfd pairsopen-iov.org. We illustrate this with an example:对于中介设备,由于主机驱动通常处理硬件中断,它可能会接收到中断,然后决定该中断属于哪个虚拟 GPU,并使用 ioeventfd/irqfd 配对向该虚拟机发出信号。我们通过一个示例来说明这一点:

  • In a vGPU scenario, the physical GPU generates an interrupt (say, a frame rendered or a command buffer completed). The host GPU driver’s ISR runs, sees it pertains to vGPU instance X. The driver writes to an ioeventfd associated with that instance’s event (which is wired to the guest’s virtual device in QEMU), causing the guest to receive the interrupt virtuallyopen-iov.org.在 vGPU 场景中,物理 GPU 会产生中断(例如,帧渲染完成或命令缓冲区完成)。主机 GPU 驱动的 ISR 运行,发现该中断属于 vGPU 实例 X。驱动程序向与该实例事件关联的 ioeventfd 写入数据(该 ioeventfd 连接到 QEMU 中的客户虚拟设备),从而使客户机虚拟地接收到中断 open-iov.org。

  • Conversely, when a guest needs to trigger something like a doorbell register (common in NICs for submitting packets, or in GPUs to kick off work), writing to that register is trapped and the host uses eventfd/irqfd to handle it.相反,当客户机需要触发类似门铃寄存器的操作(这在 NIC 中用于提交数据包,或在 GPU 中用于启动工作)时,写入该寄存器的操作会被捕获,主机使用 eventfd/irqfd 来处理。

We also cover MSI vs MSI-X vs Legacy: Legacy INTx (line-based interrupts) cannot be shared easily in passthrough because the hypervisor would have to emulate a virtual APIC pin.
Thus, VFIO typically requires MSI/MSI-X, and we mention that virtually all modern high-performance devices support MSI/MSI-X, so this is usually fine.

Pitfalls in Interrupt Handling:中断处理中的陷阱:

  • Some devices generate frequent interrupts (e.g., high network packet rates). We mention strategies like interrupt coalescing on NICs and how that interacts with virtualization (the VF’s interrupts can be coalesced by hardware, offloading a bit from hypervisor).一些设备会产生频繁的中断(例如,高网络数据包速率)。我们提到了如 NIC 上的中断合并策略,以及这如何与虚拟化交互(VF 的中断可以由硬件合并,从而减轻部分虚拟机监控器的负担)。

  • Interrupt storms or mis-routed interrupts are debugged by ensuring proper interrupt remapping. If one sees host dmesg errors about interrupts from a VF not being remapped, that indicates configuration issues.中断风暴或中断错误路由的调试方法是确保正确的中断重映射。如果在主机 dmesg 中看到关于 VF 中断未被重映射的错误,这表明存在配置问题。

  • We include a note that for devices behind a PCIe switch, correct ACS (Access Control Services) is needed so interrupts (and DMA) are isolated per function.我们还指出,对于位于 PCIe 交换机后的设备,需要正确的 ACS(访问控制服务),以确保中断(和 DMA)在各功能间隔离。

APICv and Posted Interrupts: For advanced readers, we describe that modern Intel CPUs support posting interrupts directly to a running VM’s vCPU (so-called posted interrupts) which reduces latency. This is beneficial for direct-assigned devices as it can bypass the hypervisor for delivering an interrupt.
The IOMMU and APIC can coordinate to deliver an interrupt to a VM without VM exit in some cases. It’s an optimization that might be leveraged in high-end systems.

By mastering this, the engineer will be able to design systems with minimal interrupt latency (important for realtime or network functions) and also diagnose issues like “why doesn’t my VM receive interrupts from the device” (common answer: check interrupt remapping and VFIO/MSI enabling).

Chapter 14: Scheduling and Performance Isolation第 14 章:调度与性能隔离

Core Concepts: This chapter addresses how performance is managed and Quality of Service (QoS) maintained when sharing devices. It’s not enough to simply give access; one must consider how to prevent one VM from monopolizing a shared GPU/NIC and how to ensure fair or prioritized scheduling.核心概念:本章讨论在共享设备时如何管理性能和维护服务质量(QoS)。仅仅给予访问权限是不够的,还必须考虑如何防止某个虚拟机独占共享的 GPU/NIC,以及如何确保公平或优先级调度。

GPU Scheduling: We build on Chapter 11’s discussion. Here we focus on policies and mechanisms:GPU 调度:我们基于第 11 章的讨论。在这里,我们重点关注策略和机制:

  • Time-slice scheduling vs workload-based scheduling. For example, a simple scheduler might give each vGPU 5ms of GPU time in round-robin. More complex schedulers might weight VMs differently (one VM could get a larger timeslice if it’s premium).时间片调度与基于工作负载的调度。例如,一个简单的调度器可能会以轮询方式给每个 vGPU 分配 5 毫秒的 GPU 时间。更复杂的调度器可能会对虚拟机赋予不同的权重(如果某个虚拟机是高级用户,可能会获得更长的时间片)。

  • Preemption: critical for interactive workloads. We cover if and how a running GPU context can be preempted to switch to another (older GPUs were poor at this, newer ones improved).
    E.g., NVIDIA GPUs historically had trouble preempting a long-running kernel, which could cause latency issues for others sharing the GPU.

  • We mention Intel’s solution using GuC submission, which can handle scheduling at a firmware level potentially more efficiently, and how the host driver might choose between letting GuC manage scheduling or doing it in software (with trade-offs of control vs performance)open-iov.orgopen-iov.org.我们提到了 Intel 使用 GuC 提交的解决方案,该方案可以在固件级别处理调度,可能更高效,以及主机驱动程序如何在让 GuC 管理调度或通过软件管理调度之间进行选择(控制权与性能的权衡)open-iov.orgopen-iov.org。

NIC Bandwidth Management: If multiple VMs share a NIC (via SR-IOV or even vSwitch), scheduling is about dividing bandwidth. SR-IOV NICs might allow setting per-VF bandwidth limits or weights.
Alternatively, one might rely on higher-level networking QoS (like Linux traffic control in the host for virtio or tap devices).
We present an example of using hardware features: e.g., a 100Gbps NIC could be split into VFs where each VF is rate-limited to 10Gbps, or left free but then one VF’s heavy traffic could starve others unless the NIC or switch enforces fairness. This is an architectural decision.

Isolation vs Efficiency: We discuss performance isolation issues like the Noisy Neighbor problem: one VM’s workload (e.g., cache-filling or memory bandwidth heavy operations on a GPU) might affect another’s performance. Full isolation is hard without hardware support:隔离与效率:我们讨论性能隔离问题,如“嘈杂邻居”问题:一个虚拟机的工作负载(例如,GPU 上的缓存填充或内存带宽密集型操作)可能会影响另一个虚拟机的性能。没有硬件支持,完全隔离很难实现:

  • GPUs: memory bandwidth and caches are typically shared, so one VM could cause cache thrashing that slows another.
    Techniques like cache partitioning are not yet common in GPUs, so we note that complete isolation isn’t achievable – one must either accept performance variability or not mix certain workloads on the same physical GPU.

  • NICs: More straightforward as each VF has separate queue resources, but they share the physical link, so link saturation by one VM affects others. This can be mitigated by QoS.网卡(NIC):更为直接,因为每个虚拟功能(VF)都有独立的队列资源,但它们共享物理链路,因此一个虚拟机的链路饱和会影响其他虚拟机。可以通过服务质量(QoS)来缓解这一问题。

Monitoring Performance: We advise that virtualization solutions should include monitoring hooks:性能监控:我们建议虚拟化解决方案应包含监控钩子:

  • GPU: Telemetry per VM (some drivers allow reading how many cycles or execution time each context used). For instance, NVIDIA has vGPU scheduler logs; Intel’s tooling might use perf counters to see per-context usageopen-iov.org.GPU:每个虚拟机的遥测(某些驱动允许读取每个上下文使用了多少周期或执行时间)。例如,NVIDIA 有 vGPU 调度器日志;Intel 的工具可能使用 perf 计数器来查看每个上下文的使用情况,详见 open-iov.org。

  • NIC: Standard network monitoring per VF can be done (packets, bytes, etc.), and some NICs have per-VF stats registers.网卡(NIC):可以对每个虚拟功能(VF)进行标准网络监控(数据包、字节等),一些网卡还具有每个 VF 的统计寄存器。

Tuning and Best Practices: We give some tips:调优与最佳实践:我们提供一些建议:

  • If consistent performance is needed, possibly dedicate a whole device to a critical VM rather than sharing (or use static partitioning features if available, like MIG on NVIDIA A100 which gives hardware-isolated slices).如果需要稳定的性能,最好将整个设备专用于关键虚拟机,而不是共享(或者使用静态分区功能,如果有的话,比如 NVIDIA A100 上的 MIG,它提供硬件隔离的切片)。

  • Use cgroups or control groups on the host for device access if applicable (cgroup device controller can’t throttle hardware directly, but for mediated devices one could hook into their scheduling).如果适用,主机上使用 cgroups 或控制组来管理设备访问(cgroup 设备控制器不能直接限制硬件,但对于中介设备,可以挂钩它们的调度)。

  • Align the scheduling to the use case: e.g., for VDI (Virtual Desktop Infrastructure), latency and fairness are important (every user should get a responsive UI, so timeslices should be short to interleave execution).
    For compute (multiple VMs doing heavy GPU compute), throughput might matter more than latency, so longer timeslices could be fine.

This chapter ensures the reader is not only thinking “can we share?” but “how do we share fairly and efficiently?” — a crucial aspect when designing a virtualization solution for production where SLAs or user experience are on the line.本章不仅让读者思考“我们能否共享?”,更重要的是“我们如何公平且高效地共享?”——这是设计生产环境虚拟化解决方案时的关键方面,尤其当服务级别协议(SLA)或用户体验至关重要时。

Chapter 15: Case Study – Intel GPU Virtualization (GVT-g and Xe SR-IOV)第 15 章:案例研究——Intel GPU 虚拟化(GVT-g 和 Xe SR-IOV)

Overview: Intel has been a pioneer in GPU virtualization with its integrated GPUs. In this chapter, we study Intel’s approach in depth, extracting lessons and practical knowledge:概述:Intel 在集成 GPU 虚拟化领域一直是先驱者。本章深入研究 Intel 的方法,提炼经验教训和实用知识:

  • GVT-g (Graphics Virtualization Technology): Intel’s earlier solution for Gen9 and earlier iGPUs (Skylake through Coffee Lake) which used a mediated device model (software-based). We describe how GVT-g allowed multiple VMs to share an iGPU via the i915 driver’s mediated mode. Key aspects include the i915 mediated driver (aka KVMGT) that worked in conjunction with a modified KVM and QEMUopen-iov.org. (It required patches to KVM/Xen in early days to support memory traps for GPU registers, which was a complexity eventually alleviated by moving to native SR-IOV.)GVT-g(图形虚拟化技术):英特尔针对 Gen9 及更早的集成显卡(Skylake 至 Coffee Lake)推出的早期解决方案,采用了中介设备模型(基于软件)。我们介绍了 GVT-g 如何通过 i915 驱动的中介模式允许多个虚拟机共享一个集成显卡。关键部分包括 i915 中介驱动(又称 KVMGT),它与经过修改的 KVM 和 QEMU 协同工作 open-iov.org。(早期需要对 KVM/Xen 进行补丁以支持 GPU 寄存器的内存陷阱,这一复杂性最终通过转向原生 SR-IOV 得以缓解。)

  • Xe SR-IOV: Starting with Intel’s 11th Gen (Tiger Lake) and newer (Xe architecture GPUs), Intel introduced hardware SR-IOV support. We explain how the newer i915 driver “upstream” now supports creating VFs on the GPUopen-iov.org. This eliminates the need for a custom hypervisor—one can use standard VFIO for each VF. We highlight that one of the motivations was to remove the limitations of GVT-g (e.g., dependency on specific versions, lack of official support).
    Intel’s slides indicate these GPUs can run up to 8 VMs + host simultaneously on one iGPUopen-iov.org.

Driver Implementation: We detail how Intel’s driver manages both host and guest contexts:驱动实现:我们详细介绍了 Intel 驱动如何管理主机和客户机上下文:

  • The i915 driver when loaded on a SR-IOV capable GPU will by default drive the host display with the PF and can optionally spawn VFs for VMs. We mention that at first the SR-IOV functionality was not documented in the public driver APIopen-iov.org, but open source code is being upstreamed (so engineers might have to refer to code or community wikis like Open-IOV for guidance).当在支持 SR-IOV 的 GPU 上加载 i915 驱动时,默认情况下会使用 PF 驱动主机显示,并且可以选择为虚拟机生成 VF。我们提到,最初 SR-IOV 功能并未在公开的驱动 API open-iov.org 中记录,但开源代码正在被上游合并(因此工程师可能需要参考代码或社区维基如 Open-IOV 以获取指导)。

  • One big innovation by Intel is the 0-copy display for vGPUsopen-iov.org. We explain this in simpler terms: traditionally, a vGPU rendering something for a VM would need to copy that image back to host to display or send it over remote desktop protocols. Intel’s approach maps the guest’s framebuffer directly into host GPU’s output engine.
    This significantly boosts performance for local GUI and is an example of an optimization specific to Intel’s platform that architects should be aware of for inspiration (achieved via the udmabuf driver and i915’s dma-buf export mechanism).

Hardware Details: We note the hardware requirements Intel lists:硬件详情:我们注意到 Intel 列出的硬件要求:

  • Only certain integrated GPUs had GVT-g support (we can mention a few: HD Graphics 530, 630, etc. from 2015-2018 era)open-iov.org, and newer ones pivot to SR-IOV (Iris Xe)open-iov.org.只有某些集成 GPU 支持 GVT-g(我们可以提及几个:2015-2018 年间的 HD Graphics 530、630 等)open-iov.org,较新的则转向 SR-IOV(Iris Xe)open-iov.org。

  • Intel’s use of the GuC and HuC (firmware for scheduling and media) in newer GPUs means some scheduling is offloaded. We might briefly mention how the guest’s perspective sees either a virtual GuC or the host’s scheduling decisions.Intel 在较新 GPU 中使用 GuC 和 HuC(用于调度和媒体的固件)意味着部分调度被卸载。我们可以简要提及从客户机的角度来看,看到的是虚拟 GuC 或主机的调度决策。

Performance and Use Cases: Many use cases of Intel vGPU were for cloud desktops or graphics acceleration in VMs (like Windows VMs using Intel integrated graphics for light 3D). We discuss performance: an iGPU has limited power, so splitting it among many VMs means each gets a fraction.
Still, for tasks like media transcoding or running UI, it can be sufficient. In products, one might pair an Intel iGPU for virtualization with discrete GPUs reserved for heavier tasks.

Lessons Learned: We summarize some lessons:经验总结:我们总结了一些经验教训:

  • Upstream Support is Key: Intel’s journey shows the importance of getting features upstream (their initial GVT-g needed custom patches, but now SR-IOV support is being upstreamed, meaning easier adoption).上游支持至关重要:Intel 的历程展示了将功能推送到上游的重要性(他们最初的 GVT-g 需要定制补丁,但现在 SR-IOV 支持正在被推送到上游,这意味着更容易被采用)。

  • Feature Gaps: Early Intel vGPUs lacked some 3D features or had bugs (we could mention known issues like, historically, certain OpenGL features not virtualized). However, each generation closed gaps.功能缺口:早期的 Intel 虚拟 GPU 缺少一些 3D 功能或存在漏洞(我们可以提及一些已知问题,比如历史上某些 OpenGL 功能未被虚拟化)。然而,每一代产品都在弥补这些缺口。

By analyzing Intel’s solution, readers learn how a real-world virtualization solution is built from driver to hardware.
It also demonstrates the trade-off between a quick software solution (GVT-g) vs a robust hardware-based one (SR-IOV), guiding architects on what might be appropriate given time and hardware constraints.

Chapter 16: Case Study – NVIDIA GPU Virtualization (GRID/vGPU and MIG)第 16 章:案例研究——NVIDIA GPU 虚拟化(GRID/vGPU 和 MIG)

Overview: NVIDIA’s approach to GPU virtualization is examined, including both their traditional GRID/vGPU technology and the newer Multi-Instance GPU (MIG) feature in A-series GPUs. NVIDIA’s ecosystem is slightly different as much of it has been proprietary, but key concepts can be learned:概述:本章考察了 NVIDIA 的 GPU 虚拟化方法,包括其传统的 GRID/vGPU 技术和 A 系列 GPU 中的新型多实例 GPU(MIG)功能。NVIDIA 的生态系统略有不同,因为其中大部分是专有的,但可以学习到关键概念:

  • NVIDIA vGPU (GRID): We explain that for years NVIDIA offered virtual GPU support primarily on professional/datacenter GPUs via a proprietary stack. The hypervisor (VMware, Xen, KVM) would use a vendor-supplied kernel module and user-space daemon to carve the physical GPU. On Linux/KVM, this integrates with VFIO Mdev. Indeed, the Linux kernel’s Mdev subsystem was leveraged by NVIDIA to create vGPUs on Tesla/Quadro cards by essentially using a vendor driver that registered an Mdev type for each vGPU.
    We describe at a high level how the NVIDIA driver creates vGPU instances and how the guest uses a special NVIDIA guest driver to interface with the virtual GPU. A notable challenge was license enforcement and device ID check – e.g., consumer GeForce cards by default refused vGPU operation (Code 43 in driver). Many community efforts (like patches or tools) have worked around this, showing the demand for open solutions.

  • Open GPU Kernel Modules: In 2022, NVIDIA released open-source kernel modules for their driver (branded here as OpenRM in Open-IOV context)open-iov.org. We share that the open driver potentially makes it easier to integrate with community virtualization (no more black-box). However, the user-space remains closed for now. We talk about how the driver is structured (the “Platform RM” and “Firmware RM (GSP)” conceptopen-iov.org) and that virtualization functionality in the open driver is still evolving. Engineers might consider hybrid approaches (using open kernel driver with possibly customized user components for vGPU).开放 GPU 内核模块:2022 年,NVIDIA 发布了其驱动程序的开源内核模块(在 Open-IOV 环境中称为 OpenRM)open-iov.org。我们指出,开源驱动程序有可能更容易与社区虚拟化集成(不再是黑盒)。然而,用户空间目前仍然是封闭的。我们讨论了驱动程序的结构(“平台 RM”和“固件 RM(GSP)”的概念 open-iov.org),并且开源驱动中的虚拟化功能仍在发展中。工程师可能会考虑混合方法(使用开源内核驱动,配合可能定制的用户组件来实现 vGPU)。

MIG (Multi-Instance GPU): Unique to NVIDIA’s Ampere A100 and beyond, MIG allows a single physical GPU to be partitioned at hardware level into several instances, each with dedicated compute slices and memory.
We clarify that MIG is not exactly the same as vGPU for graphics – it’s more aimed at compute workloads and is presented to the system as multiple separate GPUs.
However, we include it because it is a form of resource partitioning that could be used in virtualization (each VM could be assigned a MIG instance, which appears as a full GPU to the VM). MIG gives strong isolation between instances (each has separate SM units, guaranteed memory carve-out, etc.), at the cost that you cannot dynamically resize them easily (you must choose a partitioning scheme).

We detail how MIG works:我们详细说明 MIG 的工作原理:

  • Requires GPU and driver support; the admin partitions the GPU via driver commands into (for example) 7 instances of equal size, or fewer larger instances.需要 GPU 和驱动支持;管理员通过驱动命令将 GPU 划分为(例如)7 个大小相等的实例,或更少的较大实例。

  • Each instance shows up as a separate PCI function (in A100’s case, as separate PCIe functions on the same card, which can then be directly assigned to VMs).
    This is more akin to SR-IOV conceptually, though implemented differently (the A100 doesn’t use the PCI SR-IOV standard; MIG is managed via NVML and the driver).

Performance and Use Cases: NVIDIA vGPU (non-MIG) is used in VDI, AI, etc., allowing many VMs to share a big GPU. MIG is used in cloud (NVIDIA allows cloud providers to safely partition an A100 for different customers).
We might mention that MIG has limitations (some features like graphics or CUDA contexts cannot span instances).

Open-Source Alternatives: We highlight any community projects (like LibVF.IO which was an attempt to enable SR-IOV-like sharing on GeForce cards in open sourceopen-iov.org). Such tools can sometimes enable using consumer GPUs in VMs with partial acceleration, a topic an advanced reader might explore if cost is a factor.开源替代方案:我们会重点介绍任何社区项目(如 LibVF.IO,这是一个尝试在 GeForce 显卡上实现类似 SR-IOV 共享的开源项目 open-iov.org)。此类工具有时可以实现消费者级 GPU 在虚拟机中的部分加速,若成本是考虑因素,进阶读者可能会对此进行探索。

Pitfalls:陷阱:

  • Driver enforcement: NVIDIA drivers will disable themselves (Error 43) if they detect virtualization not using an approved method on consumer cards.
    We mention this because architects should be aware of vendor-imposed constraints (and possible ethical/legal considerations of bypassing them).

  • Compatibility: Not all NVIDIA GPUs support virtualization features; only certain Tesla/Quadro support official vGPU. However, practically many (Maxwell, Pascal, Turing, etc.) can work with the right softwareopen-iov.org.兼容性:并非所有 NVIDIA GPU 都支持虚拟化功能;只有部分 Tesla/Quadro 支持官方 vGPU。然而,实际上许多(Maxwell、Pascal、Turing 等)在合适的软件支持下也能工作,详见 open-iov.org。

  • MIG requires A100 or newer; earlier GPUs cannot do similar slicing except through traditional vGPU.MIG 需要 A100 或更新的型号;早期的 GPU 除了通过传统的 vGPU 外,无法实现类似的切片。

By examining NVIDIA’s path (which heavily leveraged proprietary solutions), readers see a contrast to Intel’s fully open approach.
The key takeaway is how hardware capability (or lack thereof) drives the solution: lacking SR-IOV in hardware, NVIDIA first did it all in software (with license locks), whereas now that customers demand openness, we’re seeing more open drivers but still unique approaches like MIG for specific segmentation.

Chapter 17: Case Study – AMD GPU Virtualization (MxGPU and Looking Forward)第 17 章:案例研究——AMD GPU 虚拟化(MxGPU 及未来展望)

Overview: AMD’s journey in GPU virtualization has been rocky, and this chapter analyzes it to glean insights and current possibilities:概述:AMD 在 GPU 虚拟化的历程中经历了不少波折,本章将对此进行分析,以获取洞见和当前的可能性:

  • MxGPU (Multiuser GPU): AMD introduced this around 2015 with their FirePro S-series (based on Tonga GPU) using a technology also known as GIM (Graphics Interface Manager). This was essentially SR-IOV-based – the hardware had the capability to create virtual functions (called “secondary” or “virtual” GPUs) and a special driver (the AMD GIM kernel module) was used on the host to manage the scheduling. We outline that MxGPU was limited to certain GPUs and had a short life (marked End-Of-Life around 2017)open-iov.org. The open-source component (GIM) was not maintained to work with newer kernels for long, and eventually AMD did not push SR-IOV in consumer or even many pro GPUs after that for a while.MxGPU(多用户 GPU):AMD 大约在 2015 年推出了这一技术,应用于其基于 Tonga GPU 的 FirePro S 系列,采用了一种也称为 GIM(图形接口管理器)的技术。这本质上是基于 SR-IOV 的——硬件具备创建虚拟功能(称为“次级”或“虚拟”GPU)的能力,主机上使用一个特殊驱动(AMD GIM 内核模块)来管理调度。我们指出,MxGPU 仅限于某些 GPU,并且寿命较短(大约在 2017 年标记为终止支持)open-iov.org。开源组件(GIM)未能长期维护以支持更新的内核,最终 AMD 在一段时间内未在消费级甚至许多专业 GPU 上推动 SR-IOV。

  • Current State: As of recent generations, AMD’s focus has been more on high-end server GPUs (MI series for HPC) possibly with SR-IOV, but their primary driver (AMDGPU) doesn’t support mediated vGPU for say, Radeon consumer or even Radeon Pro currently.
    We mention that AMDGPU driver lacks the necessary mediated hooks (like handling irqfd/ioeventfd for virtual interrupts)open-iov.org. It also doesn’t implement multiple VM contexts concurrently for graphics.
    This means if an engineer is looking to virtualize AMD GPUs, the options are limited: essentially PCI passthrough of whole GPU, or experimental use of SR-IOV on specific enterprise cards (like AMD’s MI or older SR-IOV capable cards if you can get the GIM working with patchesopen-iov.org).

Analysis of Issues: We reflect on why AMD lagged:问题分析:我们反思 AMD 落后的原因:

  • Possibly fewer enterprise customers for GPU virtualization in the AMD camp, leading to less investment.可能是 AMD 阵营中 GPU 虚拟化的企业客户较少,导致投资不足。

  • Technical difficulties or timing (NVIDIA and Intel were ahead in building infrastructure).技术难题或时间因素(NVIDIA 和 Intel 在构建基础设施方面领先)。

  • The open nature of AMD’s driver (a single driver for all GPUs) means integrating virtualization could be complex and risky for them without strong demand.AMD 驱动的开放性(所有 GPU 共享一个驱动)意味着在没有强烈需求的情况下,集成虚拟化对他们来说可能既复杂又有风险。

Future Outlook: AMD has signaled interest in SR-IOV for newer Instinct accelerators. If an AMD GPU supports SR-IOV, presumably it will expose VFs and the AMDGPU driver would eventually handle it. We advise to watch for announcements or patches in the Linux kernel regarding AMD SR-IOV support.
For instance, support for SR-IOV on AMD’s CDNA GPUs (MI100/MI200) might come, but those are compute-focused.

We also mention PCIe Resizable BAR / Large BAR usage and how that ties in (both AMD and NVIDIA use large BARs for GPU memory, which can complicate VM assignment if not enough MMIO space – an aside, but relevant in planning hardware).我们还提到了 PCIe 可调整大小的 BAR / 大 BAR 的使用及其关联(AMD 和 NVIDIA 都使用大 BAR 来管理 GPU 内存,如果 MMIO 空间不足,可能会使虚拟机分配变得复杂——这是一个附带说明,但在硬件规划中很重要)。

Practical Guidance: For someone needing virtualization with AMD GPUs today: the safest route is whole-GPU passthrough (dedicate a GPU per VM). Multi-VM sharing on AMD is not production-ready.
If cost is a concern, one might consider AMD for passthrough since they don’t block virtualization artificially like NVIDIA’s consumer cards do. But if one needs actual multi-tenant sharing on one GPU, AMD is not yet the choice (as of 2025).

Lessons: This case underscores the importance of checking vendor support. A design that works on Intel/NVIDIA might not on AMD due to driver limitations.
It encourages engineers to contribute to or advocate for open solutions (AMD being open-source friendly means if community pushes, perhaps mediated support could be added – there have been discussions, but it requires significant work).

In summary, AMD’s example teaches caution: not all hardware is equally ready for virtualization, and one must plan around these differences.总之,AMD 的例子提醒我们要谨慎:并非所有硬件都同样适合虚拟化,必须围绕这些差异进行规划。

Chapter 18: Network Device Virtualization (NICs and Beyond)第 18 章:网络设备虚拟化(网卡及其他)

Core Concepts: While much of the text focuses on GPUs, this chapter addresses network interface virtualization, which is crucial in cloud and virtualization environments. We discuss how NICs are virtualized:核心概念:虽然大部分内容聚焦于 GPU,本章讨论网络接口虚拟化,这在云计算和虚拟化环境中至关重要。我们将探讨网卡是如何被虚拟化的:

  • Software Emulated NICs: (like e1000 in QEMU) – simple but slow, mainly for legacy.软件仿真网卡:(如 QEMU 中的 e1000)——简单但速度慢,主要用于兼容旧系统。

  • Paravirtualized NICs: virtio-net – efficient, software-based (packets are handled by vhost-net in host kernel). Widely used due to flexibility (works on any NIC hardware).半虚拟化网卡:virtio-net——高效的软件实现(数据包由宿主机内核中的 vhost-net 处理)。因其灵活性广泛使用(适用于任何网卡硬件)。

  • SR-IOV for NICs: A popular approach for high performance. A single physical NIC (PF) can spawn multiple VFs, each VF can be given to a VM. We revisit SR-IOV basics but in NIC context: each VF has its own PCI function, MAC address, and ring buffers.
    The hardware ensures packets DMA to/from the correct VF’s memory and perhaps enforces basic isolation like VLAN tagging per VF. The benefit is near-native throughput and latency because the packet bypasses the software switch layerreddit.com.

We outline typical usage:我们概述了典型的使用场景:

  • In cloud deployments (OpenStack, Kubernetes with Multus, etc.), SR-IOV is used for NFV (Network Function Virtualization) or high-speed networking VMs, where a VM might get a VF of a 100Gb NIC for direct high performance.在云部署中(如 OpenStack、带 Multus 的 Kubernetes 等),SR-IOV 用于 NFV(网络功能虚拟化)或高速网络虚拟机,其中虚拟机可能直接获得一个 100Gb 网卡的 VF,以实现高性能。

  • The downside is reduced flexibility: a VF directly attached to a VM cannot easily be live-migrated (unless using advanced tricks) and cannot be bridged or filtered by the host easily.
    We mention features like Open vSwitch with SR-IOV represent compromises (sometimes one uses hybrid modes where control plane goes through software, data plane direct).

Configuration: Using an Intel Ethernet controller (e.g., X710) as an example:配置:以使用 Intel 以太网控制器(例如 X710)为例:

  • Enable SR-IOV in BIOS if needed, load ixgbe driver with max_vfs parameter or use echo to sriov_numvfs.如有需要,在 BIOS 中启用 SR-IOV,加载 ixgbe 驱动并带上 max_vfs 参数,或使用 echosriov_numvfs

  • The driver creates, say, 4 VFs. Each appears as eth1vf0, eth1vf1, etc., or just a PCI function.驱动创建,例如 4 个 VF。每个 VF 显示为 eth1vf0eth1vf1 等,或仅作为一个 PCI 功能。

  • Assign those VFs to VMs (via VFIO or host network namespaces). The guest uses the normal Intel VF driver (which is usually a subset of the PF driver functionality).将这些 VF 分配给虚拟机(通过 VFIO 或主机网络命名空间)。客户机使用普通的 Intel VF 驱动(通常是 PF 驱动功能的子集)。

  • We mention the need to coordinate MAC addresses and networking – often the host sets the VF’s MAC and VLAN allowed list before giving it to a VM, since some NICs restrict VF privileges for safety.我们提到需要协调 MAC 地址和网络——通常主机会在将 VF 分配给虚拟机之前设置 VF 的 MAC 地址和允许的 VLAN 列表,因为一些网卡为了安全起见会限制 VF 的权限。

Other I/O Virtualization: We briefly touch that similar concepts apply to:其他 I/O 虚拟化:我们简要提到类似的概念也适用于:

  • Storage (NVMe): NVMe controllers now have SR-IOV (some high-end NVMe can present multiple controller functions). There’s also NVMe-of (network based) and virtio-blk for paravirt. But direct NVMe assignment is analogous to NIC.存储(NVMe):NVMe 控制器现在支持 SR-IOV(一些高端 NVMe 可以呈现多个控制器功能)。还有基于网络的 NVMe-of 和用于半虚拟化的 virtio-blk。但直接分配 NVMe 类似于网卡。

  • Accelerators (FPGAs, AI chips): Many have started adding SR-IOV or their own sharing (e.g., Intel QAT engine uses SR-IOV for crypto acceleration). So an architect should look for SR-IOV or virtio support on any device intended for virtualization.加速器(FPGA、AI 芯片):许多加速器开始添加 SR-IOV 或自己的共享机制(例如,Intel QAT 引擎使用 SR-IOV 进行加密加速)。因此,架构师应关注任何用于虚拟化的设备是否支持 SR-IOV 或 virtio。

Performance Considerations: We highlight that SR-IOV on NICs can drastically cut CPU overhead (no need for hypervisor to route packets), but it can complicate network topology (each VF is like a separate physical NIC to manage).
Some scenarios use a mix: give VF for data intensive VMs, but use virtio for others to allow flexibility.

Example Scenario: A load-balancer VM might get two SR-IOV VFs (one for public network, one for private) to maximize packet rate. Meanwhile, general-purpose VMs use virtio which is easier for live migration and central management.示例场景:一个负载均衡虚拟机可能获得两个 SR-IOV VF(一个用于公共网络,一个用于私有网络)以最大化数据包速率。与此同时,通用虚拟机使用 virtio,这更便于实时迁移和集中管理。

This chapter ensures the course covers I/O beyond GPUs, reinforcing that the virtualization principles are universal. It also prepares the reader to consider networking in their designs, as often a solution requires both compute (GPU/CPU) and network planning.本章确保课程涵盖 GPU 之外的 I/O,强化虚拟化原理的通用性。同时也让读者准备在设计中考虑网络,因为解决方案通常需要计算(GPU/CPU)和网络规划的结合。

Chapter 19: System Planning and Hardware Selection第 19 章:系统规划与硬件选择

Core Concepts: Shifting to a higher level, this chapter guides on planning a virtualization deployment from a hardware and platform perspective. It’s about connecting the technical details to real-world decision making:核心概念:提升到更高层次,本章指导如何从硬件和平台角度规划虚拟化部署。重点是将技术细节与实际决策相结合:

  • CPU and Platform Features: Ensure the CPUs support Intel VT-d or AMD-Vi for IOMMU. Check for interrupt remapping support (required for safe VFIO).
    If planning multi-GPU setups, ensure the motherboard supports ACS and has enough IOMMU groups to isolate GPUs (workstation boards might put multiple slots in one group, which is problematic). We provide a checklist for platform capabilities.

  • GPU Selection – iGPU vs dGPU: We discuss the pros and cons:GPU 选择 – 集成显卡(iGPU)与独立显卡(dGPU):我们讨论其优缺点:

    • Integrated GPU (iGPU): Usually only one available, limited performance, but often has virtualization support (as we saw, Intel iGPUs support GVT-g/SR-IOV). Good for lightweight graphics in many VMs, and power-efficient. However, an iGPU will share memory bandwidth with CPU and may lack some features.
      If the use-case is modest (e.g., accelerate UI or encode/decode in VMs), an iGPU might suffice and is cost-effective.

    • Discrete GPU (dGPU): Much higher performance, can have large memory, and necessary for heavy workloads (AI, gaming, etc.). Virtualizing a dGPU depends on vendor: NVIDIA dGPUs can do vGPU if they are supported models (or using workarounds for GeForce), AMD dGPUs currently lack broad virtualization support (except very specific models).
      If using NVIDIA dGPUs, one might need to purchase certain SKUs (like Tesla/Quadro with vGPU license) or use open source tools if acceptable.
      We highlight that if isolation and performance are critical (and budget allows), dedicating a dGPU per VM is an option (e.g., in a workstation virtualization scenario).

    • Multiple GPUs: Some systems might integrate one iGPU and one dGPU – an advanced strategy could be to assign the dGPU to a primary VM for heavy lifting and use the iGPU’s virtualization for others. We encourage creative use, with caution on compatibility.多 GPU:有些系统可能集成一个集成显卡和一个独立显卡——一种高级策略是将独立显卡分配给主要虚拟机以处理重负载,其他虚拟机则使用集成显卡的虚拟化功能。我们鼓励创造性使用,但需注意兼容性。

  • NIC Selection: Not all NICs support SR-IOV (most server-class ones do). We note to check the NIC model and driver: Intel, Broadcom, Mellanox all have SR-IOV in many models, but some entry-level or consumer NICs may not.
    Also consider number of VFs supported (some NICs might only allow, say, 8 VFs per port). If you need dozens of VMs with direct NIC access, choose a NIC accordingly. Alternatively, plan for a mix of direct and virtio networking.

Capacity Planning: We advise on how to estimate resource needs:容量规划:我们建议如何估算资源需求:

  • If using SR-IOV or MIG, how many instances can the hardware support and does that meet your VM count?如果使用 SR-IOV 或 MIG,硬件能支持多少实例,这是否满足你的虚拟机数量?

  • Do you need hardware redundancy (e.g., if one GPU is shared by many VMs and it fails, that’s a single point of failure – maybe better to spread VMs across multiple GPUs).你是否需要硬件冗余(例如,如果一块 GPU 被许多虚拟机共享且发生故障,这就是单点故障——也许更好的是将虚拟机分布在多块 GPU 上)。

  • Consider future scalability: maybe invest in a GPU that supports SIOV down the line, or ensure the motherboard has enough extra slots if adding more GPUs later.考虑未来的可扩展性:也许可以投资支持 SIOV 的 GPU,或者确保主板有足够的额外插槽,以便以后添加更多 GPU。

Isolation and Security: We mention that from a security angle, having an iGPU share across tenants vs. having dedicated dGPUs is a decision – a shared device could potentially be a vector for covert channels or side-channel attacks.
If VMs belong to different trust zones (different customers in cloud), strong isolation is needed (SR-IOV or MIG with hardware enforcement might be preferable to a purely software Mdev which could have more complex attack surfaces).
In some cases, regulations might demand physical separation.

Example Scenarios:示例场景:

  • Virtual Desktop Infrastructure (VDI): Likely can leverage one big GPU for multiple VMs. NVIDIA or AMD multi-user GPUs might be considered. If using Intel, maybe multiple iGPUs isn’t possible (so limited scaling on one host).虚拟桌面基础设施(VDI):可能可以利用一块大型 GPU 支持多个虚拟机。可以考虑使用 NVIDIA 或 AMD 的多用户 GPU。如果使用 Intel,可能无法使用多个集成 GPU(因此单台主机的扩展性有限)。

  • Network Functions (VNFs): Use SR-IOV NICs for near-native performance, perhaps also use CPU pinning (not main topic here, but tangential).网络功能虚拟化(VNF):使用 SR-IOV 网卡以实现接近原生的性能,可能还会使用 CPU 绑定(这不是主要话题,但相关)。

  • Cloud AI accelerators: Possibly give each VM a MIG partition of an A100, to guarantee performance.云端 AI 加速器:可能为每个虚拟机分配 A100 的 MIG 分区,以保证性能。

We conclude by providing a decision matrix or checklist for engineers:我们最后提供一个供工程师参考的决策矩阵或检查清单:

  1. Determine workload type (graphics, compute, network heavy).确定工作负载类型(图形、计算、网络密集型)。

  2. Choose virtualization tech accordingly (Mdev vs SR-IOV, etc.) – e.g., for highest performance and if hardware supports, use SR-IOV/MIG; if hardware doesn’t, use Mdev; if no sharing needed, direct passthrough might suffice.根据情况选择虚拟化技术(Mdev 与 SR-IOV 等)——例如,为了获得最高性能且硬件支持时,使用 SR-IOV/MIG;如果硬件不支持,则使用 Mdev;如果不需要共享,直接直通可能就足够。

  3. Verify hardware supports it (device model, driver).确认硬件支持该技术(设备型号、驱动程序)。

  4. Weigh trade-offs (flexibility vs performance vs cost) and then architect the system (which might mean picking a different GPU or NIC if initial one doesn’t fit the needs).权衡权衡利弊(灵活性与性能与成本),然后设计系统(如果最初选择的 GPU 或 NIC 不符合需求,可能意味着选择不同的 GPU 或 NIC)。

By following this, the reader can make an informed strategy for implementing virtualization in a product.通过遵循这一点,读者可以制定出实施虚拟化的明智策略。

Chapter 20: Tools and Ecosystem for Virtualization Deployment第 20 章:虚拟化部署的工具与生态系统

Core Concepts: This chapter surveys the user-space tools and software ecosystem that assist in building and managing a virtualization solution:核心概念:本章概述了辅助构建和管理虚拟化解决方案的用户空间工具和软件生态系统:

  • Management Tools: We cover common management stacks like Libvirt (which provides an abstraction to configure VFIO devices, SR-IOV interfaces in XML), and oVirt/OpenStack for larger scale (just conceptually how they expose options for vGPUs or SR-IOV networks).管理工具:我们介绍了常见的管理栈,如 Libvirt(提供配置 VFIO 设备、SR-IOV 接口的 XML 抽象),以及用于大规模部署的 oVirt/OpenStack(仅从概念上讲述它们如何暴露 vGPU 或 SR-IOV 网络的选项)。

  • CLI Utilities: Introduce mdevctl – a command-line tool to manage mediated devices on a host (create, list, remove vGPUs)open-iov.org. This is very handy for scripting vGPU setups instead of manual sysfs writes.命令行工具:介绍 mdevctl —— 一个用于管理主机上中介设备的命令行工具(创建、列出、移除 vGPU),见 open-iov.org。这个工具非常适合通过脚本设置 vGPU,避免手动写入 sysfs。

  • LibVF.IO: A community project aiming to simplify and unify vGPU and SR-IOV usage for GPUsopen-iov.org. We describe that LibVF.IO provides scripts or a framework (written in Nim, as per FOSDEM reference) to set up virtual functions on consumer GPUs. It’s an example of how open source communities try to bridge gaps where official support is lacking.LibVF.IO:一个社区项目,旨在简化和统一 GPU 的 vGPU 和 SR-IOV 使用,见 open-iov.org。我们描述了 LibVF.IO 提供的脚本或框架(根据 FOSDEM 参考,使用 Nim 语言编写)来设置消费级 GPU 的虚拟功能。这是开源社区在官方支持不足时尝试弥补差距的一个例子。

  • GVM (Linux-GVM): We saw references to a GVM projectopen-iov.org – which likely stands for “GPU Virtual Machine” or similar. It appears to provide components (like Mdev-GPU) to help register devices that don’t natively support Mdev. We explain that tools like GVM can wrap around drivers to provide missing functionality.
    For instance, if a GPU driver doesn’t support Mdev, GVM might offer a way (via a separate module) to create mediated devices for itopen-iov.org. This is cutting-edge and might require patching or out-of-tree modules, but it’s useful for experimentation.

  • Monitoring and Tuning Tools: e.g., nvidia-smi vgpu (for NVIDIA, to monitor vGPU instances), or similar for Intel (perhaps intel_gpu_top to see per VM engine usage). Also virtualization-specific ones like virtio-trace for virtio devices.监控和调优工具:例如, nvidia-smi vgpu (用于 NVIDIA,监控 vGPU 实例),或类似的 Intel 工具(可能是 intel_gpu_top ,用于查看每个虚拟机引擎的使用情况)。还有针对虚拟化的专用工具,如用于 virtio 设备的 virtio-trace

  • DevOps and Scripting: We encourage using automation (Ansible scripts or Terraform for infra) to consistently set up IOMMU, bind devices, create vGPUs, etc. Many common tasks (like binding VFIO) can be automated at boot.DevOps 和脚本:我们鼓励使用自动化(Ansible 脚本或用于基础设施的 Terraform)来一致地设置 IOMMU、绑定设备、创建 vGPU 等。许多常见任务(如绑定 VFIO)可以在启动时自动完成。

Development Tools: For engineers designing solutions:开发工具:面向设计解决方案的工程师:

  • QEMU and Kernel Debugging: Sometimes one needs to dive into QEMU logs or enable VFIO debug logs to troubleshoot device assignment. We mention where to look (e.g., /var/log/libvirt/qemu.log or dmesg for IOMMU messages).QEMU 和内核调试:有时需要深入查看 QEMU 日志或启用 VFIO 调试日志来排查设备分配问题。我们会提到查看位置(例如, /var/log/libvirt/qemu.logdmesg 用于 IOMMU 消息)。

  • Simulators/Emulators: If actual hardware is not available, one could use software like Intel’s software-only GPU (virtio-gpu with a dummy renderer) just to test the setup flow, though performance will be low.模拟器/仿真器:如果没有实际硬件,可以使用像 Intel 的纯软件 GPU(virtio-gpu 配合虚拟渲染器)这类软件来测试设置流程,尽管性能较低。

Flowcharts and Diagrams: We include or reference flow diagrams (if we had them) illustrating how these tools integrate. For example, a flowchart of using mdevctl: Admin triggers mdevctl -> interacts with sysfs -> kernel creates mdev -> QEMU uses it.
Or a diagram how Libvirt orchestrates an SR-IOV NIC: (Libvirt XML -> calls ip link to create VF or expects pre-created -> binds VF to VM via VFIO).

We also discuss community resources: forums (VFIO mailing list, etc.), documentation sites (like Open-IOV itself, which this textbook is based on, and how to contribute or find updated info there).我们还讨论了社区资源:论坛(如 VFIO 邮件列表等)、文档网站(例如本教材所基于的 Open-IOV,以及如何贡献或查找最新信息)。

By mastering the ecosystem, readers will not just rely on theoretical knowledge but will be aware of practical tools that make deploying virtualization easier.
It underlines that building a solution is not just writing code or toggling hardware features, but also stitching together these components in a maintainable way.

Chapter 21: Troubleshooting and Common Pitfalls第 21 章:故障排除与常见陷阱

Core Concepts: This chapter is essentially a “How to fix it when it breaks” guide, listing common problems and their solutions:核心概念:本章本质上是一本“设备出问题时如何修复”的指南,列出了常见问题及其解决方案:

  • IOMMU Group Issues: “Why can’t I pass through my GPU? – It’s in a group with other devices.” We reiterate the cause (lack of ACS isolation) and mention possible mitigations: try a different slot, or use the pcie_acs_override option (with the caveat of security risk) to split groupsforum.level1techs.com. We also mention checking for new IOMMU features like IOMMUFD in latest kernels which aim to abstract groups (though groups remain fundamental for security).IOMMU 组问题:“为什么我不能直通我的 GPU?——它和其他设备在同一个组里。”我们重申原因(缺乏 ACS 隔离),并提到可能的缓解措施:尝试更换插槽,或使用 pcie_acs_override 选项(但存在安全风险)来拆分组,详见 forum.level1techs.com。我们还提到检查最新内核中的新 IOMMU 特性,如 IOMMUFD ,这些特性旨在抽象分组(尽管分组仍是安全的基础)。

  • Device Won’t Reset: GPUs often have reset issues (especially AMD and older Nvidias). Symptom: VM cannot be restarted because the device stays busy. Solutions: use vendor-specific reset quirks (some included in kernel for common GPUs), or as a workaround, doing a full host reboot in worst case.
    For NICs, resets are usually simpler but still possible to hang – update firmware if so.

  • NVIDIA Error 43: A notorious issue where consumer NVIDIA drivers detect they are in a VM and disable themselves. Solution: apply known workarounds (OVMF GPU BIOS tweaks, or hypervisor patches to hide virtualization).
    We explain that this is not a technical limitation but an artificial restriction. Nonetheless, it’s a “pitfall” many encounter, so it’s addressed.

  • Performance not as expected: If a VM with a VF NIC is slow – check if it’s actually using the VF driver (maybe the VM fell back to a slower driver). Or if a vGPU is slow – perhaps the scheduling timeslice is too small or there’s CPU bottleneck handling interrupts.
    We advise checking host CPU usage (if high, maybe too many VM exits – consider increasing hugepages or enabling posted interrupts, etc.).

  • Stability issues and driver bugs: We leverage open-iov documented issues to give examples:稳定性问题和驱动程序错误:我们利用 open-iov 记录的问题作为示例:

    • For instance, NVIDIA vGPU known issues: on some laptops, using mediated GPU caused errors on battery poweropen-iov.org – advice might be to always test on target hardware and perhaps force max performance state to avoid power state changes. Or the vfio-vmalloc error where overlapping memory caused issues on smaller VRAM GPUsopen-iov.org – solution: possibly reduce the number of VMs or use GPUs with more memory to avoid collision.例如,NVIDIA vGPU 已知问题:在某些笔记本电脑上,使用中介 GPU 会在电池供电时导致错误 open-iov.org——建议始终在目标硬件上进行测试,并可能强制最大性能状态以避免电源状态变化。或者 vfio-vmalloc 错误,其中重叠的内存在较小的 VRAM GPU 上引发问题 open-iov.org——解决方案:可能需要减少虚拟机数量或使用内存更大的 GPU 以避免冲突。

    • AMD GPU issues: Because AMD doesn’t support Mdev, some tried hacks that inevitably hit issues like lack of interrupt routingopen-iov.org. We say if you see no interrupts in guest on AMD, it’s likely because the driver doesn’t support it – no easy fix except waiting for official support or community patches.AMD GPU 问题:由于 AMD 不支持 Mdev,一些尝试的破解方法不可避免地遇到中断路由缺失等问题 open-iov.org。我们认为,如果在 AMD 客户机中看不到中断,可能是因为驱动不支持——没有简单的修复方法,除非等待官方支持或社区补丁。

    • Intel issues: Early GVT-g required custom hypervisor (KVMGT)open-iov.org – if someone inadvertently mixes an old GVT-g with a stock KVM, it won’t work. The fix is to use the versions in sync or move to newer SR-IOV approach.Intel 问题:早期的 GVT-g 需要定制的虚拟机管理程序(KVMGT)open-iov.org——如果有人不小心将旧版 GVT-g 与标准 KVM 混用,是无法工作的。解决方法是使用同步版本,或转向更新的 SR-IOV 方案。

  • Compatibility and Version Mismatch: Remind that the host driver version and guest driver version sometimes must be compatible (e.g., Nvidia vGPU requires matching guest driver version with host driver vGPU module). If mismatched, things may not function or may be unstable.兼容性和版本不匹配:请注意,主机驱动版本和客户机驱动版本有时必须兼容(例如,Nvidia vGPU 要求客户机驱动版本与主机驱动 vGPU 模块匹配)。如果不匹配,可能无法正常工作或不稳定。

We present a troubleshooting checklist:我们提供一个故障排查清单:

  • Check hardware (IOMMU enabled? device in correct group? BIOS settings).检查硬件(是否启用 IOMMU?设备是否在正确的组中?BIOS 设置)。

  • Check host software (drivers loaded? any dmesg errors?).检查主机软件(驱动是否加载?是否有 dmesg 错误?)。

  • Check hypervisor config (correct XML or QEMU args?).检查虚拟机管理程序配置(XML 是否正确或 QEMU 参数是否正确?)。

  • Check guest drivers (did the guest actually load the proper driver for the virtual device?).检查客户机驱动程序(客户机是否实际加载了虚拟设备的正确驱动程序?)。

  • Use tools (like lspci in guest to see if device is present, dmesg in guest to see errors, etc.).使用工具(例如在客户机中使用 lspci 查看设备是否存在,使用 dmesg 查看错误等)。

Pitfalls in Configuration:配置中的陷阱:

  • Forgetting to reserve memory for MMIO: e.g., if passing a GPU that has a large BAR, the VM’s firmware might need adjustments (especially in older SeaBIOS days, but UEFI usually handles large BAR).忘记为 MMIO 预留内存:例如,如果传递一个具有大 BAR 的 GPU,虚拟机的固件可能需要调整(尤其是在较旧的 SeaBIOS 时代,但 UEFI 通常能处理大 BAR)。

  • For SR-IOV NICs: sometimes VFs can’t reach the host or other VFs if not configured properly (e.g., a VF might not talk to PF host because of anti-spoof filters; the fix could be enabling “trust” mode on VF if needed, depending on NIC).对于 SR-IOV 网卡:如果配置不当,有时 VF 无法访问主机或其他 VF(例如,VF 可能因为防伪造过滤器而无法与 PF 主机通信;根据网卡情况,解决方法可能是根据需要启用 VF 的“信任”模式)。

By sharing these real-world issues and solutions, we prepare the reader to not only design but also support a virtualization deployment. It emphasizes that even with proper planning, debugging and fine-tuning are part of the process.通过分享这些实际问题和解决方案,我们让读者不仅能够设计,还能支持虚拟化部署。强调即使有适当的规划,调试和微调也是过程的一部分。

Chapter 22: Best Practices and Use Case Strategies第 22 章:最佳实践和用例策略

Core Concepts: This chapter synthesizes the knowledge into concrete best practices and strategies for different scenarios. It’s somewhat like a summary, but structured by use-case:核心概念:本章将知识综合为针对不同场景的具体最佳实践和策略。它有点像总结,但按用例结构化:

  • High-Performance Computing (HPC) with GPUs: Suggest either pass-through or MIG for strict performance isolation. If using virtualization, keep software overhead minimal (e.g., pin vCPUs, use hugepages, avoid overcommitting GPU if possible).使用 GPU 的高性能计算(HPC):建议采用直通或 MIG 以实现严格的性能隔离。如果使用虚拟化,保持软件开销最小(例如,固定 vCPU,使用大页,尽可能避免 GPU 过度分配)。

  • Cloud Multi-Tenancy (Mixed workloads): Recommend hardware with SR-IOV or strong isolation for tenants that don’t trust each other. E.g., a cloud offering could use NVIDIA A-series with MIG or AMD SR-IOV GPUs (when available) to guarantee one tenant can’t affect another beyond their share.
    Also, perhaps use network namespaces with SR-IOV NICs for tenant networking isolation.

  • Enterprise VDI: Perhaps use a combination: one powerful GPU with vGPU profiles for many users (NVIDIA vGPU profiles or Intel’s VFs). Best practice to match profile to user need (e.g., assign more GPU memory to CAD users vs basic office users).企业级 VDI:或许可以结合使用:一块强大的 GPU 配合多个 vGPU 配置文件供多用户使用(NVIDIA vGPU 配置文件或 Intel 的 VF)。最佳实践是根据用户需求匹配配置文件(例如,为 CAD 用户分配更多 GPU 内存,而基础办公用户则分配较少)。

  • Edge Computing (limited hardware): Maybe an Intel iGPU is all that’s available – use GVT-g to split it among a couple of VMs for lightweight tasks, and use a single NIC with SR-IOV for network slicing if needed. Emphasize careful resource splitting and testing since edge devices might be less powerful.边缘计算(硬件有限):可能只有 Intel 集成显卡可用——使用 GVT-g 将其分割给几个虚拟机处理轻量级任务,必要时使用带 SR-IOV 的单个网卡进行网络切片。强调资源分配和测试的谨慎,因为边缘设备可能性能较弱。

Specific Guidance:具体指导:

  • iGPU vs dGPU: Use iGPU virtualization if power and cost are a concern and workload is light/medium. Use dGPU if heavy compute or graphics, acknowledging you might need one per VM unless you have virtualization support.
    An interesting strategy: combine them – e.g., host uses iGPU for display, guests use portions of dGPU.

  • When to use SR-IOV NIC vs virtio: If sub-5% overhead is acceptable and flexibility is needed (migration, software-defined networking), virtio may be fine. If absolute performance is needed (line-rate 100Gbps, microsecond latency), go SR-IOV but accept constraints (no live migration or complex network share).何时使用 SR-IOV 网卡与 virtio:如果可以接受低于 5%的开销且需要灵活性(如迁移、软件定义网络),virtio 可能是合适的选择。如果需要绝对性能(如 100Gbps 线速、微秒级延迟),则选择 SR-IOV,但需接受限制(不支持实时迁移或复杂的网络共享)。

  • Storage Sharing: Though not main topic, we mention NVMe-oF or using network storage vs direct NVMe assignment as analogous trade-offs (flexibility vs performance).存储共享:虽然不是主要话题,但我们提到 NVMe-oF 或使用网络存储与直接分配 NVMe 的类似权衡(灵活性与性能的取舍)。

Security Best Practices: Outline that for multi-tenant:安全最佳实践:概述多租户环境下的注意事项:

  • Always enable and use IOMMU (some people turn it off for slight perf gain – not worth the risk).始终启用并使用 IOMMU(有些人为了略微提升性能会关闭它——但风险不值得)。

  • Use ACS on PCIe or ensure proper grouping – don’t assign a group that has multiple devices to different VMs.在 PCIe 上使用 ACS 或确保正确分组——不要将包含多个设备的组分配给不同的虚拟机。

  • Keep device firmware up to date (devices with SR-IOV often have fixes for virtualization issues in firmware updates).保持设备固件更新(支持 SR-IOV 的设备通常会在固件更新中修复虚拟化相关的问题)。

  • Consider using separate root ports for assigned devices to localize errors (if a device misbehaves under VFIO, a separate port might isolate errors like errors that would otherwise propagate).考虑为分配的设备使用独立的根端口以局部化错误(如果设备在 VFIO 下表现异常,独立端口可能会隔离错误,防止错误传播)。

Emerging Trends to Plan For: We foreshadow that SIOV will become available, and designing software in a flexible way (e.g., using virtio drivers that can take advantage of hardware assist in future) might be wise. Similarly, technologies like device firmware upgrades to add virtualization (some NICs got SR-IOV via firmware update post-deployment) mean one should keep an eye on vendor updates.需要规划的新兴趋势:我们预见 SIOV 将会出现,设计软件时采用灵活的方式(例如,使用能够利用未来硬件辅助的 virtio 驱动)可能是明智的。同样,诸如设备固件升级以添加虚拟化功能的技术(一些网卡通过固件更新在部署后获得了 SR-IOV)意味着应密切关注厂商的更新。

In essence, this chapter is about making architectural decisions with a holistic view – not just technical feasibility, but considering business needs (cost, security, performance guarantees).
It wraps up the actionable advice an architect or senior engineer would take to the planning table.

Chapter 23: Emerging Technologies and Future Directions第 23 章:新兴技术与未来方向

Core Concepts: Finally, we look ahead to how the landscape of GPU and I/O virtualization is evolving:核心概念:最后,我们展望 GPU 和 I/O 虚拟化领域的发展趋势:

  • Scalable IOV (SIOV) Adoption: Now that we discussed SIOV, here we speculate on its adoption. The PCIe spec revisions and industry consortia (like OCP) have published SIOV 1.0 specopen-iov.org. We expect device vendors to start implementing it. We discuss how this could blur the line between SR-IOV and Mdev – in the future, a device might support thousands of instances with minimal overhead.
    This could enable new cloud paradigms (e.g., function as a service accelerators where each invocation gets an isolated slice of a GPU).

  • CXL (Compute Express Link): An upcoming technology related to PCIe that allows memory sharing. We touch on how CXL might enable new forms of device sharing or memory pooling that could benefit VM device assignment (e.g., multiple hosts sharing a pool of accelerators over CXL fabric, etc.).CXL(计算快线):一种即将推出的与 PCIe 相关的技术,支持内存共享。我们简要介绍了 CXL 如何可能实现新的设备共享或内存池化形式,从而有利于虚拟机设备分配(例如,多主机通过 CXL 网络共享加速器池等)。

  • Improved Mediated Frameworks: Linux continues to refine its virtualization subsystems. Mention IOMMUFD – a new user API in Linux to manage IOMMU and VFIO in a more unified way, which might simplify writing user-space drivers and perhaps allow better integration of things like DPDK with VFIO.
    For instance, QEMU will eventually move to IOMMUFD which will streamline container/group handling (for the user, this means more reliable and possibly faster setup).

  • Confidential Computing Impact: With technologies like Intel TDX or AMD SEV (encrypting VMs), device assignment has challenges (since a device DMA needs to access encrypted memory).
    We mention briefly that future virtualization must reconcile security with performance – vendors are working on mediated solutions that can work with encrypted VMs (like secure mediated devices).

  • Integration with Containers: There’s a trend of using these virtualization techniques in container orchestration (e.g., Kubernetes Device Plugins for GPUs and SR-IOV). We foresee more convergence where a container might get a VF or an mdev as easily as a VM does today.
    This might require further abstraction so that container runtime can handle attaching/detaching devices seamlessly.

Vendor Roadmaps: Without going too speculative, we note hints:厂商路线图:不做过多推测,我们注意到一些迹象:

  • Intel will have discrete GPUs (Intel ARC / Data Center GPUs) and likely push their virtualization features there too.Intel 将推出独立显卡(Intel ARC / 数据中心 GPU),并可能在这些产品中推动其虚拟化功能。

  • NVIDIA might eventually open more of their stack or support SR-IOV (some of their data center GPUs reportedly have SR-IOV support for certain functions).NVIDIA 可能最终会开放更多他们的技术栈或支持 SR-IOV(据报道,他们的一些数据中心 GPU 对某些功能支持 SR-IOV)。

  • AMD’s open approach could yield a unified solution that’s upstream (we hope AMD’s future GPUs revive SR-IOV or adopt SIOV, given their participation in standards).AMD 的开放策略可能带来一个统一的上游解决方案(我们希望 AMD 未来的 GPU 能恢复 SR-IOV 或采用 SIOV,鉴于他们参与了相关标准制定)。

Encouraging Adaptability: We advise the reader to design virtualization solutions that are modular. For instance, using VFIO and virtio interfaces means if underlying implementation changes (say, from Mdev to SIOV) the higher-level architecture can remain similar. Embrace open standards (virtio, etc.) so that hardware innovations can be utilized with minimal software changes.

We might include a vision: perhaps in a few years, assigning a device to a VM or container will be as simple and common as assigning memory or CPU, thanks to these advances – with negligible overhead and strong isolation by default.我们或许会提出一个愿景:也许几年后,将设备分配给虚拟机或容器将像分配内存或 CPU 一样简单且普遍,得益于这些进步——默认情况下几乎没有开销且具备强隔离性。

Chapter 24: Conclusion and Recap第 24 章:总结与回顾

In the final chapter, we wrap up the course textbook by summarizing the journey:在最后一章中,我们总结本课程教材的内容,回顾整个学习过程:

  • We revisit the initial premise: enabling advanced engineers to independently evaluate and implement GPU/IO virtualization in their products.
    We encourage the reader to reflect on key takeaways: understanding the hardware capabilities, the kernel/hypervisor software pieces, and the trade-offs of each approach.

  • Recap of Key Learnings: From Chapter 1 to 23, list one line per major point (e.g., “VFIO and IOMMU provide the secure foundation”, “SR-IOV gives hardware-isolated virtual functions”, “Mdev offers flexibility on devices without SR-IOV”, “Intel/NVIDIA/AMD have differing support levels”, “Careful planning needed for performance and security”, etc.).

  • Real-world Impact: We might mention a few real systems that use these technologies (for inspiration, e.g., how AWS or Azure implement GPU instances – often they pass through entire GPUs, whereas some services like NVIDIA Grid are used in enterprise; or how NFV platforms use SR-IOV for packet processing VMs).

  • Next Steps for the Reader: Suggest the reader apply this knowledge: perhaps lab exercises like “set up a VM with a passed-through GPU and measure performance vs bare metal” or “create an SR-IOV VF and ping between host and VM to see near-native latency”.
    Encourage them to keep updated via communities (like the Open-IOV wikiopen-iov.org, Linux kernel mailing lists, etc.) because the field is evolving.

Final Thoughts: We conclude on an encouraging note – the virtualization of GPUs and I/O, once niche and complex, is becoming mainstream and more accessible thanks to open source contributions and industry standards.
With the solid understanding from this textbook, the engineer or architect can design innovative solutions that leverage these technologies for efficient, high-performance systems. The last sentence can reinforce that through systematic learning (as we did in 24 weeks), the once daunting landscape of I/O virtualization can be mastered and applied to solve real-world computing challenges.

引用

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Introduction - Open-IOV介绍 - Open-IOV

https://open-iov.org/index.php/Introduction

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

GPU Driver Internals - Open-IOVGPU 驱动程序内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动程序内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动程序内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

SR-IOV VF NIC (VM) can't talk to the host - Level1Techs ForumsSR-IOV VF 网卡(虚拟机)无法与主机通信 - Level1Techs 论坛

https://forum.level1techs.com/t/sr-iov-vf-nic-vm-cant-talk-to-the-host/211361

GPU Driver Internals - Open-IOVGPU 驱动内部 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

GPU Driver Internals - Open-IOVGPU 驱动内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

Merged Drivers - Open-IOV合并驱动 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

GPU Driver Internals - Open-IOVGPU 驱动程序内部结构 - Open-IOV

https://open-iov.org/index.php/GPU_Driver_Internals

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

GPU Support - Open-IOVGPU 支持 - Open-IOV

https://open-iov.org/index.php/GPU_Support

I/O Virtualization - SR-IOV versus NPAR : r/networking - RedditI/O 虚拟化 - SR-IOV 与 NPAR :r/networking - Reddit

https://www.reddit.com/r/networking/comments/10hc9a1/io_virtualization_sriov_versus_npar/

Virtual I/O Internals - Open-IOV虚拟 I/O 内部结构 - Open-IOV

https://open-iov.org/index.php/Virtual_I/O_Internals

The Pragmatic Neckbeard 3: VFIO, IOMMU and PCIe - Linux务实的极客 3:VFIO、IOMMU 与 PCIe - Linux

https://forum.level1techs.com/t/the-pragmatic-neckbeard-3-vfio-iommu-and-pcie/111251

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers

Merged Drivers - Open-IOV合并驱动程序 - Open-IOV

https://open-iov.org/index.php/Merged_Drivers