这是用户在 2025-7-7 17:05 为 https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to content

Automatic Prefix Caching
自动前缀缓存

The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
PagedAttention 的核心思想是将每个请求的 KV 缓存划分为多个 KV 块。每个块包含固定数量 token 的注意力键值对。PagedAttention 算法允许这些块存储在非连续的物理内存中,从而通过按需分配内存来消除内存碎片。

To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
为了实现 KV 缓存的自动缓存,我们利用了以下关键观察:每个 KV 块可以通过块内的 token 以及该块之前的前缀 token 来唯一标识。

                    Block 1                  Block 2                  Block 3
         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|

In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
在上面的例子中,第一个块的 KV 缓存可以通过标记"A gentle breeze stirred"唯一标识。第三个块则可以通过块内标记"laughed in the distance"以及前缀标记"A gentle breeze stirred the leaves as children"共同唯一标识。因此我们可以建立如下一一映射关系:

hash(prefix tokens + block tokens) <--> KV Block

With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
通过这种映射关系,我们可以在 vLLM 的 KV 缓存管理中增加另一层间接寻址。原先 vLLM 中每个序列都维护着从逻辑 KV 块到物理块的映射。为了实现 KV 块的自动缓存,我们将逻辑 KV 块映射到其哈希值,并维护一个包含所有物理块的全局哈希表。这样,所有共享相同哈希值的 KV 块(例如跨两个请求的共享前缀块)都可以映射到同一个物理块并共享内存空间。

This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
这种设计实现了自动前缀缓存,而无需在 KV 块之间维护树形结构。更具体地说,所有块都是相互独立的,可以自行分配和释放,这使得我们能够像管理操作系统中的普通缓存一样管理 KV 缓存。

Generalized Caching Policy
通用缓存策略 ¶

Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
将所有 KV 块保存在哈希表中,使得 vLLM 能够缓存先前请求的 KV 块,从而节省内存并加速未来请求的计算。例如,若新请求与先前请求共享系统提示,则可直接复用共享提示的 KV 缓存,无需重新计算。然而,KV 缓存总空间有限,当缓存满载时,我们必须决定保留或淘汰哪些 KV 块。

Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy:
通过哈希表管理 KV 缓存,使我们能够实现灵活的缓存策略。例如,在当前 vLLM 中,我们实现了以下淘汰策略:

  • When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0.
    当没有空闲块剩余时,我们将淘汰引用计数(即当前使用该块的请求数量)为 0 的 KV 块。
  • If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU).
    若有多个引用计数为零的块,我们优先淘汰最近最少使用的块(LRU)。
  • If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it).
    若多个块的最后访问时间相同,则优先淘汰位于最长前缀末端的块(即前面块数最多的那个)。

Note that this eviction policy effectively implements the exact policy as in RadixAttention when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree.
请注意,当应用于全注意力模型时,此淘汰策略实际上实现了与 RadixAttention 完全相同的策略,即优先淘汰引用计数为零且前缀树中最近最少使用的叶节点。

However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
然而,基于哈希的 KV 缓存管理为我们提供了灵活性,可以处理更复杂的服务场景,并实现比上述策略更复杂的淘汰策略:

  • Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
    多 LoRA 适配器服务。当为多个 LoRA 适配器处理请求时,我们可以简单地让每个 KV 块的哈希值也包含请求所查询的 LoRA ID,从而实现对所有适配器的缓存支持。通过这种方式,我们可以统一管理不同适配器的 KV 块,既简化了系统实现,又提高了全局缓存命中率和效率。
  • Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
    多模态模型。当用户输入不仅包含离散标记时,我们可以采用不同的哈希方法来处理不同模态输入的缓存。例如,对图像使用感知哈希来缓存相似的输入图像。