Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance#
Dynamo 解耦：分离预填充与解码以提升性能 #

The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
LLM 请求的预填充(prefill)和解码(decode)阶段具有不同的计算特性和内存占用。将这些阶段分离到专门的 LLM 引擎中，可以实现更好的硬件分配、更高的可扩展性以及整体性能提升。例如，在内存密集型的解码阶段使用较大的 TP(张量处理器)，而在计算密集型的预填充阶段使用较小的 TP，可以使两个阶段都高效计算。此外，对于长上下文请求，将其预填充阶段分配到专用的预填充引擎中，可以确保正在进行的解码请求不会被这些长预填充操作阻塞，从而得到高效处理。

Disaggregated execution of a request has three main steps:
请求的分解执行包含三个主要步骤：

Prefill engine computes prefill phase and generates KV cache
预填充引擎计算预填充阶段并生成 KV 缓存
Prefill engine transfers the KV cache to decode engine, and
预填充引擎将 KV 缓存传输至解码引擎，且
Decode engine computes decode phase.
解码引擎计算解码阶段

However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions.
然而，并非所有请求的预填充阶段都需在远程预填充引擎中计算。若预填充较短或解码引擎的前缀缓存命中率高，通常本地解码引擎执行预填充更为高效。Dynamo 的分离式设计全面考量了这些场景，并构建了灵活框架，确保在各种条件下均能提供强劲性能。

Design# 设计方案

There are four main components in Dynamo disaggregation:
Dynamo 解耦架构包含四个主要组件：

Worker: execute prefill and decode requests
工作节点：处理预填充和解码请求
Prefill worker: execute prefill requests only
预填充工作器：仅执行预填充请求
Disaggregated router: decide whether to prefill locally or remotely
解耦路由器：决定本地预填充还是远程预填充
Prefill queue: cache and load balance the remote prefill requests
预填充队列：缓存并负载均衡远程预填充请求

When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
当工作节点收到请求时，首先通过解耦路由判断预填充应在本地还是远程执行，并分配 KV 块。若选择远程预填充，则向预填充队列推送远程请求。随后预填充工作节点从队列拉取任务，读取工作节点中命中前缀缓存的 KV 块，执行预填充计算，并将结果 KV 块写回工作节点。最终由工作节点完成剩余的解码操作。

Conditional Disaggregation#
条件性解耦 #

Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
并非所有请求的预填充阶段都需要在远程预填充引擎中计算。解聚路由器会根据预填充长度和预填充队列状态，在运行时动态决定请求的预填充阶段应该在本地还是全局计算。具体来说，当满足以下两个条件时，请求会被发送到远程预填充引擎：

The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
无前缀缓存命中的绝对预填充长度超过预设阈值。一方面，若请求的预填充长度较短，可通过将分块预填充请求附加到正在进行的解码请求中，由解码引擎高效完成计算。另一方面，若前缀缓存命中较长，预填充将受限于内存带宽，此时由解码引擎执行计算反而更高效。
The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
预填充队列中的远程预填充请求数量低于预设阈值。当预填充队列积压大量请求时，表明预填充工作节点处理滞后，此时更适合在本地执行预填充，直至更多预填充工作节点加入处理。

Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
条件解聚使 Dynamo 能够为动态工作负载实现高性能

Prefill Queue# 预填充队列 #

Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability.
预填充请求属于计算密集型任务（极短的预填充除外），应在其专属迭代周期内执行且不与其他请求混排，以确保快速的首字节响应时间（TTFT）。为平衡多个预填充引擎间的负载，Dynamo 采用全局预填充队列机制：工作节点将远程预填充请求推入队列，预填充工作器按序拉取并逐个完成请求。该全局预填充队列基于 NATS 流技术实现，保障高性能与高可用性。

Efficient KV Transfer# 高效的键值传输 #

The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer.
高性能解耦的关键在于高效的 KV 传输。Dynamo 利用 NIXL 技术直接将 KV 缓存从预填充引擎的 VRAM 传输至解码引擎的 VRAM。此外，这种 KV 传输是非阻塞式的，使得 GPU 前向传播在传输 KV 缓存的同时还能处理其他请求。

After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish.
KV 块分配完成后，工作调度器通过预填充队列向预填充工作调度器发送远程预填充请求，其中包含已分配 KV 块的内存描述符。借助 RDMA 读写 NIXL 操作，预填充工作器可直接读写远程 KV 块，无需远程工作引擎显式处理。远程预填充完成后，工作调度器只需将解码请求加入该工作器的待执行队列。这使得工作器在等待远程预填充完成期间，能继续执行现有解码/预填充请求的前向传播。

To reduce the size of memory descriptors, Dynamo applies two optimizations:
为减小内存描述符体积，Dynamo 采用了两项优化：

After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
每个工作器完成初始化并分配完所有 KV 缓存池后，会将所有内存块描述符（即 NIXL 元数据）存入分布式键值存储系统 ETCD。预填充工作器首次处理来自该工作器的远程预填充请求时，会加载并缓存这些内存描述符。因此后续发起远程预填充请求时，仅需传递 KV 块 ID 而无需完整内存描述符。
Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks.
Dynamo 在预填充引擎中升级内存分配器，通过分配连续内存块并将相邻块合并为更大区块来减少 KV 块总数。

For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
针对解码和预填充阶段 KV 布局不同的情况（例如由于张量并行度差异），Dynamo 采用高性能内核，在 NIXL 读取之后、写入之前，将 KV 块转置为接收端匹配的布局。

Runtime-Reconfigurable xPyD#
运行时可重构的 xPyD 方案 #

The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as:
Dynamo 中基于预填充队列和 NIXL 的 KV 传输设计天然支持运行时动态配置 xPyD。工作节点和预填充工作节点可随时增减，无需系统级同步或产生额外开销。新增和现有的预填充工作节点都只需从 NATS 预填充队列拉取远程请求。新增工作节点（或为现有节点新增预填充功能时）的 NIXL 元数据会按需延迟加载并缓存。具体而言，增删工作节点的操作简化为：

Add worker: add NIXL metadata in ETCD.
添加 worker：在 ETCD 中添加 NIXL 元数据。
Remove worker: flush engine and delete NIXL metadata in ETCD.
移除工作节点：刷新引擎并删除 ETCD 中的 NIXL 元数据。
Add prefill worker: no explicit action needed.
添加预填充工作节点：无需显式操作。
Delete prefill worker: flush engine.
删除预填充工作节点：刷新引擎。

How this works under the hood#
底层工作原理 #

Auto-Discovery for new workers#
新工作节点的自动发现 #

In Dynamo, we use etcd (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to etcd allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in etcd. Newly added prefill workers also register with etcd for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
在 Dynamo 中，我们使用 etcd （一个分布式键值对存储）来注册和发现新组件。当新的解码/聚合工作节点启动时，它会将其端点信息添加到 etcd 中，使路由器能够发现它并将请求路由到该节点。对于 KV 缓存传输过程，新添加的解码工作节点会将其 KV 缓存的内存描述符（用于 NIXL 传输）放入 etcd 。新添加的预填充工作节点同样会注册到 etcd 以便被发现，并在启动后直接从全局预填充队列中拉取请求。预填充工作节点在首次服务远程预填充请求时，会惰性拉取这些描述符。

You can watch this happen live by running the following:
你可以通过运行以下命令实时观察这一过程：

# in terminal 1 - run the disaggregated serving example
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml

# in terminal 2 - watch the namespace in etcd
watch -cd etcdctl get --prefix <namespace>

You should see something like this show up as the disaggregated serving example starts up:
当解耦服务示例启动时，您应该会看到类似以下内容显示：

# worker information
dynamo/components/PrefillWorker/mock:694d967da694ea1e
{
  "component": "PrefillWorker",
  "endpoint": "mock",
  "namespace": "dynamo",
  "lease_id": 7587886413599009310,
  "transport": {
    "nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
  }
}
dynamo/components/Processor/chat/completions:694d967da694ea16
{
  "component": "Processor",
  "endpoint": "chat/completions",
  "namespace": "dynamo",
  "lease_id": 7587886413599009302,
  "transport": {
    "nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
  }
}
dynamo/components/VllmWorker/generate:694d967da694ea1a
{
  "component": "VllmWorker",
  "endpoint": "generate",
  "namespace": "dynamo",
  "lease_id": 7587886413599009306,
  "transport": {
    "nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
  }
}
dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
{
  "component": "VllmWorker",
  "endpoint": "load_metrics",
  "namespace": "dynamo",
  "lease_id": 7587886413599009306,
  "transport": {
    "nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
  }
}

# nixl metadata
dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2

Graceful worker shutdown#
优雅的 worker 关闭机制

Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
由于 worker 信息存储在 etcd 中，我们可以通过撤销其 etcd 租约来关闭 worker。租约撤销后：

Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
解码/聚合 worker 的端点会立即从 etcd 中移除，使其不再接受新请求。它们会完成所有进行中的请求，关闭引擎并优雅退出
Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
预填充 worker 会停止从预填充队列拉取任务，待所有待处理的远程 kv 缓存写入完成后优雅退出

You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: revoke_lease.py.
您还可以通过在有持续请求时撤销 worker 的 etcd 租约来观察这一过程。参考这个执行该操作的示例脚本：revoke_lease.py。

Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance#Dynamo 解耦：分离预填充与解码以提升性能 #

Design# 设计方案

Conditional Disaggregation#条件性解耦 #

Prefill Queue# 预填充队列 #

Efficient KV Transfer# 高效的键值传输 #

Runtime-Reconfigurable xPyD#运行时可重构的 xPyD 方案 #

How this works under the hood#底层工作原理 #

Auto-Discovery for new workers#新工作节点的自动发现 #

Graceful worker shutdown#优雅的 worker 关闭机制