Late last week two great blog posts were released with seemingly opposite titles. “Don’t Build Multi-Agents” by the Cognition team, and “How we built our multi-agent research system” by the Anthropic team.
上周晚些时候,两篇观点看似对立的精彩博文相继发布:Cognition 团队的《不要构建多智能体系统》与 Anthropic 团队的《我们如何构建多智能体研究系统》。
Despite their opposing titles, I would argue they actually have a lot in common and contain some insights as to how and when to build multi-agent systems:
尽管标题针锋相对,但我认为它们实则存在诸多共识,并揭示了构建多智能体系统的时机与方法:
- Context engineering is crucial
上下文工程至关重要 - Multi-agent systems that primarily “read” are easier than those that “write”
以"读取"为主的多智能体系统比侧重"写入"的系统更易实现
Context engineering is critical
上下文工程至关重要
One of the hardest parts of building multi-agent (or even single-agent) applications is effectively communicating to the models the context of what they’re being asked to do. The Cognition blog post introduces the term “context engineering” to describe this challenge.
构建多智能体(甚至单智能体)应用最困难的部分之一,是有效地向模型传达它们被要求执行任务的上下文。Cognition 博客文章引入了"上下文工程"这一术语来描述这一挑战。
In 2025, the models out there are extremely intelligent. But even the smartest human won’t be able to do their job effectively without the context of what they’re being asked to do. “Prompt engineering” was coined as a term for the effort needing to write your task in the ideal format for a LLM chatbot. “Context engineering” is the next level of this. It is about doing this automatically in a dynamic system. It takes more nuance and is effectively the #1 job of engineers building AI agents.
到 2025 年,现有的模型将变得极其智能。但即使是最聪明的人类,如果不知道被要求执行任务的上下文,也无法有效完成工作。"提示工程"这个术语被创造出来,描述为 LLM 聊天机器人以理想格式编写任务所需的努力。"上下文工程"是这一概念的进阶。它关乎在动态系统中自动完成这项工作。这需要更精细的处理,实际上已成为构建 AI 智能体的工程师们的首要任务。
They show through a few toy examples that using multi-agent systems makes it harder to ensure that each sub-agent has the appropriate context.
他们通过几个简单示例表明,使用多智能体系统会使得确保每个子智能体都拥有适当上下文变得更加困难。
The Anthropic blog post doesn’t explicitly use the term context engineering, but at multiple points it addresses the same issue. It’s clear that the Anthropic team spent a significant amount of time on context engineering. Some highlights below:
Anthropic 的博客文章虽未明确使用"上下文工程"这一术语,但在多处探讨了相同议题。显然 Anthropic 团队在上下文工程方面投入了大量精力。以下是部分亮点:
Long-horizon conversation management. Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms. We implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits approach, agents can spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Further, they can retrieve stored context like the research plan from their memory rather than losing previous work when reaching the context limit. This distributed approach prevents context overflow while preserving conversation coherence across extended interactions.
长周期对话管理。生产级智能体常需处理数百轮对话,这要求精细的上下文管理策略。随着对话延长,标准上下文窗口会变得不足,需要智能压缩和记忆机制。我们实现了这样的模式:智能体在进入新任务前,会总结已完成的工作阶段并将关键信息存入外部记忆。当接近上下文限制时,智能体可生成具有全新上下文的子代理,同时通过谨慎的交接保持连续性。此外,它们能从记忆中检索存储的上下文(如研究计划),避免因达到上下文限制而丢失先前工作。这种分布式方法既能防止上下文溢出,又能保持长时间交互中的对话连贯性。
In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries. Without detailed task descriptions, agents duplicate work, leave gaps, or fail to find necessary information.
在我们的系统中,主导智能体将查询分解为子任务并向子智能体描述。每个子智能体都需要明确目标、输出格式、工具使用指南、数据来源说明以及清晰的任务边界。若缺乏详细的任务描述,智能体会出现工作重复、任务遗漏或无法获取必要信息等问题。
Context engineering is critical to making agentic systems work reliably. This insight has guided our development of LangGraph, our agent and multi-agent framework. When using a framework, you need to have full control what gets passed into the LLM, and full control over what steps are run and in what order (in order to generate the context that gets passed into the LLM). We prioritize this with LangGraph, which is a low-level orchestration framework with no hidden prompts, no enforced “cognitive architectures”. This gives you full control to do the appropriate context engineering that you require.
上下文工程对于确保智能体系统可靠运行至关重要。这一理念指引着我们开发 LangGraph——我们的智能体与多智能体框架。使用框架时,您需要完全掌控输入 LLM 的内容,以及执行步骤的顺序(以便生成输入 LLM 的上下文)。我们在 LangGraph 中优先实现了这一点,它是一个底层编排框架,没有隐藏提示,不强制使用任何"认知架构"。这让您能够完全掌控所需的上下文工程。
Multi-agent systems that primarily “read” are easier than those that “write”
以"读取"为主的多智能体系统比侧重"写入"的系统更易实现
Multi-agent systems designed primarily for "reading" tasks tend to be more manageable than those focused on "writing" tasks. This distinction becomes clear when comparing the two blog posts: Cognition's coding-focused system and Anthropic's research-oriented approach.
主要针对“读取”任务设计的多智能体系统通常比专注于“写入”任务的系统更易于管理。通过对比两篇博客文章——Cognition 以编码为核心的系统与 Anthropic 研究导向的方法——这一区别变得尤为明显。
Both coding and research involve reading and writing, but they emphasize different aspects. The key insight is that read actions are inherently more parallelizable than write actions. When you attempt to parallelize writing, you face the dual challenge of effectively communicating context between agents and then merging their outputs coherently. As the Cognition blog post notes: "Actions carry implicit decisions, and conflicting decisions carry bad results." While this applies to both reading and writing, conflicting write actions typically produce far worse outcomes than conflicting read actions. When multiple agents write code or content simultaneously, their conflicting decisions can create incompatible outputs that are difficult to reconcile.
编码和研究都涉及读取和写入操作,但侧重点不同。关键洞察在于:读取操作天生比写入操作更易于并行化。当尝试并行化写入时,你会面临双重挑战:既要实现智能体间的有效上下文传递,又要将它们的输出连贯整合。正如 Cognition 博客文章所指出的:“操作隐含决策,而冲突决策会导致糟糕结果。”虽然这适用于读取和写入,但冲突的写入操作通常会产生比读取冲突更严重的后果。当多个智能体同时编写代码或内容时,它们的冲突决策可能产生难以调和的矛盾输出。
Anthropic's Claude Research illustrates this principle well. While the system involves both reading and writing, the multi-agent architecture primarily handles the research (reading) component. The actual writing—synthesizing findings into a coherent report—is deliberately handled by a single main agent in one unified call. This design choice recognizes that collaborative writing introduces unnecessary complexity.
Anthropic 的 Claude 研究项目很好地诠释了这一原则。虽然系统同时涉及读取和写入操作,但其多智能体架构主要处理研究(读取)组件。而实际的写作任务——将研究发现综合成连贯报告——则特意交由单一主智能体通过统一调用来完成。这种设计选择认识到协作写作会带来不必要的复杂性。
However, even read-heavy multi-agent systems aren't trivial to implement. They still require sophisticated context engineering. Anthropic discovered this firsthand:
然而,即便是以读取为主的多智能体系统,实现起来也并非易事。它们仍然需要精密的上下文工程。Anthropic 通过实践深刻认识到:
We started by allowing the lead agent to give simple, short instructions like 'research the semiconductor shortage,' but found these instructions often were vague enough that subagents misinterpreted the task or performed the exact same searches as other agents. For instance, one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor.
我们最初允许主导智能体发出简单简短的指令,比如"研究半导体短缺问题",但发现这些指令往往过于模糊,导致子智能体误解任务或与其他智能体执行完全相同的搜索。例如,一个子智能体在探究 2021 年汽车芯片危机时,另外两个子智能体却在重复调查 2025 年供应链现状,缺乏有效的工作分工。
Production reliability and engineering challenges
生产可靠性与工程挑战
Whether using multi-agent systems or just a complex single agent one, there are several reliability and engineering challenges that emerge. Anthropic's blog post does a great job of highlighting these. These challenges are not unique to Anthropic's use case, but are actually pretty generic. A lot of the tooling we've been building has been aimed at generically solving problems like these.
无论是使用多智能体系统还是复杂的单智能体系统,都会面临一些可靠性和工程上的挑战。Anthropic 的博客文章很好地阐述了这些问题。这些挑战并非 Anthropic 特有的用例,而是相当普遍存在的。我们开发的许多工具正是为了通用性地解决这类问题。
Durable execution and error handling
持久化执行与错误处理
Agents are stateful and errors compound. Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents. When errors occur, we can't just restart from the beginning: restarts are expensive and frustrating for users. Instead, we built systems that can resume from where the agent was when the errors occurred.
智能体具有状态性且错误会累积。智能体可以长时间运行,在多次工具调用间保持状态。这意味着我们需要持久化执行代码并处理沿途出现的错误。若缺乏有效缓解措施,微小的系统故障对智能体而言可能是灾难性的。当错误发生时,我们不能简单地从头重启:重启成本高昂且会让用户沮丧。为此,我们构建了能够从错误发生点恢复执行的系统。
This durable execution is a key part of LangGraph, our agent orchestration framework. We believe all long running agents will need this, and accordingly it should be built into the agent orchestration framework.
这种持久化执行能力是 LangGraph(我们的智能体编排框架)的核心组成部分。我们认为所有长时间运行的智能体都需要此功能,因此它应该内置于智能体编排框架中。
Agent Debugging and Observability
智能体调试与可观测性
Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder. For instance, users would report agents “not finding obvious information,” but we couldn't see why. Were the agents using bad search queries? Choosing poor sources? Hitting tool failures? Adding full production tracing let us diagnose why agents failed and fix issues systematically.
智能体会做出动态决策,即使在相同提示下,不同运行间也呈现非确定性。这使得调试更加困难。例如,用户会报告智能体"找不到明显信息",但我们无法查明原因。是智能体使用了错误的搜索查询?选择了劣质信源?还是遭遇了工具故障?通过添加全链路生产追踪,我们得以诊断智能体失败原因并系统性地解决问题。
We have long recognized that observability for LLM systems is different than traditional software observability. A key reason why is it needs to be optimized for debugging these types of challenges. If you’re not sure what exactly this means - check out LangSmith, our platform for (among other things) agent debugging and observability. We’ve been building LangSmith for the past two years to handle these types of challenges. Try it out and see why this is so critical!
我们很早就认识到,LLM 系统的可观测性与传统软件的可观测性不同。关键原因在于它需要针对此类挑战的调试进行优化。如果您不确定具体含义——不妨试试 LangSmith,这是我们(除其他功能外)专门用于智能体调试与可观测性的平台。过去两年我们持续构建 LangSmith 以应对这类挑战。试用一下,您就会明白为什么这如此关键!
Evaluation of agents 智能体评估
A whole section in the Anthropic post is dedicated to “effective evaluation of agents”. A few key takeaways that we like:
Anthropic 的文章中专门有一个章节讨论“智能体的有效评估”。我们特别认同的几个关键要点:
- Start small with evals, even ~20 datapoints is enough
评估从小规模开始,约 20 个数据点就足够 - LLM-as-a-judge can automate scoring of experiments
使用 LLM 作为评判者可以自动化实验评分 - Human testing remains essential
人工测试仍然不可或缺
This resonates whole-heartedly with our approach to evaluation. We’ve been building evals into LangSmith for a while, and have landed on several features to help with those aspects:
这与我们的评估方法完全契合。我们一直在将评估功能集成到 LangSmith 中,目前已推出多项特性来支持这些方面:
- Datasets, to curate datapoints easily
数据集功能,便于轻松管理数据点 - Running LLM-as-a-judge server side (more features coming here soon!)
在服务器端运行 LLM-as-a-judge(更多相关功能即将推出!) - Annotation queues to coordinate and facilitate human evaluations
标注队列功能,用于协调和简化人工评估流程
Conclusion 结论
Anthropic’s blog post also contains some wisdom for where multi-agent systems may or may not work best:
Anthropic 的博客文章还包含了一些关于多智能体系统适用场景的见解:
Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously.
我们的内部评估表明,多智能体研究系统特别擅长处理需要同时探索多个独立方向的广度优先查询。
Multi-agent systems work mainly because they help spend enough tokens to solve the problem…. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.
多智能体系统之所以有效,主要是因为它们能通过消耗足够的 token 来解决问题……对于超出单个智能体能力范围的任务,多智能体架构能有效扩展 token 的使用规模。
For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance.
从经济可行性角度考虑,多智能体系统需要执行价值足够高的任务,以支付其性能提升带来的成本。
Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.
此外,目前某些需要所有智能体共享相同上下文或涉及大量智能体间依赖关系的领域并不适合采用多智能体系统。例如,与研究工作相比,大多数编码任务中真正可并行处理的部分较少,而且 LLM 智能体在实时协调和任务委派方面仍不成熟。我们发现多智能体系统最擅长处理具有以下特征的高价值任务:需要大量并行计算、信息量超出单个上下文窗口限制,以及需要与众多复杂工具交互的场景。
As is quickly becoming apparent when building agents, there is not a “one-size-fits-all” solution. Instead, you will want to explore several options and choose the best one according to the problem you are solving.
在构建智能体时,一个日益明显的事实是:不存在"放之四海而皆准"的解决方案。相反,您需要探索多种选项,并根据具体问题选择最佳方案。
Any agent framework you choose should allow you to slide anywhere on this spectrum - something we’ve uniquely emphasized with LangGraph.
您选择的任何智能体框架都应允许您在这个光谱上自由调整——这正是我们在 LangGraph 中特别强调的设计理念。
Figuring out how to get multi-agent (or complex single agent) systems to function also requires new tooling. Durable execution, debugging, observability, and evaluation are all new tools that will make your life as an application developer easier. Luckily, these are all generic tooling. This means that you can use tools like LangGraph and LangSmith to get these off-the-shelf, allowing you to focus more on the business logic of your application than generic infrastructure.
要让多智能体(或复杂单智能体)系统正常运行,还需要新的工具支持。持久化执行、调试、可观测性和评估等工具都能让应用开发者的工作更轻松。幸运的是,这些都是通用工具。这意味着你可以直接使用 LangGraph 和 LangSmith 这类现成工具,从而更专注于应用程序的业务逻辑,而非通用基础设施。