这是用户在 2025-6-25 9:15 为 https://www.anthropic.com/engineering/built-multi-agent-research-system 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Try Claude  试用 Claude
Engineering at Anthropic
Anthropic 的工程实践

How we built our multi-agent research system
我们如何构建多智能体研究系统

Claude now has Research capabilities that allow it to search across the web, Google Workspace, and any integrations to accomplish complex tasks.
Claude 现已具备研究能力,能够通过网络搜索、Google Workspace 及各类集成来完成复杂任务。

The journey of this multi-agent system from prototype to production taught us critical lessons about system architecture, tool design, and prompt engineering. A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously. Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability.
这个多智能体系统从原型到投产的历程,让我们在系统架构、工具设计和提示工程方面获得了关键经验。多智能体系统由多个智能体(自主循环使用工具的 LLMs)协同工作构成。我们的研究功能包含一个根据用户查询规划研究流程的智能体,随后使用工具创建并行智能体同时搜索信息。多智能体系统在智能体协调、评估和可靠性方面带来了新的挑战。

This post breaks down the principles that worked for us—we hope you'll find them useful to apply when building your own multi-agent systems.
本文剖析了我们实践有效的原则——希望这些经验对您构建自己的多智能体系统有所裨益。

Benefits of a multi-agent system
多智能体系统的优势

Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent. When people conduct research, they tend to continuously update their approach based on discoveries, following leads that emerge during investigation.
研究工作涉及开放式问题,很难预先预测所需的步骤。探索复杂主题时无法硬编码固定路径,因为这一过程本质上是动态且路径依赖的。人们开展研究时,往往会根据发现不断调整方法,循着调查过程中浮现的线索推进。

This unpredictability makes AI agents particularly well-suited for research tasks. Research demands the flexibility to pivot or explore tangential connections as the investigation unfolds. The model must operate autonomously for many turns, making decisions about which directions to pursue based on intermediate findings. A linear, one-shot pipeline cannot handle these tasks.
这种不可预测性使得 AI 智能体特别适合研究任务。研究需要随着调查展开灵活转向或探索横向关联。模型必须自主运行多个回合,根据中间结果决定后续方向。线性、一次性的流程无法胜任这类任务。

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.
搜索的本质在于压缩:从海量语料中提炼洞见。子代理通过并行运作各自的情境窗口来促进压缩过程,同时探索问题的不同方面,然后为首席研究代理浓缩最重要的标记。每个子代理还实现了关注点分离——使用不同的工具、提示和探索路径——这减少了路径依赖,实现了彻底而独立的调查。

Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance. For instance, although individual humans have become more intelligent in the last 100,000 years, human societies have become exponentially more capable in the information age because of our collective intelligence and ability to coordinate. Even generally-intelligent agents face limits when operating as individuals; groups of agents can accomplish far more.
一旦智能达到某个临界点,多代理系统就成为扩展性能的关键方式。例如,尽管过去十万年间个体人类变得更聪明,但人类社会在信息时代因集体智慧和协调能力而实现了指数级的能力提升。即使是具备通用智能的代理,单独运作时也会面临局限;代理群体则能实现远为宏大的成就。

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.
我们的内部评估显示,多智能体研究系统尤其擅长处理需要同时探索多个独立方向的广度优先查询。研究发现,以 Claude Opus 4 为主导智能体、Claude Sonnet 4 为子智能体的多智能体系统,在我们的内部研究评估中表现比单智能体 Claude Opus 4 高出 90.2%。例如,当要求识别信息技术类标普 500 指数公司所有董事会成员时,多智能体系统通过将任务分解给子智能体获得了正确答案,而单智能体系统因采用缓慢的串行搜索方式未能找到答案。

Multi-agent systems work mainly because they help spend enough tokens to solve the problem. In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors. This finding validates our architecture that distributes work across agents with separate context windows to add more capacity for parallel reasoning. The latest Claude models act as large efficiency multipliers on token use, as upgrading to Claude Sonnet 4 is a larger performance gain than doubling the token budget on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.
多智能体系统之所以有效,主要在于它们能充分调用计算资源来解决问题。我们的分析显示,在 BrowseComp 评估(测试浏览智能体定位难寻信息的能力)中,三个因素解释了 95%的性能差异。其中单是计算资源消耗就解释了 80%的差异,工具调用次数和模型选择则是另外两个关键因素。这一发现验证了我们的架构设计——通过分配工作给拥有独立上下文窗口的智能体,实现并行推理能力的扩展。最新 Claude 模型显著提升了计算资源利用效率,升级到 Claude Sonnet 4 带来的性能提升甚至超过在 Claude Sonnet 3.7 上双倍计算资源的投入。对于超越单智能体处理能力的任务,多智能体架构能有效扩展计算资源的使用规模。

There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.
这种做法存在一个缺点:在实际应用中,这些架构会快速消耗大量 token。根据我们的数据,智能体通常比普通聊天交互多消耗约 4 倍的 token,而多智能体系统则比单次聊天多消耗约 15 倍的 token。要实现经济可行性,多智能体系统必须应用于任务价值足够高的场景,才能抵消其增加的运行成本。此外,当前多智能体系统并不适合那些需要所有智能体共享相同上下文,或涉及大量智能体间依赖关系的领域。例如,大多数编码任务中真正可并行处理的部分比研究任务少,而且 LLM 智能体目前还不擅长实时协调和委派其他智能体。我们发现,多智能体系统最擅长处理三类高价值任务:需要重度并行化处理的任务、超出单个上下文窗口信息量的任务,以及需要与众多复杂工具交互的任务。

Architecture overview for Research
研究架构概述

Our Research system uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.
我们的研究系统采用多智能体架构,遵循协调者-工作者模式:由一个主导智能体协调流程,同时将任务分配给并行运作的多个专业子智能体。

The multi-agent architecture in action: user queries flow through a lead agent that creates specialized subagents to search for different aspects in parallel.
多智能体架构的实际运作:用户查询首先通过主导智能体,该智能体会创建专门的子智能体来并行搜索不同方面的信息。

When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. As shown in the diagram above, the subagents act as intelligent filters by iteratively using search tools to gather information, in this case on AI agent companies in 2025, and then returning a list of companies to the lead agent so it can compile a final answer.
当用户提交查询时,主导智能体会分析查询、制定策略,并同时生成多个子智能体来探索不同方面。如上图所示,子智能体作为智能过滤器,通过迭代使用搜索工具收集信息(在本例中是关于 2025 年 AI 智能体公司的信息),然后将公司列表返回给主导智能体,由其汇总最终答案。

Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. In contrast, our architecture uses a multi-step search that dynamically finds relevant information, adapts to new findings, and analyzes results to formulate high-quality answers.
传统的检索增强生成(RAG)方法采用静态检索机制,即获取与输入查询最相似的一组文本片段,并利用这些片段生成响应。相比之下,我们的架构采用多步骤搜索,能动态发现相关信息、适应新发现,并通过分析结果来制定高质量答案。

Process diagram showing the complete workflow of our multi-agent Research system. When a user submits a query, the system creates a LeadResearcher agent that enters an iterative research process. The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan. It then creates specialized Subagents (two are shown here, but it can be any number) with specific research tasks. Each Subagent independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher. The LeadResearcher synthesizes these results and decides whether more research is needed—if so, it can create additional subagents or refine its strategy. Once sufficient information is gathered, the system exits the research loop and passes all findings to a CitationAgent, which processes the documents and research report to identify specific locations for citations. This ensures all claims are properly attributed to their sources. The final research results, complete with citations, are then returned to the user.
流程图展示了我们多智能体研究系统的完整工作流程。当用户提交查询时,系统会创建一个 LeadResearcher(首席研究员)智能体,进入迭代研究过程。LeadResearcher 首先会思考研究方法,并将计划保存到 Memory(记忆模块)以保持上下文——因为当上下文窗口超过 20 万 token 时会被截断,保留计划至关重要。随后它会创建专门化的 Subagents(子智能体,图中显示两个,但数量可任意)来执行具体研究任务。每个 Subagent 独立进行网络搜索,通过交错思考评估工具结果,并将发现返回给 LeadResearcher。LeadResearcher 综合这些结果并判断是否需要进一步研究——如果需要,可以创建更多子智能体或优化策略。当收集到足够信息后,系统退出研究循环,将所有发现传递给 CitationAgent(引用处理智能体),该智能体会处理文档和研究报告以确定具体的引用位置,确保所有主张都能正确溯源。 最终的研究成果连同引用文献一并返回给用户。

Prompt engineering and evaluations for research agents
研究智能体的提示工程与评估

Multi-agent systems have key differences from single-agent systems, including a rapid growth in coordination complexity. Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. Below are some principles we learned for prompting agents:
多智能体系统与单智能体系统存在关键差异,包括协调复杂性呈指数级增长。早期智能体常犯错误,例如为简单查询生成 50 个子智能体、无休止搜索不存在的网络资源、以及通过过多更新相互干扰。由于每个智能体都由提示词驱动,提示工程成为我们改进这些行为的主要手段。以下是我们总结的智能体提示设计原则:

  1. Think like your agents. To iterate on prompts, you must understand their effects. To help us do this, we built simulations using our Console with the exact prompts and tools from our system, then watched agents work step-by-step. This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent, which can make the most impactful changes obvious.
    像你的智能体一样思考。要优化提示词,必须理解其实际效果。为此,我们使用控制台搭建了模拟环境,采用系统完全相同的提示词和工具,逐步观察智能体的工作过程。这种方法立即暴露出多种故障模式:智能体在已获得充分结果后仍继续工作、使用过于冗长的搜索查询、或选择错误工具。有效的提示工程依赖于建立准确的智能体心智模型,这能让最重要的改进方向变得显而易见。
  2. Teach the orchestrator how to delegate. In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries. Without detailed task descriptions, agents duplicate work, leave gaps, or fail to find necessary information. We started by allowing the lead agent to give simple, short instructions like 'research the semiconductor shortage,' but found these instructions often were vague enough that subagents misinterpreted the task or performed the exact same searches as other agents. For instance, one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor.
    教导协调器如何进行任务委派。在我们的系统中,主导智能体会将查询分解为子任务并向子智能体描述这些任务。每个子智能体都需要明确目标、输出格式、工具使用指南、数据源指引以及清晰的任务边界。若缺乏详细的任务描述,智能体会出现工作重复、任务遗漏或无法获取必要信息等问题。初期我们允许主导智能体给出"研究芯片短缺问题"这类简单短指令,但发现这类指令往往过于模糊,导致子智能体误解任务或与其他智能体执行完全相同的搜索。例如曾出现一个子智能体研究 2021 年汽车芯片危机,而另外两个子智能体却重复调研 2025 年供应链现状,未能实现有效分工。
  3. Scale effort to query complexity. Agents struggle to judge appropriate effort for different tasks, so we embedded scaling rules in the prompts. Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities. These explicit guidelines help the lead agent allocate resources efficiently and prevent overinvestment in simple queries, which was a common failure mode in our early versions.
    根据查询复杂度调整工作规模。智能体难以判断不同任务所需的工作量,因此我们在提示中嵌入了分级规则:简单的事实核查仅需 1 个智能体进行 3-10 次工具调用,直接比较可能需要 2-4 个子智能体各执行 10-15 次调用,而复杂研究则可能使用超过 10 个职责明确的子智能体。这些明确指导原则帮助主导智能体高效分配资源,避免在简单查询上过度投入——这正是我们早期版本中常见的失效模式。
  4. Tool design and selection are critical. Agent-tool interfaces are as critical as human-computer interfaces. Using the right tool is efficient—often, it’s strictly necessary. For instance, an agent searching the web for context that only exists in Slack is doomed from the start. With MCP servers that give the model access to external tools, this problem compounds, as agents encounter unseen tools with descriptions of wildly varying quality. We gave our agents explicit heuristics: for example, examine all available tools first, match tool usage to user intent, search the web for broad external exploration, or prefer specialized tools over generic ones. Bad tool descriptions can send agents down completely wrong paths, so each tool needs a distinct purpose and a clear description.
    工具设计与选择至关重要。智能体-工具接口与人机界面同等重要。使用恰当的工具能提升效率——很多时候这甚至是必要条件。例如,若让智能体在网页上搜索仅存在于 Slack 中的上下文信息,从一开始就注定失败。通过 MCP 服务器为模型提供外部工具访问时,这个问题会加剧,因为智能体可能遇到描述质量参差不齐的陌生工具。我们为智能体制定了明确的启发式规则:比如先检查所有可用工具,根据用户意图匹配工具使用方式,通过网页搜索进行广泛外部探索,或优先选择专用工具而非通用工具。糟糕的工具描述可能导致智能体完全偏离正确路径,因此每个工具都需要有明确用途和清晰描述。
  5. Let agents improve themselves. We found that the Claude 4 models can be excellent prompt engineers. When given a prompt and a failure mode, they are able to diagnose why the agent is failing and suggest improvements. We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.
    让智能体自我优化。我们发现 Claude 4 模型可以成为出色的提示工程师。当给定提示和失败模式时,它们能够诊断智能体失败原因并提出改进建议。我们甚至创建了一个工具测试智能体——当遇到有缺陷的 MCP 工具时,它会尝试使用该工具,然后重写工具描述以避免故障。通过数十次测试工具,该智能体发现了关键细节和漏洞。这种改进工具人机工程学的方法使后续使用新描述的智能体任务完成时间缩短了 40%,因为它们能够避免大多数错误。
  6. Start wide, then narrow down. Search strategy should mirror expert human research: explore the landscape before drilling into specifics. Agents often default to overly long, specific queries that return few results. We counteracted this tendency by prompting agents to start with short, broad queries, evaluate what’s available, then progressively narrow focus.
    先广撒网,再逐步聚焦。搜索策略应模拟人类专家的研究方式:先全面探索领域概况,再深入具体细节。智能体常会默认生成过于冗长具体的查询,导致返回结果寥寥。我们通过引导智能体从简短宽泛的查询入手,评估现有资料后再逐步收窄范围,成功抑制了这种倾向。
  7. Guide the thinking process. Extended thinking mode, which leads Claude to output additional tokens in a visible thinking process, can serve as a controllable scratchpad. The lead agent uses thinking to plan its approach, assessing which tools fit the task, determining query complexity and subagent count, and defining each subagent’s role. Our testing showed that extended thinking improved instruction-following, reasoning, and efficiency. Subagents also plan, then use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query. This makes subagents more effective in adapting to any task.
    引导思考过程。扩展思考模式能让 Claude 在可见的思考过程中输出更多标记,这相当于一个可控的草稿区。主导智能体通过思考来规划其方法,评估哪些工具适合任务,确定查询复杂度和子智能体数量,并定义每个子智能体的角色。我们的测试表明,扩展思考提升了指令遵循、推理能力和效率。子智能体也会先制定计划,然后在工具返回结果后通过交叉思考来评估质量、发现遗漏并优化后续查询。这使得子智能体能更有效地适应各类任务。
  8. Parallel tool calling transforms speed and performance. Complex research tasks naturally involve exploring many sources. Our early agents executed sequential searches, which was painfully slow. For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.
    并行工具调用显著提升速度和性能。复杂研究任务天然涉及多源信息探索。早期智能体采用顺序搜索方式,效率极其低下。为提升速度,我们引入两种并行机制:(1) 主导智能体并行启动 3-5 个子智能体而非依次创建;(2) 子智能体并行使用 3 个以上工具。这些改进使复杂查询的研究时间缩短达 90%,让研究系统能在数分钟内完成以往需数小时的工作,同时获取的信息量远超其他系统。

Our prompting strategy focuses on instilling good heuristics rather than rigid rules. We studied how skilled humans approach research tasks and encoded these strategies in our prompts—strategies like decomposing difficult questions into smaller tasks, carefully evaluating the quality of sources, adjusting search approaches based on new information, and recognizing when to focus on depth (investigating one topic in detail) vs. breadth (exploring many topics in parallel). We also proactively mitigated unintended side effects by setting explicit guardrails to prevent the agents from spiraling out of control. Finally, we focused on a fast iteration loop with observability and test cases.
我们的提示策略侧重于培养良好的启发式方法而非僵化规则。我们研究了人类专家如何处理研究任务,并将这些策略编码到提示中——例如将复杂问题分解为小任务、谨慎评估信息来源质量、根据新信息调整搜索方式、以及判断何时该侧重深度(深入研究单一主题)或广度(并行探索多个主题)。我们还通过设置明确防护措施来主动预防意外副作用,避免智能体失控。最后,我们建立了具备可观测性和测试用例的快速迭代循环。

Effective evaluation of agents
有效评估智能体

Good evaluations are essential for building reliable AI applications, and agents are no different. However, evaluating multi-agent systems presents unique challenges. Traditional evaluations often assume that the AI follows the same steps each time: given input X, the system should follow path Y to produce output Z. But multi-agent systems don't work this way. Even with identical starting points, agents might take completely different valid paths to reach their goal. One agent might search three sources while another searches ten, or they might use different tools to find the same answer. Because we don’t always know what the right steps are, we usually can't just check if agents followed the “correct” steps we prescribed in advance. Instead, we need flexible evaluation methods that judge whether agents achieved the right outcomes while also following a reasonable process.
良好的评估对于构建可靠的 AI 应用至关重要,智能体也不例外。然而,评估多智能体系统面临着独特的挑战。传统评估通常假设 AI 每次都会遵循相同的步骤:给定输入 X,系统应通过路径 Y 产生输出 Z。但多智能体系统并非如此运作。即使起点完全相同,智能体也可能采取完全不同的有效路径来实现目标。一个智能体可能搜索三个来源,而另一个搜索十个,或者它们可能使用不同工具找到相同答案。由于我们并不总是知道正确的步骤是什么,通常无法简单地检查智能体是否遵循了我们预先规定的"正确"步骤。相反,我们需要灵活的评估方法,既要判断智能体是否取得了正确结果,也要评估其是否遵循了合理的过程。

Start evaluating immediately with small samples. In early agent development, changes tend to have dramatic impacts because there is abundant low-hanging fruit. A prompt tweak might boost success rates from 30% to 80%. With effect sizes this large, you can spot changes with just a few test cases. We started with a set of about 20 queries representing real usage patterns. Testing these queries often allowed us to clearly see the impact of changes. We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals.
立即从小样本开始评估。在智能体开发初期,由于存在大量容易实现的改进机会,细微调整往往能产生显著效果。比如简单的提示词修改就可能将成功率从 30%提升至 80%。当效果差异如此明显时,仅需少量测试案例就能观察到变化。我们最初采用了约 20 个代表真实使用场景的查询样本,通过这些测试往往能清晰识别修改产生的影响。经常听到 AI 开发团队因认为"只有包含数百测试案例的大规模评估才有价值"而推迟创建评估体系。但最佳实践是立即用少量示例开展小规模测试,而非等到能构建更全面评估体系时才行动。

LLM-as-judge evaluation scales when done well. Research outputs are difficult to evaluate programmatically, since they are free-form text and rarely have a single correct answer. LLMs are a natural fit for grading outputs. We used an LLM judge that evaluated each output against criteria in a rubric: factual accuracy (do claims match sources?), citation accuracy (do the cited sources match the claims?), completeness (are all requested aspects covered?), source quality (did it use primary sources over lower-quality secondary sources?), and tool efficiency (did it use the right tools a reasonable number of times?). We experimented with multiple judges to evaluate each component, but found that a single LLM call with a single prompt outputting scores from 0.0-1.0 and a pass-fail grade was the most consistent and aligned with human judgements. This method was especially effective when the eval test cases did have a clear answer, and we could use the LLM judge to simply check if the answer was correct (i.e. did it accurately list the pharma companies with the top 3 largest R&D budgets?). Using an LLM as a judge allowed us to scalably evaluate hundreds of outputs.
LLM 作为评估工具在妥善实施时具有良好的扩展性。由于研究输出通常是自由格式文本且鲜有唯一正确答案,程序化评估存在困难。LLMs 天然适合对输出内容进行评分。我们采用了一个 LLM 评估器,根据评分标准对每个输出进行多维度评估:事实准确性(主张是否与信源相符)、引用准确性(引用的信源是否支持主张)、完整性(是否涵盖所有要求方面)、信源质量(是否优先使用高质量一手信源而非低质量二手信源)以及工具使用效率(是否合理使用适当工具)。我们尝试过使用多个评估器分别打分,但发现单个 LLM 通过单一提示输出 0.0-1.0 分数及通过/不通过判定时,其一致性最高且与人工判断最吻合。当测试案例存在明确答案时(例如能否准确列出研发预算前三的制药公司),这种方法尤为有效,LLM 评估器只需核对答案正确性即可。采用 LLM 作为评估器使我们能够规模化地评估数百项输出成果。

Human evaluation catches what automation misses. People testing agents find edge cases that evals miss. These include hallucinated answers on unusual queries, system failures, or subtle source selection biases. In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue. Even in a world of automated evaluations, manual testing remains essential.
人工评估能发现自动化测试遗漏的问题。测试人员会发现评估系统忽略的边缘案例,包括对非常规查询的虚构回答、系统故障或微妙的来源选择偏差。在我们的案例中,人工测试员发现早期代理系统总是优先选择 SEO 优化的内容农场,而非学术 PDF 或个人博客等权威但排名较低的来源。通过在提示中添加来源质量启发式规则,我们成功解决了这个问题。即便在自动化评估普及的今天,人工测试仍然不可或缺。

Multi-agent systems have emergent behaviors, which arise without specific programming. For instance, small changes to the lead agent can unpredictably change how subagents behave. Success requires understanding interaction patterns, not just individual agent behavior. Therefore, the best prompts for these agents are not just strict instructions, but frameworks for collaboration that define the division of labor, problem-solving approaches, and effort budgets. Getting this right relies on careful prompting and tool design, solid heuristics, observability, and tight feedback loops. See the open-source prompts in our Cookbook for example prompts from our system.
多智能体系统具有涌现行为,这些行为并非通过特定编程产生。例如,对主导智能体的微小改动可能不可预测地改变子智能体的行为方式。成功的关键在于理解交互模式,而不仅是单个智能体的行为。因此,这些智能体的最佳提示并非严格的指令,而是定义分工、问题解决方法和精力分配的协作框架。实现这一点需要精心设计的提示和工具、可靠的启发式方法、可观测性以及紧密的反馈循环。请参阅我们 Cookbook 中的开源提示,查看我们系统中的示例提示。

Production reliability and engineering challenges
生产可靠性与工程挑战

In traditional software, a bug might break a feature, degrade performance, or cause outages. In agentic systems, minor changes cascade into large behavioral changes, which makes it remarkably difficult to write code for complex agents that must maintain state in a long-running process.
在传统软件中,一个错误可能破坏某个功能、降低性能或导致服务中断。而在智能体系统中,细微的改动会引发行为的巨大变化,这使得为需要在长期运行过程中保持状态的复杂智能体编写代码变得异常困难。

Agents are stateful and errors compound. Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents. When errors occur, we can't just restart from the beginning: restarts are expensive and frustrating for users. Instead, we built systems that can resume from where the agent was when the errors occurred. We also use the model’s intelligence to handle issues gracefully: for instance, letting the agent know when a tool is failing and letting it adapt works surprisingly well. We combine the adaptability of AI agents built on Claude with deterministic safeguards like retry logic and regular checkpoints.
智能体具有状态性且错误会累积。智能体可以长时间运行,在多次工具调用间保持状态。这意味着我们需要持久化执行代码并全程处理错误。若缺乏有效缓解措施,微小的系统故障对智能体而言可能是灾难性的。当错误发生时,我们不能简单地从头重启:重启成本高昂且会给用户带来困扰。为此,我们构建了能够从智能体出错位置恢复的系统。我们还利用模型的智能来优雅处理问题:例如当工具失效时通知智能体并让其自适应调整,这种方法效果出奇地好。我们将基于 Claude 构建的 AI 智能体的适应能力,与重试逻辑和定期检查点等确定性保障机制相结合。

Debugging benefits from new approaches. Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder. For instance, users would report agents “not finding obvious information,” but we couldn't see why. Were the agents using bad search queries? Choosing poor sources? Hitting tool failures? Adding full production tracing let us diagnose why agents failed and fix issues systematically. Beyond standard observability, we monitor agent decision patterns and interaction structures—all without monitoring the contents of individual conversations, to maintain user privacy. This high-level observability helped us diagnose root causes, discover unexpected behaviors, and fix common failures.
调试工作得益于新方法的引入。智能体能够动态决策,即使在相同提示下,不同运行间也呈现非确定性特征,这增加了调试难度。例如,用户曾反馈智能体"未能发现明显信息",但我们无法查明原因——是搜索查询语句不当?选择了低质量信源?还是遭遇了工具故障?通过部署全链路生产追踪系统,我们得以系统性诊断智能体失败原因并修复问题。除标准可观测性方案外,我们还监控智能体决策模式与交互结构(全程不监测具体对话内容以保护用户隐私)。这种高层级可观测性帮助我们定位根本原因、发现异常行为并修复常见故障。

Deployment needs careful coordination. Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process. We therefore need to prevent our well-meaning code changes from breaking existing agents. We can’t update every agent to the new version at the same time. Instead, we use rainbow deployments to avoid disrupting running agents, by gradually shifting traffic from old to new versions while keeping both running simultaneously.
部署需要谨慎协调。代理系统是由提示词、工具和执行逻辑构成的高度有状态网络,几乎持续运行。这意味着当我们部署更新时,代理可能处于流程中的任何阶段。因此,我们需要防止善意的代码变更破坏现有代理的运行。我们无法同时将所有代理更新到新版本,而是采用彩虹部署策略——通过逐步将流量从旧版本迁移到新版本,同时保持两个版本并行运行,从而避免干扰正在执行的代理。

Synchronous execution creates bottlenecks. Currently, our lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding. This simplifies coordination, but creates bottlenecks in the information flow between agents. For instance, the lead agent can’t steer subagents, subagents can’t coordinate, and the entire system can be blocked while waiting for a single subagent to finish searching. Asynchronous execution would enable additional parallelism: agents working concurrently and creating new subagents when needed. But this asynchronicity adds challenges in result coordination, state consistency, and error propagation across the subagents. As models can handle longer and more complex research tasks, we expect the performance gains will justify the complexity.
同步执行会形成瓶颈。目前,我们的主导代理采用同步方式执行子代理,必须等待每组子代理完成后才能继续推进。这种方式简化了协调工作,但会在代理间的信息流中造成阻塞。例如:主导代理无法引导子代理、子代理之间无法协作,整个系统可能因等待单个子代理完成搜索任务而停滞。异步执行将实现更高程度的并行化:代理可并发工作,并在需要时动态创建新的子代理。但这种异步机制会带来结果协调、状态一致性以及子代理间错误传播等新挑战。随着模型处理更复杂研究任务的能力提升,我们预计性能收益将超越由此增加的复杂度。

Conclusion  结论

When building AI agents, the last mile often becomes most of the journey. Codebases that work on developer machines require significant engineering to become reliable production systems. The compound nature of errors in agentic systems means that minor issues for traditional software can derail agents entirely. One step failing can cause agents to explore entirely different trajectories, leading to unpredictable outcomes. For all the reasons described in this post, the gap between prototype and production is often wider than anticipated.
在构建 AI 智能体时,最后的阶段往往成为整个旅程的主要部分。那些在开发者机器上能运行的代码库,需要大量工程化工作才能成为可靠的生产系统。智能体系统中错误的复合性质意味着,对于传统软件而言的小问题可能完全破坏智能体的运行。一个步骤的失败可能导致智能体探索完全不同的轨迹,从而产生不可预测的结果。基于本文所述的所有原因,原型与生产环境之间的差距往往比预期更为显著。

Despite these challenges, multi-agent systems have proven valuable for open-ended research tasks. Users have said that Claude helped them find business opportunities they hadn’t considered, navigate complex healthcare options, resolve thorny technical bugs, and save up to days of work by uncovering research connections they wouldn't have found alone. Multi-agent research systems can operate reliably at scale with careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight collaboration between research, product, and engineering teams who have a strong understanding of current agent capabilities. We're already seeing these systems transform how people solve complex problems.
尽管面临这些挑战,多智能体系统已被证明在开放式研究任务中具有重要价值。用户反馈称,Claude 帮助他们发现了未曾考虑的商业机遇、梳理复杂的医疗选项、解决棘手的技术故障,并通过揭示单靠个人无法发现的研究关联,节省了多达数天的工作量。通过精心的工程实现、全面测试、细节导向的提示词与工具设计、稳健的运维实践,以及深刻理解当前智能体能力的研究、产品和工程团队间的紧密协作,多智能体研究系统能够实现规模化可靠运行。我们已见证这些系统正在改变人们解决复杂问题的方式。

A Clio embedding plot showing the most common ways people are using the Research feature today. The top use case categories are developing software systems across specialized domains (10%), develop and optimize professional and technical content (8%), develop business growth and revenue generation strategies (8%), assist with academic research and educational material development (7%), and research and verify information about people, places, or organizations (5%).
一份 Clio 嵌入图展示了当前人们使用研究功能的最常见方式。主要应用场景包括:跨专业领域开发软件系统(10%)、开发和优化专业技术内容(8%)、制定业务增长与营收策略(8%)、辅助学术研究与教育资料开发(7%)、以及研究验证关于人物/地点/机构的信息(5%)。

Acknowlegements  致谢

Written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. This work reflects the collective efforts of several teams across Anthropic who made the Research feature possible. Special thanks go to the Anthropic apps engineering team, whose dedication brought this complex multi-agent system to production. We're also grateful to our early users for their excellent feedback.
本文由 Jeremy Hadfield、Barry Zhang、Kenneth Lien、Florian Scholz、Jeremy Fox 和 Daniel Ford 共同撰写。这项成果凝聚了 Anthropic 多个团队的集体智慧,正是他们的努力使这项研究功能成为可能。特别感谢 Anthropic 应用工程团队,正是他们的奉献精神让这个复杂的多智能体系统得以投入生产。我们也要感谢早期用户提供的宝贵反馈。

Appendix  附录

Below are some additional miscellaneous tips for multi-agent systems.
以下是一些关于多智能体系统的额外实用建议。

End-state evaluation of agents that mutate state over many turns. Evaluating agents that modify persistent state across multi-turn conversations presents unique challenges. Unlike read-only research tasks, each action can change the environment for subsequent steps, creating dependencies that traditional evaluation methods struggle to handle. We found success focusing on end-state evaluation rather than turn-by-turn analysis. Instead of judging whether the agent followed a specific process, evaluate whether it achieved the correct final state. This approach acknowledges that agents may find alternative paths to the same goal while still ensuring they deliver the intended outcome. For complex workflows, break evaluation into discrete checkpoints where specific state changes should have occurred, rather than attempting to validate every intermediate step.
在多轮对话中状态持续演变的智能体终态评估。评估那些在多轮对话中修改持久状态的智能体带来了独特挑战。与只读研究任务不同,每个动作都可能改变后续步骤的环境,产生传统评估方法难以处理的依赖关系。我们发现聚焦终态评估而非逐轮分析更为有效——不评判智能体是否遵循特定流程,而是评估其是否达成正确的最终状态。这种方法承认智能体可能找到实现相同目标的不同路径,同时仍确保它们交付预期结果。对于复杂工作流,可将评估拆分为若干离散检查点,验证特定状态变化是否发生,而非试图验证每个中间步骤。

Long-horizon conversation management. Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms. We implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits approach, agents can spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Further, they can retrieve stored context like the research plan from their memory rather than losing previous work when reaching the context limit. This distributed approach prevents context overflow while preserving conversation coherence across extended interactions.
长周期对话管理。生产环境中的智能体经常需要进行数百轮对话,这要求精心设计的上下文管理策略。随着对话延长,标准上下文窗口会变得不足,需要智能压缩和记忆机制。我们实现了这样的模式:智能体在进入新任务前会总结已完成的工作阶段,并将关键信息存储到外部记忆系统中。当接近上下文限制时,智能体可以生成具有全新上下文的子智能体,同时通过细致的交接保持连续性。此外,它们可以从记忆中检索存储的上下文(如研究计划),避免在达到上下文限制时丢失先前工作。这种分布式方法既能防止上下文溢出,又能保持长时间交互中的对话连贯性。

Subagent output to a filesystem to minimize the ‘game of telephone.’ Direct subagent outputs can bypass the main coordinator for certain types of results, improving both fidelity and performance. Rather than requiring subagents to communicate everything through the lead agent, implement artifact systems where specialized agents can create outputs that persist independently. Subagents call tools to store their work in external systems, then pass lightweight references back to the coordinator. This prevents information loss during multi-stage processing and reduces token overhead from copying large outputs through conversation history. The pattern works particularly well for structured outputs like code, reports, or data visualizations where the subagent's specialized prompt produces better results than filtering through a general coordinator.
将子代理输出至文件系统以减少"传话游戏"效应。特定类型的结果可直接由子代理输出,绕过主协调器,从而提高保真度和性能。无需强制子代理通过主导代理传递所有信息,可建立工件系统让专业代理创建持久化独立输出。子代理调用工具将其工作存储于外部系统,随后向协调器传回轻量级引用。这种方法避免了多阶段处理中的信息丢失,并减少了通过对话历史复制大型输出带来的令牌开销。该模式尤其适用于代码、报告或数据可视化等结构化输出场景,在这些场景中,专业子代理的专用提示词能产生比通过通用协调器过滤更优质的结果。