这是用户在 2025-7-10 10:13 为 https://arxiv.org/html/2506.18096v1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非排他性许可证
arXiv:2506.18096v1 [cs.AI] 22 Jun 2025
arXiv:2506.18096v1 [cs.AI] 2025 年 6 月 22 日

Deep Research Agents:
A Systematic Examination And Roadmap
深度研究代理: 系统检查和路线图

Yuxuan Huang Yihang Chen Haozheng Zhang Kang Li Meng Fang Linyi Yang
Xiaoguang Li
Lifeng Shang Songcen Xu Jianye Hao Kun Shao Jun Wang
Abstract  抽象

The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: https://github.com/ai-agents-2030/awesome-deep-research-agent.
大型语言模型 (LLM) 的快速发展催生了一种新型的自主 AI 系统,称为深度研究 (DR) 代理。这些代理旨在通过利用动态推理、自适应长期规划、多跳信息检索、迭代工具使用和结构化分析报告的生成的组合来处理复杂的多轮次信息研究任务。在本文中,我们对构成 Deep Research 代理的基础技术和架构组件进行了详细分析。我们首先回顾信息获取策略,将基于 API 的检索方法与基于浏览器的探索进行对比。然后,我们研究了模块化工具使用框架,包括代码执行、多模态输入处理和模型上下文协议 (MCP) 的集成,以支持可扩展性和生态系统开发。为了将现有方法系统化,我们提出了一种区分静态和动态工作流的分类法,并根据规划策略和代理组合对代理架构进行分类,包括单代理和多代理配置。我们还对当前基准进行了批判性评估,强调了关键限制,例如对外部知识的访问受限、顺序执行效率低下以及评估指标与 DR 代理的实际目标之间的不一致。最后,我们概述了未来研究的开放挑战和有希望的方向。DR 代理研究的精选和持续更新存储库可在以下网址获得: https://github.com/ai-agents-2030/awesome-deep-research-agent

1 Introduction  1 介绍

Recent advances in large language models (LLMs) have led to the rapid emergence of sophisticated AI agents capable of autonomous research. Early models such as GPT-3 brown2020languagemodelsfewshotlearners primarily addressed isolated tasks, including question answering and machine translation. Subsequently, integration with external tools enabled models such as WebGPT nakano2021webgpt to navigate the web and synthesize information from diverse sources autonomously. Most recently, a new class of advanced autonomous systems, termed Deep Research (DR) agents, has emerged, exemplified by industry-leading solutions such as OpenAI DR openai2025deepresearch , Gemini DR geminideepresearch , Grok DeepSearch grokdeepresearch , and Perplexity DR perplexitydeepresearch . These deep research agents significantly extend LLMs by incorporating advanced reasoning, dynamic task planning, and adaptive interaction with web resources and analytical tools.
大型语言模型 (LLM) 的最新进展导致能够自主研究的复杂 AI 代理迅速出现。GPT-3 brown2020languagemodelsfewshotlearners 等早期模型主要处理孤立的任务,包括问答和机器翻译。随后,与外部工具的集成使 WebGPT nakano2021webgpt 等模型能够导航网络并自主合成来自不同来源的信息。最近,出现了一类新的高级自主系统,称为深度研究 (DR) 代理,以行业领先的解决方案为例,例如 OpenAI DR openai2025deepresearch 、 Gemini DR geminideepresearch 、 Grok DeepSearch grokdeepresearch 和 Perplexity DR perplexitydeepresearch .这些深入的研究代理通过结合高级推理、动态任务规划以及与 Web 资源和分析工具的自适应交互,显著扩展了 LLM。

Formally, we define “Deep Research Agents” as:
正式地,我们将 “深度研究代理” 定义为:

AI agents powered by LLMs, integrating dynamic reasoning, adaptive planning, multi-iteration external data retrieval and tool use, and comprehensive analytical report generation for informational research tasks.
由 LLM 提供支持的 AI 代理,集成了动态推理、自适应规划、多迭代外部数据检索和工具使用,以及用于信息研究任务的综合分析报告生成。

Specifically, DR agents leverage LLMs as their cognitive core, retrieving external knowledge in real-time through web browsers and structured APIs, and dynamically invoking analytical tools via customized toolkits or standardized interfaces such as the Model Context Protocol (MCP). This architecture enables DR agents to autonomously manage complex, end-to-end research workflows by seamlessly integrating reasoning processes with multimodal resources.
具体来说,DR 代理利用 LLM 作为其认知核心,通过 Web 浏览器和结构化 API 实时检索外部知识,并通过自定义工具包或标准化接口(如模型上下文协议 (MCP))动态调用分析工具。这种架构使 DR 代理能够通过将推理流程与多模式资源无缝集成来自主管理复杂的端到端研究工作流程。

Compared with traditional Retrieval-Augmented Generation (RAG) methods singh2025agentic , which primarily enhance factual accuracy but lack sustained reasoning capabilities chen2025improving , and conventional Tool Use (TU) systems qu2025tool that heavily depend on pre-defined workflows wang2025tdag , DR agents offer significantly greater autonomy, continual and deep reasoning abilities, dynamic task planning, and adaptive real-time interaction. These advanced capabilities uniquely position DR agents to handle complex, evolving, and knowledge-intensive research scenarios. A representative example of such a DR agent architecture is illustrated in Figure 1, which demonstrates the complete workflow from user input through optional planning and intent clarification, to iterative tool utilization encompassing offline retrieval (vector and relational databases), online retrieval (APIs and browsers), and extended capabilities including data analytics, coding (etc.), and multimodal generation, ultimately producing comprehensive structured report.
与传统的检索增强生成 (RAG) 方法相比 ,主要提高事实准确性但缺乏持续推理能力 chen2025 改进 ,以及严重依赖预定义工作流程 的传统工具使用 (TU) 系统 qu2025 工具 wang2025tdag ,DR 代理提供更大的自主性、持续和深入的推理能力、动态任务规划和自适应实时交互。这些高级功能使 DR 代理能够处理复杂、不断发展和知识密集型的研究场景。图 1 说明了这种 DR 代理架构的一个代表性示例,它演示了从用户输入到可选规划和意图澄清,再到迭代工具使用的完整工作流程,包括离线检索(矢量和关系数据库)、在线检索(API 和浏览器)以及包括数据分析、编码(等)和多模式生成在内的扩展功能。 最终生成全面的结构化报告。

Refer to caption
Figure 1: A structural overview of a DR agent in an multi-agent architecture for ease of illustration.
图 1: 多代理架构中 DR 代理的结构概述,以便于说明。
Contribution.  贡献。

This survey systematically reviews recent advancements in DR agents, providing a comprehensive analysis of core technologies, methodologies, optimization pipelines, and representative implementations. Specifically, the contributions of this survey include:
本调查系统地回顾了 DR 代理的最新进展,对核心技术、方法、优化管道和代表性实施进行了全面分析。具体来说,这项调查的贡献包括:

  • A thorough analysis of representative DR systems, explicitly examining their system architectures, retrieval mechanisms, tool invocation methods, and performance characteristics, alongside optimization and tuning paradigms.


    • 对代表性的 DR 系统进行全面分析,明确检查其系统架构、检索机制、工具调用方法和性能特征,以及优化和调优范式。
  • A unified classification framework (Figure 4) that systematically categorizes DR systems based on workflow characteristics (static versus dynamic), planning strategies, and agent-based architectures (single-agent versus multi-agent), bridging diverse technical methodologies and current industrial solutions.


    • 一个统一的分类框架(图 4),根据工作流特征(静态与动态)、规划策略和基于智能体的架构(单智能体与多智能体)对 DR 系统进行分类,将不同的技术方法和当前的工业解决方案联系起来。
  • A systematic review and categorization of existing benchmarks utilized to evaluate DR systems, highlighting how these benchmarks assess critical capabilities, such as retrieval accuracy, reasoning depth, and adaptive tool invocation proficiency.


    • 对用于评估 DR 系统的现有基准进行系统回顾和分类,强调这些基准如何评估关键能力,例如检索准确性、推理深度和自适应工具调用熟练度。
  • A systematic analysis of critical open challenges and research directions, focusing on expanding retrieval scope beyond traditional methods, enabling asynchronous parallel execution, developing comprehensive multi-modal benchmarks, and optimizing multi-agent architectures for enhanced robustness and efficiency.


    • 对关键的开放挑战和研究方向进行系统分析,专注于将检索范围扩展到传统方法之外,实现异步并行执行,开发全面的多模态基准,并优化多智能体架构以提高稳健性和效率。
Survey Organization.  调查组织。

This survey methodically explores recent advancements in DR agents, organized as follows: Section 2 provides foundational concepts, examining recent progress in reasoning, retrieval-augmented generation, and agent communication protocols. Section 3 comprehensively analyzes key DR agent components, including search engine integration (Section 3.1), tool invocation strategies (Section 3.2), architectural workflows (Section 3.3), and optimization methodologies (Section 3.4). Section 4 reviews major industrial applications and practical implementations of DR agents by leading organizations. Section 5 surveys benchmarks used for evaluating DR systems, categorizing them into question-answering and task execution scenarios. Section 6 highlights critical challenges and outlines promising directions for future research, focusing on enhancing information acquisition, asynchronous parallel execution, benchmark alignment, and optimizing multi-agent architectures. Finally, Section 7 concludes with a summary and provides insights into the broader implications and opportunities within DR agent research.
本调查系统地探讨了 DR 代理的最新进展,组织如下:第 2 节提供了基本概念,研究了推理、检索增强生成和代理通信协议的最新进展。第 3 节全面分析了关键的 DR 代理组件,包括搜索引擎集成(第 3.1 节)、工具调用策略(第 3.2 节)、架构工作流(第 3.3 节)和优化方法(第 3.4 节)。第 4 节回顾了领先组织对 DR 代理的主要工业应用和实际实施。第 5 节调查了用于评估 DR 系统的基准,将它们分为问答和任务执行场景。第 6 节强调了关键挑战,并概述了未来研究的有希望的方向,重点是增强信息获取、异步并行执行、基准对齐和优化多智能体架构。最后,第 7 节以总结结束,并提供了对 DR 代理研究中更广泛的影响和机会的见解。

2 Background and Preliminaries
阿拉伯数字 背景和初步

2.1 Advances in Reasoning and Tool Integration
2.1 推理和工具集成的进步

Recent advancements in large reasoning models (LRMs) have greatly enhanced the ability of language models to tackle complex and abstract tasks. These models have shown significant improvements in tasks such as arithmetic, common-sense reasoning, and symbolic problem-solving, largely due to innovations in model architectures and training techniques. One such advancement is Chain-of-Thought (CoT) prompting, introduced by Wei et al. wei2023chainofthoughtpromptingelicitsreasoning , which explicitly guides models to articulate intermediate logical steps, decomposing complex problems into simpler, sequential stages. This has led to notable improvements in both the interpretability and accuracy of LLMs on various reasoning benchmarks. Building upon CoT, subsequent research has introduced methods to further enhance LLM reasoning, particularly in handling lengthy textual contexts. Approaches such as positional interpolation and sparse attention mechanisms bai2024longalign ; wang2024beyond have been proposed to extend the effective context window. Furthermore, specialized benchmarks like LongBench bai2024longbench and LongFinanceQA lin2025facilitating have been developed to rigorously evaluate and improve the performance of these models in extended-context reasoning.
大型推理模型 (LRM) 的最新进展极大地增强了语言模型处理复杂和抽象任务的能力。这些模型在算术、常识推理和符号问题解决等任务方面显示出显著改进,这主要是由于模型架构和训练技术的创新。其中一项进步是 Wei 等人引入的思维链 (CoT) 提示 wei2023chainofthoughtpromptingelicitsreasoning ,它明确指导模型阐明中间逻辑步骤,将复杂问题分解为更简单的顺序阶段。这导致 LLM 在各种推理基准上的可解释性和准确性都有显着提高。在 CoT 的基础上,随后的研究引入了进一步增强 LLM 推理的方法,尤其是在处理冗长的文本上下文时。位置插值和稀疏注意力机制等方法 bai2024longalign ; wang2024 超越 已提议扩展有效上下文窗口。此外,还开发了 LongBench bai2024longbench 和 LongFinanceQA lin2025 便利 等专业基准测试,以严格评估和改进这些模型在扩展上下文推理中的性能。

To address reasoning tasks that require real-time or specialized external knowledge, frameworks like Toolformer schick2023toolformerlanguagemodelsteach and MultiTool-CoT inaba2023multitool have been proposed, enabling LLMs to autonomously incorporate external computational resources and APIs directly within reasoning workflows. These approaches effectively enhance performance in tasks dependent on precise numerical calculations and dynamic information retrieval. Maintaining reasoning coherence across multiple conversational turns also poses distinct challenges. Techniques such as Dialogue CoT chae2023dialogue and Structured CoT (SCoT) sultan2024structured explicitly integrate dialogue states and conversational contexts within reasoning chains, significantly improving coherence, context-awareness, and the ability to manage iterative interactions and clarify complex user queries. However, despite substantial improvements, existing reasoning frameworks still encounter critical issues, including hallucinations, static or outdated internal knowledge, and insufficient responsiveness to rapidly changing information needs. These limitations highlight the necessity of integrating external information sources, real-time retrieval mechanisms, and adaptive reasoning strategies, which are core motivations driving recent advances toward more comprehensive and robust reasoning frameworks suitable for DR Agent applications.
为了解决需要实时或专业外部知识的推理任务,已经提出了 Toolformer schick2023toolformerlanguagemodelsteach 和 MultiTool-CoT inaba2023multitool 等框架,使 LLM 能够自主地将外部计算资源和 API 直接整合到推理工作流程中。这些方法有效地提高了依赖于精确数值计算和动态信息检索的任务的性能。在多个对话回合中保持推理的连贯性也带来了明显的挑战。对话 CoT chae2023dialogue 和结构化 CoT (SCoT) sultan2024structured 等技术明确地将对话状态和对话上下文集成到推理链中,显著提高了连贯性、上下文感知以及管理迭代交互和澄清复杂用户查询的能力。然而,尽管有了实质性的改进,现有的推理框架仍然遇到关键问题,包括幻觉、静态或过时的内部知识,以及对快速变化的信息需求的响应不足。这些限制凸显了集成外部信息源、实时检索机制和自适应推理策略的必要性,这些是推动最新进展的核心动机,旨在实现适用于 DR 代理应用程序的更全面、更强大的推理框架。

2.2 Advances in Retrieval-Augmented Generation and Agentic Retrieval
2.2 元检索增强生成和代理检索的进展

Retrieval-augmented Generation (RAG), leveraging external knowledge bases (e.g., webs, APIs), has emerged as an effective strategy to mitigate hallucination problems and enhance the accuracy of web information search fan2024survey ; gao2023retrieval ; singh2025agentic . Early RAG architectures typically involved a static pipeline, where retrievers fetched relevant documents from external sources such as Wikipedia or search engines, and generators (e.g., LLMs) produced answers based solely on these retrieved passages. However, static approaches were limited in handling complex or multi-step queries, motivating recent advances toward iterative and interactive retrieval mechanisms to generate richer and more relevant responses, including FLARE zhang2024enhancing , Self-RAG asai2023self , IAG zhang2023iag , and ToC kim2023tree . In addition, studies izacard2023atlas ; lin2023ra expanded retrieval sources from structured databases (e.g., Wikipedia) to large-scale, diverse web corpora such as the Common Crawl dump preprocessed via the CCNet pipeline fu2022ccnet . Further improvements of RAG include hybrid approaches that combine internal LLM knowledge and external retrievals for better accuracy and coherence aliannejadi2024trec . Recently, Huang et al. huang2025rag proposed RAG-RL, introducing reinforcement learning and curriculum learning techniques, enabling reasoning language models (RLMs) to more effectively identify and utilize relevant contexts.
利用外部知识库(例如,网络、API)的检索增强生成 (RAG) 已成为缓解幻觉问题和提高网络信息搜索准确性的有效策略 fan2024 调查 ; GAO2023 检索 ; singh2025 代理 .早期的 RAG 架构通常涉及一个静态管道,其中检索器从外部来源(如 Wikipedia 或搜索引擎)获取相关文档,而生成器(例如 LLM)仅根据这些检索到的段落生成答案。然而,静态方法在处理复杂或多步骤查询方面受到限制,这推动了迭代和交互式检索机制的最新进展,以生成更丰富、更相关的响应,包括 FLARE zhang2024sulincing 、Self-RAG asai2023self 、IAG zhang2023iag 和 ToC kim2023tree 。此外,研究 izacard2023atlas ; lin2023ra 将检索源从结构化数据库(例如维基百科)扩展到大规模、多样化的 Web 语料库,例如通过 CCNet 管道 fu2022ccnet 预处理的 Common Crawl 转储。RAG 的进一步改进包括将内部 LLM 知识和外部检索相结合的混合方法,以提高准确性和连贯性 aliannejadi2024trec 。最近,Huang et al. huang2025rag 提出了 RAG-RL,引入了强化学习和课程学习技术,使推理语言模型 (RLM) 能够更有效地识别和利用相关上下文。

Despite these advancements in retrieval methods and reasoning-enhanced models, RAG approaches still face limitations in effectively managing complex reasoning workflows and dynamically adapting to varied task requirements. To address these challenges, recent research extends RAG into an agentic paradigm, integrating additional reasoning and decision-making layers atop conventional RAG pipelines singh2025agentic . Agentic RAG approaches leverage iterative retrieval, adaptive querying, and dynamic workflow adjustments, significantly enhancing multi-step reasoning capabilities. For example, RL-based query refinement techniques (e.g., Hsu et al. hsu2024grounding ) improve retrieval for complex queries, while graph-based retrieval (e.g., GeAR shen2024gear ) further enhances the processing of multi-hop queries. Despite these advancements, agentic RAG still faces critical challenges, including balancing computational overhead from dynamic reasoning processes singh2025agentic , aligning agent behaviors with user intentions zerhoudi2024personarag , and ensuring interpretability in adaptive workflows hsu2024grounding ; singh2025agentic . Moreover, even advanced agentic RAG approaches remain constrained by their reliance on pre-existing or periodically updated corpora, limiting their ability to handle real-time, rapidly changing, or long-tail information needs effectively. Addressing this challenge requires integrating external APIs and web browsing capabilities into RAG architectures, motivating recent DR methods aimed at further enhancing retrieval comprehensiveness and adaptability.
尽管检索方法和推理增强模型取得了这些进步,但 RAG 方法在有效管理复杂的推理工作流程和动态适应不同的任务要求方面仍然面临限制。为了应对这些挑战,最近的研究将 RAG 扩展到代理范式,在传统的 RAG 管道 singh2025agentic 上集成了额外的推理和决策层。代理 RAG 方法利用迭代检索、自适应查询和动态工作流调整,显著增强了多步骤推理能力。例如,基于 RL 的查询优化技术(例如,Hsu et al. hsu2024grounding )改进了复杂查询的检索,而基于图形的检索(例如,GeAR shen2024gear )进一步增强了多跳查询的处理。尽管取得了这些进步,代理 RAG 仍然面临关键挑战,包括平衡动态推理过程的计算开销 singh2025agentic ,使代理行为与用户意图保持一致 zerhoudi2024personarag ,以及确保自适应工作流程中的可解释性 hsu2024grounding ; singh2025 代理 .此外,即使是先进的代理 RAG 方法仍然受到对预先存在或定期更新的语料库的依赖的限制,从而限制了它们有效处理实时、快速变化或长尾信息需求的能力。应对这一挑战需要将外部 API 和 Web 浏览功能集成到 RAG 架构中,从而推动最近的 DR 方法进一步提高检索的全面性和适应性。

2.3 Model Context Protocol and Agent-to-Agent Policy
2.3 对上下文协议和代理到代理策略进行建模

Model Context Protocol (MCP) and Agent-to-Agent (A2A) have been proposed to address interoperability challenges in LLM-based agent systems, enabling efficient tool access and effective multi-agent collaboration. MCP: Traditional Tool Use (TU) agents face significant challenges, including inconsistent APIs, high maintenance costs, and redundant development efforts, severely limiting interoperability across systems schick2023toolformerlanguagemodelsteach . To address these issues, Anthropic introduced the MCP, a unified communication layer allowing LLM-based agents to interact securely and consistently with external services and data sources via standardized interfaces. MCP mitigates data silo problems by providing dynamic service discovery and uniform access patterns. A2A: Google’s A2A protocol facilitates decentralized multi-agent collaboration through structured, task-oriented dialogues. Agents from diverse vendors and model architectures can discover peers, delegate responsibilities, and collaboratively manage complex tasks as equal participants google2025a2a . By abstracting agent discovery into Agent Cards, and task coordination into Tasks and Artifacts, A2A supports flexible, incremental, multi-modal workflows, ideally suited to sophisticated collaborative scenarios.
模型上下文协议 (MCP) 和代理到代理 (A2A) 已被提出,以解决基于 LLM 的代理系统中的互作性挑战,实现高效的工具访问和有效的多代理协作。 MCP:传统工具使用 (TU) 代理面临重大挑战,包括 API 不一致、维护成本高和冗余开发工作,严重限制了跨系统的互作性 schick2023toolformerlanguagemodelsteach .为了解决这些问题,Anthropic 引入了 MCP,这是一个统一的通信层,允许基于 LLM 的代理通过标准化接口与外部服务和数据源安全一致地交互。MCP 通过提供动态服务发现和统一访问模式来缓解数据孤岛问题。 A2A:Google 的 A2A 协议通过结构化、面向任务的对话促进去中心化的多代理协作。来自不同供应商和模型架构的代理可以发现对等方、委派责任,并以平等参与者的身份协作管理复杂的任务 google2025a2a 。通过将代理发现抽象到代理卡中,将任务协调抽象到任务和工件中,A2A 支持灵活的增量多模式工作流,非常适合复杂的协作场景。

MCP and A2A complement each other by clearly separating responsibilities: MCP serves as a standardized interface for accessing external tools, while A2A orchestrates collaborative agent interactions. Together, they establish a modular and scalable foundation for open, interoperable agent ecosystems, significantly enhancing the practical capabilities of AI systems in tackling complex real-world challenges.
MCP 和 A2A 通过明确划分职责来相辅相成:MCP 充当访问外部工具的标准化接口,而 A2A 则协调协作代理交互。它们共同为开放、可互作的代理生态系统建立了模块化和可扩展的基础,显著增强了 AI 系统在应对复杂的现实挑战方面的实用能力。

3 Deep Research: Search Engine, Tool Use, Workflow, Tuning, Non-parametric Continual Learning
3 深入研究:搜索引擎、工具使用、工作流程、调优、非参数持续学习

Comparison with Coventional RAG-based Approaches. DR agents expand the capabilities of traditional RAG methods by integrating dynamic retrieval, real-time TU, and adaptive reasoning into a unified system. RAG-based approaches typically rely on fixed pipelines, limiting their flexibility in handling complex, multi-step queries or rapidly changing contexts. In contrast, DR agents provide greater autonomy, context-awareness, and accuracy by dynamically engaging with external tools and managing multi-stage research tasks in real time.
与 Coventional 基于 RAG 的方法的比较。 DR 代理通过将动态检索、实时 TU 和自适应推理集成到一个统一的系统中,扩展了传统 RAG 方法的功能。基于 RAG 的方法通常依赖于固定管道,这限制了它们在处理复杂的多步骤查询或快速变化的上下文方面的灵活性。相比之下,DR 代理通过与外部工具动态交互并实时管理多阶段研究任务来提供更大的自主性、上下文感知和准确性。

In this section, we explore five core components essential for the development and optimization of DR agents: (3.1) search engine integration, which compares API-based interfaces with browser-based exploration to enhance dynamic knowledge acquisition; (3.2) Tool Use capabilities, which investigate the integration of code execution, mathematical computation, file manipulation, and multimodal processing modules within the agent’s inference pipeline; (3.3) workflow architecture, analyzing foundational designs, the balance between multi-agent and single-agent paradigms, memory mechanisms, and auxiliary components that facilitate the orchestration of complex research workflows; (3.4) tuning methodologies, which examine prompt-driven structured generation, LLM-driven prompting, fine-tuning strategies, and reinforcement learning approaches aimed at optimizing agent performance, and (3.5) Non-parametric continual learning, which enables LLM agents to self-evolve by dynamically adapting external tools, memory, and workflows without updating internal model weights, offering scalable optimization for complex tasks.
在本节中,我们探讨了开发和优化 DR 代理所必需的五个核心组件:(3.1) 搜索引擎集成 ,将基于 API 的界面与基于浏览器的探索进行比较,以增强动态知识获取;(3.2) 工具使用功能 ,研究代理推理管道中代码执行、数学计算、文件作和多模态处理模块的集成;(3.3) 工作流程架构 ,分析基础设计、多智能体和单智能体范式之间的平衡、记忆机制以及促进复杂研究工作流程编排的辅助组件;(3.4) 调整方法 ,检查旨在优化代理性能的提示驱动的结构化生成、LLM 驱动的提示、微调策略和强化学习方法,以及 (3.5) 非参数持续学习 ,使 LLM 代理能够通过动态适应外部工具、内存和工作流进行自我进化,而无需更新内部模型权重,为复杂任务提供可扩展的优化。

Refer to caption
Figure 2: General Comparison of API-Based and Browser-Based Retrieval Workflow.
图 2: 基于 API 和基于浏览器的检索工作流的一般比较。

3.1 Search Engine: API vs. Browser
3.1 搜索引擎:API 与浏览器

To enhance reasoning depth and accuracy for handling evolving tasks, DR agents employ search engines (SE) to update their knowledge through interaction with the external environment. In Table 1, we present a comparative overview of SEs, base models, and evaluation benchmarks employed by existing DR agents. The SEs can be broadly categorized into two types:
为了提高推理深度和准确性,以处理不断变化的任务,DR 代理使用搜索引擎 (SE) 通过与外部环境的交互来更新他们的知识。在表 1 中,我们提供了现有 DR 代理采用的 SE、基本模型和评估基准的比较概述。SE 大致可分为两种类型:

  1. 1)

    API-Based SEs, which interact with structured data sources, such as search-engine APIs or scientific database APIs, enabling efficient retrieval of organized information.


    1) 基于 API 的 SE,它与结构化数据源(如搜索引擎 API 或科学数据库 API)交互,从而能够高效检索有组织的信息。
  2. 2)

    Browser-Based SEs, which simulate human-like interactions with web pages, facilitating real-time extraction of dynamic or unstructured content, improving the comprehensiveness of the external knowledge.


    2) 基于浏览器的 SE,模拟与网页的类似人类的交互,促进动态或非结构化内容的实时提取,提高外部知识的全面性。
Table 1: Comparison of DR Agents with Search Engine Details
表 1: DR Agent 与搜索引擎详细信息的比较
 = Primary focus = Secondary/minor focus = Not present
= 主要焦点 , = 次要/次要焦点 , = 不存在
DR Agent Search Engine  搜索引擎 Benchmark  基准 Base Model Release
API Browser  浏览器 GAIA HLE Other QA
Avatar wu2024avataroptimizingllmagents Stark Claude-3-Opus, GPT-4 Feb-2024
CoSearch-Agent gong2024cosearchagent GPT-3.5-turbo Feb-2024
MMAC-Copilot song2024mmac GPT-3.5, GPT-4 Mar-2024
Storm shao2024assistingwritingwikipedialikearticles FreshWiki GPT-3.5-turbo Jul-2024
OpenResearcher zheng2024openresearcher Privately Collected QA Data DeepSeek-V2-Chat Aug-2024
The AI Scientist lu2024aiscientistfullyautomated MLE-Bench GPT-4o, o1-mini, o1-preview Aug-2024
Gemini DR geminideepresearch GPQA Gemini-2.0-Flash Dec-2024
Agent Laboratory schmidgall2025agent MLE-Bench GPT-4o, o1-preview Jan-2025
Search-o1 li2025search GPQA, NQ, TriviaQA QwQ-32B-preview Jan-2025
Agentic Reasoning wu2025agentic GPQA DeepSeek-R1, Qwen2.5 Feb-2025
AutoAgent tang2025autoagentfullyautomatedzerocodeframework Claude-Sonnet-3.5 Feb-2025
Grok DeepSearch grokdeepresearch GPQA Grok3 Feb-2025
OpenAI DR openai2025deepresearch GPT-o3 Feb-2025
Perplexity DR perplexitydeepresearch SimoleQA Flexible Feb-2025
AgentRxiv schmidgall2025agentrxiv GPQA, MedQA GPT-4o-mini Mar-2025
Agent-R1 Agent-R1 HotpotQA Qwen2.5-1.5B-Inst Mar-2025
AutoGLM Ruminationzhipu2025autoglm GPQA GLM-Z1-Air Mar-2025
Copilot Researcher microsoft_copilot_researcher o3-mini Mar-2025
H2O.ai DR h2oai h2ogpt-oasst1-512-12b Mar-2025
Manus manus2025 Claude3.5, GPT-4o Mar-2025
Openmanus openmanus2025 Claude3.5, GPT-4o Mar-2025
OWL owl2025 Deepeek-R1, Gemini2.5-Pro, GPT-4o Mar-2025
R1-Searcher song2025r1 2WikiMultiHopQA, HotpotQA Llama3.1-8B-Inst, Qwen2.5-7B Mar-2025
ReSearch chen2025learning 2WikiMultiHopQA, HotpotQA Qwen2.5-7B, Qwen2.5-7B-Inst Mar-2025
Search-R1 jin2025search 2WikiMultiHopQA, HotpotQA, NQ, TriviaQA Llama3.2-3B, Qwen2.5-3B/7B Mar-2025
DeepResarcher zheng2025deepresearcherscalingdeepresearch
DeepResarcher zheng2025deepresearcher 扩展 deepresearch
HotpotQA, NQ, TriviaQA Qwen2.5-7B-Inst Apr-2025
Genspark Super Agent genspark Mixture of Agents111Mixture of Agents refers to an ensemble of nine base models comprising GPT-4.1, GPT-o3, GPT-o4-mini-high, Claude-Sonnet-3.7-Thinking, Claude-Sonnet-3.7, Gemini-2.0-Flash, Gemini-2.5-Pro, DeepSeek-V3, DeepSeek-R1 Apr-2025
WebThinker Li2025webthinker GPQA, WebWalkerQA QwQ-32B Apr-2025
SWIRL goldie2025synthetic HotQA, BeerQA Gemma 2-27b Apr-2025
SimpleDeepSearcher SimpleDeepSearcher 2WikiMultiHopQA Qwen-2.5-7B-In, Qwen-2.5-32B-In, DeepSeek-Distilled-Qwen-2.5-32B, QwQ-32B Apr-2025
Suna AI sunaai GPT-4o, Claude Apr-2025
AgenticSeek agenticseek GPT-4o, DeepSeek-R1, Claude May-2025
Alita qiu2025alita PathVQA GPT-4o, Claude-Sonnet-4 May-2025
DeerFlow deerflow Doubao-1.5-Pro-32k, DeepSeek-R1, GPT-4o, Qwen May-2025
PANGU DEEPDIVER shi2025pangu C-SimpleQA, HotpotQA, ProxyQA Pangu-7B-Reasoner May-2025

API-based retrieval is a fast, efficient, structured, and scalable method that allows DR agents to access external knowledge sources with relatively less time and computational cost. For instance, Gemini DR leverages multi-source interfaces, most notably the Google Search API and the arXiv API, to perform large-scale retrieval across hundreds to thousands of web pages, thereby significantly expanding its information coverage. Grok DeepSearch grokdeepresearch claims to ensure both the freshness and depth of its knowledge base by maintaining a continuous index via news-outlet feeds, the Wikipedia API, and X’s native interface, and by activating a query-driven agent on demand to generate targeted sub-queries and fetch relevant pages in real time. AgentLaboratory schmidgall2025agent uses the arXiv API to extract paper metadata and abstracts for automated literature reviews. AI Scientist lu2024aiscientistfullyautomated issues requests to the Semantic Scholar API to validate the novelty and citation relationships of model-generated research ideas, and CoSearchAgent gong2024cosearchagent integrates SerpApi to deliver Slack-based, real-time search responses. DeepRetrieval jiang2025deepretrieval , operating within a reinforcement-learning framework, optimizes queries against the PubMed and ClinicalTrials.gov APIs to maximize recall on biomedical tasks, and Search-o1 li2025search combines the Bing Search API with the Jina Reader API to dynamically extract and refine passages for downstream reasoning. Whilst these API-driven methods excel at structured, high-throughput data acquisition, they generally struggle when faced with deeply nested client-side JavaScript-rendered content, interactive components, or authentication barriers, thereby motivating the development of browser-based search mechanisms capable of comprehensively extracting and analyzing dynamic or unstructured information.
基于 API 的检索是一种快速、高效、结构化且可扩展的方法,它允许 DR 代理以相对较少的时间和计算成本访问外部知识源。例如,Gemini DR 利用多源接口(最著名的是 Google Search API 和 arXiv API)在成百上千个网页上执行大规模检索,从而显著扩大了其信息覆盖范围。Grok DeepSearch grokdeepresearch 声称,通过新闻媒体提要、维基百科 API 和 X 的原生界面维护连续索引,并按需激活查询驱动的代理来生成有针对性的子查询并实时获取相关页面,从而确保其知识库的新鲜度和深度。代理实验室 schmidgall2025agent 使用 arXiv API 提取论文元数据和摘要,用于自动文献综述。AI Scientist lu2024aiscientistfullyautomated 向 Semantic Scholar API 发出请求,以验证模型生成的研究思路的新颖性和引用关系,CoSearchAgent gong2024cosearchagent 集成 SerpApi 以提供基于 Slack 的实时搜索响应。DeepRetrieval jiang2025deepretrieval 在强化学习框架内运行,优化了对 PubMed 和 ClinicalTrials.gov API 的查询,以最大限度地提高对生物医学任务的召回率,而 Search-o1 li2025search 将 Bing 搜索 API 与 Jina Reader API 相结合,以动态提取和提炼段落以进行下游推理。 虽然这些 API 驱动的方法擅长结构化、高吞吐量的数据采集,但它们在面对深度嵌套的客户端 JavaScript 渲染内容、交互式组件或身份验证障碍时通常会遇到困难,从而推动开发基于浏览器的搜索机制,能够全面提取和分析动态或非结构化信息。

Browser-based retrieval provides DR agents with dynamic, flexible, and interactive access to multimodal and unstructured web content through simulated human-like browser interactions. For example, Manus AI’s browsing agent operates a sandboxed Chromium instance for each research session, programmatically opening new tabs, issuing search queries, clicking through result links, scrolling pages until content thresholds are met, filling out form elements when necessary, executing in-page JavaScript to reveal lazily loaded sections, and downloading files or PDFs for local analysis manus2025 . Although OpenAI DR, Grok DeepSearch, and Gemini 2.5 DR do not publicly disclose the implementation details of their browsing capabilities, their ability to handle interactive widgets, dynamically rendered content, and multi-step navigation strongly suggests that they too employ comparable headless-browser frameworks behind the scenes. Among open-source studies, AutoAgent yu2024auto operates within a BrowserGym environment to scroll, interact with page components, and download files when APIs are unavailable yu2024auto ; DeepResearcher zheng2025deepresearcherscalingdeepresearch employs a dedicated Web Browsing Agent that, upon receiving a browse request, processes each segment of a webpage in turn, decides whether to continue to subsequent segments based on relevance, and incrementally aggregates pertinent information into a short-term memory buffer before returning it for reasoning. While browser-based retrieval excels at capturing real-time and deeply nested content that API calls cannot reach, it also incurs greater latency, resource consumption, and complexity in handling page variability and errors, suggesting that DR agents may benefit from hybrid architectures that combine the efficiency of API-based methods with the comprehensiveness of browser-driven exploration.
基于浏览器的检索通过模拟的类似人类的浏览器交互,为 DR 代理提供了对多模式和非结构化 Web 内容的动态、灵活和交互式访问。例如,Manus AI 的浏览代理为每个研究会话运行一个沙盒 Chromium 实例,以编程方式打开新选项卡、发出搜索查询、单击结果链接、滚动页面直到满足内容阈值、必要时填写表单元素、执行页面内 JavaScript 以显示延迟加载的部分,以及下载文件或 PDF 进行本地分析 manus2025 .尽管 OpenAI DR、Grok DeepSearch 和 Gemini 2.5 DR 没有公开披露其浏览功能的实现细节,但它们处理交互式小部件、动态呈现的内容和多步骤导航的能力强烈表明,它们也在幕后使用了类似的无头浏览器框架。在开源研究中,AutoAgent yu2024auto 在 BrowserGym 环境中运行,以便在 API 不可用时滚动、与页面组件交互和下载文件 yu2024auto ;DeepResearcher zheng2025deepresearcherscalingdeepresearch 采用专用的 Web 浏览代理,在收到浏览请求后,依次处理网页的每个段,根据相关性决定是否继续后续段,并将相关信息增量聚合到短期内存缓冲区中,然后再返回进行推理。 虽然基于浏览器的检索擅长捕获 API 调用无法访问的实时和深度嵌套内容,但它在处理页面可变性和错误时也会产生更大的延迟、资源消耗和复杂性,这表明 DR 代理可能会受益于混合架构,该架构将基于 API 的方法的效率与浏览器驱动的探索的全面性相结合。

3.2 Tool Use: Empowering Agents with Extended Functionalities
3.2 工具使用:为代理提供扩展功能

Table 2: Comparison of DR Agents with Tool Use Capabilities
表 2: DR 代理与工具使用功能的比较
= Involved = Non Disclosure = Not present
= 涉及 , = 保密 , = 不存在
DR Agent Code Interpreter  代码解释器 Data Analytics  数据分析 Multimodal Release
CoSearchAgent gong2024cosearchagent
协同搜索代理 gong2024 协同搜索代理
Feb-2024
Storm shao2024assistingwritingwikipedialikearticles Jul-2024
The AI Scientist lu2024aiscientistfullyautomated
人工智能科学家 lu2024aiscientistfullyautomated
Aug-2024
Agent Laboratory schmidgall2025agent
代理商 Laboratory schmidgall2025agent
Jan-2025
Agentic Reasoning wu2025agentic
能动性推理 wu2025agentic
Feb-2025
AutoAgent tang2025autoagentfullyautomatedzerocodeframework Feb-2025
Genspark DR genspark Feb-2025
Grok DeepSearch grokdeepresearch
Grok DeepSearch 格罗克深度研究
Feb-2025
OpenAI DR openai2025deepresearch Feb-2025
Perplexity DR perplexitydeepresearch
困惑 DR 困惑 deepresearch
Feb-2025
Agent-R1 Agent-R1 Mar-2025
AutoGLM Romination zhipu2025autoglm
AutoGLM Romination 智普 2025autoglm
Mar-2025
Copilot Researcher microsoft_copilot_researcher
Copilot 研究员 microsoft_copilot_researcher
Mar-2025
Manus manus2025 Mar-2025
OpenManus openmanus2025 Mar-2025
OWL owl2025 Mar-2025
H2O.ai DR h2oai Mar-2025
Genspark Super Agent genspark Apr-2025
WebThinker Li2025webthinker Apr-2025
Suna Ai sunaai Apr-2025
AgenticSeek agenticseek May-2025
Alita qiu2025alita May-2025
DeerFlow deerflow May-2025

To expand DR agents’ capacity to interact with external environments in complex research tasks, specifically by actively invoking and handling diverse tools and data sources, various DR agents have introduced three core tool modules: code interpreters, data analytics, multimodal processing, along with the Model Context Protocol.
为了扩展 DR 代理在复杂研究任务中与外部环境交互的能力,特别是通过主动调用和处理各种工具和数据源,各种 DR 代理引入了三个核心工具模块:代码解释器、数据分析、多模态处理以及模型上下文协议。

Code Interpreter.  代码解释器。

The code interpreter capability enables DR agents to execute scripts during inference, allowing them to perform data processing, algorithm verification and model simulation. Most DR agents, except CoSearchAgent, embed a script execution environment. They typically rely on Python utilities such as Aider and Java utilities to orchestrate dynamic scripting, conduct literature-driven analysis and carry out real-time computational reasoning.
代码解释器功能使 DR 代理能够在推理期间执行脚本,从而允许它们执行数据处理、算法验证和模型模拟。大多数 DR 代理(CoSearchAgent 除外)都嵌入了脚本执行环境。它们通常依靠 Python 实用程序(如 Aider 和 Java 实用程序)来编排动态脚本、进行文献驱动的分析并执行实时计算推理。

Data Analytics.  数据分析。

By integrating data analytics modules, DR agents transform raw retrievals into structured insights by computing summary statistics, generating interactive visualizations and conducting quantitative model evaluations, thereby accelerating hypothesis testing and decision-making. Many commercial DR agents have implemented analytics features such as charting, table generation and statistical analysis, either locally or via remote services. However, most of these systems have not publicly disclosed technical details of their implementations. In contrast, academic studies often provide concrete examples: CoSearchAgent integrates SQL-based queries within team communication platforms to run aggregate analyses and produce reports; AutoGLM extracts and analyzes structured datasets directly from table-based web interfaces; and Search-o1’s Reason-in-Documents component refines lengthy retrieved texts before extracting key metrics for downstream evaluation.
通过集成数据分析模块,DR 代理通过计算摘要统计数据、生成交互式可视化和进行定量模型评估,将原始检索转化为结构化见解,从而加速假设检验和决策。许多商业 DR 代理已经在本地或通过远程服务实施了分析功能,例如图表、表格生成和统计分析。但是,这些系统中的大多数尚未公开披露其实现的技术细节。相比之下,学术研究通常提供具体的例子:CoSearchAgent 将基于 SQL 的查询集成到团队通信平台中,以运行汇总分析并生成报告;AutoGLM 直接从基于表格的 Web 界面中提取和分析结构化数据集;Search-o1 的 Reason-in-Documents 组件在提取关键指标进行下游评估之前,会细化冗长的检索文本。

Multimodal Processing and Generation.
多模态处理和生成。

Multimodal processing and generation tools enable DR agents to integrate, analyze and generate heterogeneous data such as text, images, audio and video within a unified reasoning pipeline, thereby enriching their contextual understanding and broadening the range of their outputs. Only a subset of mature commercial and open-source projects, for example Manus, OWL, AutoAgent, AutoGLM, OpenAI, Gemini, Perplexity and Grok DeepSearch, support this capability, whereas most academic prototypes have not implemented it, often due to the high computational cost. As the typical open source studies, OWL and Openmanus extend their pipelines to include interactions with platforms such as GitHub, Notion and Google Maps and to leverage numerical libraries such as Sympy and Excel for combined data analysis and multimodal media processing owl2025 ; openmanus2025 .
多模态处理和生成工具使 DR 代理能够在统一的推理管道中集成、分析和生成异构数据,例如文本、图像、音频和视频,从而丰富他们对上下文的理解并扩大其输出范围。只有一部分成熟的商业和开源项目,例如 Manus、OWL、AutoAgent、AutoGLM、OpenAI、Gemini、Perplexity 和 Grok DeepSearch,支持此功能,而大多数学术原型尚未实现它,这通常是由于计算成本高。作为典型的开源研究,OWL 和 Openmanus 扩展了他们的管道,包括与 GitHub、Notion 和 Google Maps 等平台的交互,并利用 Sympy 和 Excel 等数值库进行组合数据分析和多模态媒体处理 owl2025 ; 打开手动 2025 .

Deep Research Agent with Computer Use.
使用计算机的深度研究代理。

Most recently, the boundaries of DR agents have been progressively expanded through integrating computer-assisted task execution capabilities (i.e., computer use). For example, Zhipu AI introduced AutoGLM Rumination zhipu2025autoglm , a RL-based system incorporating self-reflection and iterative refinement mechanisms, which significantly enhances multi-step reasoning and advanced function-calling abilities. Specifically, AutoGLM Rumination autonomously interacts with web environments, executes code, invokes external APIs, and effectively accomplishes sophisticated tasks, including data retrieval, analysis, and structured generation of comprehensive reports. Comparison with OpenAI’s DR: While OpenAI DR primarily focus on intricate reasoning and information retrieval, AutoGLM Rumination exhibits superior autonomy in practical execution. This enhanced autonomy allows it to transform abstract analytical insights into concrete operational tasks, such as automated interactions with web interfaces and real-time data processing. Moreover, AutoGLM Rumination addresses and resolves limitations inherent in simulated browsing environments by seamlessly integrating advanced reasoning capabilities with authentic browser-based interactions. Therefore, the agent gains reliable access to user-authenticated resources, including platforms such as CNKI, Xiaohongshu, and WeChat official accounts. Such integration significantly elevates the agent’s autonomy and adaptability in both information acquisition and execution of real-world tasks.
最近,通过集成计算机辅助任务执行功能(即计算机使用),DR 代理的边界逐渐扩大。例如,智普 AI 推出了 AutoGLM Rumination zhipu2025autoglm ,这是一个基于 RL 的系统,结合了自我反射和迭代细化机制,显着增强了多步推理和高级函数调用能力。具体来说,AutoGLM Rumination 可以自主地与 Web 环境交互、执行代码、调用外部 API,并有效地完成复杂的任务,包括数据检索、分析和结构化生成综合报告。与 OpenAI 的 DR 比较:OpenAI DR 主要关注复杂的推理和信息检索,而 AutoGLM Rumination 在实际执行中表现出卓越的自主性。这种增强的自主性使其能够将抽象的分析见解转化为具体的作任务,例如与 Web 界面的自动交互和实时数据处理。此外,AutoGLM Rumination 通过将高级推理功能与基于浏览器的真实交互无缝集成,解决模拟浏览环境中固有的限制。因此,代理可以可靠地访问用户身份验证的资源,包括 CNKI、小红书和微信公众号等平台。这种集成显着提高了代理在信息获取和执行实际任务方面的自主性和适应性。

3.3 Architecture and Workflow
3.3 架构和工作流程

As shown in Figure 4, this section systematically analyzes the construction of DR systems, focusing on workflows categorized into static and dynamic types. We then discuss planning strategies, which enhance task allocation and execution through three distinctive user interaction types to clarify intent: planning-only (direct planning without clarifying user intent), intent-to-planning (clarifying intent before planning to align the task with user goals), and unified intent-planning (generating a plan and requesting user confirmation). The distinction between single-agent and multi-agent systems is examined in the context of dynamic workflows, emphasizing specialization in task management. Additionally, we examine memory mechanisms for managing and integrating retrieved information, which enhance the performance and adaptability of DR systems.
如图 4 所示,本节系统地分析了 DR 系统的构建,重点介绍了分为 静态 动态 类型的工作流。然后,我们讨论了规划策略,这些策略通过 三种不同的用户交互类型 来增强任务分配和执行,以阐明意图:仅规划(直接规划而不明确用户意图)、意向到规划(在规划之前澄清意图,使任务与用户目标保持一致)和统一意图规划(生成计划并请求用户确认)。在动态工作流的上下文中研究了 单代理 系统和 多代理 系统之间的区别,强调任务管理的专业化。此外,我们还研究了用于管理和集成检索到的信息的内存机制,从而提高了 DR 系统的性能和适应性。

Refer to caption
Figure 3: Comparison of Information Retrieval Methods. The upper left corner (Search) represents the searching methods, which can use the browser or API; the lower left corner (RAG, Query) represents Retrieval-Augmented Generation, combining retrieval and generative models to output natural language answers; the right side (Deep Research) represents the deep research process, generating complex decisions or analyses through retrieval and explicit reasoning.
图 3: 信息检索方法的比较。 左上角 (Search) 代表搜索方式,可以使用浏览器或 API;左下角 (RAG, Query) 代表 Retrieval-Augmented Generation,结合检索和生成模型来输出自然语言答案;右侧 (Deep Research) 代表深度研究过程,通过检索和明确推理生成复杂的决策或分析。
Refer to caption
Figure 4: Comparison of DR Workflows: (1) Static vs. Dynamic Workflows: Static workflows rely on predefined task sequences, while dynamic workflows allow LLM-based task planning. (2) Planning Strategies: Three types include: planning-only (direct planning without clarifying user intent), intent-to-planning (clarifying intent before planning), and unified intent-planning (generating a plan and requesting user confirmation). (3) Single-Agent vs. Multi-Agent: Dynamic workflows can be categorized to dynamic-multi-agent systems (tasks distributed across specialized agents) or dynamic-single-agent systems (a LRM autonomously updates and executes tasks).
图 4: DR 工作流的比较:(1) 静态工作流与动态工作流 :静态工作流依赖于预定义的任务序列,而动态工作流允许基于 LLM 的任务规划。(2) 规划策略 :三种类型包括:仅规划(直接规划,不明确用户意图)、意图到规划(规划前明确意图)和统一意图规划(生成计划并请求用户确认)。(3) 单代理与多代理 :动态工作流可分为动态多代理系统(任务分布在专用代理之间)或动态单代理系统(LRM 自主更新和执行任务)。

3.3.1 Static vs. Dynamic Workflows
3.3.1 静态工作流与动态工作流

Static Workflows.  静态工作流。

Static workflows rely on manually predefined task pipelines, decomposing research processes into sequential subtasks executed by dedicated agents. These workflows follow explicitly structured procedures, making them particularly suitable for well-defined, structured research scenarios. For instance, AI Scientist lu2024aiscientistfullyautomated automates scientific discovery through distinct sequential phases, including ideation, experimentation, and reporting. Similarly, Agent Laboratory schmidgall2025agent segments research activities into formalized stages, such as literature review, experimentation, and synthesis of findings. Extending this static paradigm further, AgentRxiv schmidgall2025agentrxiv incorporates inter-agent collaboration mechanisms, enabling incremental knowledge reuse through sharing intermediate research outcomes among specialized agents. Whist their ease of implementation and structured clarity, static workflows suffer from limited generalization capabilities, as each distinct task necessitates a specifically tailored pipeline.
静态工作流依赖于手动预定义的任务管道,将研究流程分解为由专用代理执行的顺序子任务。这些工作流程遵循明确结构化的程序,使其特别适用于定义明确、结构化的研究场景。例如,AI Scientist lu2024aiscientistfullyautomated 通过不同的顺序阶段(包括构思、实验和报告)自动执行科学发现。同样,Agent Laboratory schmidgall2025agent 将研究活动划分为正式阶段,例如文献综述、实验和结果综合。AgentRxiv schmidgall2025agentrxiv 进一步扩展了这种静态范式,整合了代理间协作机制,通过在专业代理之间共享中间研究成果来实现增量知识重用。尽管静态工作流易于实施且结构清晰,但其泛化能力有限,因为每个不同的任务都需要专门定制的管道。

Dynamic Workflows.  动态工作流。

To overcome the limitations in flexibility and generalizability inherent in static workflows, dynamic workflows support adaptive task planning, allowing agents to dynamically reconfigure task structures based on iterative feedback and evolving contexts. Dynamic architectures leverage advanced mechanisms including automated planning, iterative refinement, and interactive task allocation, enabling tasks to evolve in real-time as new knowledge or external inputs become available. Consequently, dynamic workflows exhibit superior generality and adaptability, making them highly suitable for complex, knowledge-intensive tasks commonly encountered in AI-driven research scenarios.
为了克服静态工作流中固有的灵活性和通用性限制,动态工作流支持自适应任务规划,允许代理根据迭代反馈和不断变化的上下文动态地重新配置任务结构。动态架构利用高级机制,包括自动规划、迭代优化和交互式任务分配,使任务能够随着新知识或外部输入的出现而实时发展。因此,动态工作流表现出卓越的通用性和适应性,使其非常适合 AI 驱动型研究场景中常见的复杂、知识密集型任务。

3.3.2 Dynamic Workflows: Planning Strategies
3.3.2 动态工作流:规划策略

To enhance DR agents’ adaptability in response to evolving user requirements and contexts, existing studies propose three distinctive LLM-based planning strategies, each differing in whether and how they interact with the user to clarify intent:
为了增强 DR 代理对不断变化的用户需求和上下文的适应性,现有研究提出了三种独特的基于 LLM 的规划策略,每种策略在它们是否以及如何与用户互动以阐明意图方面有所不同:

  1. 1)

    The Planning-Only approach directly generates task plans based solely on initial user prompts without actively engaging in further clarification, adopted by the majority of existing DR agents, including Grok grokdeepresearch , H2O h2oai and Manus manus2025 .


    1) 仅规划方法仅根据初始用户提示直接生成任务计划,而无需积极进行进一步澄清,被大多数现有的 DR 代理采用,包括 Grok grokdeepresearch、H2O h2oai 和 Manus manus2025。
  2. 2)

    The Intent-to-Planning strategy actively clarifies user intent prior to planning through targeted questions, subsequently generating tailored task sequences based on clarified user inputs; this method is utilized by OpenAI DR openai2025deepresearch .


    2) Intent-to-Planning 策略在规划之前通过有针对性的问题主动阐明用户意图,随后根据澄清的用户输入生成量身定制的任务序列;OpenAI DR openai2025deepresearch 使用此方法。
  3. 3)

    The Unified Intent-Planning approach synthesizes these methods by generating a preliminary plan from the initial prompt, together with interactively engaging the user to confirm or revise the proposed plan. Gemini DR geminideepresearch is representative of this strategy, effectively adopts the strength of user-guided refinement.


    3) 统一意图规划方法通过从初始提示生成初步计划,以及以交互方式让用户确认或修改拟议的计划来综合这些方法。Gemini DR geminideepresearch 就是这种策略的代表,有效地采用了用户引导式细化的优势。

3.3.3 Dynamic Workflows: Single-Agent vs. Multi-Agent
3.3.3 动态工作流:单代理与多代理

Dynamic workflows of DR agents can be differentiated based on agent architectures into single-agent and multi-agent frameworks, each exhibiting distinct characteristics concerning task specialization, coordination complexity, and scalability of execution.
DR 代理的动态工作流可以根据代理架构分为单代理和多代理框架,每个框架在任务专业化、协调复杂性和执行可扩展性方面都表现出不同的特征。

Dynamic Single-Agent Systems.
动态单代理系统。

Dynamic single-agent systems integrate planning, tool invocation, and execution within a unified LRM, streamlining task management into a cohesive cognitive loop. Single-agent architectures autonomously refine task plans and invoke appropriate tools based on evolving contexts, typically without explicit inter-agent coordination. Compared to multi-agent architectures, single-agent systems enable direct end-to-end reinforcement learning (RL) optimization across the entire workflow, facilitating smoother and more coherent integration of reasoning, planning, and tool invocation. Systems such as Agent-R1 Agent-R1 , ReSearch chen2025learning , and Search-R1 jin2025search exemplify this paradigm through iterative cycles of explicit reasoning, action, and reflection, aligning with the ReAct framework yao2023reactsynergizingreasoningacting . However, this streamlined approach places significant demands on the foundation model’s reasoning capabilities, contextual understanding, and autonomous selection and invocation of tools. Additionally, the tightly integrated nature of single-agent systems may limit modular flexibility, complicating independent scaling or optimization of individual functional components.
动态单代理系统集成 在一个统一的 LRM 中 ,将任务管理简化为一个有凝聚力的认知循环。单代理架构根据不断变化的环境自主优化任务计划并调用适当的工具,通常没有明确的代理间协调。与多智能体架构相比,单智能体系统支持在整个工作流程中直接进行端到端强化学习 (RL) 优化,从而促进推理、规划和工具调用的更顺畅、更连贯的集成。Agent-R1 Agent-R1 、 ReSearch chen2025learning 和 Search-R1 jin2025search 等系统通过显式推理、行动和反思的迭代循环来例证这种范式,与 ReAct 框架 yao2023reactsynergizingreasoningacting 保持一致。然而,这种简化的方法对基础模型的推理能力、上下文理解以及工具的自主选择和调用提出了很高的要求。此外,单代理系统的紧密集成特性可能会限制模块化的灵活性,使单个功能组件的独立扩展或优化变得复杂。

Dynamic Multi-Agent Systems.
动态多代理系统。

Dynamic multi-agent systems leverage multiple specialized agents to collaboratively execute subtasks generated and dynamically allocated through adaptive planning strategies. These systems typically employ hierarchical or centralized planning mechanisms, wherein a coordinator agent continuously assigns and redistributes tasks based on real-time feedback and replanning. Representative frameworks include OpenManus openmanus2025 and Manus manus2025 , both adopting hierarchical planner-toolcaller architectures. Similarly, OWL owl2025 includes a workforce-oriented model, utilizing a central manager agent to orchestrate task distribution among specialized execution agents. Furthermore, Alita qiu2025alita incorporates a self-evolution mechanism into DR agents, allowing the agent to online instantiate and configure new MCP servers tailored to specific tasks and environmental conditions. Such multi-agent configurations effectively handle complex, parallelizable research tasks, thereby enhancing flexibility and scalability in open-ended research scenarios. Nevertheless, a major current challenge of multi-agent systems lies in the inherent complexity of coordinating multiple independent agents, making it difficult to conduct effective end-to-end reinforcement learning optimization.
动态多智能体系统利用多个专业智能体协作执行通过自适应规划策略生成和动态分配的子任务。这些系统通常采用分层或集中式规划机制,其中协调代理根据实时反馈和重新规划持续分配和重新分配任务。代表性框架包括 OpenManus openmanus2025 和 Manus manus2025 ,均采用分层规划器-工具调用器架构。同样,OWL owl2025 包括一个面向劳动力的模型,利用中央管理器代理来协调专业执行代理之间的任务分配。此外,Alita qiu2025alita 将自我进化机制整合到 DR 代理中,允许代理在线实例化和配置针对特定任务和环境条件量身定制的新 MCP 服务器。这种多智能体配置可以有效地处理复杂、可并行化的研究任务,从而提高开放式研究场景的灵活性和可扩展性。然而,多智能体系统当前的主要挑战在于协调多个独立智能体的固有复杂性,这使得难以进行有效的端到端强化学习优化。

3.3.4 Memory Mechanism for Long-Context Optimization
3.3.4 用于长上下文优化的内存机制

Memory mechanisms empower DR agents to persistently capture, organize, and recall relevant information across multiple retrieval rounds, thereby reducing redundant queries and improving both the efficiency and coherence of DR tasks. During the DR process, agents typically perform extensive multi-round retrieval, generating hundreds of thousands of tokens (or even millions). Although recent advances in LLMs have significantly expanded context window sizes, current limits still constrain tasks involving extremely long contexts. To address these challenges, DR systems have implemented various optimizations for processing extended contexts. Broadly, these optimizations can be categorized into three main strategies: (i) Expanding the Context Window Length; (ii) Compressing Intermediate Steps; (iii) Utilizing External Structured Storage for Temporary Results.
内存机制使 DR 代理能够在多轮检索中持久捕获、组织和调用相关信息,从而减少冗余查询并提高 DR 任务的效率和连贯性。在 DR 过程中,代理通常会执行广泛的多轮检索,生成数十万个(甚至数百万个)令牌。尽管 LLM 的最新进展显著扩大了上下文窗口的大小,但当前的限制仍然限制了涉及极长上下文的任务。为了应对这些挑战,DR 系统实施了各种优化来处理扩展上下文。从广义上讲,这些优化可以分为三个主要策略: (i) 扩大上下文窗口长度;(ii) 压缩中间步骤;(iii) 利用外部结构化存储获得临时结果

Extending the context window length is the most intuitively effective approach, exemplified by Google’s Gemini model geminideepresearch , which supports a context window of up to one million tokens, supplemented by a RAG setup. Despite its straightforwardness, this method often incurs high computational costs and may lead to inefficiencies in resource utilization during practical deployments.
延长上下文窗口长度 是最 直观有效的 方法,Google 的 Gemini 模型 geminideepresearch 就是一个例子,它支持多达 100 万个令牌的上下文窗口,并辅以 RAG 设置。尽管这种方法简单明了,但往往会产生高计算成本,并可能导致实际部署期间资源利用效率低下。

An alternative strategy involves compressing or summarizing intermediate reasoning steps, significantly reducing the number of tokens processed by the model and thereby improving both efficiency and output quality. Representative frameworks such as The AI Scientist lu2024aiscientistfullyautomated and CycleResearcher weng2024cycleresearcher pass summarized intermediate results between workflow phases. Further, Li et al. li2025search introduced the concept of “Reason-in-Documents,” utilizing LRMs to compress documents, substantially reducing token volume and enhancing model decision-making efficiency. However, a potential drawback of this approach is the loss of detailed information, potentially impacting the precision of subsequent reasoning.
另一种策略涉及压缩或总结中间推理步骤,显著减少模型处理的标记数量,从而提高效率和输出质量。The AI Scientist lu2024aiscientistfullyautomated 和 CycleResearcher weng2024cycleresearcher pass 等代表性框架总结了工作流阶段之间的中间结果。此外,Li et al. li2025search 引入了“Reason-in-Documents”的概念,利用 LRM 来压缩文档,大大减少了 Token 量,提高了模型决策效率。但是,这种方法的一个潜在缺点是丢失了详细信息,从而可能影响后续推理的精度。

Utilizing external structured storages for preserving and retrieving historical information enables DR agents to persistently and efficiently store vast amounts of past context beyond the constraints of the context window, improving memory capacity, retrieval speed, and semantic relevance. Popular open-source frameworks such as Manus manus2025 , OWL owl2025 , Open Manus openmanus2025 , and Avatar wu2024avataroptimizingllmagents utilize external file systems to store intermediate outcomes and historical data for subsequent retrieval. Frameworks like WebThinker Li2025webthinker and AutoAgent tang2025autoagentfullyautomatedzerocodeframework have developed self-managing modules that leverage vector databases to support scalable memory storage and fast similarity-based lookup. Beyond plain text or vector stores, some works propose more semantically structured memory frameworks: for instance, Wu et al. wu2025agentic employ knowledge graphs to capture intermediate reasoning processes and thereby enhance the precision of information reuse, while Agentrxiv schmidgall2025agentrxiv simulates an academic repository akin to arXiv for storing and retrieving relevant outcomes from other agents. Although these structured approaches offer superior semantic retrieval efficiency and accuracy, they typically entail higher development and maintenance costs due to the need for meticulous data structure design and management.
利用外部结构化存储 来保留和检索历史信息,使 DR 代理能够在上下文窗口的限制之外持久高效地存储大量过去的上下文,从而提高内存容量、检索速度和语义相关性。流行的开源框架(如 Manus manus2025 、 OWL owl2025 、 Open Manus openmanus2025 和 Avatar wu2024avataroptimizingllmagents )利用外部文件系统来存储中间结果和历史数据,以供后续检索。WebThinker Li2025webthinker 和 AutoAgent tang2025autoagentfullyautomatedzerocodeframework 等框架开发了自我管理模块,这些模块利用矢量数据库来支持可扩展的内存存储和基于相似性的快速查找。除了纯文本或向量存储之外,一些工作还提出了语义结构更强的内存框架:例如,Wu et al. wu2025agentic 使用知识图谱来捕获中间推理过程,从而提高信息重用的精度,而 Agentrxiv schmidgall2025agentrxiv 模拟了一个类似于 arXiv 的学术存储库,用于存储和检索来自其他代理的相关结果。尽管这些结构化方法提供了卓越的语义检索效率和准确性,但由于需要细致的数据结构设计和管理,它们通常需要更高的开发和维护成本。

3.4 Tuning: Beyond Prompting toward Capability Enhancement
3,4 调整:超越提示功能增强

Table 3: Comparison of DR Agents with Tuning Methods
表 3: DR 代理与优化方法的比较
= Yes = Yes but details unknown = Not present
= , = 是但细节未知 , = 不存在
DR Agent SFT RL Base Model  基本模型 Data Reward Design Release
Gemini DR geminideepresearch Gemini-2.0-Flash  双子座-2.0-Flash Dec-2024
Grok DeepSearch grokdeepresearch Grok3  格罗克3 Feb-2025
OpenAI DR openai2025deepresearch GPT-o3  GPT-o3 的 Feb-2025
Agent-R1 Agent-R1 PPO schulman2017proximal , Reinforce++ hu2025reinforce , GRPO shao2024deepseekmath
PPO schulman2017proximal , Reinforce++ hu2025reinforce , GRPO shao2024deepseekmath
Qwen2.5-1.5B-Inst  Qwen2.5-1.5B-研究所 HotpotQA Rule-Outcome Mar-2025
AutoGLM Romination zhipu2025autoglm GLM-Z1-Air  GLM-Z1-空气 Mar-2025
H2O.ai DR h2oai h2ogpt-oasst1-512-12b Mar-2025
Copilot Researcher microsoft_copilot_researcher o3-mini  O3-迷你 Mar-2025
ReSearch chen2025learning GRPO Qwen2.5-7B-Inst, Qwen2.5-32B-Inst
Qwen2.5-7B-研究所, Qwen2.5-32B-研究所
2WikiMultiHopQA Rule-Outcome Mar-2025
R1-Searcher song2025r1 Reinforce++, GRPO  Reinforce++、GRPO Qwen2.5-7B-InSt, LLaMA-3.1-8B-Inst
Qwen2.5-7B-InSt, LLaMA-3.1-8B-Inst
2WikiMultiHopQA, HotpotQA Rule-Outcome Mar-2025
Search-R1 jin2025search PPO, GRPO  PPO、GRPO Qwen2.5-3B/7B, LLaMA3.2-3B-Inst
Qwen2.5-3B/7B、LLaMA3.2-3B-研究所
NQ, HotpotQA Rule-Outcome Mar-2025
DeepResearcher zheng2025deepresearcherscalingdeepresearch GRPO Qwen2.5-7B-Inst  Qwen2.5-7B-研究所 NQ, HotpotQA Rule-Outcome Apr-2025
Genspark Super Agent genspark Mixture of Agents  试剂混合物 Apr-2025
WebThinker Li2025webthinker Iterative Online DPO  迭代在线 DPO QwQ-32B Expert Dataset Rule-Outcome Apr-2025
SWIRL goldie2025synthetic Offline-RL  离线 RL Gemma-2-27B  杰玛-2-27B HotPotQA Apr-2025
SimpleDeepSearcher SimpleDeepSearcher PPO Qwen-2.5-7B-In, Qwen-2.5-32B-In, Deepseek-Distilled-Qwen-32B, QwQ-32B
Qwen-2.5-7B-In, Qwen-2.5-32B-In, Deepseek-Distilled-Qwen-32B, QwQ-32B
NQ, HotpotQA, 2WikiMultiHopQA, Musique, SimpleQA, MultiHop-RAG Process-based reward Apr-2025
PANGU DEEPDIVER shi2025pangu GRPO Pangu-7B-Reasoner  盘古-7B-推理器 WebPuzzle Rule-Outcome May-2025
Parametric Approaches.  参数化方法。

Prompt-based methods directly leverage the capabilities of pre-trained LLMs, enabling complex functionalities without expensive fine-tuning or additional training. However, it remains challenging to systematically optimize prompt structures and workflows. Moreover, since an agent’s performance is inherently limited by its backbone LLM, increasing the complexity of decision-making processes quickly reaches the model’s performance ceiling. To overcome these limitations, it is essential to incorporate advanced optimization techniques such as fine-tuning, reinforcement learning (RL) or hybrid training paradigms to further extend the model’s inherent capabilities. Below, we discuss the two main tuning paradigms, supervised fine-tuning (SFT) and RL, and highlight how each extends agent capabilities beyond prompt-only methods.
基于提示的方法直接利用预训练 LLM 的功能,无需昂贵的微调或额外培训即可实现复杂的功能。然而,系统地优化提示结构和工作流程仍然具有挑战性。此外,由于代理的性能本质上受到其主干 LLM 的限制,因此决策过程的复杂性增加很快就会达到模型的性能上限。为了克服这些限制,必须采用先进的优化技术,例如微调、强化学习 (RL) 或混合训练范式,以进一步扩展模型的固有功能。下面,我们将讨论两种主要的优化范式,即监督式微调 (SFT) 和 RL,并重点介绍每种范式如何将代理功能扩展到仅提示方法之外。

3.4.1 SFT-based Optimization
3.4.1 基于 SFT 的优化

Prompt-based approaches, while effective for rapid adaptation, are fundamentally constrained by the intrinsic generalization capacity of backbone LLMs and often exhibit limited robustness in complex task settings. In order to address these limitations, researchers have increasingly explored fine-tuning methodologies aimed at systematically optimizing LLMs for critical components of deep research agents. These components include search query formulation, structured report generation, and external tool utilization. These efforts aim to enhance retrieval quality, mitigate hallucinations, and enable more reliable long-form and evidence-grounded generation.
基于提示的方法虽然对快速适应有效,但从根本上受到骨干 LLM 固有泛化能力的限制,并且在复杂的任务设置中通常表现出有限的稳健性。为了解决这些限制,研究人员越来越多地探索旨在系统优化深度研究代理关键组成部分的 LLM 的微调方法。这些组件包括搜索查询公式、结构化报告生成和外部工具利用率。这些努力旨在提高检索质量、减轻幻觉,并实现更可靠的长格式和循证生成。


An early milestone in this research direction is Open-RAG islam2024open , which augments data construction with diverse supervisory signals, including retrieval tokens, relevance tokens, grounding tokens, and utility tokens. Through adversarial training, Open-RAG improves the model’s capability to filter irrelevant information, thereby enhancing both retrieval accuracy and the quality of downstream tasks. Building upon this foundation, AUTO-RAG yu2024auto enhances the autonomous iterative retrieval capabilities of LLMs. In contrast to earlier multi-hop retrieval approaches that relied on few-shot prompting or hand-crafted templates jiang2023active ; feng2024retrieval ; wang2024llms , AUTO-RAG constructs reasoning-grounded instruction datasets, enabling models to autonomously plan retrieval queries and engage in multi-round interactions with retrievers. The model dynamically refines its retrieval strategy during generation, gathering sufficient evidence before synthesizing a final answer. Extending these retrieval-centric innovations, DeepRAG guan2025deeprag proposes a binary tree search mechanism that recursively generates sub-queries and constructs multi-turn retrieval trajectories. This mechanism enables the model to judiciously balance between internal parametric knowledge and external retrieval-based rollouts. Consequently, it enhances search efficiency and mitigates redundant external queries.
该研究方向的早期里程碑是 Open-RAG islam2024open ,它通过各种监督信号(包括检索令牌、相关性令牌、接地令牌和实用令牌)来增强数据构建。通过对抗性训练,Open-RAG 提高了模型过滤不相关信息的能力,从而提高了检索的准确性和下游任务的质量。在此基础上,AUTO-RAG yu2024auto 增强了 LLM 的自主迭代检索能力。与早期依赖于小镜头提示或手工制作模板的多跳检索方法相比, jiang2023active ; feng2024 检索 ; wang2024llms , AUTO-RAG 构建基于推理的指令数据集,使模型能够自主规划检索查询并与检索器进行多轮交互。该模型在生成过程中动态优化其检索策略,在综合最终答案之前收集足够的证据。扩展这些以检索为中心的创新,DeepRAG guan2025deeprag 提出了一种二叉树搜索机制,该机制递归生成子查询并构建多轮检索轨迹。此机制使模型能够在内部参数知识和基于外部检索的转出之间做出明智的平衡。因此,它提高了搜索效率并减少了冗余的外部查询。


In order to further reduce reliance on manually constructed supervised fine-tuning (SFT) datasets, recent work has sought to reduce dependence on manually constructed supervised fine-tuning datasets by developing fine-tuning strategies based on rejection sampling. CoRAG wang2025chainofretrievalaugmentedgeneration uses rejection sampling to extract intermediate retrieval chains from standard question answering datasets, allowing for stepwise retrieval augmentation and dynamic reformulation of subqueries as context evolves instead of supervising only final outputs. Li et al. li2025start propose a hint-infer mechanism that monitors token patterns during generation and triggers external computational tools, such as Python executors or hint libraries when specific cues are detected. After an initial supervised fine-tuning phase, the model undergoes a rejection sampling fine-tuning process that teaches it to generate its own prompts and invoke tools autonomously without reliance on hand-curated demonstrations. ATLAS chen2025atlas proposes a novel approach for LLM-based agents that trains exclusively on selected critical steps from expert trajectories, significantly improving generalization performance.
为了进一步减少对手动构建的监督微调 (SFT) 数据集的依赖,最近的工作试图通过开发基于拒绝采样的微调策略来减少对手动构建的监督微调数据集的依赖。CoRAG wang2025chainofretrievalaugmentedgeneration 使用拒绝采样从标准问答数据集中提取中间检索链,允许随着上下文的发展对子查询进行逐步检索增强和动态重新表述,而不是仅监督最终输出。Li et al. li2025start 提出了一种提示推理机制,该机制在生成过程中监控令牌模式,并在检测到特定线索时触发外部计算工具,例如 Python 执行程序或提示库。在最初的监督微调阶段之后,模型将经历拒绝采样微调过程,该过程教会它生成自己的提示并自主调用工具,而无需依赖手动策划的演示。ATLAS chen2025atlas 为基于 LLM 的代理提出了一种新方法,该方法专门针对专家轨迹中的选定关键步骤进行训练,从而显着提高了泛化性能。


Although these SFT methods enhance the generalization of deep research agents by supporting dynamic retrieval planning, structured information synthesis, and integrated tool use, they remain confined to offline, static retrieval pipelines characteristic of retrieval-augmented systems. In contrast, reinforcement learning offers a more adaptive solution for online query generation and tool invocation. By learning from real-time reward signals, reinforcement learning agents acquire the ability to formulate effective search queries and determine the optimal timing for tool calls. This approach addresses the limitations of synthetic demonstration data and distributional shifts, yielding more robust and adaptive performance in open-ended research environments.
尽管这些 SFT 方法通过支持动态检索计划、结构化信息合成和集成工具使用来增强深度研究代理的泛化,但它们仍然局限于检索增强系统所特有的离线静态检索管道。相比之下,强化学习为在线查询生成和工具调用提供了适应性更强的解决方案。通过从实时奖励信号中学习,强化学习代理获得了制定有效搜索查询并确定工具调用最佳时间的能力。这种方法解决了合成演示数据和分布偏移的局限性,在开放式研究环境中产生更稳健和适应性更强的性能。

3.4.2 Reinforcement Learning-based Optimization
3.4.2 基于强化学习的优化

RL-based methods optimize DR agents by directly enhancing their adaptive capabilities and generalization across diverse tasks, surpassing conventional instruction-following or pattern learning approaches. Recent advances have demonstrated that end-to-end RL training significantly strengthens iterative information retrieval, dynamic tool invocation, and integrated reasoning capabilities within DR agents. See comparative analysis in Table 3.
基于 RL 的方法通过直接增强 DR 代理在不同任务中的自适应能力和泛化来优化 DR 代理,超越了传统的指令跟踪或模式学习方法。最近的进展表明,端到端 RL 训练显着加强了 DR 代理中的迭代信息检索、动态工具调用和集成推理能力。参见表 3 中的比较分析。

Early RL-based approaches such as DeepRetrieval jiang2025deepretrieval optimized query generation for improved information retrieval quality, effectively enhancing downstream text generation by producing more relevant search results. Building on query optimization, ReSearch chen2025learning extended RL to adaptive reasoning over retrieved information. The model dynamically refined search strategies and iteratively updated results based on continuous feedback, significantly improving task-solving accuracy. Subsequently, R1-Searcher song2025r1 further optimized retrieval interactions, explicitly training models to refine search strategies through carefully designed reward functions. This allowed better exploitation of external information and improved search result relevance.
早期基于 RL 的方法,如 DeepRetrieval jiang2025deepretrieval 优化了查询生成,提高了信息检索质量,通过生成更相关的搜索结果,有效地增强了下游文本的生成。在查询优化的基础上,ReSearch chen2025learning 将 RL 扩展到对检索到的信息进行自适应推理。该模型动态优化搜索策略,并根据持续反馈迭代更新结果,显著提高任务解决的准确性。随后,R1-Searcher song2025r1 进一步优化了检索交互,通过精心设计的奖励函数显式训练模型来优化搜索策略。这样可以更好地利用外部信息并提高搜索结果的相关性。

Search-R1 jin2025search advanced RL-based retrieval by structurally integrating sophisticated search interactions with complex reasoning processes. The method systematically bridged query generation and information reasoning, enabling nuanced responses through refined integration of retrieved content. Finally, this research line culminated in the development of Agent-R1 Agent-R1 , a comprehensive DR framework integrating RL into end-to-end training of LLM agents. Agent-R1 leveraged diverse tools such as APIs, search engines, and databases, achieving autonomous multi-step task execution and dynamic tool coordination. Through RL-driven optimization across its entire pipeline, Agent-R1 demonstrated advanced capabilities in adaptive planning, iterative execution, and task refinement. Moreover, WebThinker Li2025webthinker integrates a Web Explorer module for dynamic multi-hop web exploration and employs Iterative Online Direct Pre ference (DPO) Optimization to seamlessly interleave search, navigation, and report drafting during reasoning, while Pangu DeepDiver shi2025pangu builds on the 7B Pangu model pretrained on Huawei’s Ascend NPUs by introducing Search Intensity Scaling (SIS) through a two-phase SFT and RL curriculum, enabling adaptive adjustment of search depth and frequency in open-web environments.
Search-R1 jin2025search 通过在结构上将复杂的搜索交互与复杂的推理过程相结合,实现高级基于 RL 的检索。该方法系统地将查询生成和信息推理联系起来,通过对检索到的内容进行精细集成,实现细致入微的响应。最后,这条研究路线最终开发了 Agent-R1 Agent-R1 ,这是一个全面的 DR 框架,将 RL 集成到 LLM 代理的端到端培训中。Agent-R1 利用 API、搜索引擎和数据库等多种工具,实现了自主的多步骤任务执行和动态工具协调。通过对整个管道进行 RL 驱动的优化,Agent-R1 展示了自适应规划、迭代执行和任务优化方面的高级功能。此外,WebThinker Li2025webthinker 集成了用于动态多跳 Web 探索的 Web Explorer 模块,并采用迭代在线直接引用 (DPO) 优化在推理过程中无缝交错搜索、导航和报告起草,而盘古 DeepDiver shi2025pangu 建立在华为昇腾 NPU 上预训练的 7B 盘古模型之上,通过两阶段 SFT 和 RL 课程引入搜索强度缩放 (SIS), 在开放 Web 环境中启用搜索深度和频率的自适应调整。

Table 3 reveals three key RL implementation patterns in DR systems: 1) Industrial systems like Gemini DR geminideepresearch and Grok DeepSearch grokdeepresearch employ proprietary RL implementations with undisclosed details, 2) Academic approaches chen2025learning ; song2025r1 favor modular RL optimization using GRPO shao2024deepseekmath and Reinforce++ hu2025reinforce with transparent reward designs, and 3) Emerging hybrid systems like SimpleDeepSearcher SimpleDeepSearcher combine process-based rewards with multi-task training across 6 QA datasets. The table also highlights the prevalence of Qwen2.5 and LLaMA3 model families as preferred base architectures for RL optimization.
3 揭示了 DR 系统中的三种关键 RL 实现模式:1) Gemini DR geminideepresearch 和 Grok DeepSearch grokdeepresearch 等工业系统采用专有的 RL 实现,但细节未披露,2) 学术方法 chen2025learning ; 宋 2025R1 支持使用 GRPO shao2024deepseekmath 和 Reinforce++ hu2025reinforce 进行模块化 RL 优化,并提供透明的奖励设计,以及 3) SimpleDeepSearcher SimpleDeepSearcher 等新兴混合系统将基于流程的奖励与跨 6 个 QA 数据集的多任务训练相结合。该表还强调了 Qwen2.5 和 LLaMA3 模型系列作为 RL 优化的首选基础架构的普遍性。

Reward Model and Policy Model. Most current open-source RL implementations of DR agents, including the methods discussed above, commonly adopt rule-based reward models that explicitly define task-specific objectives such as retrieval relevance, information accuracy, or successful tool invocation. To efficiently perform policy optimization, recent systems have increasingly utilized Proximal Policy Optimization (PPO) schulman2017proximal and Group Relative Policy Optimization (GRPO) shao2024deepseekmath . In particular, GRPO fundamentally reconfigures the advantage estimation paradigm by replacing traditional value functions with group-relative advantage computation. It expands reward space through intra-group normalization, sparse binary rewards are transformed into continuous advantage values spanning wider ranges. This expanded signal space provides richer gradient information for policy updates, as evidenced higher high-reward response density compared to PPO. In addition, GRPO provides a variance suppression mechanism by constraining advantage estimation within dynamically clustered response groups, such as grouping by reasoning depth or tool usage patterns, reducing policy gradient variance through local standardization. In contrast to PPO, GRPO eliminates separate value networks, removing conflicting optimization objectives between policy and value functions. Empirical measurements show GRPO reduces gradient direction conflicts from 12 to 3 per training epoch, significantly accelerating convergence. As a result, GRPO outperforms conventional PPO in wider reward distribution coverage, enhancing exploration capacity and faster KL divergence stabilization during alignment.
奖励模型和策略模型。 当前大多数 DR 代理的开源 RL 实现(包括上面讨论的方法)通常采用基于规则的奖励模型,这些模型明确定义特定于任务的目标,例如检索相关性、信息准确性或成功的工具调用。为了有效地执行策略优化,最近的系统越来越多地使用近端策略优化 (PPO) schulman2017proximal 和组相对策略优化 (GRPO) shao2024deepseekmath 。特别是,GRPO 通过用群体相对优势计算取代传统的价值函数,从根本上重新配置了优势估计范式。它通过组内归一化扩展了奖励空间,将稀疏的二进制奖励转化为跨越更广泛范围的连续优势值。这种扩展的信号空间为策略更新提供了更丰富的梯度信息,与 PPO 相比,这证明了更高的高回报响应密度。此外,GRPO 通过将优势估计限制在动态聚类响应组中,例如按推理深度或工具使用模式进行分组,通过局部标准化减少策略梯度方差,从而提供方差抑制机制。与 PPO 相比,GRPO 消除了单独的价值网络,消除了策略和价值功能之间相互冲突的优化目标。实证测量表明,GRPO 将每个训练时期的梯度方向冲突从 12 个减少到 3 个,从而显著加快收敛速度。因此,GRPO 在更广泛的奖励分配覆盖范围方面优于传统的 PPO,增强了勘探能力并在对齐期间更快地稳定了 KL 背离。

3.5 Non-parametric Continual Learning
3,5 非参数持续学习

DR agents depend heavily on LRMs and often utilize complex hierarchical workflows. Parameter-based learning approaches such as SFT and RL encounter significant obstacles in this context, including the need to scale model parameters, manage extensive volumes of structured experience data, and design increasingly intricate training algorithms. In contrast, non-parametric continual learning approaches offer a scalable alternative: agents refine their capabilities at runtime by optimizing external memory, workflows, and tool configurations through continuous interaction with the external environment rather than by updating internal weights. This non-parametric continual learning paradigm enables efficient online adaptation with minimal data and computational overhead, making it well-suited to DR agents with complex architectures.
DR 代理严重依赖 LRM,并且经常使用复杂的分层工作流。在这种情况下,基于参数的学习方法(如 SFT 和 RL)遇到了重大障碍,包括需要扩展模型参数、管理大量结构化体验数据以及设计日益复杂的训练算法。相比之下,非参数持续学习方法提供了一种可扩展的替代方案:代理通过与外部环境的持续交互而不是更新内部权重来优化外部内存、工作流程和工具配置,从而在运行时改进其功能。这种非参数持续学习范式能够以最少的数据和计算开销实现高效的在线适应,使其非常适合具有复杂架构的 DR 代理。


Non-parametric continual learning approaches, most notably case-based reasoning (CBR), are currently a mainstream method in LLM-driven agent systems. The CBR-based method enables agents to retrieve, adapt, and reuse structured problem-solving trajectories from an external case bank dynamically. Unlike traditional RAG-based methods, which rely on static databases, CBR facilitates online contextual adaptation and effective task-level generalisation. Such flexibility underscores its potential as a scalable and practical optimization solution for DR agents with complex architecture. DS-Agent guo2024dsagentautomateddatascience is a pioneering LLM-driven agent that introduced CBR into automated data science workflows, employing approximate online retrieval from a constructed case bank. Similarly, LAM guo2025optimizing applies CBR techniques to functional test generation, combining trajectory-level retrieval with LLM planning in a modular system design. Although DS-Agent itself does not include a learning phase, Agent K grosnit2024largelanguagemodelsorchestrating advances this paradigm with dynamic external case retrieval and reuse guided by a reward-based memory policy, which exemplifies genuine self-evolution enabling continual adaptation and optimization without updating model parameters. Focusing on DR agents, AgentRxiv schmidgall2025agentrxiv further extends this paradigm by enabling autonomous research agents to collaboratively share and access a centralized repository of prior research outputs. This framework allows LLM agent laboratories to upload and retrieve reports from a shared preprint server, simulating an online-updating arXiv-like platform which can be seen as a comprehensive case bank. Such a system empowers agents to enhance their capabilities and knowledge through contextual adaptation without modifying their model parameters.
非参数持续学习方法,尤其是基于案例的推理 (CBR),目前是 LLM 驱动的代理系统中的主流方法。基于 CBR 的方法使代理能够动态地从外部案例库中检索、调整和重用结构化的问题解决轨迹。与依赖于静态数据库的传统基于 RAG 的方法不同,CBR 有助于在线上下文适应和有效的任务级泛化。这种灵活性凸显了它作为具有复杂架构的 DR 代理的可扩展且实用的优化解决方案的潜力。DS-Agent guo2024dsagentautomateddatascience 是一个开创性的 LLM 驱动代理,它将 CBR 引入自动化数据科学工作流程,采用从构建的案例库中进行近似在线检索。同样,LAM guo2025optimizing 将 CBR 技术应用于功能测试生成,在模块化系统设计中将轨迹级检索与 LLM 规划相结合。尽管 DS-Agent 本身不包括学习阶段,但 Agent K grosnit2024largelanguagemodelsorchestrating 通过动态外部案例检索和重用来推进这种范式,该模式由基于奖励的内存策略指导,它体现了真正的自我进化,无需更新模型参数即可实现持续适应和优化。AgentRxiv schmidgall2025agentrxiv 专注于 DR 代理,通过使自主研究代理能够协作共享和访问先前研究成果的集中存储库,进一步扩展了这一范式。 该框架允许 LLM 代理实验室从共享的预印本服务器上传和检索报告,模拟一个在线更新的类似 arXiv 的平台,该平台可以被视为一个全面的案例库。这样的系统使代理能够通过上下文适应来增强他们的能力和知识,而无需修改他们的模型参数。

Compared to prompt-based methods, which encode fixed demonstrations or task heuristics into static input templates, Non-parametric methods enable dynamic retrieval and adaptation of structured trajectories, thereby facilitating continual task generalization without manual prompt engineering. Relative to RAG, which typically retrieves unstructured textual content from static corpora, CBR operates at the trajectory level and emphasizes reasoning-centered memory organization. A notable example is the Kaggle Grandmaster Agent grosnit2024largelanguagemodelsorchestrating , which demonstrates how LLMs equipped with modular reasoning components and persistent memory can achieve expert-level structured problem solving, aligning closely with the CBR paradigm. These characteristics make CBR particularly well-suited for agents requiring procedural adaptation and context-sensitive optimization across tasks. Except memory-based method, self-evolution can also arise from dynamic infrastructure adaptation. For example, Alita qiu2025alita monitors task requirements and environmental signals to provision and configure new MCP servers at runtime, seamlessly extending and refining its toolset on demand.
与将固定演示或任务启发式编码为静态输入模板的基于提示的方法相比,非参数方法能够动态检索和适应结构化轨迹,从而促进持续的任务泛化,而无需手动提示工程。相对于通常从静态语料库中检索非结构化文本内容的 RAG,CBR 在轨迹级别运行,强调以推理为中心的记忆组织。一个值得注意的例子是 Kaggle Grandmaster Agent grosnit2024largelanguagemodelsorchestrating ,它演示了配备模块化推理组件和持久内存的 LLM 如何实现专家级的结构化问题解决,与 CBR 范式紧密保持一致。这些特性使 CBR 特别适合需要跨任务进行程序适应和上下文敏感优化的代理。除了基于内存的方法外,动态基础设施适应也可以产生自我进化。例如,Alita qiu2025alita 监控任务要求和环境信号,以在运行时预置和配置新的 MCP 服务器,从而按需无缝扩展和完善其工具集。


In summary, these self-evolution paradigms in LLM-driven DR agent systems offer substantial promise for structured reasoning and dynamic retrieval and open new pathways for efficient knowledge reuse and continual learning. Although these methods have not yet achieved widespread attention, they address the high data and computational demands inherent to parameter-based approaches and therefore represent an attractive direction for future research and practical deployment.
总之,LLM 驱动的 DR 代理系统中的这些自我进化范式为结构化推理和动态检索提供了巨大的前景,并为高效的知识重用和持续学习开辟了新的途径。尽管这些方法尚未得到广泛关注,但它们解决了基于参数的方法固有的高数据和计算需求,因此代表了未来研究和实际部署的一个有吸引力的方向。

Refer to caption
Figure 5: An overview of DR agents evolution over years.
图 5: DR 代理多年来的演变概述。

4 Industrial Applications of Deep Research Agents
4 深度研究试剂的工业应用

4.1 Open AI Deep Research
4.1 Open AI 深入研究

OpenAI recently introduced its DR capability openai2025deepresearch , employing a single-agent architecture centered around a reinforcement learning-based, fine-tuned o3 reasoning model. Upon receiving a research query, the system initiates a concise interactive clarification step to accurately define user intent and research objectives. It then autonomously formulates and executes a sophisticated, multi-step research strategy, encompassing multimodal information retrieval, web browsing, and computational tasks such as data analysis and visualization through browser tools. Technologically, this solution delivers three significant advancements: (1) A dynamically adaptive iterative research workflow: Capable of refining its strategy throughout task execution. (2) Enhanced context memory and robust multimodal processing capabilities: Facilitating effective integration of diverse information sources. (3) Comprehensive toolchain integration: Combining web browsing capabilities with built-in programming tools to produce structured, authoritative reports supported by precise citations.
OpenAI 最近推出了其 DR 功能 openai2025deepresearch ,采用以基于强化学习、微调的 o3 推理模型为中心的单代理架构。收到研究查询后,系统会启动一个简洁的交互式澄清步骤,以准确定义用户意图和研究目标。然后,它自主制定并执行复杂的多步骤研究策略,包括多模态信息检索、网页浏览和计算任务,例如通过浏览器工具进行数据分析和可视化。在技术上,该解决方案实现了三个重大进步:(1) 动态自适应迭代研究工作流程 :能够在整个任务执行过程中完善其策略。(2) 增强的上下文记忆和强大的多模态处理能力 :促进不同信息源的有效集成。(3) 全面的工具链集成 :将 Web 浏览功能与内置编程工具相结合,生成由精确引用支持的结构化、权威报告。

4.2 Gemini Deep Research
4.2 双子座深度研究

Google DeepMind recently introduced Gemini DR geminideepresearch , an advanced DR agent based on its multimodal Gemini 2.0 Flash Thinking model. Gemini’s reinforcement learning-driven fine-tuning, facilitated by a single-agent architecture, has been shown to enhance planning and adaptive research capabilities, enabling the system to autonomously and expeditiously complete complex tasks. Technologically, this solution delivers four significant advancements: (1) Interactive Research Planning: Upon receiving a research query, Gemini autonomously formulates a multi-step investigation plan for interactive user review and modification. (2) Asynchronous Task Management: Adopts asynchronous task management architecture to efficiently handle multiple simultaneous tasks. (3) Large-scale context windows RAG ensembles: Enabling effective management and coherent synthesis of multimodal data (eg. text, images)for in-depth professional research analysis. (4) High speed adaptive retrieval: Implements fast, multi-round adaptive web search that significantly outperforms other agents in terms of retrieval speed and amount of information per iteration.
Google DeepMind 最近推出了 Gemini DR geminideepresearch ,这是一款基于其多模态 Gemini 2.0 Flash Thinking 模型的高级 DR 代理。Gemini 的强化学习驱动的微调,由单代理架构促进,已被证明可以增强规划和自适应研究能力,使系统能够自主、快速地完成复杂的任务。在技术上,该解决方案提供了四项重大进步:(1) 交互式研究计划 :收到研究查询后,Gemini 会自动制定一个多步骤调查计划,用于交互式用户审查和修改。(2) 异步任务管理 :采用异步任务管理架构,高效处理多个同时进行的任务。(3) 大规模上下文窗口 RAG 集成 :实现多模态数据(例如文本、图像)的有效管理和连贯合成,以进行深入的专业研究分析。(4) 高速自适应检索 :实现快速、多轮自适应 Web 搜索,在检索速度和每次迭代的信息量方面明显优于其他代理。

4.3 Perplexity Deep Research
4.3 Perplexity 深度研究

Perplexity’s recently developed DR agent perplexitydeepresearch has demonstrated an advanced capability to decompose complex queries into well-defined subtasks. The system is capable of conducting targeted web searches iteratively, critically evaluating authoritative sources, and synthesizing structured, comprehensive reports. Technologically, this solution delivers two significant advancements: (1) Iterative Information Retrieval: Conducts successive rounds of targeted web searches with dynamic adjustments based on interim insights, ensuring comprehensive information coverage and accuracy. (2) Dynamic Prompt-Guided Model Selection: Use hybrid architecture to autonomously select the optimal combination of specialized models based on the requirements and context of specific tasks, thereby enhancing adaptability and effectiveness in various research scenarios.
Perplexity 最近开发的 DR 代理 perplexitydeepresearch 展示了将复杂查询分解为定义明确的子任务的高级功能。该系统能够迭代地进行有针对性的 Web 搜索,批判性地评估权威来源,并综合结构化、全面的报告。在技术上,该解决方案提供了两项重大进步:(1) 迭代信息检索 :进行连续轮定向 Web 搜索,并根据临时见解进行动态调整,确保全面的信息覆盖率和准确性。(2) 动态提示引导模型选择 :使用混合架构,根据特定任务的需求和上下文自主选择专业模型的最优组合,从而增强在各种研究场景中的适应性和有效性。

4.4 Grok DeepSearch
4.4 Grok 深度搜索

Grok DeepSearch grokdeepresearch , developed by xAI, is a computational framework that combines real-time information retrieval with multimodal reasoning to dynamically solve complex and information-rich problems. Technologically, this solution delivers two significant advancements: (1) Segment-level module processing pipline: Upon receiving a query, Grok3 initiates the credibility assessment module to identify and filter out low-quality information. Subsequently, the system’s real-time data acquisition engine gathers multimodal inputs (e.g. text, images, and code) from various sources. Subsequently, employing the sparse attention mechanism, the system undertakes key reasoning subtasks, including data cleaning, cross-source verification, and multimodal integration, in a concurrent manner. Finally, the iterative optimization process culminates in the generation of structured outputs, encompassing analysis summaries, advanced visualizations (e.g. 3D trajectories), and verifiable citations. (2) Dynamic resource allocation: Capacity for adaptively alternating between lightweight retrieval and intensive analysis modes is noteworthy, and it is further augmented by the incorporation of a secure sandbox environment for computational verification.
Grok DeepSearch grokdeepresearch 由 xAI 开发,是一个计算框架,它将实时信息检索与多模态推理相结合,以动态解决复杂且信息丰富的问题。从技术上讲,该解决方案提供了两个重大进步:(1) 分段级模块处理流水线 :收到查询后,Grok3 会启动可信度评估模块来识别和过滤掉低质量的信息。随后,系统的实时数据采集引擎从各种来源收集多模态输入(例如文本、图像和代码)。随后,系统采用稀疏注意力机制,以并发方式承担关键推理子任务,包括数据清洗、跨源验证和多模态集成。最后,迭代优化过程以生成结构化输出而告终,包括分析摘要、高级可视化(e.g. 3D 轨迹)和可验证的引文。(2) 动态资源分配 :在轻量级检索和密集分析模式之间自适应交替的能力值得注意,并且通过结合用于计算验证的安全沙箱环境进一步增强了它。

4.5 Microsoft Copilot Researcher and Analyst
4,5 Microsoft Copilot 研究员和分析师

Microsoft recently introduced two innovative reasoning agents within Microsoft 365 Copilot: Researcher and Analyst spataro_introducing_2025 . These agents securely and compliantly access users’ work data (such as emails, meeting notes, documents, and chats) as well as web information, delivering on-demand expert knowledge.
Microsoft 最近在 Microsoft 365 Copilot 中引入了两个创新的推理代理:研究员和分析师 spataro_introducing_2025 。这些代理安全合规地访问用户的工作数据(例如电子邮件、会议记录、文档和聊天记录)以及 Web 信息,从而按需提供专业知识。

Researcher is designed to assist users in tackling complex, multi-step research tasks, delivering insights with unprecedented quality and accuracy. It combines OpenAI’s advanced research models with Microsoft 365 Copilot’s sophisticated orchestration and deep search capabilities. Users can employ Researcher to craft detailed market entry strategies, identify market opportunities for new products by integrating internal and external data, or prepare comprehensive quarterly reports for client reviews. Additionally, Researcher enhances its insights through connectors to third-party data sources such as Salesforce, ServiceNow, and Confluence.
Researcher 旨在帮助用户处理复杂的多步骤研究任务,以前所未有的质量和准确性提供见解。它将 OpenAI 的高级研究模型与 Microsoft 365 Copilot 的复杂编排和深度搜索功能相结合。用户可以聘请 Researcher 制定详细的市场进入策略,通过整合内部和外部数据来确定新产品的市场机会,或准备全面的季度报告供客户审查。此外,Researcher 还通过与第三方数据源(如 Salesforce、ServiceNow 和 Confluence)的连接器来增强其洞察。

Analyst is built as an advanced data analytics agent that rapidly transforms raw data into valuable insights within minutes. It leverages OpenAI’s o3-mini inference model, specifically optimized for advanced analytical tasks in professional environments. Analyst uses a chain-of-thought reasoning approach, solving problems step-by-step, generating high-quality responses that closely mirror human analytical thinking.
Analyst 是作为高级数据分析代理构建的,可在几分钟内将原始数据快速转换为有价值的见解。它利用 OpenAI 的 o3-mini 推理模型,专门针对专业环境中的高级分析任务进行了优化。Analyst 使用思维链推理方法,逐步解决问题,生成与人类分析思维密切相关的高质量响应。

4.6 Qwen Deep Research
4,6 Qwen 深入研究

Alibaba Qwen recently launched Qwen Deep Research, an advanced research agent powered by its flagship multimodal model Qwen3-235B-A22B. Through reinforcement learning-optimized task scheduling within a unified agent framework, the system demonstrates enhanced autonomous planning and adaptive execution capabilities, enabling rapid completion of complex research workflows. Key technological advancements include: (1) Dynamic Research Blueprinting with interactive plan refinement. (2) Concurrent Task Orchestration enabling parallel retrieval validation synthesis.
阿里巴巴 Qwen 最近推出了 Qwen Deep Research,这是一款由其旗舰多模式模型 Qwen3-235B-A22B 提供支持的高级研究代理。通过在统一的代理框架内进行强化学习优化的任务调度,该系统展示了增强的自主规划和自适应执行能力,从而能够快速完成复杂的研究工作流程。主要技术进步包括:(1) 具有交互式计划改进的 动态研究蓝图 。(2) Concurrent Task Orchestration 支持并行检索验证合成。


In addition to the pioneering DR services previously discussed, major technology corporations such as Microsoft and ByteDance, alongside emerging startups including Jina AI jina_deepsearch , H2O h2oai , and Zhipu AI zhipu2025autoglm , have also introduced their proprietary DR platforms. The advent of these solutions has spurred considerable global interest, reflected by their rapid proliferation, thereby underscoring both the technological attractiveness and substantial market potential of DR applications. Looking forward, continuous advancements in LLM reasoning, retrieval integration techniques, and multimodal generation are expected to enable DR agents to transcend traditional information retrieval and basic tool invocation tasks. Consequently, DR systems are anticipated to tackle increasingly sophisticated reasoning and complex knowledge-construction challenges, ultimately positioning DR as a foundational technological pillar for next-generation intelligent collaborative research platforms.
除了前面讨论的开创性 DR 服务外,Microsoft 和字节跳动等主要科技公司,以及 Jina AI jina_deepsearch 、H2O h2oai 和 Zhipu AI zhipu2025autoglm 等新兴初创公司也推出了他们专有的 DR 平台。这些解决方案的出现激发了全球的极大兴趣,其迅速扩散反映了这一点,从而凸显了 DR 应用的技术吸引力和巨大的市场潜力。展望未来,LLM 推理、检索集成技术和多模态生成的不断进步有望使 DR 代理超越传统的信息检索和基本工具调用任务。因此,预计 DR 系统将应对日益复杂的推理和复杂的知识构建挑战,最终将 DR 定位为下一代智能协作研究平台的基础技术支柱。

5 Benchmarks for DR Agent
5 DR 代理的基准测试

Evaluating DR agents requires benchmarks that capture their full research workflow, including multi-step information retrieval, cross-source synthesis, dynamic tool invocation, and structured evidence-grounded report generation. Existing evaluations fall into two main categories. Question-Answering (QA) benchmarks range from single-turn factual queries to complex research-style problems, assessing agents’ factual knowledge, domain-specific reasoning, and ability to locate and integrate relevant information. Task Execution benchmarks evaluate broader capabilities such as long-horizon planning, multimodal understanding, tool usage, and environment interaction by measuring how well agents carry out end-to-end research tasks. Although long-form generation datasets such as Qasper dasigi2021dataset and ELI5 fan2019eli5 provide tests of extended output coherence, their free-form nature does not align with the structured evidence-based reporting expected of DR agents. Consequently, there is a pressing need for specialized benchmarks that reflect the multi-stage, multimodal characteristics of DR workflows and ensure rigorous and relevant assessment of agent performance across all phases of autonomous research.
评估 DR 代理需要捕获其完整研究工作流程的基准,包括多步骤信息检索、跨源合成、动态工具调用和结构化循证报告生成。现有评估分为两大类。 问答 (QA) 基准测试的范围从单轮事实查询到复杂的研究式问题,评估代理的事实知识、特定领域的推理以及查找和整合相关信息的能力。 任务执行基准测试 通过衡量代理执行端到端研究任务的能力,评估更广泛的功能,例如长期规划、多模式理解、工具使用和环境交互。尽管 Qasper dasigi2021dataset 和 ELI5 fan2019eli5 等长格式生成数据集提供了扩展输出连贯性的测试,但它们的自由格式性质与 DR 代理所期望的结构化循证报告不一致。因此,迫切需要专门的基准来反映 DR 工作流程的多阶段、多模式特征,并确保在自主研究的所有阶段对代理性能进行严格和相关评估。

Table 4: Performance of DR agents on major QA benchmarks. The best performance is highlighted in bold, and the second-best is indicated with an underline.
表 4: DR 代理在主要 QA 基准上的性能。最佳性能以 体突出显示,次佳性能用 下划线表示。
= Not present  = 不存在
DR Agent Base Model  基本模型 QA Benchmarks Release
Hotpot  火锅 2Wiki  2维基 NQ TQ GPQA
Search-o1 li2025search QwQ-32B-preview  QwQ-32B-预览版 57.3 71.4 49.7 74.1 57.9 Jan-2025
Agentic Reasoning wu2025agentic DeepSeek-R1, Qwen2.5  DeepSeek-R1、Qwen2.5 67.0 Feb-2025
Grok DeepSearch grokdeepresearch Grok3  格罗克3 84.6 Feb-2025
AgentRxiv schmidgall2025agentrxiv GPT-4o-mini  GPT-4o-迷你 41.0 Mar-2025
R1-Searcher song2025r1 Qwen2.5-7B-Base  Qwen2.5-7B-基座 71.9 63.8 Mar-2025
ReSearch chen2025learning Qwen2.5-7B-Base  Qwen2.5-7B-基座 30.0 29.7 Mar-2025
ReSearch chen2025learning Qwen2.5-7B-Inst  Qwen2.5-7B-研究所 63.6 54.2 Mar-2025
ReSearch chen2025learning Qwen2.5-32B-Base  Qwen2.5-32B-基座 64.3 45.6 Mar-2025
ReSearch chen2025learning Qwen2.5-32B-Inst  Qwen2.5-32B-研究所 67.7 50.0 Mar-2025
Search-R1 jin2025search Llama3.2-3B-Base  骆驼 3.2-3B-基 30.0 29.7 43.1 61.2 Mar-2025
Search-R1 jin2025search Llama3.2-3B-Inst  骆驼 3.2-3B-研究所 31.4 23.3 35.7 57.8 Mar-2025
Search-R1 jin2025search Qwen2.5-7B-Base  Qwen2.5-7B-基座 28.3 27.3 39.6 58.2 Mar-2025
Search-R1 jin2025search Qwen2.5-7B-Inst  Qwen2.5-7B-研究所 34.5 36.9 40.9 55.2 Mar-2025
DeepResearcher zheng2025deepresearcherscalingdeepresearch
DeepResearcher 郑 2025deepresearcher 扩展 deepresearch
Qwen2.5-7B-Inst  Qwen2.5-7B-研究所 64.3 66.6 61.9 85.0 Apr-2025
WebThinker Li2025webthinker QwQ-32B 68.7 Apr-2025
SimpleDeepSearch SimpleDeepSearcher Qwen2.5-7B-Inst  Qwen2.5-7B-研究所 68.1 Apr-2025
SimpleDeepSearch SimpleDeepSearcher Qwen2.5-32B-Inst  Qwen2.5-32B-研究所 70.5 Apr-2025
SimpleDeepSearch SimpleDeepSearcher DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-蒸馏-Qwen-32B
68.1 Apr-2025
SimpleDeepSearch SimpleDeepSearcher QwQ-32B 73.5 Apr-2025
SWIRL goldie2025synthetic Gemma-2-27B  杰玛-2-27B 72.0 Apr-2025
Table 5: Performance of DR agents on GAIA test, validation set and HLE benchmarks. The best performance is highlighted in bold, and the second-best is indicated with an underline.
表 5: DR 代理在 GAIA 测试、验证集和 HLE 基准测试中的性能。最佳性能以 体突出显示,次佳性能用 下划线表示。
= Not present  = 不存在
DR Agent Base Model  基本模型 GAIA HLE Release
Level-1  1 级 Level-2 Level-3 Ave.
Test set
MMAC-Copilot song2024mmac GPT-3.5, GPT-4  GPT-3.5、GPT-4 45.16 20.75 6.12 25.91 Mar-2024
H2O.ai DR h2oai Claude3.7-Sonnet  克劳德3.7-十四行诗 89.25 79.87 61.22 79.73 Mar-2025
Alita qiu2025alita Claude-Sonnet-4, GPT-4o  克劳德-十四行诗-4、GPT-4o 92.47 71.7 55.1 75.42 May-2025
Dev set
AutoAgent tang2025autoagentfullyautomatedzerocodeframework Claude-Sonnet-3.5  克劳德-十四行诗-3.5 71.7 53.5 26.9 55.2 Feb-2025
OpenAI DR openai2025deepresearch GPT-o3-customized  GPT-o3 定制 78.7 73.2 58.0 67.4 26.6 Feb-2025
Perplexity DR perplexitydeepresearch Flexible  灵活 21.1 Feb-2025
Manus manus2025 Claude3.5, GPT-4o  克劳德 3.5,GPT-4o 86.5 70.1 57.7 71.4 Mar-2025
OWL owl2025 Claude-3.7-Sonnet  克劳德 3.7 十四行诗 84.9 68.6 42.3 69.7 Mar-2025
H2O.ai DR h2oai h2ogpt-oasst1-512-12b 67.92 67.44 42.31 63.64 Mar-2025
Genspark Super Agent genspark Claude 3 Opus 87.8 72.7 58.8 73.1 Apr-2025
WebThinker Li2025webthinker QwQ-32B 53.8 44.2 16.7 44.7 13.0 Apr-2025
SimpleDeepSearch SimpleDeepSearcher QwQ-32B 50.5 45.8 13.8 43.9 Apr-2025
Alita qiu2025alita Claude-Sonnet-4, GPT-4o  克劳德-十四行诗-4、GPT-4o 75.15 87.27 May-2025
QA Benckmarks.  QA 本克马克斯。
Table 6: Overview of nine widely used QA benchmark datasets employed in recent DR-agent studies. The first group covers single-hop QA tasks, while the second group focuses on multi-hop and multi-turn reasoning.
表 6: 概述了最近 DR 代理研究中采用的九个广泛使用的 QA 基准数据集。第一组涵盖单跳 QA 任务,而第二组侧重于多跳和多轮推理。
Benchmark Release  释放 Size  大小 Task & Context  任务 & 上下文 Domain Multi-hop Nums
TriviaQA joshi2017triviaqalargescaledistantly
triviaQA: joshi2017triviaqalargescaledistriantly
2017 95 k  95 千米 Single-hop retrieval (Long web/Wiki docs)
单跳检索(Long web/Wiki 文档)
Open 1
Natural Questions kwiatkowski2019natural
自然问题 kwiatkowski2019natural
2019 307 k  307 千米 Document answer extraction (Full Wikipedia article)
文档答案提取(完整的 Wikipedia 文章)
Open 1
PopQA mallen2023trustlanguagemodelsinvestigating
PopQA mallen2023 信任语言模型调查
2023 14 k  14 千米 Single-hop parametric recall (None)
单跳参数召回率 (None)
Open 1
TELEQnA maatouk2023teleqnabenchmarkdatasetassess 2023 10 k  10 千米 Domain factual QA (Telecom standards/articles)
域事实 QA(电信标准/文章)
Telecom 1
SimpleQA wei2024measuringshortformfactualitylarge 2024 4.3 k  4,3 千米 Single-hop factual recall (None / parametric)
单跳事实召回 (None / parametric)
Open 1
HotpotQA yang2018hotpotqadatasetdiverseexplainable
火锅 QA 洋 2018 火锅 QA 数据集多样化 explainable
2018 113 k  113 千米 Multi-hop reasoning (2 Wikipedia paragraphs)
多跳推理(2 个 Wikipedia 段落)
Open 2
2WikiMultihopQA ho2020constructingmultihopqadataset
2WikiMultihopQA ho2020 构建多 hopqadataset
2020 192 k  192 千米 Multi-hop reasoning (Retrieval across Wikipedia)
多跳推理(跨 Wikipedia 检索)
Open 2+
Bamboogle aksitov2023restmeetsreactselfimprovement 2023 125 Compositional reasoning (Online search)
组合推理 (在线搜索)
Open 2–3
Humanity’s Last Exam phan2025humanity
人类的最后考试 phan2025humanity
2025 2.5 k  2,5 千米 Expert-level multi-turn (Mixed external sources)
专家级多轮次(混合外部源)
Multi-discipline 2+

QA benchmarks span a spectrum of complexity, from simple factual recall to multi-hop reasoning and research-style question answering. At the lower end, datasets such as SimpleQA wei2024measuringshortformfactualitylarge , TriviaQA joshi2017triviaqalargescaledistantly , and PopQA mallen2023trustlanguagemodelsinvestigating focus on parametric or single-hop factual recall, evaluating whether models can retrieve short factual answers from memory or minimal context. Natural Questions (NQ) kwiatkowski2019natural and TELEQnA maatouk2023teleqnabenchmarkdatasetassess add complexity by requiring answer extraction from long documents or domain-specific sources. Benchmarks like HotpotQA yang2018hotpotqadatasetdiverseexplainable , 2WikiMultihopQA ho2020constructingmultihopqadataset , and Bamboogle aksitov2023restmeetsreactselfimprovement emphasize multi-hop reasoning and supporting evidence selection across documents. At the highest level of difficulty lies Humanity’s Last Exam (HLE) phan2025humanity , which targets expert-level, open-domain scientific questions crafted by leading professors in various fields. These questions often require multi-turn retrieval, complex inference, and even multimodal understanding. Additionally, BrowseComp browsercomp is another challenging benchmark proposed by OpenAI to measure the ability of AI agents to locate hard-to-find information. It retains the answer verifiability of the Simple QA benchmark while filtering out those that can be easily solved by LLMs with web search, thus testing agents’ information retrieval and synthesis capabilities. Despite recent advancements, leading DR agents still exhibit suboptimal performance on the HLE and BrowserComp benchmark compared to human experts. This highlights these two benchmarks as the most critical and unresolved challenges in the evaluation of DR agents.
QA 基准测试涵盖一系列复杂性,从简单的事实回忆到多跳推理和研究式问答。在低端, SimpleQAwei2024measuringshortformfactualitylarge TriviaQAjoshi2017triviaqalargescaledistly PopQAmallen2023trustlanguagemodels 调查 等数据集专注于参数或单跳事实召回,评估模型是否可以从记忆或最小上下文中检索简短的事实答案。 自然问题 (NQ)kwiatkowski2019natural TELEQnAmaatouk2023teleqnabenchmarkdatasetassess 要求从长文档或特定领域的来源中提取答案,从而增加了复杂性。 HotpotQAyang2018hotpotqadatasetdiverseexplainable 2WikiMultihopQAho2020constructingmultihopqadataset Bamboogleaksitov2023restmeetsreactselfimprovement 等基准测试强调跨文档的多跳推理和支持证据选择。难度最高的是 人类最后的考试 (HLE) phan2025humanity ,它针对由各个领域的领先教授精心设计的专家级、开放领域的科学问题。这些问题通常需要多轮检索、复杂推理,甚至多模态理解。此外,BrowseComp browsercomp 是 OpenAI 提出的另一个具有挑战性的基准,用于衡量 AI 代理定位难以找到的信息的能力。 它保留了 Simple QA 基准测试的答案可验证性,同时过滤掉了那些可以通过 Web 搜索由 LLM 轻松解决的答案,从而测试代理的信息检索和合成能力。尽管最近取得了进展,但与人类专家相比,领先的 DR 代理在 HLE 和 BrowserComp 基准测试中仍然表现出次优的性能。这突出了这两个基准是 DR 代理评估中最关键和未解决的挑战。

Task Execution Benchmarks.
任务执行基准。

Task execution benchmarks evaluate an agent’s integrated capabilities in tool use, environment perception, and information filtering. These can be grouped into two subcategories. The first category comprises general-purpose assistant tasks such as GAIA mialon2023gaia , AssistantBench yoran2024assistantbenchwebagentssolve , and Magentic-One fourney2024magenticonegeneralistmultiagentsolving . These tasks require agents to plan and execute tool-based workflows (for example, searching, browsing, or form filling) within environments that are open-ended and often web-based. Among them, GAIA has emerged as the most important benchmark, offering diverse, realistic tasks that are easily human-solvable but remain highly challenging for current agents. The second subcategory focuses on research and code-oriented tasks, including SWE-bench jimenez2024swebenchlanguagemodelsresolve , HumanEvalFix muennighoff2024octopackinstructiontuningcode , MLGym nathani2025mlgymnewframeworkbenchmark , MLE-bench chan2025mlebenchevaluatingmachinelearning , MLBench tang2024mlbenchevaluatinglargelanguage , MLAgentBench huang2024mlagentbenchevaluatinglanguageagents , and ScienceAgentBench chen2025scienceagentbenchrigorousassessmentlanguage , which test agents on completing machine learning pipelines, repairing real-world code, or replicating scientific experiments. These tasks require long-horizon planning, precise tool invocation, and often code generation and validation. Additionally, benchmarks like RE-Bench wijk2024rebenchevaluatingfrontierai and RESEARCHTOWN yu2024researchtownsimulatorhumanresearch simulate multi-agent research environments, evaluating how well agents collaborate and iterate in multi-role scientific workflows.
任务执行基准测试评估代理在工具使用、环境感知和信息过滤方面的集成能力。这些可以分为两个子类别。第一类包括通用的辅助任务,例如 GAIA mialon2023gaia 、 AssistantBench yoran2024assistantbenchwebagentssolve 和 Magentic-One fourney2024magenticonegeneralistmultiagentsolving 。这些任务要求代理在开放式且通常基于 Web 的环境中规划和执行基于工具的工作流程(例如,搜索、浏览或表单填写)。其中, GAIA 已成为最重要的基准,它提供了多样化、现实的任务,这些任务很容易被人类解决,但对当前的代理来说仍然极具挑战性。第二个子类别侧重于 研究和面向代码的任务 ,包括 SWE-benchjimenez2024swebenchlanguagemodelsresolve HumanEvalFixmuennighoff2024octopackinstructiontuningcode MLGymnathani2025mlgymnewframeworkbenchmark MLE-benchchan2025mlebenchevaluatingmachinelearning MLBenchtang2024mlbenchevaluatinglargelanguage MLAgentBenchhuang2024mlagentbenchevaluatinglanguageagents ScienceAgentBenchchen2025scienceagentbenchrigorousassessmentlanguage ,它们测试代理完成机器学习管道、修复真实代码或复制科学实验。这些任务需要长期规划、精确的工具调用,并且通常需要代码生成和验证。 此外, RE-Benchwijk2024rebenchevaluatingfrontierai RESEARCHTOWNyu2024researchtownsimulatorhumanresearch 等基准测试模拟了多智能体研究环境,评估智能体在多角色科学工作流程中的协作和迭代情况。


As DR agents continue to integrate more interactive tools, future evaluation may expand into GUI-based manipulation environments. Benchmarks such as OSWorld xie2024osworldbenchmarkingmultimodalagents , WebArena zhou2024webarenarealisticwebenvironment , and SpaBench chen2025spabenchcomprehensivebenchmarksmartphone allow agents to control applications or web interfaces directly, opening new avenues for testing embodied research capabilities in realistic, user-facing scenarios.
随着 DR 代理不断集成更多交互式工具,未来的评估可能会扩展到基于 GUI 的作环境。 OSWorldxie2024osworldbenchmarkingmultimodalagents WebArenazhou2024webarenarealisticwebenvironment SpaBenchchen2025spabenchcomprehensivebenchmarksmartphone 等基准测试允许代理直接控制应用程序或 Web 界面,为在面向用户的现实场景中测试具体研究能力开辟了新途径。

6 Challenge and Future Directions
6 挑战和未来方向

Despite the rapid evolution of DR agents and their demonstrated efficacy in automating multi-step information discovery and synthesis, two overarching challenges persist, defining the roadmap for future innovation. First, the breadth and depth of accessible information remain tightly constrained by reliance on static knowledge repositories or conventional search interfaces. Second, the efficiency and robustness of execution workflows and system architectures are limited by linear planning paradigms and monolithic agent designs. Addressing these challenges will be critical to enabling DR agents to function as truly autonomous, adaptable research assistants capable of navigating complex, heterogeneous data landscapes and orchestrating high-throughput, parallelized reasoning processes.
尽管 DR 代理发展迅速,并且在自动化多步骤信息发现和合成方面已证明其有效性,但两个首要挑战仍然存在,这为未来的创新定义了路线图。首先,可访问信息的广度和深度仍然受到对静态知识库或传统搜索界面的依赖的严格限制。其次,执行工作流程和系统架构的效率和稳健性受到线性规划范式和整体代理设计的限制。应对这些挑战对于使 DR 代理能够成为真正自主、适应性强的研究助手至关重要,这些助手能够导航复杂的异构数据环境并编排高吞吐量、并行推理过程。

Broaden Information Source.
拓宽信息源。

To meet the information needs of complex tasks, current DR agents adopt static knowledge bases (such as the RAG method) or rely exclusively on search engines and browsers; the former approach is insufficiency, while the latter is confined to publicly available web content, thereby significantly constraining their information-acquisition capabilities. This inherent limitation renders them incapable of retrieving information concealed behind applications, proprietary interfaces or specialised databases. For example, conventional browsing and search techniques cannot penetrate enterprise software, mobile applications, or subscription-only services, such as the Bloomberg Terminal, thereby precluding access to critical, real-time market intelligence. In order to surmount this limitation, it is imperative to integrate a more granular and extensive range of modular tools via MCPs. This approach enables agents to dynamically access specialised tools and resources beyond the scope of standard browsers or search engines. Such resources may include proprietary applications, databases, or APIs, thereby facilitating the retrieval of previously inaccessible data. Consequently, DR agents have the capacity to deliver more precise, adaptive, and context-aware interactions, thereby effectively fulfilling diverse and complex user requirements.
为了满足复杂任务的信息需求,当前的 DR 代理采用静态知识库(如 RAG 方法)或完全依赖搜索引擎和浏览器;前一种方法是不足的,而后者仅限于公开可用的 Web 内容,从而极大地限制了他们的信息获取能力。这种固有的限制使他们无法检索隐藏在应用程序、专有界面或专业数据库后面的信息。例如,传统的浏览和搜索技术无法渗透到企业软件、移动应用程序或仅限订阅的服务(如 Bloomberg Terminal)中,从而无法访问关键的实时市场情报。为了克服这一限制,必须通过 MCP 集成更精细、更广泛的模块化工具。这种方法使代理能够动态访问标准浏览器或搜索引擎范围之外的专用工具和资源。此类资源可能包括专有应用程序、数据库或 API,从而有助于检索以前无法访问的数据。因此,DR 代理能够提供更精确、自适应和上下文感知的交互,从而有效地满足多样化和复杂的用户需求。

Following the integration of proprietary APIs and databases, the rate-limiting factor in the workflow shifts from data acquisition to webpage interaction efficiency. Conventional human-centred browsers create a further bottleneck for agents. Because they optimise for visual rendering rather than programmatic control, they suffer from sluggish page loads, fragile element locators that shift with every layout change, and aggressive anti-bot defences that often break automated sessions. These shortcomings translate into high latency, unstable scraping and limited parallelism whenever DR agents try to harvest data at scale. To address this bottleneck, researchers have begun to design AI-native browsers such as Browserbase browserbase2024 , Browser Use browser_use2024 , Dia, Fellou fellou-2025 , and the Comet comet-perplexity-2025 from Perplexity. expose a stable, structured DOM view that agents can traverse programmatically browserbase2024 ; browser_use2024 ; comet-perplexity-2025 . browserbase2024 ; fellou-2025 supply explicit API hooks for clicking elements and filling forms, which removes the need for brittle coordinate-based actions. browserbase2024 further executes pages asynchronously in a headless container, reducing load-time variance and avoiding the overhead of a visible interface. browserbase2024 embeds a vision–language model that tracks dynamic page changes and automatically resolves log-in gates and anti-bot challenges. browser_use2024 ; comet-perplexity-2025 coordinates dozens of tabs in parallel, allowing DR agents to interact with private dashboards, single-page applications, and interactive visualisations at scale. In combination, these capabilities eliminate the delays and fragility that arise when conventional, human-centred browsers sit between the agent and newly unlocked proprietary data sources.
在集成专有 API 和数据库之后,工作流程中的速率限制因素从数据采集转向网页交互效率。传统的以人为中心的浏览器为代理创造了进一步的瓶颈。因为它们针对视觉渲染而不是编程控制进行了优化,所以它们会遇到页面加载缓慢、每次布局更改都会移动的脆弱元素定位器以及经常破坏自动化会话的激进反机器人防御等问题。这些缺点会导致高延迟、不稳定的抓取和有限的并行性,每当 DR 代理尝试大规模收集数据时。为了解决这一瓶颈,研究人员已经开始设计 AI 原生浏览器 ,例如 Browserbase browserbase2024 、 Browser Use browser_use2024 、 Dia 、 Fellou fellou-2025 和 来自 Perplexity 的彗 星 comet-perplexity-2025 。公开一个稳定的、结构化的 DOM 视图,代理可以通过编程方式 遍历 browserbase2024 ; browser_use2024 ; 彗星-困惑-2025 . 浏览器基地 2024 ; 费洛-2025 为单击元素和填写表单提供显式 API 钩子,这消除了对基于坐标的脆弱作的需求。 BrowserBase2024 进一步在无头容器中异步执行页面,从而减少加载时间差异并避免可见界面的开销。 BrowserBase2024 嵌入了一个视觉语言模型,该模型可跟踪动态页面更改并自动解决登录门和反机器人挑战。 browser_use2024 ; 彗星-困惑-2025 并行协调数十个选项卡,使 DR 代理能够与私有仪表板、单页应用程序和交互式可视化进行大规模交互。这些功能结合起来,消除了传统的、以人为本的浏览器位于代理和新解锁的专有数据源之间时出现的延迟和脆弱性。

Fact Checking.  事实核查。

To further boost factual accuracy, the latest methods add a structured verification loop and self-reflection abilities on top of multi-step retrieval. Concretely, once an agent has drafted a preliminary answer, it does not rush to deliver a verdict. Instead, it proactively launches cross-checks: it looks for independent sources that confirm the same fact and searches for evidence of contradictions. Grok DeepSearch, for example, follows this strategy—it rates the credibility of every source, inspects consistency through as many as seven layers of depth, and verifies each key claim across multiple origins grokdeepresearch . This multi-source cross-validation sharply reduces single-source errors and raises answer reliability. At the same time, agents have begun to reflect on their own reasoning. During inference, they inspect and test intermediate results, much like a human researcher’s reflective thinking. Zhipu’s Rumination model (zhipu2025autoglm, ), for instance, pauses after concluding, keeps searching to check whether that conclusion holds, and only then finalizes the answer. Such introspection is typically encouraged by adding correctness-oriented rewards in reinforcement learning. If the model detects conflict or uncertainty, it replans its retrieval strategy and, when necessary, backtracks to revise earlier inferences openai2025deepresearch . Through this blend of structured verification and self-reflection, research agents now attain an unprecedented level of rigor in fact-checking: they not only supply an answer but also explain why it is trustworthy, dramatically lowering factual errors and hallucinations. In short, modern agents can lay out a search plan, adapt queries as intermediate evidence comes in, and—where needed—rewind prior steps to recover missing information openai2025deepresearch .
为了进一步提高事实准确性,最新方法在多步骤检索的基础上增加了结构化验证循环和自我反思功能。具体来说,一旦代理人起草了初步答案,它就不会急于做出裁决。相反,它主动发起交叉检查:寻找确认相同事实的独立来源,并寻找矛盾的证据。例如,Grok DeepSearch 就遵循这种策略,它会对每个来源的可信度进行评级,通过多达七层深度检查一致性,并验证多个来源 grokdeepresearch 中的每个关键声明。这种多源交叉验证大大减少了单源错误,并提高了答案的可靠性。与此同时,经纪人已经开始反思自己的推理。在推理过程中,他们检查和测试中间结果,就像人类研究人员的反思性思维一样。例如,智普的反刍模型 zhipu2025autoglm, 在结论后停顿,不断搜索以检查该结论是否成立,然后才能最终确定答案。这种内省通常是通过在强化学习中添加面向正确的奖励来鼓励的。如果模型检测到冲突或不确定性,它会重新规划其检索策略,并在必要时回溯以修改之前的推理 openai2025deepresearch 。通过这种结构化验证和自我反省的结合,研究代理人现在在事实核查方面达到了前所未有的严谨水平:他们不仅提供答案,还解释为什么它是可信的,从而大大减少了事实错误和幻觉。 简而言之,现代代理可以制定搜索计划,根据中间证据的出现调整查询,并在需要时倒带之前的步骤以恢复缺失的信息 openai2025deepresearch .

Asynchronous Parallel Execution.
异步并行执行。

To address the limitation that most existing DR agents rely exclusively on linear task planning, i.e. the sequential execution of subtasks, we introduce two possible methodologies. These methods overcome the inherent efficiency and robustness constraints of purely linear strategies and enable both the exploitation of parallelism and the implementation of dynamic adjustments during task execution. Firstly, an asynchronous, parallel architecture leveraging advanced task-modeling structures, such as directed acyclic graphs (DAGs), presents a promising future direction which could enable parallel execution and dynamic prioritisation of subtasks, effectively managing complex interdependencies among tasks and facilitating potentially sophisticated planning capabilities such as replanning. Secondly, a learned scheduling agent, trained via reinforcement learning to allocate subtasks and adjust execution order based on runtime performance signals (e.g. execution latency), could be proposed. By treating scheduling decisions as actions in an RL environment, the agent progressively discovers policies that balance parallelism, resource utilisation, and task criticality, yielding more robust and efficient end-to-end research workflows.
为了解决大多数现有 DR 代理完全依赖线性任务规划(即子任务的顺序执行)的限制,我们引入了两种可能的方法。这些方法克服了纯线性策略固有的效率和稳健性限制,并支持在任务执行期间利用并行性和实施动态调整。首先,利用高级任务建模结构(如有向无环图 (DAG))的异步、并行架构提出了一个有前途的未来方向,该方向可以实现子任务的并行执行和动态优先级排序,有效管理任务之间复杂的相互依赖关系,并促进潜在的复杂规划功能,例如重新规划。其次,可以提出一个学习到的调度代理,通过强化学习训练,根据运行时性能信号(例如执行延迟)分配子任务并调整执行顺序。通过将调度决策视为 RL 环境中的作,代理会逐步发现平衡并行性、资源利用率和任务关键性的策略,从而产生更强大、更高效的端到端研究工作流程。

Tool-Integrated Reasoning.
工具集成推理。

A fundamental challenge in developing effective DR agents lies in the implementation of Tool-Integrated Reasoning (TIR), a paradigm that extends beyond simple tool usage to encompass complex, multi-step reasoning with dynamic tool integration. TIR requires agents to not only invoke appropriate tools in logical sequence but also to adaptively adjust their reasoning pathways based on intermediate results. Traditional supervised fine-tuning approaches have demonstrated limited generalization capabilities in tool-based reasoning tasks, often leading to over-reasoning or inappropriate tool selection. Recent research by qiancheng2025toolrl has shown that reinforcement learning frameworks with carefully designed reward structures can significantly enhance models’ tool reasoning abilities. By incorporating fine-grained rewards that evaluate not only final answer correctness but also tool selection appropriateness, parameter specification accuracy, and reasoning efficiency, TIR-optimized agents have demonstrated performance improvements of 15-17% across multiple benchmarks. Furthermore, these agents exhibit superior generalization to unseen tools and tasks, more rational invocation patterns, and better balance between tool utilization and self-knowledge. Implementing TIR effectively within DR agents represents a critical step toward achieving truly autonomous research assistants capable of navigating complex information landscapes with minimal human intervention.
开发有效的 DR 代理的一个基本挑战在于工具集成推理 (TIR) 的实施,TIR 是一种范式,它超越了简单的工具使用,包括复杂的多步骤推理和动态工具集成。TIR 要求代理不仅要按逻辑顺序调用适当的工具,还要根据中间结果自适应地调整其推理路径。传统的监督微调方法在基于工具的推理任务中表现出有限的泛化能力,通常会导致过度推理或不适当的工具选择。 qiancheng2025toolrl 最近的研究表明,精心设计的奖励结构的强化学习框架可以显著提高模型的工具推理能力。通过纳入精细的奖励,不仅评估最终答案的正确性,还评估工具选择的适当性、参数规范的准确性和推理效率,TIR 优化代理在多个基准测试中展示了 15-17% 的性能改进。此外,这些代理表现出对看不见的工具和任务的卓越泛化、更合理的调用模式以及工具利用率和自我知识之间的更好平衡。在 DR 代理中有效实施 TIR 是朝着实现真正的自主研究助理迈出的关键一步,这些助手能够在最少的人工干预下导航复杂的信息环境。

Benchmark Misalignment.  基准测试错位。

Most public DR evaluations remain anchored in traditional QA suites whose items are harvested chiefly from static corpora such as Wikipedia. Since a considerable amount of this content is now embedded in backbone model parameters, current competitive agents can often answer directly from memory, bypassing any research procedure and thus inflating their performance. To probe genuine capabilities of retrieval, reasoning and tool usage, the field of DR urgently needs open-web, time-sensitive benchmarks. From this perspective, BrowseComp browsercomp constitutes a meaningful step forward by filtering out questions solvable with parametric knowledge and forcing agents to locate hard-to-find information online. Besides, a complementary direction is a continually refreshed leaderboard that updates problems from the latest web environment and events, deterring benchmark hacking through parametric memorisation.
大多数公共 DR 评估仍然锚定在传统的 QA 套件中,其项目主要来自静态语料库,例如 Wikipedia。由于大量此类内容现在嵌入在主干模型参数中,因此当前有竞争力的代理通常可以直接从内存中回答,绕过任何研究程序,从而夸大其性能。为了探索检索、推理和工具使用的真正能力,DR 领域迫切需要开放的、对时间敏感的基准。从这个角度来看,BrowseComp browsercomp 通过过滤掉可以用参数知识解决的问题并迫使代理在网上找到难以找到的信息,构成了有意义的进步。此外,一个补充方向是不断刷新的排行榜,该排行榜会更新来自最新 Web 环境和事件的问题,从而通过参数记忆来阻止基准测试黑客攻击。


Beyond parametric knowledge hacking of QA benchmark, the metrics of the most existing DR research still collapse open-ended research workflows into narrowly scoped QA prompts or rudimentary GUI-control tasks, overlooking the paradigm’s defining outcome, a structured, multi-modal research report that weaves together textual narrative, tables, figures, and citations. Since the metrics of these benchmarks centre almost exclusively on information retrieval and extraction and tool invocation, they under-assess higher-level competencies such as evidence aggregation across heterogeneous sources, cross-modal synthesis, and discourse-level organization. Thus, a key research priority is the development of comprehensive benchmarks that evaluate DR agents’ capacity for end-to-end report generation, encompassing long-form narrative, integrated tables and figures, and multimodal coherence, thereby assessing factual accuracy, discourse structure, and cross-modal alignment within a single task.
除了对 QA 基准的参数知识黑客攻击之外,大多数现有 DR 研究的指标仍然将开放式研究工作流程折叠为范围狭窄的 QA 提示或基本的 GUI 控制任务,而忽略了范式的决定性结果,即结构化的、多模态的研究报告,将文本叙述、表格、数字和引文编织在一起。由于这些基准的指标几乎完全集中在信息检索和提取以及工具调用上,因此它们低估了更高级别的能力,例如跨异构来源的证据聚合、跨模态综合和话语级组织。因此,一个关键的研究重点是开发综合基准,以评估 DR 代理的端到端报告生成能力,包括长篇叙述、综合表格和数字以及多模态连贯性,从而评估事实准确性、话语结构和跨模态对齐在单个任务中。

Parametric Optimization of Multi-Agent Architectures.
多代理架构的参数优化。

End-to-end RL has been demonstrated by OpenAI openai2025deepresearch ; Agent-R1 to significantly enhance the reasoning capabilities of backbone models for DR tasks, a result successfully replicated by several open-source initiatives. However, current implementations predominantly utilize single-agent architectures, requiring the backbone model to simultaneously manage planning, tool invocation, and report generation. This multitasking places excessive computational and cognitive demands on backbone models, thereby reducing their efficiency and robustness. Distributing workloads across multiple specialized agents has shown promising improvements in system performance wang2024mobile , yet achieving effective end-to-end training and efficient coordination among multiple agents remains a critical open challenge.
OpenAI openai2025deepresearch 已经演示了端到端 RL; 代理 R1 显著增强 DR 任务主干模型的推理能力,这一结果被几个开源计划成功复制。但是,当前的实施主要使用单代理架构,需要主干模型同时管理规划、工具调用和报告生成。这种多任务处理对主干模型提出了过多的计算和认知要求,从而降低了它们的效率和稳健性。在多个专用代理之间分配工作负载已显示出系统性能 的有希望的改进 wang2024mobile ,但实现有效的端到端培训和多个代理之间的高效协调仍然是一个关键的开放挑战。

To optimize multi-agent architectures for DR tasks, we propose two promising future directions: (i) adopting hierarchical reinforcement learning (HRL), which introduces layered internal reward mechanisms that facilitate efficient feedback propagation and foster cooperative learning among agents; or implementing a post-training optimization pipeline consisting of multiple refinement stages specifically tailored for DR tasks, which could iteratively enhance inter-agent interactions and thus improve overall system stability and adaptability; and (ii) employing an RL-based dedicated scheduling agent designed to dynamically allocate subtasks and adjust execution order based on real-time performance metrics. By modeling scheduling decisions as actions within an RL framework, this method progressively learns adaptive policies that optimally balance parallel execution, resource utilization, and task prioritization, enhancing both the robustness and efficiency of end-to-end research workflows.
为了优化 DR 任务的多智能体架构,我们提出了两个有前途的未来方向:(i) 采用 分层强化学习 (HRL),它引入了分层的内部奖励机制,促进了有效的反馈传播并促进了智能体之间的合作学习;或者实施由专门为 DR 任务量身定制的多个细化阶段组成的训练后优化管道,这可以迭代增强代理之间的交互,从而提高整体系统的稳定性和适应性;(ii) 采用 基于 RL 的专用调度代理,旨在根据实时性能指标动态分配子任务并调整执行顺序 。通过将调度决策建模为 RL 框架中的作,该方法逐步学习自适应策略,以最佳方式平衡并行执行、资源利用率和任务优先级,从而提高端到端研究工作流程的稳健性和效率。

Self-Evolving Language Model Agents.
自我进化的语言模型代理。

Although initial attempts at self-evolution methods for DR agents have emerged, exemplified by simulated collaborative platforms such as AgentRxiv schmidgall2025agentrxiv that facilitate online sharing and reuse of structured research experiences, the paradigm remains underdeveloped and narrowly focused on only the case-based reasoning paradigm. Similarly, CycleResearcher (weng2024cycleresearcher, ) enables the entire research process simulation (research-evaluation-refine) through iterative preference learning with a robust verifier (zhu2025deepreview, ), representing a significant step toward fully automated scientific inquiry and sharing the similar self-evolution concept with AlphaEvolve (novikov2025alphaevolve, ).
尽管已经出现了 DR 代理自我进化方法的初步尝试,例如促进在线共享和重用结构化研究经验的模拟协作平台,例如 AgentRxiv schmidgall2025agentrxiv ,但该范式仍然不发达,并且狭隘地只关注基于案例的推理范式。同样,CycleResearcher weng2024cycleresearcher, 通过使用强大的验证器 zhu2025deepreview, 的迭代偏好学习来实现整个研究过程的模拟(研究-评估-精炼),这代表了向全自动科学探究迈出的重要一步,并与 AlphaEvolve novikov2025alphaevolve, 共享类似的自我进化概念。

To fully realize the potential of self-evolution in DR agents, future research should expand the self-evolution method along two complementary directions. (i) Comprehensive case-based reasoning framework. Case-based reasoning approaches (aamodt1994case, ) leverage hierarchical experience traces, including planning trajectories and structured tool invocation logs, and employ advanced retrieval and selection mechanisms to enable fine-grained, context-specific adaptation. (ii) Autonomous workflow evolution promises enhanced efficiency and flexibility. By representing agent workflows as mutable structures such as trees or graphs, researchers can apply evolutionary algorithms or adaptive graph optimization to explore, modify and refine execution plans dynamically. Pursuing both directions in tandem will strengthen the robustness of frameworks and reduce the reliance on data and computation resources.
为了充分实现 DR 代理中自我进化的潜力,未来的研究应该沿着两个互补的方向扩展自我进化方法。(i) 全面的基于案例的推理框架 。基于案例的推理方法 aamodt1994case, 利用分层经验跟踪,包括规划轨迹和结构化工具调用日志,并采用高级检索和选择机制来实现精细的、特定于上下文的适应。(ii) 自主工作流程的演变 有望提高效率和灵活性。通过将代理工作流表示为可变结构(如树或图形),研究人员可以应用进化算法或自适应图形优化来动态探索、修改和完善执行计划。同时追求这两个方向将加强框架的稳健性,并减少对数据和计算资源的依赖。

7 Conclusion  7 结论

LLM-driven Deep Research Agents represent an emerging paradigm for automated research support, integrating advanced techniques such as iterative information retrieval, long-form content generation, autonomous planning, and sophisticated tool utilization. In this survey, we systematically reviewed recent advancements in DR agents, categorizing existing methodologies into prompt-based, fine-tuning-based, and reinforcement learning-based approaches from the perspectives of information retrieval and report generation. Non-parametric methods utilize LLMs and carefully designed prompts to achieve efficient and cost-effective deployment, making them suitable for rapid prototyping. In contrast, fine-tuning and reinforcement learning approaches explicitly optimize model parameters, significantly enhancing the agents’ reasoning and decision-making capabilities. We also examined prominent DR agent systems developed by industry leaders and discussed their technical implementations, strengths, and limitations.
LLM 驱动的 Deep Research Agent 代表了自动化研究支持的新兴范式,集成了迭代信息检索、长篇内容生成、自主规划和复杂工具利用等先进技术。在这项调查中,我们系统地回顾了 DR 代理的最新进展,从信息检索和报告生成的角度将现有方法分为基于提示、基于微调和基于强化学习的方法。非参数方法利用 LLM 和精心设计的提示来实现高效且具有成本效益的部署,使其适用于快速原型设计。相比之下,微调和强化学习方法显式优化了模型参数,显著提高了智能体的推理和决策能力。我们还研究了行业领导者开发的著名 DR 代理系统,并讨论了它们的技术实现、优势和局限性。

Limitation  限度

Despite notable progress, key challenges remain, including limited generalization across diverse tasks, inflexible task workflows, difficulty in integrating granular external tools, and substantial computational complexity associated with advanced planning and optimization. Future research directions thus emphasize broader and more flexible tool integration through modular capability providers (e.g., Operator-based architectures), development of asynchronous and parallel planning frameworks (e.g., Directed Acyclic Graph-based approaches), and sophisticated end-to-end optimization methods for multi-agent architectures, such as hierarchical reinforcement learning or multi-stage fine-tuning pipelines. With continued advancements in LLM technologies, DR agents have significant potential to transform complex research workflows, enhance human productivity, and drive innovation across academic and industrial domains.
尽管取得了显著进展,但关键挑战仍然存在,包括跨不同任务的有限泛化、不灵活的任务工作流、难以集成精细的外部工具以及与高级规划和优化相关的大量计算复杂性。因此,未来的研究方向强调通过模块化能力提供者(例如,基于作员的架构)、异步和并行规划框架的开发(例如,基于有向无环图的方法)以及多智能体架构的复杂端到端优化方法(例如分层强化学习或多阶段微调管道)进行更广泛、更灵活的工具集成。随着 LLM 技术的不断进步,DR 代理在改变复杂的研究工作流程、提高人类生产力以及推动学术和工业领域的创新方面具有巨大潜力。

References

  • [1] Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI communications, 7(1):39–59, 1994.
  • [2] McQuilkin Adam, Kamath Anirudh, McGuire Sean, and Chance Sophie. Browserbase: A web browser for your ai, 2024.
  • [3] Jina AI. Jina ai. https://jina.ai/deepsearch/, 2025. Accessed: 2025-04-28.
  • [4] Kortix AI. Suna: Open source generalist ai agent. https://github.com/kortix-ai/suna, 2025. Accessed: 2025-05-28.
  • [5] Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, and Sanjiv Kumar. Rest meets react: Self-improvement for multi-step reasoning llm agent, 2023.
  • [6] Mohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, Jeffery Dalton, and Leif Azzopardi. Trec ikat 2023: The interactive knowledge assistance track overview. arXiv preprint arXiv:2401.01330, 2024.
  • [7] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023.
  • [8] Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058, 2024.
  • [9] Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204, 2024.
  • [10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • [11] CAMEL-AI.org. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation. https://github.com/camel-ai/owl, 2025. Accessed: 2025-03-07.
  • [12] Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, Taeyoon Kwon, Minjin Kim, Youngjae Yu, Dongha Lee, Dongyeop Kang, and Jinyoung Yeo. Dialogue chain-of-thought distillation for commonsense-aware conversational agents. arXiv preprint arXiv:2310.09343, 2023.
  • [13] Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025.
  • [14] Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, and Kun Shao. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. arXiv preprint arXiv:2410.15164, 2025.
  • [15] Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025.
  • [16] Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228, 2025.
  • [17] Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, and Tianyi Zhou. Atlas: Agent tuning via learning critical steps. arXiv preprint arXiv:2503.02197, 2025.
  • [18] Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. arXiv preprint arXiv:2410.05080, 2025.
  • [19] DanielWalnut. Deerflow. https://github.com/bytedance/deer-flow, 2025. Accessed: 2025-05-28.
  • [20] Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
  • [21] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.
  • [22] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–6501, 2024.
  • [23] Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, and Bing Qin. Retrieval-generation synergy augmented large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11661–11665. IEEE, 2024.
  • [24] Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468, 2024.
  • [25] Tong Fu, Liquan Chen, Zhangjie Fu, Kunliang Yu, and Yu Wang. Ccnet: Cnn model with channel attention and convolutional pooling mechanism for spatial image steganalysis. Journal of Visual Communication and Image Representation, 88:103633, 2022.
  • [26] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023.
  • [27] Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D Manning. Synthetic data generation & multi-step rl for reasoning & tool use. arXiv preprint arXiv:2504.04736, 2025.
  • [28] Peiyuan Gong, Jiamian Li, and Jiaxin Mao. Cosearchagent: a lightweight collaborative search agent with large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2729–2733, 2024.
  • [29] Google. Announcing the Agent2Agent Protocol (A2A), 2025. Accessed: 2025-04-22.
  • [30] Google Team. Introducing gemini deep research. https://gemini.google/overview/deep-research/, 2025. Accessed: 2025-04-06.
  • [31] Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, and Jun Wang. Large language models orchestrating structured reasoning achieve kaggle grandmaster level, 2024.
  • [32] Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models. arXiv preprint arXiv:2502.01142, 2025.
  • [33] Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning, 2024.
  • [34] Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, and Jun Wang. Optimizing case-based reasoning system for functional test script generation with large language models. arXiv preprint arXiv:2503.20576, 2025.
  • [35] H2O.ai. H2o.ai, 2025. Accessed: 2025-04-28.
  • [36] Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
  • [37] Sheryl Hsu, Omar Khattab, Chelsea Finn, and Archit Sharma. Grounding by trying: Llms with reinforcement learning-enhanced retrieval. arXiv preprint arXiv:2410.23214, 2024.
  • [38] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025.
  • [39] Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning. arXiv preprint arXiv:2503.12759, 2025.
  • [40] Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2024.
  • [41] Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and Sadao Kurohashi. Multitool-cot: Gpt-3 can use multiple external tools with chain of thought prompting. arXiv preprint arXiv:2305.16896, 2023.
  • [42] Shayekh Bin Islam, Md Asib Rahman, KSM Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. Open-rag: Enhanced retrieval-augmented reasoning with open-source large language models. arXiv preprint arXiv:2410.01782, 2024.
  • [43] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43, 2023.
  • [44] Pengcheng Jiang. Deepretrieval: Powerful query generation for information retrieval with reinforcement learning. arXiv preprint arXiv:2503.00223, 2025.
  • [45] Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023.
  • [46] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2024.
  • [47] Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025.
  • [48] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  • [49] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 996–1009, 2023.
  • [50] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  • [51] Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, and Dayiheng Liu. Start: Self-taught reasoner with tools. arXiv preprint arXiv:2503.04625, 2025.
  • [52] Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025.
  • [53] Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. https://www.notion.so/WebThinker-Empowering-Large-Reasoning-Models-with-Deep-Research-Capability-d13158a27d924a4b9df7f9ab94066b64, 2025. Notion Blog.
  • [54] Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, and Sirui Hong. Openmanus: An open-source framework for building general ai agents. https://github.com/mannaandpoem/OpenManus, 2025. Accessed: 2025-04-06.
  • [55] Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, and Jiebo Luo. Facilitating long context understanding via supervised chain-of-thought reasoning. arXiv preprint arXiv:2502.13127, 2025.
  • [56] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
  • [57] Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024.
  • [58] Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo. Teleqna: A benchmark dataset to assess large language models telecommunications knowledge, 2023.
  • [59] Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories, 2023.
  • [60] Manus AI. Leave it to manus. https://manus.im/, 2025. Accessed: 2025-04-06.
  • [61] Martin. Agenticseek: Private, local manus alternative. https://github.com/Fosowl/agenticSeek, 2025. Accessed: 2025-05-28.
  • [62] Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023.
  • [63] Microsoft. Introducing researcher and analyst in microsoft 365 copilot. https://www.microsoft.com/en-us/microsoft-365/blog/2025/03/25/introducing-researcher-and-analyst-in-microsoft-365-copilot/, March 2025. Accessed: 2025-04-28.
  • [64] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2024.
  • [65] Magnus Müller and Gregor Žunič. Browser use: Enable ai to control your browser, 2024.
  • [66] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  • [67] Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. Mlgym: A new framework and benchmark for advancing ai research agents. arXiv preprint arXiv:2502.14499, 2025.
  • [68] Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. Google DeepMind, 2025.
  • [69] OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/, 2025. Accessed: 2025-04-06.
  • [70] OpenAI Team. Browsecomp: a benchmark for browsing agents. https://openai.com/index/browsecomp/, 2025. Accessed: 2025-04-29.
  • [71] Jie Ouyang, Ruiran Yan, Yucong Luo, Mingyue Cheng, Qi Liu, Zirui Liu, Shuo Yu, and Daoyu Wang. Training powerful llm agents with end-to-end reinforcement learning. https://github.com/0russwest0/Agent-R1, 2025. Accessed: 2025-04-06.
  • [72] Perplexity Team. Introducing perplexity deep research. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research, 2025. Accessed: 2025-04-06.
  • [73] Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025.
  • [74] Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025.
  • [75] Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286, 2025.
  • [76] Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19(8):198343, 2025.
  • [77] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023.
  • [78] Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research. arXiv preprint arXiv:2503.18102, 2025.
  • [79] Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. arXiv preprint arXiv:2501.04227, 2025.
  • [80] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [81] Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam. Assisting in writing wikipedia-like articles from scratch with large language models. arXiv preprint arXiv:2402.14207, 2024.
  • [82] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  • [83] Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Damien Graux, Dandan Tu, Zeren Jiang, Ruofei Lai, Yang Ren, et al. Gear: Graph-enhanced agent for retrieval-augmented generation. arXiv preprint arXiv:2412.18431, 2024.
  • [84] Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lifeng Shang, Fisher Yu, et al. Pangu deepdiver: Adaptive search intensity scaling via open-web reinforcement learning. arXiv preprint arXiv:2505.24332, 2025.
  • [85] Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag. arXiv preprint arXiv:2501.09136, 2025.
  • [86] Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025.
  • [87] Zirui Song, Yaohang Li, Meng Fang, Zhenhao Chen, Zecheng Shi, Yuan Huang, and Ling Chen. Mmac-copilot: Multi-modal agent collaboration operating system copilot. arXiv preprint arXiv:2404.18074, 2024.
  • [88] Jared Spataro. Introducing Researcher and Analyst in Microsoft 365 Copilot, March 2025.
  • [89] Md Arafat Sultan, Jatin Ganhotra, and Ramón Fernandez Astudillo. Structured chain-of-thought prompting for few-shot generation of content-grounded qa conversations. arXiv preprint arXiv:2402.11770, 2024.
  • [90] Shuang Sun*, Huatong Song*, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen Wayne Xin Zhao. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis. 2025.
  • [91] Jiabin Tang, Tianyu Fan, and Chao Huang. Autoagent: A fully-automated and zero-code framework for llm agents, 2025.
  • [92] Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code. arXiv preprint arXiv:2311.09835, 2024.
  • [93] Fellou AI Team. Fellou: The world’s first agentic browser, 2025.
  • [94] Genspark Team. Genspark super agent with enhancements in mixture of agents, 2025.
  • [95] Perplexity Team. Comet: A browser for agentic search by perplexity, 2024.
  • [96] Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.01014, 2024.
  • [97] Keheng Wang, Feiyu Duan, Peiguang Li, Sirui Wang, and Xunliang Cai. Llms know what they need: Leveraging a missing information guided framework to empower retrieval-augmented generation. arXiv preprint arXiv:2404.14043, 2024.
  • [98] Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of-retrieval augmented generation. arXiv preprint arXiv:2501.14342, 2025.
  • [99] Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language models. arXiv preprint arXiv:2402.02244, 2024.
  • [100] Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, and Jinsong Su. Tdag: A multi-agent framework based on dynamic task decomposition and agent generation. Neural Networks, page 107200, 2025.
  • [101] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024.
  • [102] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • [103] Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. arXiv preprint arXiv:2411.00816, 2024.
  • [104] Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts. arXiv preprint arXiv:2411.15114, 2024.
  • [105] Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025.
  • [106] Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec, and James Zou. Avatar: Optimizing llm agents for tool usage via contrastive reasoning. arXiv preprint arXiv:2406.11200, 2024.
  • [107] xAI Team. Introducing grok deepsearch. https://x.ai/news/grok-3, 2025. Accessed: 2025-04-06.
  • [108] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  • [109] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  • [110] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2023.
  • [111] Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024.
  • [112] Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, and Jiaxuan You. Researchtown: Simulator of human research community. arXiv preprint arXiv:2412.17767, 2024.
  • [113] Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443, 2024.
  • [114] Saber Zerhoudi and Michael Granitzer. Personarag: Enhancing retrieval-augmented generation systems with user-centric agents. arXiv preprint arXiv:2407.09394, 2024.
  • [115] Liang Zhang, Katherine Jijo, Spurthi Setty, Eden Chung, Fatima Javid, Natan Vidra, and Tommy Clifford. Enhancing large language model performance to answer questions and extract information more accurately. arXiv preprint arXiv:2402.01722, 2024.
  • [116] Zhebin Zhang, Xinyu Zhang, Yuanhang Ren, Saijiang Shi, Meng Han, Yongkang Wu, Ruofei Lai, and Zhao Cao. Iag: Induction-augmented generation framework for answering reasoning questions. arXiv preprint arXiv:2311.18397, 2023.
  • [117] Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025.
  • [118] Yuxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie Wang, Yun Luo, Renjie Pan, et al. Openresearcher: Unleashing ai for accelerated scientific research. arXiv preprint arXiv:2408.06941, 2024.
  • [119] Zhipu AI. Autoglm rumination. https://autoglm-research.zhipuai.cn/, 2025. Accessed: 2025-04-06.
  • [120] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2024.
  • [121] Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. Deepreview: Improving llm-based paper review with human-like deep thinking process. arXiv preprint arXiv:2503.08569, 2025.