这是用户在 2025-7-18 22:37 为 https://app.immersivetranslate.com/pdf-pro/a648c787-bdce-486b-824a-01d371e332c8/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications
深度学习研究综合综述:系统、方法论与应用

RENJUN XU* and JINGWEN PENG, Zhejiang University, China
徐仁俊* 和彭靖雯,浙江大学,中国

This survey examines the rapidly evolving field of Deep Research systems-AI-powered applications that automate complex research workflows through the integration of large language models, advanced information retrieval, and autonomous reasoning capabilities. We analyze more than 80 commercial and non-commercial implementations that have emerged since 2023, including OpenAI/DeepResearch, Gemini/DeepResearch, Perplexity/DeepResearch, and numerous open-source alternatives. Through comprehensive examination, we propose a novel hierarchical taxonomy that categorizes systems according to four fundamental technical dimensions: foundation models and reasoning engines, tool utilization and environmental interaction, task planning and execution control, and knowledge synthesis and output generation. We explore the architectural patterns, implementation approaches, and domain-specific adaptations that characterize these systems across academic, scientific, business, and educational applications. Our analysis reveals both the significant capabilities of current implementations and the technical and ethical challenges they present regarding information accuracy, privacy, intellectual property, and accessibility. The survey concludes by identifying promising research directions in advanced reasoning architectures, multimodal integration, domain specialization, human-AI collaboration, and ecosystem standardization that will likely shape the future evolution of this transformative technology. By providing a comprehensive framework for understanding Deep Research systems, this survey contributes to both the theoretical understanding of AI-augmented knowledge work and the practical development of more capable, responsible, and accessible research technologies. The paper resources can be viewed at https://github.com/scienceaix/deepresearch.
本综述探讨了深度研究系统这一快速发展的领域——这些由人工智能驱动的应用程序通过整合大型语言模型、先进信息检索和自主推理能力,实现了复杂研究流程的自动化。我们分析了自 2023 年以来出现的 80 多个商业和非商业实现方案,包括 OpenAI/DeepResearch、Gemini/DeepResearch、Perplexity/DeepResearch 以及众多开源替代方案。通过全面考察,我们提出了一种新颖的分层分类法,根据四个基本技术维度对系统进行分类:基础模型与推理引擎、工具利用与环境交互、任务规划与执行控制,以及知识合成与输出生成。我们探讨了这些系统在学术、科研、商业和教育应用中表现出的架构模式、实现方法和领域特定适配特征。 我们的分析揭示了当前实施方案的重要能力,以及它们在信息准确性、隐私、知识产权和可访问性方面带来的技术与伦理挑战。本调查通过指出在高级推理架构、多模态集成、领域专业化、人机协作和生态系统标准化等有前景的研究方向作为结论,这些方向很可能塑造这一变革性技术的未来发展。通过提供一个理解深度研究系统的全面框架,本调查既促进了对 AI 增强知识工作的理论理解,也推动了更强大、负责任且可访问的研究技术的实际发展。论文资源可在 https://github.com/scienceaix/deepresearch 查看。
CCS Concepts: - Computing methodologies rarr\rightarrow Artificial intelligence; Natural language processing; - Computer systems organization rarr\rightarrow Embedded and cyber-physical systems; • Information systems rarr\rightarrow Information retrieval; • Human-centered computing rarr\rightarrow Collaborative and social computing.
CCS 概念:- 计算方法论 rarr\rightarrow 人工智能;自然语言处理;- 计算机系统组织 rarr\rightarrow 嵌入式与信息物理系统;• 信息系统 rarr\rightarrow 信息检索;• 以人为中心的计算 rarr\rightarrow 协作与社会计算。
Additional Key Words and Phrases: Deep Research, Large Language Models, Autonomous Agents, AI Systems, Research Automation, Information Retrieval, Knowledge Synthesis, Human-AI Collaboration, Multi-Agent Systems, Tool-Using Agents
其他关键词与短语:深度研究、大语言模型、自主代理、人工智能系统、研究自动化、信息检索、知识合成、人机协作、多代理系统、工具使用型代理

Contents  目录

Abstract … 1  摘要…1
Contents … 2  目录…2
1 Introduction … 4
1 引言 … 4

1.1 Definition and Scope of Deep Research … 4
1.1 深度研究的定义与范畴 … 4

1.2 Historical Context and Technical Evolution … 5
1.2 历史背景与技术演进 … 5

1.3 Significance and Practical Implications … 6
1.3 研究意义与实际应用 … 6

1.4 Research Questions and Contribution of this Survey … 7
1.4 研究问题与本综述的贡献…7

2 The Evolution and Technical Framework of Deep Research … 7
2 深度研究的演进与技术框架…7

2.1 Foundation Models and Reasoning Engines: Evolution and Advances … 7
2.1 基础模型与推理引擎:演进与进展…7

2.2 Tool Utilization and Environmental Interaction: Evolution and Advances … 10
2.2 工具利用与环境交互:演进与进展…10

2.3 Task Planning and Execution Control: Evolution and Advances … 11
2.3 任务规划与执行控制:演进与进展…11

2.4 Knowledge Synthesis and Output Generation: Evolution and Advances … 13
2.4 知识合成与输出生成:演进与进展…13

3 Comparative Analysis and Evaluation of Deep Research Systems … 14
3 深度研究系统的比较分析与评估…14

3.1 Cross-Dimensional Technical Comparison … 14
3.1 跨维度技术比较…14

3.2 Application-Based System Suitability Analysis … 16
3.2 基于应用的系统适用性分析…16

3.3 Performance Metrics and Benchmarking … 18
3.3 性能指标与基准测试…18

4 Implementation Technologies and Challenges … 21
4 实现技术与挑战…21

4.1 Architectural Implementation Patterns … 21
4.1 架构实现模式…21

4.2 Infrastructure and Computational Optimization … 27
4.2 基础设施与计算优化 … 27

4.3 System Integration and Interoperability … 29
4.3 系统集成与互操作性 … 29

4.4 Technical Challenges and Solutions … 33
4.4 技术挑战与解决方案 … 33

5 Evaluation Methodologies and Benchmarks … 35
5 评估方法与基准测试 … 35

5.1 Functional Evaluation Frameworks … 35
5.1 功能评估框架…35

5.2 Non-Functional Evaluation Metrics … 37
5.2 非功能性评估指标…37

5.3 Cross-Domain Evaluation Benchmarks … 39
5.3 跨领域评估基准…39

5.4 Emerging Evaluation Approaches … 41
5.4 新兴评估方法…41

5.5 Comparative Evaluation Methodology … 42
5.5 比较评估方法…42

6 Applications and Use Cases … 44
6 应用与使用案例…44

6.1 Academic Research Applications … 44
6.1 学术研究应用…44

6.2 Scientific Discovery Applications … 46
6.2 科学发现应用…46

6.3 Business Intelligence Applications … 49
6.3 商业智能应用…49

6.4 Financial Analysis Applications … 51
6.4 财务分析应用…51

6.5 Educational Applications … 52
6.5 教育应用…52

6.6 Personal Knowledge Management Applications … 54
6.6 个人知识管理应用…54

7 Ethical Considerations and Limitations … 56
7 伦理考量与限制 … 56

7.1 Information Accuracy and Hallucination Concerns … 56
7.1 信息准确性与幻觉问题 … 56

7.2 Privacy and Data Security … 59
7.2 隐私与数据安全 … 59

7.3 Source Attribution and Intellectual Property … 61
7.3 来源归属与知识产权 … 61

7.4 Accessibility and Digital Divide … 63
7.4 可访问性与数字鸿沟…63

A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications … 3
深度学习研究综合调查:系统、方法论与应用…3

8 Future Research Directions … 64
8 未来研究方向…64

8.1 Advanced Reasoning Architectures … 64
8.1 高级推理架构…64

8.2 Multi-Modal Deep Research … 68
8.2 多模态深度研究 … 68

8.3 Domain-Specific Optimization … 70
8.3 领域特定优化 … 70

8.4 Human-AI Collaboration and Standardization … 72
8.4 人机协作与标准化 … 72

9 Conclusion … 76
9 结论 … 76

9.1 Key Findings and Contributions … 76
9.1 主要发现与贡献 … 76

9.2 Limitations and Outlook … 78
9.2 局限性与展望 … 78

9.3 Broader Implications … 79
9.3 更广泛的影响 … 79

9.4 Final Thoughts … 80
9.4 最终思考 … 80

References … 81  参考文献 … 81

1 Introduction  1 引言

Rapid advancement of artificial intelligence has precipitated a paradigm shift in how knowledge is discovered, validated, and utilized across academic and industrial domains. Traditional research methodologies, reliant on manual literature reviews, experimental design, and data analysis, are increasingly supplemented-and in some cases supplanted-by intelligent systems capable of automating end-to-end research workflows. This evolution has given rise to a novel domain we term “Deep Research”, which signifies the convergence of large language models (LLMs), advanced information retrieval systems, and automated reasoning frameworks to redefine the boundaries of scholarly inquiry and practical problem-solving.
人工智能的快速发展引发了一场范式转变,彻底改变了学术与工业领域中知识的发现、验证和应用方式。传统研究方法主要依赖人工文献综述、实验设计和数据分析,如今正日益被能够自动化端到端研究流程的智能系统所补充——在某些情况下甚至被取代。这一演变催生了一个我们称之为"深度研究"的新领域,它标志着大型语言模型(LLMs)、先进信息检索系统与自动化推理框架的融合,正在重新定义学术探索和实际问题解决的边界。

1.1 Definition and Scope of Deep Research
1.1 深度研究的定义与范畴

Deep Research refers to the systematic application of AI technologies to automate and enhance research processes through three core dimensions:
深度研究指通过三个核心维度系统化应用人工智能技术来自动化和增强研究流程:

(1) Intelligent Knowledge Discovery: Automating literature search, hypothesis generation, and pattern recognition across heterogeneous data sources
(1) 智能知识发现:自动化文献检索、假设生成及跨异构数据源的模式识别

(2) End-to-End Workflow Automation: Integrating experimental design, data collection, analysis, and result interpretation into unified AI-driven pipelines
(2) 端到端工作流自动化:将实验设计、数据收集、分析和结果解释整合为统一的 AI 驱动流程

(3) Collaborative Intelligence Enhancement: Facilitating human-AI collaboration through natural language interfaces, visualizations, and dynamic knowledge representation
(3) 协作式智能增强:通过自然语言界面、可视化工具和动态知识表征促进人机协作
To clearly delineate the boundaries of Deep Research, we distinguish it from adjacent AI systems as follows:
为明确界定深度研究的边界,我们将其与相邻 AI 系统区分如下:
  • Differentiating from General AI Assistants: While general AI assistants like ChatGPT can answer research questions, they lack the autonomous workflow capabilities, specialized research tools, and end-to-end research orchestration that define Deep Research systems. Recent surveys have highlighted this crucial distinction between specialized research systems and general AI capabilities [73, 76], with particular emphasis on how domain-specific tools fundamentally transform research workflows compared to general-purpose assistants [213, 318].
    区别于通用 AI 助手:虽然像 ChatGPT 这样的通用 AI 助手能够回答研究问题,但它们缺乏自主工作流能力、专业研究工具以及端到端的研究编排功能,而这些正是深度研究系统的核心特征。近期调查凸显了专业研究系统与通用 AI 能力之间的这一关键差异[73, 76],特别强调了领域专用工具相较于通用助手如何从根本上改变研究工作流程[213, 318]。
  • Differentiating from Single-Function Research Tools: Specialized tools like citation managers, literature search engines, or statistical analysis packages address isolated research functions but lack the integrated reasoning and cross-functional orchestration of Deep Research systems. Tools like scispace [242] and You.com [313] represent earlier attempts at research assistance but lack the end-to-end capabilities that define true Deep Research systems.
    区别于单一功能研究工具:诸如文献管理工具、学术搜索引擎或统计分析软件等专业工具虽然能解决孤立的研究功能,但缺乏深度研究系统所具备的整合推理与跨功能协调能力。像 scispace[242]和 You.com[313]这类工具代表了早期研究辅助的尝试,但仍不具备定义真正深度研究系统所需的端到端能力。
  • Differentiating from Pure LLM Applications: Applications that simply wrap LLMs with researchoriented prompts lack the environmental interaction, tool integration, and workflow automation capabilities that characterize true Deep Research systems.
    区别于纯 LLM 应用:仅用研究导向提示词封装 LLM 的应用缺乏环境交互、工具集成和工作流自动化能力,而这些正是真正深度研究系统的特征。
This survey specifically examines systems that exhibit at least two of the three core dimensions, with a focus on those incorporating large language models as their foundational reasoning engine. Our scope encompasses commercial offerings such as OpenAI/DeepResearch [197], Google’s Gemini/DeepResearch [89], and Perplexity/DeepResearch [209], alongside open-source implementations including dzhng/deepresearch [321], HKUDS/Auto-Deep-Research [112], and numerous others detailed in subsequent sections. We exclude purely bibliometric tools or single-stage automation systems lacking integrated cognitive capabilities,
本调查专门研究至少具备三大核心维度中两项的系统,重点关注那些以大型语言模型作为基础推理引擎的案例。我们的研究范围涵盖商业产品如 OpenAI/DeepResearch [197]、Google 的 Gemini/DeepResearch [89]和 Perplexity/DeepResearch [209],以及开源实现包括 dzhng/deepresearch [321]、HKUDS/Auto-Deep-Research [112]和后续章节详述的众多其他系统。我们排除了纯文献计量工具或缺乏集成认知能力的单阶段自动化系统,

such as research assistance tools like Elicit [74], ResearchRabbit [228], Consensus [63], or citation tools like Scite [243]. Additional specialized tools like STORM [278], which focuses on scientific text retrieval and organization, are valuable but lack the end-to-end deep research capabilities central to our survey scope.
例如 Elicit [74]、ResearchRabbit [228]、Consensus [63]等研究辅助工具,或 Scite [243]等文献引用工具。STORM [278]等专注于科学文本检索与组织的专业工具虽有价值,但缺乏我们调查核心关注的端到端深度研究能力。

1.2 Historical Context and Technical Evolution
1.2 历史背景与技术演进

The trajectory of Deep Research can be mapped through three evolutionary stages that reflect both technological advancements and implementation approaches:
深度研究的发展轨迹可通过反映技术进步与实施方法的三阶段演进历程来描绘:

1.2.1 Origin and Early Exploration (2023 - February 2025). It should be noted that workflow automation frameworks like n8n [183], QwenLM/Qwen-Agent [224], etc. had already been in existence long before the boom of deep research. Their early establishment demonstrated the pre-existing groundwork in related technological domains, highlighting that the development landscape was not solely shaped by the emergence of deep research, but had a more diverse and earlier-rooted origin. The concept of Deep Research emerged from the shift of AI assistants towards intelligent agents. In December 2024, Google Gemini pioneered this functionality with its initial Deep Research implementation, focusing on basic multi-step reasoning and knowledge integration [60]. This phase laid the groundwork for subsequent advancements, setting the stage for more sophisticated AI-driven research tools. Many of these advances built upon earlier workflow automation tools like n8n [183] and agent frameworks such as AutoGPT [250] and BabyAGI [311] that had already established foundations for autonomous task execution. Other early contributions to this ecosystem include cline2024 [61], which pioneered integrated research workflows, and open_operator [36], which developed foundational browser automation capabilities essential for web-based research.
1.2.1 起源与早期探索(2023-2025 年 2 月) 需要指出的是,诸如 n8n[183]、QwenLM/Qwen-Agent[224]等工作流自动化框架在深度研究热潮兴起前早已存在。这些早期成果展示了相关技术领域已有的基础工作,表明该发展格局并非仅由深度研究的出现所塑造,而是有着更为多元且根基更早的起源。深度研究的概念源于 AI 助手向智能代理的转变。2024 年 12 月,Google Gemini 通过其首个深度研究实现率先推出该功能,重点关注基础的多步推理与知识整合[60]。这一阶段为后续发展奠定了基础,为更复杂的 AI 驱动研究工具搭建了舞台。许多进展都建立在早期工作流自动化工具(如 n8n[183])及代理框架(如 AutoGPT[250]和 BabyAGI[311])之上,这些工具已为自主任务执行建立了基础。 对该生态系统的其他早期贡献包括 cline2024[61],它开创了集成研究工作流程,以及 open_operator[36],它开发了基于网络研究必不可少的基础浏览器自动化能力。

1.2.2 Technological Breakthrough and Competitive Rivalry (February - March 2025). The rise of DeepSeek’s open-source models [68] revolutionized the market with efficient reasoning and cost-effective solutions. In February 2025, OpenAI’s release of Deep Research, marked a significant leap forward [197]. Powered by the o3 model, it demonstrated advanced capabilities such as autonomous research planning, cross-domain analysis, and high-quality report generation, achieving accuracy rates exceeding previous benchmarks in complex tasks. Concurrently, Perplexity launched its free-to-use Deep Research in February 2025 [209], emphasizing rapid response and accessibility to capture the mass market. Open-source projects such as nickscamara/open-deepresearch [42], mshumer/OpenDeepResearcher [249], btahir_open_deep_research [37], and GPT-researcher [16] emerged as community-driven alternatives to commercial platforms. The ecosystem continued to expand with lightweight implementations like Automated-AI-Web-Researcher-Ollama [267], designed for local execution with limited resources, and modular frameworks such as Langchain-AI/Open_deep_research [131] that provided composable components for custom research workflows.
1.2.2 技术突破与竞争对抗(2025 年 2 月-3 月)。DeepSeek 开源模型[68]的崛起以高效推理和成本效益解决方案彻底改变了市场。2025 年 2 月,OpenAI 发布的 Deep Research 标志着重大飞跃[197]。该产品由 o3 模型驱动,展示了自主研究规划、跨领域分析和高质量报告生成等先进能力,在复杂任务中实现了超越以往基准的准确率。同期,Perplexity 于 2025 年 2 月推出其免费使用的 Deep Research[209],强调快速响应和可访问性以占领大众市场。开源项目如 nickscamara/open-deepresearch[42]、mshumer/OpenDeepResearcher[249]、btahir_open_deep_research[37]和 GPT-researcher[16]则作为社区驱动的替代方案涌现,与商业平台形成竞争。 生态系统持续扩展,出现了轻量级实现方案如 Automated-AI-Web-Researcher-Ollama[267](专为资源有限的本地化运行设计)以及模块化框架如 Langchain-AI/Open_deep_research[131](为定制化研究流程提供可组合组件)。

1.2.3 Ecosystem Expansion and Multi-modal Integration (March 2025 - Present). The third stage is characterized by the maturation of a diverse ecosystem. Open-source projects like Jina-AI/node-DeepResearch [121] enable localized deployment and customization, while commercial closed-source versions from OpenAI and Google continue to push boundaries with multi-modal support and multi-agent collaboration capabilities. The integration of advanced search technologies and report generation frameworks further enhances the tool’s utility across academic research, financial analysis, and other fields. Meanwhile, platforms like Manus [164] and AutoGLM-Research [330], MGX [171], and Devin [62] are incorporating advanced AI research capabilities
1.2.3 生态系统扩展与多模态整合(2025 年 3 月至今)。第三阶段的标志是多元化生态系统的成熟。开源项目如 Jina-AI/node-DeepResearch[121]支持本地化部署和定制,而 OpenAI 和 Google 的商业闭源版本则持续通过多模态支持和多智能体协作能力突破边界。先进搜索技术与报告生成框架的集成进一步增强了该工具在学术研究、金融分析等领域的实用性。与此同时,Manus[164]、AutoGLM-Research[330]、MGX[171]和 Devin[62]等平台正在整合先进的人工智能研究能力。

to enhance their services. Concurrently, Anthropic launched Claude/Research [13] in April 2025, introducing agentic search capabilities that systematically explore multiple angles of queries and deliver comprehensive answers with verifiable citations. Agent frameworks such as OpenManus [193], Camel-AI/OWL [43], and TARS [39] further expand the ecosystem with specialized capabilities and domain-specific optimizations.
以提升其服务能力。与此同时,Anthropic 于 2025 年 4 月推出了 Claude/Research[13],引入了具备代理搜索功能的新特性,能够系统性地探索查询的多个角度,并提供带有可验证引用的全面答案。诸如 OpenManus[193]、Camel-AI/OWL[43]和 TARS[39]等代理框架通过专业化能力和领域特定优化进一步扩展了这一生态系统。
Evolution Timeline of Deep Research Systems (2024-2025)
深度研究系统演进时间线(2024-2025)

Fig. 1. Evolution Timeline of Deep Research Systems
图 1. 深度研究系统演进时间线

1.3 Significance and Practical Implications
1.3 研究意义与实际应用价值

Deep Research demonstrates transformative potential across multiple domains:
深度研究在多个领域展现出变革性潜力:

(1) Academic Innovation: Accelerating hypothesis validation through automated literature synthesis (e.g., HotpotQA [307] performance benchmarks) and enabling researchers to explore broader interdisciplinary connections that might otherwise remain undiscovered. The transformative potential of Deep Research extends beyond individual applications to fundamentally reshape scientific discovery processes. As Sourati and Evans [256] argue, human-aware artificial intelligence can significantly accelerate science by augmenting researchers’ capabilities while adapting to their conceptual frameworks and methodological approaches. This human-AI synergy represents a fundamental shift from traditional automation toward collaborative intelligence that respects and enhances human scientific intuition. Complementary work by Khalili and Bouchachia [128] further demonstrates how systematic approaches to building science discovery machines can transform hypothesis generation, experimental design, and theory refinement through integrated AI-driven research workflows.
(1) 学术创新:通过自动化文献综述(如 HotpotQA[307]性能基准测试)加速假设验证,使研究人员能够探索更广泛的跨学科联系,这些联系原本可能无法被发现。深度研究的变革潜力不仅限于具体应用,更将从根本上重塑科学发现流程。正如 Sourati 和 Evans[256]所述,具备人类意识的人工智能可以通过增强研究人员能力,同时适应其概念框架和方法论路径,显著加速科学发展。这种人机协同代表从传统自动化向协作智能的根本转变,这种智能尊重并增强人类的科学直觉。Khalili 与 Bouchachia[128]的补充研究进一步证明,通过集成 AI 驱动的研究工作流,构建科学发现机器的系统化方法可以彻底改变假设生成、实验设计和理论完善过程。

(2) Enterprise Transformation: Enabling data-driven decision-making at scale through systems like Agent-RL/ReSearch [2] and smolagents/open_deep_research [115] that can analyze market trends, competitive landscapes, and strategic opportunities with unprecedented depth and efficiency.
(2)企业转型:通过 Agent-RL/ReSearch[2]和 smolagents/open_deep_research[115]等系统实现规模化数据驱动决策,这些系统能够以前所未有的深度和效率分析市场趋势、竞争格局与战略机遇。

(3) Democratization of Knowledge: Reducing barriers to entry through open-source implementations like grapeot/deep_research_agent [263] and OpenManus [193], making sophisticated research capabilities accessible to individuals and organizations regardless of technical expertise or resource constraints.
(3)知识民主化:通过 grapeot/deep_research_agent[263]和 OpenManus[193]等开源实现降低准入门槛,使个人与组织无论技术专长或资源限制如何,都能获得尖端研究能力。

1.4 Research Questions and Contribution of this Survey
1.4 研究问题与本综述的贡献

This survey addresses three fundamental questions:
本综述致力于回答三个核心问题:

(1) How do architectural choices (system architecture, implementation approach, functional capabilities) impact Deep Research effectiveness?
(1) 架构选择(系统架构、实现方法、功能能力)如何影响深度研究的效果?

(2) What technical innovations have emerged in LLM fine-tuning, retrieval mechanisms, and workflow orchestration across the spectrum of Deep Research implementations?
(2) 在深度研究的各种实现中,LLM 微调、检索机制和工作流编排方面出现了哪些技术创新?

(3) How do existing systems balance performance, usability, and ethical considerations, and what patterns emerge from comparing approaches like those of n8n [183] and OpenAI/AgentsSDK [199]?
(3) 现有系统如何平衡性能、可用性和伦理考量?通过比较 n8n[183]和 OpenAI/AgentsSDK[199]等方法,显现出哪些模式?
Our contributions manifest in three dimensions:
我们的贡献体现在三个维度:

(1) Methodological: Proposing a novel taxonomy categorizing systems by their technical architecture, from foundation models to knowledge synthesis capabilities
(1) 方法论:提出一种新颖的分类法,按技术架构(从基础模型到知识合成能力)对系统进行分类

(2) Analytical: Conducting comparative analysis of representative systems across evaluation metrics, highlighting the strengths and limitations of different approaches
(2) 分析性:对代表性系统在评估指标上进行对比分析,突出不同方法的优势与局限性

(3) Practical: Identifying key challenges and formulating a roadmap for future development, with specific attention to emerging architectures and integration opportunities
(3) 实践性:识别关键挑战并制定未来发展路线图,特别关注新兴架构与集成机遇
The remainder of this paper follows a structured exploration beginning with conceptual frameworks (Section 2), technical innovations and comparative analysis (Sections 3-4), implementation technologies (Section 5), evaluation methodologies (Section 6), applications and use cases (Section 7), ethical considerations (Section 8), and future directions (Section 9).
本文后续内容采用结构化探索方式,从概念框架(第 2 节)开始,依次涵盖技术创新与比较分析(第 3-4 节)、实现技术(第 5 节)、评估方法(第 6 节)、应用场景(第 7 节)、伦理考量(第 8 节)以及未来方向(第 9 节)。

2 The Evolution and Technical Framework of Deep Research
2 深度研究的演进与技术框架

This section presents a comprehensive technical taxonomy for understanding Deep Research systems, organized around four fundamental technological capabilities that define these systems. For each capability, we examine the evolutionary trajectory and technical innovations while highlighting representative implementations that exemplify each approach.
本节提出了一个理解深度研究系统的全面技术分类法,围绕定义这些系统的四项基本技术能力进行组织。针对每项能力,我们考察其演进轨迹和技术创新,同时重点介绍体现每种方法的代表性实现。

2.1 Foundation Models and Reasoning Engines: Evolution and Advances
2.1 基础模型与推理引擎:演进与进展

The foundation of Deep Research systems lies in their underlying AI models and reasoning capabilities, which have evolved from general-purpose language models to specialized research-oriented architectures.
深度学习系统的核心在于其底层 AI 模型和推理能力,这些技术已从通用语言模型发展为专门面向研究的架构。

2.1.1 From General-Purpose LLMs to Specialized Research Models. The progression from general LLMs to research-specialized models represents a fundamental shift in deep research capabilities:
2.1.1 从通用 LLMs 到专业研究模型的演进。从通用 LLMs 到研究专用模型的发展代表了深度研究能力的根本性转变:
Hierarchical Technical Framework of Deep Research Systems
深度研究系统的分层技术框架

Fig. 2. Hierarchical Technical Framework of Deep Research Systems
图 2. 深度研究系统的层次化技术框架
Technical Evolution Trajectory. Early implementations relied on general-purpose LLMs with minimal task-specific optimization. Current systems feature models specifically enhanced for research tasks through architectural modifications, specialized training corpora, and fine-tuning regimes focused on analytical and reasoning capabilities. The transition from models like GPT-4 to OpenAI’s o3 demonstrates significant improvements in abstraction, multi-step reasoning, and knowledge integration capabilities essential for complex research tasks [198, 200].
技术演进轨迹。早期实现依赖于通用 LLMs,仅进行最低限度的任务特定优化。当前系统则采用经过架构修改、专用训练语料库以及专注于分析和推理能力的微调方案而专门增强的研究任务模型。从 GPT-4 等模型到 OpenAI 的 o3 模型的转变,展现了在抽象化、多步推理和知识整合能力方面的显著提升,这些能力对复杂研究任务至关重要[198, 200]。
Representative Systems. OpenAI/DeepResearch [197] exemplifies this evolution with its o3-based model optimized specifically for web browsing and data analysis. The system leverages chain-of-thought and tree-ofthought reasoning techniques to navigate complex information landscapes. Google’s Gemini/DeepResearch [60] similarly employs Gemini 2.5 Pro with enhanced reasoning capabilities and a million-token context window to process extensive information. These approaches build upon foundational work in reasoning enhancement techniques like chain-of-thought prompting [291], self-consistency [287], and human preference alignment [205] that have been adapted specifically for research-intensive tasks. In the open-source domain, AutoGLM-Research [330] demonstrates how specialized training regimes can optimize existing models like
代表性系统。OpenAI/DeepResearch [197] 通过其专为网页浏览和数据分析优化的 o3 基础模型展现了这一演进。该系统利用思维链和思维树推理技术来驾驭复杂的信息场景。Google 的 Gemini/DeepResearch [60] 同样采用具备增强推理能力和百万 token 上下文窗口的 Gemini 2.5 Pro 来处理海量信息。这些方法建立在思维链提示[291]、自一致性[287]和人类偏好对齐[205]等推理增强技术的基础工作之上,并专门针对研究密集型任务进行了适配。在开源领域,AutoGLM-Research [330]展示了如何通过专门训练方案优化现有模型如
ChatGLM for research-intensive tasks, achieving significant performance gains through targeted enhancements to reasoning components.
ChatGLM 针对研究密集型任务,通过对推理组件进行针对性优化,实现了显著的性能提升。

2.1.2 Context Understanding and Memory Mechanisms. The ability to process, retain, and utilize extensive contextual information represents a crucial advancement in Deep Research systems:
2.1.2 上下文理解与记忆机制。处理、保留和利用大量上下文信息的能力代表了深度研究系统的关键进步:
Technical Evolution Trajectory. Early systems struggled with limited context windows, hampering their ability to synthesize information from multiple sources. Contemporary implementations employ sophisticated memory management techniques including episodic buffers, hierarchical compression, and attention-based retrieval mechanisms that extend effective context far beyond model limitations. The million-token context windows of models like Grok 3 [299] and Gemini 2.5 Pro [60], along with the context optimization in OpenAI’s o3 model [195], have dramatically expanded the information processing capabilities of these systems. Advanced systems now distinguish between working memory (active reasoning context) and long-term memory (knowledge repository), allowing for more human-like research processes.
技术演进轨迹。早期系统受限于狭窄的上下文窗口,难以综合多源信息。现代实现方案采用包括情景缓冲器、分层压缩和基于注意力的检索机制等先进内存管理技术,将有效上下文范围扩展至远超模型原生限制。诸如 Grok 3 [299]和 Gemini 2.5 Pro [60]的百万级 token 上下文窗口,以及 OpenAI o3 模型[195]的上下文优化技术,显著提升了这些系统的信息处理能力。先进系统现已能区分工作记忆(主动推理上下文)与长期记忆(知识库),从而实现更接近人类研究过程的认知架构。
Representative Systems. Perplexity/DeepResearch [209] has pioneered efficient context processing by leveraging DeepSeek-R1’s capabilities while implementing proprietary mechanisms for structured information management. The system can analyze hundreds of sources while maintaining coherent reasoning threads. Similarly, Camel-AI/OWL [43] employs an innovative open-weight approach to memory management, allowing for dynamic allocation of attention resources based on information relevance and task requirements. Both systems demonstrate how effective memory architectures can significantly enhance research performance even with comparable base model capabilities.
代表性系统。Perplexity/DeepResearch [209]通过利用 DeepSeek-R1 的能力,同时实施专有的结构化信息管理机制,开创了高效的上下文处理方法。该系统能够分析数百个信息来源,同时保持连贯的推理线索。同样,Camel-AI/OWL [43]采用了一种创新的开放权重方法进行内存管理,允许根据信息相关性和任务需求动态分配注意力资源。这两个系统都展示了即使基础模型能力相当,有效的内存架构也能显著提升研究性能。

2.1.3 Enhancements in Reasoning Capabilities. Advanced reasoning mechanisms distinguish modern Deep Research systems from conventional LLM applications:
2.1.3 推理能力增强。先进的推理机制将现代深度研究系统与传统 LLM 应用区分开来:
Technical Evolution Trajectory. Early implementations relied primarily on zero-shot or few-shot prompting for reasoning tasks. Current systems integrate explicit reasoning frameworks including chain-of-thought, tree-of-thought, and graph-based reasoning architectures. Recent work by Lang et al. [132] demonstrates how debate-driven reasoning can facilitate weak-to-strong generalization, enabling more robust performance on complex research tasks through structured argumentative processes. These approaches implement reasoning patterns that more closely mirror human scientific discourse, with explicit representation of alternative viewpoints and structured evaluation of competing hypotheses. Advanced implementations like OpenAI’s o3 incorporate self-critique, uncertainty estimation, and recursive reasoning refinement [198, 200]. This evolution enables increasingly sophisticated forms of evidence evaluation, hypothesis testing, and knowledge synthesis essential for high-quality research outputs.
技术演进轨迹。早期实现主要依赖零样本或少样本提示进行推理任务。当前系统整合了显式推理框架,包括思维链、思维树和图基推理架构。Lang 等人[132]的最新研究表明,辩论驱动式推理能够促进从弱到强的泛化能力,通过结构化论证流程在复杂研究任务中实现更稳健的性能。这些方法实施的推理模式更贴近人类科学讨论,明确呈现不同观点并结构化评估竞争性假设。OpenAI 的 o3 等先进实现融合了自我批判、不确定性估计和递归推理优化[198,200]。这一演进使得证据评估、假设检验和知识合成等日益复杂的形式成为可能,对高质量研究成果至关重要。
Representative Systems. QwenLM/Qwen-Agent [224] exemplifies advanced reasoning capabilities through its specialized toolkit integration and modular reasoning framework. The system employs a multi-stage reasoning process with explicit planning, information gathering, analysis, and synthesis phases optimized for research workflows. Similar capabilities are evident in smolagents/open_deep_research [115], which implements a flexible reasoning architecture that can adapt to different research domains and methodologies. Systems like CycleResearcher [294] demonstrate how integrating automated review processes into research workflows can enhance accuracy through structured feedback loops. These approaches implement explicit verification steps
代表性系统。QwenLM/Qwen-Agent [224] 通过其专业工具包集成和模块化推理框架展现了先进的推理能力。该系统采用多阶段推理流程,包含明确规划的规划、信息收集、分析和综合阶段,专为研究工作流程优化。类似能力在 smolagents/open_deep_research [115]中同样显著,该系统实现了灵活推理架构,可适应不同研究领域和方法论。CycleResearcher [294]等系统展示了如何将自动化评审流程整合到研究工作流中,通过结构化反馈循环提升准确性。这些方法均实现了明确的验证步骤。

that identify potential errors and inconsistencies before generating final research outputs. The application of AI to complex domains like mathematics further illustrates this progress, where models are increasingly viewed from a cognitive science perspective to enhance their reasoning abilities [320], achieving notable milestones such as silver-medal standards in solving International Mathematical Olympiad problems [7]. These systems highlight how reasoning enhancements can dramatically improve research quality even without requiring the largest or most computationally intensive base models.
在生成最终研究成果前识别潜在错误和不一致性的能力。人工智能在数学等复杂领域的应用进一步证明了这一进展,人们越来越多地从认知科学角度审视模型以增强其推理能力[320],并取得了显著里程碑,例如在国际数学奥林匹克竞赛解题中达到银牌标准[7]。这些系统表明,即使不需要最大或计算最密集的基础模型,推理能力的提升也能显著提高研究质量。

2.2 Tool Utilization and Environmental Interaction: Evolution and Advances
2.2 工具利用与环境交互:演进与进展

Deep Research systems must effectively interact with external environments to gather and process information, representing a fundamental capability beyond core language model functions[144].
深度研究系统必须有效对接外部环境以获取和处理信息,这代表着超越核心语言模型功能的基础能力[144]。

2.2.1 Web Interaction Technology Development. The ability to navigate and extract information from the web represents a foundational capability for deep research:
2.2.1 网络交互技术发展。浏览和提取网络信息的能力是深度研究的基础技能:
Technical Evolution Trajectory. Initial implementations relied on simple API-based search queries with limited interaction capabilities. Current systems employ sophisticated web navigation including dynamic content handling, authentication management, and interactive element manipulation. Advanced implementations feature semantic understanding of web structures, allowing for adaptive information extraction and multi-page navigation flows. This evolution has dramatically expanded access to web-based information sources and the ability to extract insights from complex web environments.
技术演进轨迹。早期实现依赖于简单的基于 API 的搜索查询,交互能力有限。当前系统采用复杂的网页导航技术,包括动态内容处理、认证管理和交互元素操控。高级实现具备对网页结构的语义理解能力,可实现自适应信息提取和多页面导航流程。这一演进极大地扩展了对网络信息源的访问能力,以及从复杂网络环境中提取洞察的能力。
Representative Systems. Nanobrowser [184] represents a purpose-built browser environment designed specifically for AI agent use, offering optimized rendering and interaction capabilities for research tasks. It enables fine-grained control of web navigation while maintaining security and performance. Similarly, AutoGLM [330] demonstrates sophisticated GUI interaction capabilities across both web and mobile interfaces, allowing it to access information through interfaces designed for human use. These systems showcase how specialized web interaction technologies can significantly expand the information gathering capabilities of Deep Research systems.
代表性系统。Nanobrowser [184] 是一款专为 AI 代理使用而设计的浏览器环境,为研究任务提供优化的渲染和交互能力。它在保持安全性和性能的同时,实现了对网页导航的精细控制。同样,AutoGLM [330] 展示了跨网页和移动界面的复杂 GUI 交互能力,使其能够通过为人机交互设计的界面获取信息。这些系统展示了专业化的网页交互技术如何显著扩展深度研究系统的信息收集能力。

2.2.2 Content Processing Technology Advancements. Beyond basic navigation, the ability to process diverse content formats is crucial for comprehensive research:
2.2.2 内容处理技术进步。除了基本导航外,处理多样化内容格式的能力对于全面研究至关重要:
Technical Evolution Trajectory. Early systems were limited primarily to text extraction from HTML sources. Modern implementations support multi-modal content processing including structured data tables, embedded visualizations, PDF documents, and interactive applications. Advanced systems like those built on OpenAI’s o3 can extract semantic structure from unstructured content, identify key information from diverse formats, and integrate insights across modalities [201]. This evolution has dramatically expanded the range of information sources that can be incorporated into research processes.
技术演进轨迹。早期系统主要局限于从 HTML 源文件中提取文本。现代实现则支持多模态内容处理,包括结构化数据表格、嵌入式可视化图表、PDF 文档及交互式应用程序。基于 OpenAI o3 等先进系统能够从非结构化内容中提取语义结构,识别多种格式中的关键信息,并实现跨模态的洞察整合[201]。这一演进极大地扩展了可纳入研究过程的信息源范围。
Representative Systems. The dzhng/deep-research [321] project exemplifies advanced content processing through its specialized modules for different document types and formats. It implements custom extraction logic for academic papers, technical documentation, and structured data sources. Similarly, nickscamara/ open-deep-research [42] features sophisticated content normalization pipelines that transform diverse formats into consistent knowledge representations suitable for analysis. Both systems demonstrate how
代表性系统。dzhng/deep-research[321]项目通过针对不同文档类型和格式的专用模块,展现了先进的内容处理能力。它实现了针对学术论文、技术文档和结构化数据源的定制化提取逻辑。同样,nickscamara/open-deep-research[42]系统具备复杂的内容规范化流程,能将多种格式转换为适合分析的一致性知识表示。这两个系统共同展示了如何

specialized content processing can significantly enhance the quality and comprehensiveness of research outputs.
专业内容处理能显著提升研究成果的质量和全面性。

2.2.3 Specialized Tool Integration Progress. Integration with domain-specific tools extends Deep Research capabilities beyond general information processing:
2.2.3 专业工具集成进展。通过与领域专用工具的集成,深度研究能力超越了通用信息处理的范畴:
Technical Evolution Trajectory. Initial systems relied on general-purpose web search and basic API integrations. The integration of diverse tools has been dramatically advanced by frameworks like ToolLLM [222], which enables large language models to master over 16,000 real-world APIs, significantly expanding the interaction capabilities of research systems. Similarly, AssistGPT [82] demonstrates how general multimodal assistants can plan, execute, inspect, and learn across diverse environments, creating unified research experiences that seamlessly incorporate varied information sources and interaction modalities. LLaVA-Plus [152] further extends these capabilities through explicit tool learning mechanisms, enabling research assistants to adaptively incorporate specialized tools within multimodal workflows. Current implementations feature complex toolchains including specialized databases, analytical frameworks, and domain-specific services. Advanced systems dynamically select and orchestrate tools based on research requirements, effectively composing custom research workflows from available capabilities. Some implementations like those leveraging OpenAI’s Codex [194] can even generate custom code to process research data or implement analytical models on demand, further extending analytical capabilities. This evolution has enabled increasingly sophisticated analysis and domain-specific research applications.
技术演进轨迹。早期系统依赖于通用网络搜索和基础 API 集成。随着 ToolLLM[222]等框架的出现,多样化工具的整合取得了显著进展,该框架使大语言模型能够掌握超过 16,000 个现实世界 API,极大拓展了研究系统的交互能力。类似地,AssistGPT[82]展示了通用多模态助手如何在多样化环境中进行规划、执行、检查和学习,创建出无缝整合多种信息源与交互模式的统一研究体验。LLaVA-Plus[152]则通过显式工具学习机制进一步扩展这些能力,使研究助手能在多模态工作流中自适应地整合专业工具。当前实施方案包含由专业数据库、分析框架和领域特定服务组成的复杂工具链。先进系统能根据研究需求动态选择和编排工具,有效利用现有能力组合出定制化的研究流程。 一些实现方案(如利用 OpenAI 的 Codex[194]的方案)甚至能生成定制代码来处理研究数据或按需实现分析模型,从而进一步扩展分析能力。这一演进使得越来越复杂的分析和特定领域的研究应用成为可能。
Representative Systems. Manus [164] exemplifies sophisticated tool orchestration through its extensive API integration framework and tool selection mechanisms. The system can incorporate domain-specific research tools and services into unified workflows, significantly expanding its analytical capabilities. Similarly, n8n [183] provides a flexible workflow automation platform that can be configured for research tasks, allowing for integration with specialized data sources and analytical services. Steward extends web interaction capabilities by implementing natural language-driven navigation and operation across websites, overcoming scalability limitations of traditional automation frameworks while maintaining low operational costs [261]. These systems highlight how tool integration can extend Deep Research capabilities into specialized domains and complex analytical workflows.
代表性系统。Manus[164]通过其广泛的 API 集成框架和工具选择机制,展现了复杂的工具编排能力。该系统能将特定领域的研究工具和服务整合到统一的工作流中,显著扩展其分析能力。类似地,n8n[183]提供了一个灵活的工作流自动化平台,可配置用于研究任务,实现与专业数据源和分析服务的集成。Steward 通过实现跨网站的自然语言驱动导航与操作,扩展了网络交互能力,在保持低运营成本的同时克服了传统自动化框架的可扩展性限制[261]。这些系统突显了工具集成如何将深度研究能力延伸至专业领域和复杂分析工作流中。

2.3 Task Planning and Execution Control: Evolution and Advances
2.3 任务规划与执行控制:演进与进展

Effective research requires sophisticated planning and execution mechanisms to coordinate complex, multistage workflows.
高效研究需要复杂的规划与执行机制来协调多阶段工作流程。

2.3.1 Research Task Planning Development. The ability to decompose research objectives into manageable tasks represents a fundamental advancement:
2.3.1 研究任务规划的发展。将研究目标分解为可管理任务的能力代表了一项基础性进步:
Technical Evolution Trajectory. Early approaches employed simple task decomposition with linear execution flows, similar to those found in early agent frameworks like MetaGPT [111] and AgentGPT [230]. Modern systems implement hierarchical planning with dynamic refinement based on intermediate results and discoveries. Advanced planning approaches increasingly incorporate structured exploration methodologies to navigate complex solution spaces efficiently. AIDE [120] demonstrates how tree search algorithms can
技术演进轨迹。早期方法采用简单的任务分解与线性执行流程,类似于 MetaGPT[111]和 AgentGPT[230]等早期智能体框架中的实现。现代系统则采用基于中间结果和发现的动态优化分层规划。先进的规划方法越来越多地结合结构化探索方法论,以高效遍历复杂解空间。AIDE[120]展示了树搜索算法如何

effectively explore the space of potential code solutions for machine learning engineering, trading computational resources for enhanced performance through strategic reuse and refinement of promising pathways. Advanced implementations incorporate resource-aware planning, considering time constraints, computational limitations, and information availability. However, incorporating AI tools for tasks like automated code review has been observed to increase pull request closure durations despite benefits, as evidenced in studies such as Cihan et al. [59], highlighting the critical need to account for temporal impacts in such resource-aware systems. This evolution has enabled increasingly sophisticated research strategies adaptive to both task requirements and available resources.
有效探索机器学习工程中潜在代码解决方案的空间,通过策略性地重用和优化有前景的路径,以计算资源换取性能提升。高级实现方案采用资源感知规划,综合考虑时间约束、计算限制和信息可用性。然而研究表明(如 Cihan 等人[59]所示),尽管引入 AI 工具(如自动化代码审查)能带来效益,却会延长拉取请求的关闭周期,这凸显了在此类资源感知系统中考虑时间影响的关键必要性。这一演进使得研究策略能够根据任务需求和可用资源进行自适应调整,从而日趋精密复杂。
Representative Systems. The OpenAI/AgentsSDK [199] provides a comprehensive framework for research task planning, with explicit support for goal decomposition, execution tracking, and adaptive refinement. It enables the development of applications with sophisticated planning capabilities for research workflows. Similarly, Flowith/OracleMode [77] implements specialized planning mechanisms optimized for research tasks, with particular emphasis on information quality assessment and source prioritization. These systems demonstrate how advanced planning capabilities can significantly improve research efficiency and effectiveness.
代表性系统。OpenAI/AgentsSDK [199] 提供了一个全面的研究任务规划框架,明确支持目标分解、执行跟踪和自适应优化。它能够开发具有复杂规划能力的研究工作流应用。同样,Flowith/OracleMode [77] 实现了专为研究任务优化的规划机制,特别强调信息质量评估和来源优先级排序。这些系统展示了高级规划能力如何显著提升研究效率和效果。

2.3.2 Autonomous Execution and Monitoring Advances. Reliable execution of research plans requires sophisticated control and monitoring mechanisms:
2.3.2 自主执行与监控进展。研究计划的可靠执行需要复杂的控制和监控机制:
Technical Evolution Trajectory. Initial systems employed basic sequential execution with limited error handling. Current implementations feature concurrent execution paths, comprehensive monitoring, and dynamic response to execution challenges. Advanced systems implement self-supervision with explicit success criteria, failure detection, and autonomous recovery strategies. This evolution has dramatically improved the reliability and autonomy of Deep Research systems across complex tasks.
技术演进轨迹。早期系统采用基础顺序执行方式,错误处理能力有限。当前实现方案具备并发执行路径、全面监控能力及对执行挑战的动态响应。先进系统实现了自我监督机制,包含明确成功标准、故障检测和自主恢复策略。这一演进显著提升了深度研究系统在复杂任务中的可靠性和自主性。
Representative Systems. Agent-RL/ReSearch [2] exemplifies advanced execution control through its reinforcement learning-based approach to research execution. The system learns effective execution strategies from experience, continuously improving its ability to navigate complex research workflows. Its adaptive execution mechanisms can recover from failures and adjust strategies based on intermediate results, highlighting how sophisticated control mechanisms can enhance research reliability and effectiveness.
代表性系统。Agent-RL/ReSearch [2]通过其基于强化学习的研究执行方法,展现了先进的执行控制能力。该系统从经验中学习有效的执行策略,持续提升驾驭复杂研究流程的能力。其自适应执行机制能够从故障中恢复,并根据中间结果调整策略,彰显了精密的控制机制如何提升研究的可靠性与有效性。

2.3.3 Multi-Agent Collaboration Framework Development. Complex research often benefits from specialized agent roles and collaborative approaches:
2.3.3 多智能体协作框架开发。复杂研究往往受益于专业化的智能体角色分工与协作方法:
Technical Evolution Trajectory. Early systems relied on monolithic agents with undifferentiated capabilities. Modern implementations employ specialized agent roles with explicit coordination mechanisms and information sharing protocols. Advanced systems feature dynamic role allocation, consensus-building mechanisms, and sophisticated conflict resolution strategies. This evolution has enabled increasingly complex collaborative research workflows and improved performance on challenging tasks[49]. For instance, frameworks employing multi-agent debate have been shown to improve evaluation consistency [48], while research into generative AI voting demonstrates resilience to model biases in collective decision-making [162].
技术演进轨迹。早期系统依赖功能单一的整体式智能体,而现代实现则采用具有明确协调机制和信息共享协议的专业化智能体角色。先进系统具备动态角色分配、共识构建机制和复杂冲突解决策略,这一演进使得协作研究流程能处理日益复杂的任务,并在挑战性任务上实现性能提升[49]。例如,采用多智能体辩论的框架已被证明能提高评估一致性[48],而关于生成式 AI 投票的研究则展示了集体决策中对模型偏见的抗干扰能力[162]。
Representative Systems. The smolagents/open_deep_research [115] framework demonstrates effective multi-agent collaboration through its modular agent architecture and explicit coordination mechanisms. It
代表性系统。smolagents/open_deep_research[115]框架通过其模块化智能体架构和显式协调机制,展示了有效的多智能体协作能力。该

enables the composition of specialized research teams with complementary capabilities and shared objectives. Similarly, TARS [39] implements a sophisticated agent collaboration framework within its desktop environment, allowing multiple specialized agents to contribute to unified research workflows. These systems highlight how multi-agent approaches can enhance research capabilities through specialization and collaboration.
使得能够组建具备互补能力和共同目标的专业研究团队。同样,TARS[39]在其桌面环境中实现了一个复杂的智能体协作框架,允许多个专业智能体共同参与统一的研究工作流程。这些系统展示了多智能体方法如何通过专业化和协作来增强研究能力。

2.4 Knowledge Synthesis and Output Generation: Evolution and Advances
2.4 知识合成与输出生成:演进与进展

The ultimate value of Deep Research systems lies in their ability to synthesize disparate information into coherent, actionable insights.
深度研究系统的最终价值在于其将分散信息综合为连贯、可操作见解的能力。

2.4.1 Information Evaluation Technology Development. Critical assessment of information quality represents a crucial capability for reliable research:
2.4.1 信息评估技术发展。对信息质量的严格评估是确保研究可靠性的关键能力:
Technical Evolution Trajectory. Early systems relied primarily on source reputation heuristics with limited content-based assessment. Modern implementations employ sophisticated evaluation frameworks considering source characteristics, content features, and consistency with established knowledge. Advanced systems implement explicit uncertainty modeling, contradiction detection, and evidential reasoning approaches. This evolution has dramatically improved the reliability and trustworthiness of research outputs. Advances in knowledge retrieval based on generative AI enhance the ability to source and verify information [306].
技术演进轨迹。早期系统主要依赖来源声誉启发式方法,内容评估能力有限。现代实现采用复杂的评估框架,综合考虑来源特征、内容属性以及与既有知识的一致性。先进系统实现了显式不确定性建模、矛盾检测和证据推理方法。这一演进显著提升了研究成果的可靠性与可信度。基于生成式 AI 的知识检索技术进步增强了信息获取与验证能力[306]。
Representative Systems. The grapeot/deep_research_agent [263] implements sophisticated information evaluation mechanisms with explicit quality scoring for diverse source types. It can assess information reliability based on both intrinsic content features and extrinsic source characteristics, enabling more discerning information utilization. These capabilities highlight how advanced evaluation mechanisms can significantly enhance research quality and reliability.
代表性系统。grapeot/deep_research_agent[263]实现了精密的信息评估机制,针对不同来源类型设有明确的质量评分体系。该系统能基于内容内在特征和来源外在特性评估信息可靠性,从而实现更具鉴别力的信息利用。这些能力突显了先进评估机制对提升研究质量与可靠性的显著作用。

2.4.2 Report Generation Technology Advances. Effective communication of research findings requires sophisticated content organization and presentation:
2.4.2 报告生成技术进展。研究成果的有效传达需要精密的内容组织与呈现:
Technical Evolution Trajectory. Initial systems produced simple text summaries with limited structure or coherence. Current implementations generate comprehensive reports with hierarchical organization, evidence integration, and coherent argumentation. Advanced systems produce adaptive outputs tailored to audience expertise, information needs, and presentation contexts. This evolution has dramatically improved the usability and impact of Deep Research outputs.
技术演进轨迹。早期系统只能生成结构简单、连贯性有限的文本摘要。当前实现已能产出具有层次化组织、证据整合和连贯论证的综合报告。先进系统可生成适应受众专业水平、信息需求和呈现场景的自适应输出。这一演进显著提升了深度研究成果的可用性和影响力。
Representative Systems. The mshumer/OpenDeepResearcher [249] project exemplifies advanced report generation through its structured output framework and evidence integration mechanisms. It produces comprehensive research reports with explicit attribution, structured arguments, and integrated supporting evidence. These capabilities demonstrate how sophisticated report generation can enhance the utility and trustworthiness of Deep Research outputs. Additionally, the MegaWika dataset [22] offers a large-scale multilingual resource consisting of millions of articles and referenced sources, enabling collaborative AI report generation.
代表性系统。mshumer/OpenDeepResearcher[249]项目通过其结构化输出框架和证据整合机制,展现了先进的报告生成能力。该系统能产出包含明确引用、结构化论证和整合支撑证据的全面研究报告。这些功能证明了复杂报告生成技术如何提升深度研究成果的实用性和可信度。此外,MegaWika 数据集[22]提供了包含数百万篇文章和参考来源的大规模多语言资源,支持协作式 AI 报告生成。

2.4.3 Interactive Presentation Technology Development. Beyond static reports, interactive result exploration enhances insight discovery and utilization:
2.4.3 交互式呈现技术发展。除静态报告外,交互式结果探索能显著提升洞察发现与利用效率:
Technical Evolution Trajectory. Early systems produced fixed textual outputs with minimal user interaction. Modern implementations support dynamic exploration including drill-down capabilities, source verification, and alternative viewpoint examination. Advanced systems enable collaborative refinement through iterative feedback incorporation and adaptive response to user queries. This evolution has dramatically enhanced the utility and flexibility of Deep Research interfaces.
技术演进轨迹。早期系统仅能生成固定文本输出,用户交互极为有限。现代实现方案支持包括下钻分析、来源验证和多视角审视在内的动态探索功能。先进系统通过迭代反馈整合和用户查询自适应响应,实现协同式结果优化。这一演进极大提升了深度研究界面的实用性与灵活性。
Representative Systems. HKUDS/Auto-Deep-Research [112] implements sophisticated interactive presentation capabilities, allowing users to explore research findings through dynamic interfaces, examine supporting evidence, and refine analysis through iterative interaction. These features highlight how interactive presentation technologies can enhance the utility and accessibility of Deep Research outputs, facilitating more effective knowledge transfer and utilization.
代表性系统。HKUDS/Auto-Deep-Research [112] 实现了复杂的交互式呈现功能,允许用户通过动态界面探索研究发现、查验支撑证据,并通过迭代交互优化分析流程。这些特性彰显了交互式呈现技术如何提升深度研究成果的实用性和可及性,从而促进更高效的知识传递与利用。
This technical framework provides a comprehensive foundation for understanding the capabilities and evolution of Deep Research systems. The subsequent sections will build on this framework to analyze implementation approaches, evaluate system performance, and explore applications across diverse domains.
该技术框架为理解深度研究系统的能力与演进提供了全面基础。后续章节将基于此框架分析实现方法、评估系统性能,并探索跨领域应用。

3 Comparative Analysis and Evaluation of Deep Research Systems
3 深度研究系统的比较分析与评估

Building upon the technical framework established in Section 2, this section provides a comprehensive comparative analysis of existing Deep Research systems across multiple dimensions. We examine how different implementations balance technical capabilities, application suitability, and performance characteristics to address diverse research needs.
基于第 2 节建立的技术框架,本节从多个维度对现有深度研究系统进行全面比较分析。我们研究不同实现方案如何平衡技术能力、应用适配性和性能特征,以满足多样化研究需求。

3.1 Cross-Dimensional Technical Comparison
3.1 跨维度技术比较

Deep Research systems demonstrate varying strengths across the four key technical dimensions identified in our framework. This section analyzes how different implementations balance these capabilities and the resulting performance implications.
深度研究系统在我们框架中确定的四个关键技术维度上展现出不同的优势。本节分析了不同实现方案如何平衡这些能力及其对性能的影响。

3.1.1 Foundation Model and Reasoning Efficiency Comparison. The underlying reasoning capabilities of Deep
3.1.1 基础模型与推理效率比较。深度研究的底层推理能力

Research systems significantly impact their overall effectiveness:
的底层推理能力会显著影响其整体效能:

Table 1. Comparison of Foundation Model Characteristics
表 1. 基础模型特性对比
System  系统 Base Model  基础模型 Context Length  上下文长度 Reasoning Approach  推理方法
OpenAI/DeepResearch [197] o3 may up to 200k tokens [195]
可达 20 万令牌[195]
Multi-step reasoning  多步推理
Gemini/DeepResearch [60] Gemini 2.5 Pro 1M tokens [167]  100 万 tokens [167] Chain-of-thought  思维链
Perplexity/DeepResearch [209] DeepSeek-R1 128 K tokens [210]
128K tokens [210]
Iterative reasoning  迭代推理
Grok3Beta [299] Grok 3 1M tokens [299]  100 万 token[299] Chain-of-thought  思维链
AutoGLM-Research [330]  AutoGLM-研究 [330] ChatGLM DOM Step-by-step planning  逐步规划
System Base Model Context Length Reasoning Approach OpenAI/DeepResearch [197] o3 may up to 200k tokens [195] Multi-step reasoning Gemini/DeepResearch [60] Gemini 2.5 Pro 1M tokens [167] Chain-of-thought Perplexity/DeepResearch [209] DeepSeek-R1 128 K tokens [210] Iterative reasoning Grok3Beta [299] Grok 3 1M tokens [299] Chain-of-thought AutoGLM-Research [330] ChatGLM DOM Step-by-step planning| System | Base Model | Context Length | Reasoning Approach | | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | o3 | may up to 200k tokens [195] | Multi-step reasoning | | Gemini/DeepResearch [60] | Gemini 2.5 Pro | 1M tokens [167] | Chain-of-thought | | Perplexity/DeepResearch [209] | DeepSeek-R1 | 128 K tokens [210] | Iterative reasoning | | Grok3Beta [299] | Grok 3 | 1M tokens [299] | Chain-of-thought | | AutoGLM-Research [330] | ChatGLM | DOM | Step-by-step planning |
DOM: Depends On the Model
DOM:取决于模型
Commercial systems from OpenAI and Google leverage proprietary models with extensive context windows and sophisticated reasoning mechanisms, enabling them to process larger volumes of information with greater coherence. OpenAI’s o3 model demonstrates particular strength in complex reasoning tasks, while Gemini 2.5 Pro excels in information integration across diverse sources. In contrast, Perplexity/DeepResearch achieves
来自 OpenAI 和 Google 的商业系统采用具有广泛上下文窗口和复杂推理机制的专有模型,使其能够以更高的连贯性处理更大体量的信息。OpenAI 的 o3 模型在复杂推理任务中展现出独特优势,而 Gemini 2.5 Pro 则在跨多源信息整合方面表现卓越。相比之下,Perplexity/深度研究通过...

competitive performance with the open-source DeepSeek-R1 model through optimized implementation and focused use cases.
通过优化实现和聚焦特定用例,与开源 DeepSeek-R1 模型相比具有竞争优势。
Open-source implementations like Camel-AI/OWL [43] and QwenLM/Qwen-Agent [224] demonstrate that effective deep research capabilities can be achieved with more accessible models through specialized optimization. The open-weight approach of Camel-AI/OWL [43] enables flexible deployment across computing environments, while QwenLM/Qwen-Agent [224] leverages modular reasoning to compensate for more limited base model capabilities.
Camel-AI/OWL [43]和 QwenLM/Qwen-Agent [224]等开源实现表明,通过专门优化,使用更易获取的模型也能实现强大的深度研究能力。Camel-AI/OWL [43]的开源权重方法支持跨计算环境的灵活部署,而 QwenLM/Qwen-Agent [224]则利用模块化推理来弥补基础模型能力的局限性。

3.1.2 Tool Integration and Environmental Adaptability Comparison. The ability to interact with diverse information environments varies significantly across implementations:
3.1.2 工具集成与环境适应性对比。不同实现方案与多样化信息环境的交互能力存在显著差异:
Table 2. Environmental Interaction Capabilities of Deep Research Systems
表 2. 深度研究系统的环境交互能力
System  系统名称 Web Interaction  网络交互 API Integration  API 集成 Document Processing  文档处理 GUI Navigation  图形用户界面导航
Nanobrowser [184] Headless browsing, JavaScript execution, dynamic content rendering
无头浏览、JavaScript 执行、动态内容渲染
REST API connectors  REST API 连接器 Basic HTML parsing  基础 HTML 解析 Not implemented  未实现
AutoGLM [330] Full browser automation, form interaction
完整的浏览器自动化,表单交互
RESTful and GraphQL support
支持 RESTful 和 GraphQL
PDF, Office formats, JSON
PDF、Office 格式、JSON
Element identification, click/input automation
元素识别,点击/输入自动化
dzhng/deep-research [321] Multi-page navigation, cookie handling
多页面导航,Cookie 处理
OAuth authentication support
OAuth 认证支持
Academic paper extraction, table parsing
学术论文提取,表格解析
Not implemented  未实现
Manus [164]  手稿[164] JavaScript rendering, session management
JavaScript 渲染,会话管理
150 + 150 + 150+150+ service integrations, webhook support
150 + 150 + 150+150+ 服务集成,支持 Webhook
PDF with layout preservation, CSV processing
保留布局的 PDF 处理,CSV 文件处理
Basic element interaction
基础元素交互
n8n [183] Limited, via HTTP requests
通过 HTTP 请求有限制
200 + 200 + 200+200+ integration nodes, custom webhook endpoints
200 + 200 + 200+200+ 集成节点,自定义 webhook 端点
CSV/XML processing  CSV/XML 处理 Not implemented  未实现
TARS [39] Viewport management, scroll handling
视口管理,滚动处理
REST/SOAP support  REST/SOAP 支持 Standard formats processing
标准格式处理
Desktop application control, UI element recognition
桌面应用程序控制,UI 元素识别
System Web Interaction API Integration Document Processing GUI Navigation Nanobrowser [184] Headless browsing, JavaScript execution, dynamic content rendering REST API connectors Basic HTML parsing Not implemented AutoGLM [330] Full browser automation, form interaction RESTful and GraphQL support PDF, Office formats, JSON Element identification, click/input automation dzhng/deep-research [321] Multi-page navigation, cookie handling OAuth authentication support Academic paper extraction, table parsing Not implemented Manus [164] JavaScript rendering, session management 150+ service integrations, webhook support PDF with layout preservation, CSV processing Basic element interaction n8n [183] Limited, via HTTP requests 200+ integration nodes, custom webhook endpoints CSV/XML processing Not implemented TARS [39] Viewport management, scroll handling REST/SOAP support Standard formats processing Desktop application control, UI element recognition| System | Web Interaction | API Integration | Document Processing | GUI Navigation | | :--- | :--- | :--- | :--- | :--- | | Nanobrowser [184] | Headless browsing, JavaScript execution, dynamic content rendering | REST API connectors | Basic HTML parsing | Not implemented | | AutoGLM [330] | Full browser automation, form interaction | RESTful and GraphQL support | PDF, Office formats, JSON | Element identification, click/input automation | | dzhng/deep-research [321] | Multi-page navigation, cookie handling | OAuth authentication support | Academic paper extraction, table parsing | Not implemented | | Manus [164] | JavaScript rendering, session management | $150+$ service integrations, webhook support | PDF with layout preservation, CSV processing | Basic element interaction | | n8n [183] | Limited, via HTTP requests | $200+$ integration nodes, custom webhook endpoints | CSV/XML processing | Not implemented | | TARS [39] | Viewport management, scroll handling | REST/SOAP support | Standard formats processing | Desktop application control, UI element recognition |
Note: Capabilities documented based on system repositories, technical documentation, and published demonstrations as of April 2025.
注:能力文档基于截至 2025 年 4 月的系统存储库、技术文档和已发布的演示。
Specialized tools like Nanobrowser [184] excel in web interaction capabilities, providing sophisticated navigation and content extraction optimized for research workflows. Systems like dzhng/deep-research [321] and nickscamara/open-deep-research [42] complement these capabilities with advanced document processing features that can extract structured information from diverse formats.
Nanobrowser [184]等专业工具在网络交互能力方面表现卓越,提供专为研究工作流程优化的复杂导航和内容提取功能。dzhng/deep-research [321]和 nickscamara/open-deep-research [42]等系统通过高级文档处理功能补充这些能力,可从多种格式中提取结构化信息。
Comprehensive platforms like Manus [164] and AutoGLM [330] offer broader environmental interaction capabilities, balancing web browsing, API integration, and document processing. These systems can adapt to diverse research scenarios but may not match the specialized performance of more focused tools in specific domains. The workflow automation capabilities of n8n [183] provide exceptional flexibility for API integration but offer more limited direct interaction with web and document environments.
Manus [164]和 AutoGLM [330]等综合平台提供更广泛的环境交互能力,平衡了网页浏览、API 集成和文档处理功能。这些系统能适应多样化的研究场景,但在特定领域的专业性能可能不及更专注的工具。n8n [183]的工作流自动化能力为 API 集成提供了极高的灵活性,但与网页和文档环境的直接交互功能较为有限。

3.1.3 Task Planning and Execution Stability Comparison. Effective research requires reliable task planning and execution capabilities:
3.1.3 任务规划与执行稳定性比较。有效研究需要可靠的任务规划与执行能力:
Table 3. Planning and Execution Capabilities of Deep Research Systems
表 3. 深度研究系统的规划与执行能力
System  系统 Task Planning Mechanisms
任务规划机制
Error Handling Features  错误处理功能 Collaboration Infrastructure
协作基础设施
OpenAI/AgentsSDK [199] Hierarchical task decomposition, goal-oriented planning
分层任务分解,面向目标的规划
Automated retry logic, exception handling
自动重试逻辑,异常处理
Supervisor-worker architecture
主管-工作者架构
Flowith/OracleMode [77] Constraint-based planning, information quality prioritization
基于约束的规划,信息质量优先级排序
Checkpoint-based recovery
基于检查点的恢复
Limited role-based workflow
基于角色的有限工作流
Agent-RL/ReSearch [2] Reinforcement learning planning, adaptive task ordering
强化学习规划,自适应任务排序
Progressive fallback strategies, state restoration
渐进式回退策略,状态恢复
Standard agent messaging protocol
标准代理消息协议
smolagents/open_deep_research [115] Task queue management, priority-based scheduling
任务队列管理,基于优先级的调度
Basic retry mechanisms  基础重试机制 Multi-agent configuration, specialized role definitions
多智能体配置,专业化角色定义
TARS [39] Process template architecture, event-driven coordination
流程模板架构,事件驱动协调
State persistence, interruption handling
状态持久化,中断处理
Team-based agent organization, shared memory
基于团队的智能体组织,共享内存
grapeot/deep_research_agent [263] Linear task execution, sequential processing
线性任务执行,顺序处理
Timeout handling  超时处理 Single-agent architecture
单智能体架构
Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025.
注:文档能力基于截至 2025 年 4 月的系统代码库、技术文档和已发布实现。
System Task Planning Mechanisms Error Handling Features Collaboration Infrastructure OpenAI/AgentsSDK [199] Hierarchical task decomposition, goal-oriented planning Automated retry logic, exception handling Supervisor-worker architecture Flowith/OracleMode [77] Constraint-based planning, information quality prioritization Checkpoint-based recovery Limited role-based workflow Agent-RL/ReSearch [2] Reinforcement learning planning, adaptive task ordering Progressive fallback strategies, state restoration Standard agent messaging protocol smolagents/open_deep_research [115] Task queue management, priority-based scheduling Basic retry mechanisms Multi-agent configuration, specialized role definitions TARS [39] Process template architecture, event-driven coordination State persistence, interruption handling Team-based agent organization, shared memory grapeot/deep_research_agent [263] Linear task execution, sequential processing Timeout handling Single-agent architecture Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025. | System | Task Planning Mechanisms | Error Handling Features | Collaboration Infrastructure | | :--- | :--- | :--- | :--- | | OpenAI/AgentsSDK [199] | Hierarchical task decomposition, goal-oriented planning | Automated retry logic, exception handling | Supervisor-worker architecture | | Flowith/OracleMode [77] | Constraint-based planning, information quality prioritization | Checkpoint-based recovery | Limited role-based workflow | | Agent-RL/ReSearch [2] | Reinforcement learning planning, adaptive task ordering | Progressive fallback strategies, state restoration | Standard agent messaging protocol | | smolagents/open_deep_research [115] | Task queue management, priority-based scheduling | Basic retry mechanisms | Multi-agent configuration, specialized role definitions | | TARS [39] | Process template architecture, event-driven coordination | State persistence, interruption handling | Team-based agent organization, shared memory | | grapeot/deep_research_agent [263] | Linear task execution, sequential processing | Timeout handling | Single-agent architecture | | Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025. | | | |
The OpenAI/AgentsSDK [199] demonstrates sophisticated planning capabilities with hierarchical task decomposition and adaptive execution, enabling complex research workflows with reliable completion rates. Similarly, Flowith/OracleMode [77] offers advanced planning mechanisms optimized for research tasks, though with more limited error recovery capabilities.
OpenAI/AgentsSDK [199] 展示了具备分层任务分解和自适应执行能力的复杂规划系统,可实现高完成率的复杂研究工作流。类似地,Flowith/OracleMode [77] 提供了专为研究任务优化的高级规划机制,但其错误恢复能力较为有限。
Agent-RL/ReSearch [2] employs reinforcement learning techniques to develop robust execution strategies, enabling exceptional error recovery capabilities that can adapt to unexpected challenges during research workflows. In contrast, smolagents/open_deep_research [115] and TARS [39] focus on multi-agent collaboration, distributing complex tasks across specialized agents to enhance overall research effectiveness.
Agent-RL/ReSearch [2] 采用强化学习技术开发鲁棒执行策略,具备卓越的错误恢复能力,可适应研究工作流中的意外挑战。相比之下,smolagents/open_deep_research [115] 和 TARS [39] 专注于多智能体协作,通过将复杂任务分配给专业化智能体来提升整体研究效率。
Simpler implementations like grapeot/deep_research_agent [263] offer more limited planning and execution capabilities but may provide sufficient reliability for less complex research tasks, demonstrating the range of complexity available across the ecosystem.
更简单的实现如 grapeot/deep_research_agent [263]提供的规划和执行能力较为有限,但对于复杂度较低的研究任务可能已足够可靠,这展现了整个生态系统中不同层级的复杂度选择。

3.1.4 Knowledge Synthesis and Output Quality Comparison. The ability to synthesize findings into coherent, reliable outputs varies significantly:
3.1.4 知识综合与输出质量比较。不同系统将研究发现综合为连贯可靠输出的能力存在显著差异:
Table 4. Knowledge Synthesis Capabilities of Deep Research Systems
表 4.深度研究系统的知识综合能力
System  系统名称 Source Evaluation Mechanisms
来源评估机制
Output Structuring  输出结构化 User Interaction Features
用户交互功能
OpenAI/DeepResearch [197] Source corroboration, authority ranking algorithms
来源佐证,权威性排序算法
Hierarchical report generation, section organization
层级化报告生成,章节组织
Query clarification dialogue, result expansion
查询澄清对话,结果扩展
Perplexity/DeepResearch [209]
困惑度/深度研究[209]
Source diversity metrics, publication date filtering
来源多样性指标,出版日期筛选
Citation-based organization, inline attribution
基于引用的组织,内联归属
Source exploration interface, follow-up questioning
源探索界面,后续提问
mshumer/OpenDeepResearcher [249] Publication venue filtering, citation count tracking
出版物渠道筛选,引用次数追踪
Template-based document generation, section templating
基于模板的文档生成,章节模板化
Minimal interaction, batch processing focus
最小化交互,专注于批处理
HKUDS/Auto-Deep-Research [112] Basic source categorization, recency filtering
基础来源分类,时效性筛选
Standard academic format, heading organization
标准学术格式,标题层级组织
Interactive result exploration, citation navigation
交互式结果探索,引文导航
grapeot/deep_research_agent [263] Evidence classification algorithms, contradictory claim detection
证据分类算法,矛盾主张检测
Minimal formatting, raw data presentation
极简格式化,原始数据呈现
Command-line interface, non-interactive
命令行界面,非交互式
OpenManus [193] Source type categorization, basic metadata filtering
源类型分类,基础元数据过滤
Markdown formatting, hierarchy-based organization
Markdown 格式,基于层级的组织
Basic query refinement, result browsing
基础查询优化与结果浏览
Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025.
注:所记录功能基于截至 2025 年 4 月的系统代码库、技术文档及已发布实现方案。
System Source Evaluation Mechanisms Output Structuring User Interaction Features OpenAI/DeepResearch [197] Source corroboration, authority ranking algorithms Hierarchical report generation, section organization Query clarification dialogue, result expansion Perplexity/DeepResearch [209] Source diversity metrics, publication date filtering Citation-based organization, inline attribution Source exploration interface, follow-up questioning mshumer/OpenDeepResearcher [249] Publication venue filtering, citation count tracking Template-based document generation, section templating Minimal interaction, batch processing focus HKUDS/Auto-Deep-Research [112] Basic source categorization, recency filtering Standard academic format, heading organization Interactive result exploration, citation navigation grapeot/deep_research_agent [263] Evidence classification algorithms, contradictory claim detection Minimal formatting, raw data presentation Command-line interface, non-interactive OpenManus [193] Source type categorization, basic metadata filtering Markdown formatting, hierarchy-based organization Basic query refinement, result browsing Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025. | System | Source Evaluation Mechanisms | Output Structuring | User Interaction Features | | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | Source corroboration, authority ranking algorithms | Hierarchical report generation, section organization | Query clarification dialogue, result expansion | | Perplexity/DeepResearch [209] | Source diversity metrics, publication date filtering | Citation-based organization, inline attribution | Source exploration interface, follow-up questioning | | mshumer/OpenDeepResearcher [249] | Publication venue filtering, citation count tracking | Template-based document generation, section templating | Minimal interaction, batch processing focus | | HKUDS/Auto-Deep-Research [112] | Basic source categorization, recency filtering | Standard academic format, heading organization | Interactive result exploration, citation navigation | | grapeot/deep_research_agent [263] | Evidence classification algorithms, contradictory claim detection | Minimal formatting, raw data presentation | Command-line interface, non-interactive | | OpenManus [193] | Source type categorization, basic metadata filtering | Markdown formatting, hierarchy-based organization | Basic query refinement, result browsing | | Note: Capabilities documented based on system repositories, technical documentation, and published implementations as of April 2025. | | | |
Commercial platforms like OpenAI/DeepResearch [197] and Perplexity/DeepResearch [209] demonstrate sophisticated information evaluation capabilities, effectively assessing source credibility and content reliability to produce high-quality syntheses. OpenAI’s implementation excels in report structure and organization, while Perplexity offers particularly strong citation practices for source attribution and verification.
OpenAI/DeepResearch [197]和 Perplexity/DeepResearch [209]等商业平台展现出复杂的信息评估能力,能有效判断信源可信度与内容可靠性,从而生成高质量的综述报告。OpenAI 的实现方案在报告结构与组织方面表现突出,而 Perplexity 则在文献引用规范及来源验证方面具有显著优势。
Open-source implementations like mshumer/OpenDeepResearcher [249] focus on report structure and organization, producing well-formatted outputs that effectively communicate research findings. HKUDS/Auto-Deep-Research [112] emphasizes interactive exploration, allowing users to examine evidence and refine analyses through iterative interaction. Specialized tools like grapeot/deep_research_agent [263] prioritize information evaluation over presentation, focusing on reliable content assessment rather than sophisticated output formatting.
mshumer/OpenDeepResearcher [249]等开源实现专注于报告结构与组织,能生成格式规范、有效传达研究发现的输出成果。HKUDS/Auto-Deep-Research [112]强调交互式探索,允许用户通过迭代式交互查验证据并优化分析。grapeot/deep_research_agent [263]等专用工具则优先考虑信息评估而非呈现形式,着重于可靠的内容评判而非复杂的输出排版。

3.2 Application-Based System Suitability Analysis
3.2 基于应用的系统适用性分析

Beyond technical capabilities, Deep Research systems demonstrate varying suitability for different application contexts. This section examines how system characteristics align with key application domains.
除技术能力外,深度研究系统在不同应用场景中展现出差异化适用性。本节探讨系统特性如何与关键应用领域相匹配。

3.2.1 Academic Research Scenario Adaptability Assessment. Academic research requires particular emphasis on comprehensive literature review, methodological rigor, and citation quality. Systems like OpenAI/ DeepResearch [197] excel in this domain through their ability to access academic databases, comprehensively analyze research methodologies, and generate properly formatted citations. Other specialized academic research tools like PaperQA [80] and Scite [243] offer complementary capabilities focused specifically on scientific literature processing, while Google’s NotebookLm [95] provides structured knowledge workspaces for academic exploration.
3.2.1 学术研究场景适应性评估。学术研究需要特别强调全面的文献综述、方法论的严谨性和引文质量。OpenAI/DeepResearch[197]等系统通过其访问学术数据库、全面分析研究方法并生成格式规范引文的能力,在该领域表现卓越。其他专业学术研究工具如 PaperQA[80]和 Scite[243]则提供专注于科学文献处理的补充功能,而 Google 的 NotebookLm[95]则为学术探索提供结构化知识工作区。
OpenAI/DeepResearch [197] demonstrates exceptional suitability for academic research through its comprehensive literature coverage, methodological rigor, and high-quality citation practices. The system can effectively navigate academic databases, understand research methodologies, and produce well-structured
OpenAI/DeepResearch[197]通过全面的文献覆盖、方法论的严谨性和高质量的引用实践,展现出对学术研究的卓越适用性。该系统能有效检索学术数据库,理解研究方法,并生成结构良好的
Table 5. Academic Research Application Features of Deep Research Systems
表 5. 深度研究系统的学术研究应用特性
System  系统 Academic Database Integration
学术数据库集成
Methodology Analysis Features
方法论分析特性
Citation Management  参考文献管理
OpenAI/DeepResearch [197] ArXiv, IEEE Xplore, PubMed, Google Scholar
ArXiv、IEEE Xplore、PubMed、Google Scholar
Statistical method identification, study design classification
统计方法识别、研究设计分类
IEEE, APA, MLA, Chicago format support
支持 IEEE、APA、MLA、Chicago 格式
Perplexity/DeepResearch [209]
困惑度/深度研究[209]
ArXiv, PubMed, JSTOR, ACM Digital Library
ArXiv、PubMed、JSTOR、ACM 数字图书馆
Experimental design analysis, sample size assessment
实验设计分析、样本量评估
Automated citation generation, DOI resolution
自动引文生成、DOI 解析
dzhng/deep-research [321] ArXiv, Semantic Scholar, limited database access
ArXiv、Semantic Scholar,有限数据库访问权限
Basic methodology extraction
基础方法论提取
BibTeX export, standard format support
BibTeX 导出,支持标准格式
Came1-AI/OWL [43] Custom corpus integration, specialized domain databases
定制语料库集成,专业领域数据库
Research design pattern recognition, methodology comparison
研究设计模式识别,方法论比较
Domain-specific citation formatting
领域特定引用格式
mshumer/OpenDeepResearcher [249] Open access databases, PDF repository processing
开放获取数据库,PDF 文献库处理
Methodology summary extraction
方法论摘要提取
Standard citation format generation
标准引用格式生成
HKUDS/Auto-Deep-Research [112]
香港大学数字学术/Auto-Deep-Research [112]
University library integration, institutional repository access
大学图书馆集成,机构知识库访问
Research approach categorization
研究方法分类
Reference management, bibliography generation
参考文献管理,书目生成
Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025.
注:本文档记录的功能基于截至 2025 年 4 月的系统存储库、技术文档和已发布用例。
System Academic Database Integration Methodology Analysis Features Citation Management OpenAI/DeepResearch [197] ArXiv, IEEE Xplore, PubMed, Google Scholar Statistical method identification, study design classification IEEE, APA, MLA, Chicago format support Perplexity/DeepResearch [209] ArXiv, PubMed, JSTOR, ACM Digital Library Experimental design analysis, sample size assessment Automated citation generation, DOI resolution dzhng/deep-research [321] ArXiv, Semantic Scholar, limited database access Basic methodology extraction BibTeX export, standard format support Came1-AI/OWL [43] Custom corpus integration, specialized domain databases Research design pattern recognition, methodology comparison Domain-specific citation formatting mshumer/OpenDeepResearcher [249] Open access databases, PDF repository processing Methodology summary extraction Standard citation format generation HKUDS/Auto-Deep-Research [112] University library integration, institutional repository access Research approach categorization Reference management, bibliography generation Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025. | System | Academic Database Integration | Methodology Analysis Features | Citation Management | | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | ArXiv, IEEE Xplore, PubMed, Google Scholar | Statistical method identification, study design classification | IEEE, APA, MLA, Chicago format support | | Perplexity/DeepResearch [209] | ArXiv, PubMed, JSTOR, ACM Digital Library | Experimental design analysis, sample size assessment | Automated citation generation, DOI resolution | | dzhng/deep-research [321] | ArXiv, Semantic Scholar, limited database access | Basic methodology extraction | BibTeX export, standard format support | | Came1-AI/OWL [43] | Custom corpus integration, specialized domain databases | Research design pattern recognition, methodology comparison | Domain-specific citation formatting | | mshumer/OpenDeepResearcher [249] | Open access databases, PDF repository processing | Methodology summary extraction | Standard citation format generation | | HKUDS/Auto-Deep-Research [112] | University library integration, institutional repository access | Research approach categorization | Reference management, bibliography generation | | | | Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025. | |
literature reviews with appropriate attribution. Perplexity/DeepResearch [209] offers similarly strong performance for literature coverage and citation quality, though with somewhat less methodological sophistication.
文献综述需附适当引用。Perplexity/DeepResearch [209]在文献覆盖范围和引文质量方面表现出相似的强劲性能,尽管方法论复杂度稍逊一筹。
Open-source alternatives like Camel-AI/OWL [43] provide competitive capabilities for specific academic domains, particular strength in methodological understanding for specific domains. Systems like dzhng/deepresearch [321], mshumer/OpenDeepResearcher [249], and HKUDS/Auto-Deep-Research [112] offer moderate capabilities across all dimensions, making them suitable for less demanding academic research applications or preliminary literature exploration.
开源替代方案如 Camel-AI/OWL [43]在特定学术领域提供具有竞争力的能力,尤其擅长特定领域的方法论理解。诸如 dzhng/deepresearch [321]、mshumer/OpenDeepResearcher [249]和 HKUDS/Auto-Deep-Research [112]等系统在所有维度上都具备中等能力,使其适用于要求不高的学术研究应用或初步文献探索。

3.2.2 Enterprise Decision-Making Scenario Adaptability Assessment. Business intelligence and strategic decision-making emphasize information currency, analytical depth, and actionable insights:
3.2.2 企业决策场景适应性评估。商业智能和战略决策强调信息时效性、分析深度和可操作见解:
Table 6. Enterprise Decision-Making Application Features of Deep Research Systems
表 6. 深度研究系统的企业决策应用特性
System  系统 Market Information Sources
市场信息来源
Analytical Frameworks  分析框架 Decision Support Features
决策支持功能
Gemini/DeepResearch [60] News API integration, SEC filings access, market data feeds
新闻 API 集成、SEC 文件访问、市场数据源
Competitor analysis templates, trend detection algorithms
竞争对手分析模板、趋势检测算法
Executive summary generation, recommendation formatting
执行摘要生成,推荐格式
Manus [164]  手稿[164] Financial data integrations, news aggregation, industry reports
金融数据集成,新闻聚合,行业报告
Market sizing frameworks, SWOT analysis templates
市场规模框架,SWOT 分析模板
Strategic options presentation, decision matrix generation
战略选项展示,决策矩阵生成
n8n [183] CRM integration, marketing platform connectivity, custom data sources
CRM 集成,营销平台连接,自定义数据源
Custom analytics workflow creation, data pipeline automation
自定义分析工作流创建,数据管道自动化
Dashboard generation, notification systems
仪表板生成,通知系统
Agent-RL/ReSearch [2] Configurable information source adapters, custom data inputs
可配置的信息源适配器,自定义数据输入
Pattern recognition algorithms, causal analysis frameworks
模式识别算法,因果分析框架
Scenario planning tools, impact assessment matrices
情景规划工具,影响评估矩阵
Flowith/OracleMode [77] Real-time data feeds, specialized industry sources
实时数据源,专业行业信息来源
Industry-specific analytical templates, framework application
行业专用分析模板,框架应用
Strategic briefing generation, insight prioritization
战略简报生成,洞察优先级排序
TARS [39] Enterprise system integration, desktop application data access
企业系统集成,桌面应用程序数据访问
Basic analytical template application
基础分析模板应用
Standardized reporting, data visualization
标准化报告与数据可视化
Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025.
注:截至 2025 年 4 月,文档功能基于系统代码库、技术文档及已发布用例
System Market Information Sources Analytical Frameworks Decision Support Features Gemini/DeepResearch [60] News API integration, SEC filings access, market data feeds Competitor analysis templates, trend detection algorithms Executive summary generation, recommendation formatting Manus [164] Financial data integrations, news aggregation, industry reports Market sizing frameworks, SWOT analysis templates Strategic options presentation, decision matrix generation n8n [183] CRM integration, marketing platform connectivity, custom data sources Custom analytics workflow creation, data pipeline automation Dashboard generation, notification systems Agent-RL/ReSearch [2] Configurable information source adapters, custom data inputs Pattern recognition algorithms, causal analysis frameworks Scenario planning tools, impact assessment matrices Flowith/OracleMode [77] Real-time data feeds, specialized industry sources Industry-specific analytical templates, framework application Strategic briefing generation, insight prioritization TARS [39] Enterprise system integration, desktop application data access Basic analytical template application Standardized reporting, data visualization Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025. | System | Market Information Sources | Analytical Frameworks | Decision Support Features | | :--- | :--- | :--- | :--- | | Gemini/DeepResearch [60] | News API integration, SEC filings access, market data feeds | Competitor analysis templates, trend detection algorithms | Executive summary generation, recommendation formatting | | Manus [164] | Financial data integrations, news aggregation, industry reports | Market sizing frameworks, SWOT analysis templates | Strategic options presentation, decision matrix generation | | n8n [183] | CRM integration, marketing platform connectivity, custom data sources | Custom analytics workflow creation, data pipeline automation | Dashboard generation, notification systems | | Agent-RL/ReSearch [2] | Configurable information source adapters, custom data inputs | Pattern recognition algorithms, causal analysis frameworks | Scenario planning tools, impact assessment matrices | | Flowith/OracleMode [77] | Real-time data feeds, specialized industry sources | Industry-specific analytical templates, framework application | Strategic briefing generation, insight prioritization | | TARS [39] | Enterprise system integration, desktop application data access | Basic analytical template application | Standardized reporting, data visualization | | Note: Features documented based on system repositories, technical documentation, and published use cases as of April 2025. | | | |
Gemini/DeepResearch [60] demonstrates exceptional suitability for enterprise decision-making through its strong information currency, analytical capabilities, and actionable output formats. The system effectively navigates business information sources, analyzes market trends, and produces insights directly relevant to decision processes. Manus [164] offers similarly strong performance for information acquisition and analysis, though with somewhat less emphasis on actionable recommendation formatting. Microsoft Copilot [173] empowers organizations with powerful generative AI, enterprise-grade security and privacy, and is trusted by companies around the world. Similarly, the Adobe Experience Platform AI Assistant [181] employs knowledge graph-enhanced retrieval-augmented generation to accurately respond over private enterprise documents, significantly enhancing response relevance while maintaining provenance tracking.
Gemini/DeepResearch [60] 通过强大的信息时效性、分析能力和可执行输出格式,展现出对企业决策的卓越适用性。该系统能有效处理商业信息来源,分析市场趋势,并生成与决策流程直接相关的洞察。Manus [164] 在信息获取与分析方面表现同样出色,但在可执行建议的格式处理上稍显不足。Microsoft Copilot [173] 为企业提供强大的生成式 AI 能力、企业级安全与隐私保护,深受全球企业信赖。同样地,Adobe Experience Platform AI Assistant [181] 采用知识图谱增强的检索增强生成技术,能够基于企业私有文档精准响应,在保持来源追溯的同时显著提升回答相关性。
Workflow automation platforms like n8n [183] provide particular strengths in information currency and actionability through their integration with enterprise data sources and business intelligence tools. Research-focused systems like Agent-RL/ReSearch [2] and Flowith/OracleMode [77] offer competitive analytical capabilities but may require additional processing to translate findings into actionable business recommendations.
像 n8n[183]这样的工作流自动化平台通过与企业数据源和商业智能工具的集成,在信息时效性和可操作性方面展现出独特优势。以研究为核心的系统如 Agent-RL/ReSearch[2]和 Flowith/OracleMode[77]虽具备竞争力的分析能力,但可能需要额外处理才能将研究发现转化为可执行的商业建议。

3.2.3 Personal Knowledge Management Adaptability Assessment. Individual knowledge management emphasizes accessibility, personalization, and integration with existing workflows:
3.2.3 个人知识管理适应性评估。个体知识管理强调可访问性、个性化以及与现有工作流的整合:
Table 7. Personal Knowledge Management Features of Deep Research Systems
表 7. 深度研究系统的个人知识管理特性
System  系统 User Interface Design  用户界面设计 Customization Options  自定义选项 Existing Tool Integration
现有工具集成
Perplexity/DeepResearch [209] Web-based interface, mobile application support
基于网页的界面,支持移动应用
Topic preference settings, information filtering options
主题偏好设置,信息筛选选项
Browser extension, sharing functionality
浏览器扩展,分享功能
nickscamara/open-deep-research [42] Command-line interface, web interface option
命令行界面,提供网页界面选项
Modular configuration, source priority adjustment
模块化配置,支持来源优先级调整
Local file system integration, note-taking exports
本地文件系统集成,支持笔记导出功能
OpenManus [193] Desktop application, local web interface
桌面应用程序,本地网页界面
Template customization, workflow configuration
模板定制,工作流配置
Note application exports, knowledge base connections
笔记应用导出,知识库连接
Nanobrowser [184] Programmatic interface, developer-focused API
程序化接口,面向开发者的 API
Full configuration access, component-level customization
完整配置访问,组件级定制
Browser automation framework compatibility
浏览器自动化框架兼容性
smolagents/open_deep_research [115] Technical interface, Python library integration
技术接口,Python 库集成
Architecture-level customization, agent behavior configuration
架构级定制,代理行为配置
Python ecosystem integration, custom adapter support
Python 生态系统集成,自定义适配器支持
Jina-AI/node-DeepResearch [121] Node.js integration, API-driven interface
Node.js 集成,API 驱动接口
Component-level configuration, pipeline customization
组件级配置,流水线定制
Node.js application ecosystem, JavaScript framework support
Node.js 应用生态系统,JavaScript 框架支持
System User Interface Design Customization Options Existing Tool Integration Perplexity/DeepResearch [209] Web-based interface, mobile application support Topic preference settings, information filtering options Browser extension, sharing functionality nickscamara/open-deep-research [42] Command-line interface, web interface option Modular configuration, source priority adjustment Local file system integration, note-taking exports OpenManus [193] Desktop application, local web interface Template customization, workflow configuration Note application exports, knowledge base connections Nanobrowser [184] Programmatic interface, developer-focused API Full configuration access, component-level customization Browser automation framework compatibility smolagents/open_deep_research [115] Technical interface, Python library integration Architecture-level customization, agent behavior configuration Python ecosystem integration, custom adapter support Jina-AI/node-DeepResearch [121] Node.js integration, API-driven interface Component-level configuration, pipeline customization Node.js application ecosystem, JavaScript framework support https://cdn.mathpix.com/cropped/2025_06_22_deb75acf6fb7d6768d68g-18.jpg?height=27&width=1413&top_left_y=549&top_left_x=421 | System | User Interface Design | Customization Options | Existing Tool Integration | | :--- | :--- | :--- | :--- | | Perplexity/DeepResearch [209] | Web-based interface, mobile application support | Topic preference settings, information filtering options | Browser extension, sharing functionality | | nickscamara/open-deep-research [42] | Command-line interface, web interface option | Modular configuration, source priority adjustment | Local file system integration, note-taking exports | | OpenManus [193] | Desktop application, local web interface | Template customization, workflow configuration | Note application exports, knowledge base connections | | Nanobrowser [184] | Programmatic interface, developer-focused API | Full configuration access, component-level customization | Browser automation framework compatibility | | smolagents/open_deep_research [115] | Technical interface, Python library integration | Architecture-level customization, agent behavior configuration | Python ecosystem integration, custom adapter support | | Jina-AI/node-DeepResearch [121] | Node.js integration, API-driven interface | Component-level configuration, pipeline customization | Node.js application ecosystem, JavaScript framework support | | ![](https://cdn.mathpix.com/cropped/2025_06_22_deb75acf6fb7d6768d68g-18.jpg?height=27&width=1413&top_left_y=549&top_left_x=421) | | | |
Perplexity/DeepResearch [209] offers strong accessibility for personal knowledge management through its consumer-friendly interface and free access tier, though with more limited personalization capabilities. Open-source implementations like nickscamara/open-deep-research [42] and OpenManus [193] provide greater personalization possibilities through local deployment and customization, enabling adaptation to individual information management preferences.
Perplexity/DeepResearch [209] 通过其用户友好的界面和免费访问层级,为个人知识管理提供了强大的可访问性,尽管个性化功能较为有限。开源实现如 nickscamara/open-deep-research [42] 和 OpenManus [193] 通过本地部署和定制提供了更大的个性化可能性,能够适应个人信息管理偏好。
Infrastructure tools like Nanobrowser [184] and Jina-AI/node-DeepResearch [121] offer particular strengths in workflow integration, allowing seamless incorporation into existing personal knowledge management systems and processes. More complex frameworks like smolagents/open_deep_research [115] provide sophisticated capabilities but may present accessibility challenges for non-technical users.
像 Nanobrowser [184]和 Jina-AI/node-DeepResearch [121]这样的基础设施工具在工作流集成方面具有独特优势,能够无缝融入现有的个人知识管理系统和流程。而 smolagents/open_deep_research [115]等更复杂的框架虽然提供了高级功能,但对非技术用户可能存在使用门槛。

3.3 Performance Metrics and Benchmarking
3.3 性能指标与基准测试

Beyond qualitative comparisons, quantitative performance metrics provide objective assessment of Deep Research capabilities across systems.
除了定性比较外,定量性能指标为不同系统的深度研究能力提供了客观评估依据。

3.3.1 Quantitative Evaluation Metrics. Standard benchmarks enable comparative evaluation of core research capabilities:
3.3.1 定量评估指标。标准基准测试支持核心研究能力的对比评估:
Table 8. Performance on Standard Evaluation Benchmarks
表 8. 标准评估基准性能表现
System  系统 HLE Score* [212]  HLE 分数* [212] MMLU** Score [33]  MMLU**分数 [33] HotpotQA Score [307]  HotpotQA 得分 [307] GAIA Score(pass@1)*** [172]
GAIA 得分(pass@1)*** [172]
OpenAI/DeepResearch [197] 26.6% - - 67.36%
Gemini-2.5 [60, 293] 18.8% - - -
Gemini-2.0-Flash [89, 93] - 77.9% - -
Perplexity/DeepResearch [209] 21.1% - - -
Grok3Beta [299] - 79.9% - -
Manus [164] - - - 86.5%
Agent-RL/ReSearch [2] - - 37.51% -
System HLE Score* [212] MMLU** Score [33] HotpotQA Score [307] GAIA Score(pass@1)*** [172] OpenAI/DeepResearch [197] 26.6% - - 67.36% Gemini-2.5 [60, 293] 18.8% - - - Gemini-2.0-Flash [89, 93] - 77.9% - - Perplexity/DeepResearch [209] 21.1% - - - Grok3Beta [299] - 79.9% - - Manus [164] - - - 86.5% Agent-RL/ReSearch [2] - - 37.51% -| System | HLE Score* [212] | MMLU** Score [33] | HotpotQA Score [307] | GAIA Score(pass@1)*** [172] | | :--- | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | 26.6% | - | - | 67.36% | | Gemini-2.5 [60, 293] | 18.8% | - | - | - | | Gemini-2.0-Flash [89, 93] | - | 77.9% | - | - | | Perplexity/DeepResearch [209] | 21.1% | - | - | - | | Grok3Beta [299] | - | 79.9% | - | - | | Manus [164] | - | - | - | 86.5% | | Agent-RL/ReSearch [2] | - | - | 37.51% | - |
OpenAI/DeepResearch [30, 123, 197] demonstrates leading performance across various benchmark categories, particularly excelling in Humanity’s Last Exam (HLE) [212] hich measures advanced research and reasoning capabilities. Gemini/DeepResearch [60] shows comparable performance. According to the introduction of Google Deep Research with Gemini 2.5 Pro Experimental [60, 126], the new model demonstrated superior user preference over OpenAI/DeepResearch across four key metrics: instruction following ( 60.6 % vs 60.6 % vs 60.6%vs60.6 \% \mathrm{vs}. 39.4 % 39.4 % 39.4%39.4 \% ), Comprehensiveness ( 76.9 % 76.9 % 76.9%76.9 \% vs. 23.1 % 23.1 % 23.1%23.1 \% ), Completeness ( 73.3 % 73.3 % 73.3%73.3 \% vs. 26.7 % 26.7 % 26.7%26.7 \% ), and Writing quality ( 58.2 % 58.2 % 58.2%58.2 \% vs. 41.8 % 41.8 % 41.8%41.8 \% ). These results suggest Gemini 2.5 Pro’s enhanced capability in synthesizing structured, high-fidelity research outputs. This capability is further amplified in fullstack applications, where the integration of Gemini
OpenAI/DeepResearch [30, 123, 197] 在各类基准测试中展现出领先性能,尤其在衡量高级研究与推理能力的"人类终极考试"(HLE)[212]中表现突出。Gemini/DeepResearch [60] 显示出与之相当的性能。根据 Google Deep Research 采用 Gemini 2.5 Pro 实验版[60, 126]的介绍,新模型在四项关键指标上均优于 OpenAI/DeepResearch:指令遵循( 60.6 % vs 60.6 % vs 60.6%vs60.6 \% \mathrm{vs} vs. 39.4 % 39.4 % 39.4%39.4 \% )、全面性( 76.9 % 76.9 % 76.9%76.9 \% vs. 23.1 % 23.1 % 23.1%23.1 \% )、完整性( 73.3 % 73.3 % 73.3%73.3 \% vs. 26.7 % 26.7 % 26.7%26.7 \% )以及写作质量( 58.2 % 58.2 % 58.2%58.2 \% vs. 41.8 % 41.8 % 41.8%41.8 \% )。这些结果表明 Gemini 2.5 Pro 在生成结构化、高保真研究成果方面具备更强能力。该能力在全栈应用中得到进一步放大,特别是当 Gemini
Table 9. Documented Performance Metrics from Deep Research Systems
表 9. Deep Research Systems 记录的绩效指标
System  系统 Benchmark  基准测试 Reported Score  报告得分 Evaluation Context  评估背景 Source  来源
OpenAI/DeepResearch HLE 26.6% Humanity's Last Exam  人类终极考验 [197]
OpenAI/DeepResearch  OpenAI/深度研究 GAIA (pass@1) 67.36% General AI assistant tasks
通用人工智能助手任务
[197]
Perplexity/DeepResearch  困惑度/深度研究 HLE 21.1% Humanity's Last Exam  人类的终极考验 [209]
Perplexity/DeepResearch  困惑度/深度研究 SimpleQA  简单问答 93.9% Factual question answering
事实性问题回答
[209]
Grok3Beta MMLU 92.7% Multitask language understanding
多任务语言理解
[299]
Manus  手稿 GAIA (pass@1)  GAIA(pass@1) 86.5% General AI assistant tasks
通用人工智能助手任务
[164]
Agent-RL/ReSearch HotpotQA 37.51% Multi-hop question answering
多跳问答
[2]
AutoGLM WebArena-Lite 55.2% (59.1% retry)  55.2%(重试成功率 59.1%) Web navigation tasks  网页导航任务 [330]
AutoGLM OpenTable 96.2% Restaurant booking tasks
餐厅预订任务
[330]
System Benchmark Reported Score Evaluation Context Source OpenAI/DeepResearch HLE 26.6% Humanity's Last Exam [197] OpenAI/DeepResearch GAIA (pass@1) 67.36% General AI assistant tasks [197] Perplexity/DeepResearch HLE 21.1% Humanity's Last Exam [209] Perplexity/DeepResearch SimpleQA 93.9% Factual question answering [209] Grok3Beta MMLU 92.7% Multitask language understanding [299] Manus GAIA (pass@1) 86.5% General AI assistant tasks [164] Agent-RL/ReSearch HotpotQA 37.51% Multi-hop question answering [2] AutoGLM WebArena-Lite 55.2% (59.1% retry) Web navigation tasks [330] AutoGLM OpenTable 96.2% Restaurant booking tasks [330]| System | Benchmark | Reported Score | Evaluation Context | Source | | :--- | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch | HLE | 26.6% | Humanity's Last Exam | [197] | | OpenAI/DeepResearch | GAIA (pass@1) | 67.36% | General AI assistant tasks | [197] | | Perplexity/DeepResearch | HLE | 21.1% | Humanity's Last Exam | [209] | | Perplexity/DeepResearch | SimpleQA | 93.9% | Factual question answering | [209] | | Grok3Beta | MMLU | 92.7% | Multitask language understanding | [299] | | Manus | GAIA (pass@1) | 86.5% | General AI assistant tasks | [164] | | Agent-RL/ReSearch | HotpotQA | 37.51% | Multi-hop question answering | [2] | | AutoGLM | WebArena-Lite | 55.2% (59.1% retry) | Web navigation tasks | [330] | | AutoGLM | OpenTable | 96.2% | Restaurant booking tasks | [330] |
evaluation methodologies and task specifications.
评估方法学与任务规范

models with frameworks like LangGraph facilitates research-augmented conversational AI for comprehensive query handling, as demonstrated in Google-Gemini/Gemini-Fullstack-Langgraph-Quickstart [94]. Perplexity/DeepResearch [209] achieves competitive results despite utilizing the open-source DeepSeek-R1 model, highlighting the importance of implementation quality beyond raw model capabilities.
将模型与 LangGraph 等框架结合,可增强研究型对话 AI 的综合查询处理能力,如 Google-Gemini/Gemini-Fullstack-Langgraph-Quickstart[94]所示。Perplexity/DeepResearch[209]采用开源模型 DeepSeek-R1 仍取得竞争优势,印证了实现质量的重要性往往超越原始模型性能。
Open-source implementations show progressively lower benchmark scores, though many still achieve respectable performance suitable for practical applications. Systems like AutoGLM-Research [330], HKUDS/ Auto-Deep-Research [112], and Camel-AI/OWL [43] demonstrate that effective research capabilities can be achieved with more accessible models and frameworks, though with some performance trade-offs compared to leading commercial implementations.
开源实现方案的基准分数呈梯度递减,但多数仍保持适用于实际场景的可靠性能。AutoGLM-Research[330]、HKUDS/Auto-Deep-Research[112]及 Camel-AI/OWL[43]等系统表明,采用更易获取的模型与框架同样能实现有效的研究功能,虽在性能上与头部商业方案存在一定差距。
Recent benchmark development has expanded evaluation to more specialized aspects of research assistance. The AAAR-1.0 benchmark [157] specifically evaluates AI’s potential to assist research through 150 multi-domain tasks designed to test both retrieval and reasoning capabilities. Domain-specific approaches include DSBench [122], which evaluates data science agent capabilities across 20 real-world tasks[182, 283], SciCode [268] for scientific code generation, MASSW [323] for scientific workflow assistance, and MMSci [147] for multimodal scientific understanding across graduate-level materials. ScienceQA[160] offers a comprehensive multimodal science benchmark with chain-of-thought explanations for evaluating reasoning capabilities. Domain-specific benchmarks like TPBench [58] for theoretical physics and AAAR-1.0 [157] for research assistance capabilities offer additional targeted evaluation approaches for specialized research applications. Multi-domain code generation benchmark like DomainCodeBench[328] is designed to systematically assess large language models across 12 software application domains and 15 programming languages. Interactive evaluation frameworks like LatEval [114] specifically assess systems’ capabilities in handling incomplete information through lateral thinking puzzles, providing insight into research abilities under uncertainty and ambiguity. Complementary approaches like Mask-DPO [100] focus on generalizable fine-grained factuality alignment, addressing a critical requirement for reliable research outputs. Domain-specific benchmarks such as GMAI-MMBench [51] provide comprehensive multimodal evaluation frameworks specifically designed for medical AI applications, while AutoBench [52] offers automated evaluation of scientific discovery capabilities, providing standardized assessment of core research functions. Other broad evaluation frameworks including HELM [149], BIG-bench [88], and AGIEval [331], provide complementary assessment dimensions. Specialized
近期基准测试的发展已将评估范围扩展至研究辅助的更多专业领域。AAAR-1.0 基准[157]专门通过 150 个跨领域任务评估 AI 辅助研究的潜力,这些任务旨在测试检索与推理能力。领域特定评估方法包括:DSBench[122]通过 20 个真实世界任务评估数据科学代理能力[182,283],SciCode[268]面向科学代码生成,MASSW[323]用于科学工作流辅助,以及 MMSci[147]针对研究生水平材料的多模态科学理解评估。ScienceQA[160]提供包含思维链解释的多模态科学综合基准,用于评估推理能力。TPBench[58]等针对理论物理的领域基准与 AAAR-1.0[157]等研究辅助能力评估,为专业研究应用提供了针对性评估方案。DomainCodeBench[328]等多领域代码生成基准则系统性地评估大语言模型在 12 个软件应用领域和 15 种编程语言中的表现。 类似 LatEval[114]这样的交互式评估框架专门通过横向思维谜题评估系统处理不完整信息的能力,为不确定性及模糊性下的研究能力提供洞察。互补性方法如 Mask-DPO[100]则聚焦于可泛化的细粒度事实对齐,满足可靠研究成果的关键需求。领域专用基准如 GMAI-MMBench[51]为医疗 AI 应用提供全面的多模态评估框架,而 AutoBench[52]实现了科学发现能力的自动化评估,对核心研究功能进行标准化测评。其他广泛评估框架包括 HELM[149]、BIG-bench[88]和 AGIEval[331],提供了互补的评估维度。专业化的

multimodal benchmarks like INQUIRE [279] extend this landscape to ecological challenges, rigorously evaluating expert-level text-to-image retrieval tasks critical for accelerating biodiversity research.
多模态基准如 INQUIRE[279]将这一版图扩展至生态挑战领域,严格评估对加速生物多样性研究至关重要的专家级文图检索任务。
Table 10. Specialized Deep Research Benchmarks
表 10. 专业深度研究基准
Benchmark Focus Area  重点领域 Evaluation Approach  评估方法 Key Metrics  关键指标
AAAR-1.0 [157] Research assistance  研究辅助 150 multi-domain tasks  150 个多领域任务 Retrieval and reasoning capability
检索与推理能力
DSBench [122] Data science  数据科学 20 real-world tasks  20 个真实世界任务 End-to-end completion rate
端到端完成率
SciCode [268] Scientific coding  科学编程 Curated by scientists  由科学家精心编选 Code quality, task completion
代码质量,任务完成度
MASSW [323] Scientific workflows  科学工作流程 Benchmarking tasks  基准测试任务 Workflow orchestration quality
工作流编排质量
MMSci [147] Multimodal science  多模态科学 Graduate-level questions
研究生水平问题
Cross-modal understanding
跨模态理解
TPBench [58] Theoretical physics  理论物理学 Physics reasoning  物理推理 Problem-solving accuracy
问题解决准确度
Note: These benchmarks represent domain-specific evaluation frameworks for specialized research capabilities.
注:这些基准代表了针对专业研究能力的领域特定评估框架。
Benchmark Focus Area Evaluation Approach Key Metrics AAAR-1.0 [157] Research assistance 150 multi-domain tasks Retrieval and reasoning capability DSBench [122] Data science 20 real-world tasks End-to-end completion rate SciCode [268] Scientific coding Curated by scientists Code quality, task completion MASSW [323] Scientific workflows Benchmarking tasks Workflow orchestration quality MMSci [147] Multimodal science Graduate-level questions Cross-modal understanding TPBench [58] Theoretical physics Physics reasoning Problem-solving accuracy Note: These benchmarks represent domain-specific evaluation frameworks for specialized research capabilities. | Benchmark | Focus Area | Evaluation Approach | Key Metrics | | :--- | :--- | :--- | :--- | | AAAR-1.0 [157] | Research assistance | 150 multi-domain tasks | Retrieval and reasoning capability | | DSBench [122] | Data science | 20 real-world tasks | End-to-end completion rate | | SciCode [268] | Scientific coding | Curated by scientists | Code quality, task completion | | MASSW [323] | Scientific workflows | Benchmarking tasks | Workflow orchestration quality | | MMSci [147] | Multimodal science | Graduate-level questions | Cross-modal understanding | | TPBench [58] | Theoretical physics | Physics reasoning | Problem-solving accuracy | | Note: These benchmarks represent domain-specific evaluation frameworks for specialized research capabilities. | | | |

3.3.2 Qualitative Assessment Frameworks. Beyond numeric benchmarks, qualitative evaluation provides
3.3.2 定性评估框架。除数值基准外,定性评估通过

insight into practical effectiveness:
实际有效性的深入洞察
Table 11. Documented Output Characteristics of Deep Research Systems
表 11. 深度研究系统记录的输出特性
System  系统 Content Organization  内容组织 Information Diversity  信息多样性 Verification Features  验证特性 Novel Connection Mechanisms
新型连接机制
OpenAI/DeepResearch [197] Hierarchical structure with 5+ sections, executive summaries
包含 5 个以上章节的层级结构,附执行摘要
Cross-domain source integration (reported in [197])
跨领域资源整合(如文献[197]所述)
Statement-level citation linking, contradiction flagging
声明级引证链接,矛盾标记
Cross-domain connection identification
跨域连接识别
Gemini/DeepResearch [60] Multi-level heading organization, standardized formatting
多级标题组织,标准化格式
Multi-perspective source inclusion (documented in [60])
多视角来源收录(记录于[60])
Source credibility metrics, confidence indicators
可信度指标,置信度指示器
Thematic pattern identification
主题模式识别
Perplexity/DeepResearch [209]
困惑度/深度研究[209]
Progressive information disclosure, expandable sections
渐进式信息展示,可扩展区域
Real-time source aggregation across platforms
跨平台实时信源聚合
Direct quote attribution, inline source linking
直接引述标注,内联信源链接
Timeline-based relationship mapping
基于时间轴的关系图谱
mshumer/OpenDeepResearcher [249] Template-based document structure, consistent formatting
基于模板的文档结构,统一格式规范
Topic-based categorization of sources
按主题分类的文献来源
Basic citation framework, reference listing
基础引用框架,参考文献列表
Topic cluster visualization
主题聚类可视化
grapeot/deep_research_agent [263] Minimal formatting, content-focused presentation
极简格式化,专注内容呈现
Source type categorization, domain tracking
来源类型分类与领域追踪
Source credibility scoring system based on metadata
基于元数据的来源可信度评分系统
Not implemented per repository documentation
未按照存储库文档实现
Agent-RL/ReSearch [2] Adaptive content organization based on information types
基于信息类型的自适应内容组织
Exploratory search patterns documented in repository
存储库中记录的探索性搜索模式
Contradiction detection algorithms
矛盾检测算法
Pattern-based insight generation documented in [2]
基于模式的洞察生成方法在文献[2]中有详细记载
Note: Characteristics documented based on system technical documentation, published demonstrations, repository analysis, and official descriptions as of April 2025. Specific feature implementations may vary across system versions.
注:特性描述基于截至 2025 年 4 月的系统技术文档、已发布的演示案例、代码库分析及官方说明。具体功能实现可能因系统版本而异。
System Content Organization Information Diversity Verification Features Novel Connection Mechanisms OpenAI/DeepResearch [197] Hierarchical structure with 5+ sections, executive summaries Cross-domain source integration (reported in [197]) Statement-level citation linking, contradiction flagging Cross-domain connection identification Gemini/DeepResearch [60] Multi-level heading organization, standardized formatting Multi-perspective source inclusion (documented in [60]) Source credibility metrics, confidence indicators Thematic pattern identification Perplexity/DeepResearch [209] Progressive information disclosure, expandable sections Real-time source aggregation across platforms Direct quote attribution, inline source linking Timeline-based relationship mapping mshumer/OpenDeepResearcher [249] Template-based document structure, consistent formatting Topic-based categorization of sources Basic citation framework, reference listing Topic cluster visualization grapeot/deep_research_agent [263] Minimal formatting, content-focused presentation Source type categorization, domain tracking Source credibility scoring system based on metadata Not implemented per repository documentation Agent-RL/ReSearch [2] Adaptive content organization based on information types Exploratory search patterns documented in repository Contradiction detection algorithms Pattern-based insight generation documented in [2] Note: Characteristics documented based on system technical documentation, published demonstrations, repository analysis, and official descriptions as of April 2025. Specific feature implementations may vary across system versions. | System | Content Organization | Information Diversity | Verification Features | Novel Connection Mechanisms | | :--- | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | Hierarchical structure with 5+ sections, executive summaries | Cross-domain source integration (reported in [197]) | Statement-level citation linking, contradiction flagging | Cross-domain connection identification | | Gemini/DeepResearch [60] | Multi-level heading organization, standardized formatting | Multi-perspective source inclusion (documented in [60]) | Source credibility metrics, confidence indicators | Thematic pattern identification | | Perplexity/DeepResearch [209] | Progressive information disclosure, expandable sections | Real-time source aggregation across platforms | Direct quote attribution, inline source linking | Timeline-based relationship mapping | | mshumer/OpenDeepResearcher [249] | Template-based document structure, consistent formatting | Topic-based categorization of sources | Basic citation framework, reference listing | Topic cluster visualization | | grapeot/deep_research_agent [263] | Minimal formatting, content-focused presentation | Source type categorization, domain tracking | Source credibility scoring system based on metadata | Not implemented per repository documentation | | Agent-RL/ReSearch [2] | Adaptive content organization based on information types | Exploratory search patterns documented in repository | Contradiction detection algorithms | Pattern-based insight generation documented in [2] | | Note: Characteristics documented based on system technical documentation, published demonstrations, repository analysis, and official descriptions as of April 2025. Specific feature implementations may vary across system versions. | | | | | | | | | | |
Commercial systems generally demonstrate stronger qualitative performance, particularly in output coherence and factual accuracy. OpenAI/DeepResearch [197] produces exceptionally well-structured reports with reliable factual content, while also achieving moderate innovation in connecting disparate sources. Gemini/DeepResearch [60] shows similar strengths in coherence and accuracy, with slightly less emphasis on novel insights.
商业系统通常展现出更优异的定性性能,尤其在输出连贯性和事实准确性方面。OpenAI/DeepResearch[197]生成的报告结构极佳且事实内容可靠,同时在关联不同来源方面实现了适度创新。Gemini/DeepResearch[60]在连贯性和准确性上表现出相似优势,但对新颖见解的强调稍弱。
Some open-source implementations show particular strengths in specific dimensions. Agent-RL/ReSearch [2] achieves notable performance in insight novelty through its exploration-focused approach, while grapeot/ deep_research_agent [263] demonstrates strong factual accuracy through its emphasis on information verification. These specialized capabilities highlight the diversity of approaches within the Deep Research ecosystem.
部分开源实现在特定维度上展现出独特优势。Agent-RL/ReSearch[2]通过其探索导向的方法,在见解新颖性方面取得显著成效,而 grapeot/deep_research_agent[263]则凭借对信息验证的重视,展现出强大的事实准确性。这些专项能力凸显了深度研究生态系统中方法的多样性。

3.3.3 Efficiency and Resource Utilization Metrics. Practical deployment considerations include computational requirements and operational efficiency:
3.3.3 效率与资源利用指标。实际部署需考虑计算需求和运行效率:
Commercial cloud-based services offer optimized performance with moderate response times, though with dependency on external infrastructure and associated costs. Perplexity/DeepResearch [209] achieves particularly strong efficiency metrics, with relatively quick response times and high token efficiency despite its competitive output quality.
商业云服务提供经过优化的性能及适中的响应时间,但存在对外部基础设施的依赖及相关成本。Perplexity/DeepResearch[209]实现了尤其突出的效率指标,在保持具有竞争力的输出质量同时,具备较快的响应速度和高令牌效率。
Open-source implementations present greater variability in efficiency metrics. Systems like AutoGLMResearch [330] and QwenLM/Qwen-Agent [224] require substantial computational resources but can be deployed in local environments, offering greater control and potential cost savings for high-volume usage.
开源实现方案在效率指标上表现出更大的差异性。诸如 AutoGLMResearch [330]和 QwenLM/Qwen-Agent [224]等系统虽然需要大量计算资源,但可部署在本地环境中,对于高频使用场景能提供更强的控制力并可能实现成本节约。
Table 12. Efficiency and Resource Utilization
表 12. 效率与资源利用率
System  系统名称 Response Time*  响应时间* Compute Requirements  计算需求 Token Efficiency**  令牌效率
OpenAI/DeepResearch [197] 5 30 min 5 30 min 5-30min5-30 \mathrm{~min} Cloud-only  仅限云端 High (detailed, citation-rich)
高(详细,引用丰富)
Perplexity/DeepResearch [209] 2 m 59 s
2 分 59 秒
Cloud-only  仅限云端 -
Grok3Beta [299] - Cloud-only  仅限云端 -
Nanobrowser [184] - User-defined via LLM API key
通过 LLM API 密钥用户自定义
-
n8n [183] - Self-hosted or cloud-based; scalable
自托管或基于云端;可扩展
-
System Response Time* Compute Requirements Token Efficiency** OpenAI/DeepResearch [197] 5-30min Cloud-only High (detailed, citation-rich) Perplexity/DeepResearch [209] 2 m 59 s Cloud-only - Grok3Beta [299] - Cloud-only - Nanobrowser [184] - User-defined via LLM API key - n8n [183] - Self-hosted or cloud-based; scalable -| System | Response Time* | Compute Requirements | Token Efficiency** | | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch [197] | $5-30 \mathrm{~min}$ | Cloud-only | High (detailed, citation-rich) | | Perplexity/DeepResearch [209] | 2 m 59 s | Cloud-only | - | | Grok3Beta [299] | - | Cloud-only | - | | Nanobrowser [184] | - | User-defined via LLM API key | - | | n8n [183] | - | Self-hosted or cloud-based; scalable | - |
*Typical response time for moderately complex research tasks
*中等复杂度研究任务的典型响应时间

**Efficiency of token utilization relative to output quality
**相对于输出质量的令牌使用效率
Lighter-weight implementations like nickscamara/open-deep-research [42] can operate with more limited resources but typically demonstrate longer response times and lower token efficiency.
像 nickscamara/open-deep-research[42]这样的轻量级实现可以在资源有限的情况下运行,但通常表现出更长的响应时间和更低的 token 效率。
This comparative analysis highlights the diversity of approaches and capabilities across the Deep Research ecosystem. While commercial implementations currently demonstrate leading performance on standard benchmarks, open-source alternatives offer competitive capabilities in specific domains and use cases, with particular advantages in customization, control, and potential cost efficiency for specialized applications. The subsequent sections will build on this analysis to examine implementation technologies, evaluation methodologies, and application domains in greater detail.
这一对比分析凸显了深度研究生态系统中方法和能力的多样性。虽然商业实现在标准基准测试中目前展现出领先性能,但开源替代方案在特定领域和用例中提供了具有竞争力的能力,特别是在定制化、控制权以及专业应用潜在成本效益方面具有独特优势。后续章节将基于此分析,更详细地探讨实现技术、评估方法和应用领域。

4 Implementation Technologies and Challenges
4 实现技术与挑战

The practical realization of Deep Research systems involves numerous technical challenges spanning infrastructure design, system integration, and safeguard implementation. This section examines the key implementation technologies that enable effective Deep Research capabilities and the challenges that must be addressed for reliable, efficient operation.
深度研究系统的实际实现涉及基础设施设计、系统集成和安全保障实施等诸多技术挑战。本节将探讨实现有效深度研究能力的关键技术,以及确保系统可靠高效运行必须解决的挑战。

4.1 Architectural Implementation Patterns
4.1 架构实现模式

The diverse systems analyzed in this survey reveal several distinct architectural patterns that represent different approaches to implementing Deep Research capabilities. This section examines four fundamental architectural patterns: monolithic, pipeline-based, multi-agent, and hybrid implementations. For each pattern, we analyze the underlying structural principles, component interactions, information flow mechanisms, and representative systems.
本调研分析的各种系统揭示了实现深度研究能力的几种不同架构模式。本节将考察四种基本架构模式:单体式、基于流水线、多智能体和混合实现。针对每种模式,我们分析其底层结构原理、组件交互方式、信息流机制以及代表性系统。

4.1.1 Monolithic Architecture Pattern. Monolithic implementations integrate all Deep Research capabilities within a unified architectural framework centered around a core reasoning engine. As illustrated in Figure 4, these systems employ a centralized control mechanism with direct integration of specialized modules.
4.1.1 单体式架构模式。单体式实现将所有深度研究能力集成在围绕核心推理引擎的统一架构框架内。如图 4 所示,这些系统采用集中式控制机制,直接集成专用模块。
The defining characteristics of this architecture include:
该架构的典型特征包括:
  • Centralized Control Flow: All operations route through a primary reasoning engine that maintains global state and execution context
    集中式控制流:所有操作都通过主推理引擎进行路由,该引擎维护全局状态和执行上下文
  • Tightly Coupled Integration: Specialized modules (web browsing, document processing, etc.) are directly integrated with the central controller
    紧密耦合集成:专用模块(网页浏览、文档处理等)与中央控制器直接集成
  • Shared Memory Architecture: Information state is maintained in a centralized memory system accessible to all components
    共享内存架构:信息状态保存在所有组件均可访问的集中式内存系统中

Fig. 3. Implementation Architecture of Deep Research Systems
图 3. 深度研究系统的实现架构
  • Sequential Reasoning Processes: Operations typically follow a structured sequence defined by the central controller
    顺序推理过程:操作通常遵循由中央控制器定义的结构化序列
This architectural pattern offers strong coherence and reasoning consistency through its unified control structure. However, it presents challenges for extensibility and can struggle with parallelization of complex operations. Representative implementations include OpenAI/DeepResearch [197] and grapeot/deep_ research_agent [263], which demonstrate how this architecture enables coherent reasoning across diverse information sources while maintaining implementation simplicity.
这种架构模式通过其统一的控制结构提供了强大的连贯性和推理一致性。然而,它在可扩展性方面存在挑战,并且难以实现复杂操作的并行化。代表性实现包括 OpenAI/DeepResearch [197]和 grapeot/deep_research_agent [263],这些案例展示了该架构如何在保持实现简洁性的同时,实现跨多样信息源的连贯推理。

4.1.2 Pipeline-Based Architecture Pattern. Pipeline architectures implement Deep Research capabilities through a sequence of specialized processing stages connected through well-defined interfaces. As shown in Figure 5, these systems decompose research workflows into discrete processing components with explicit data transformations between stages.
4.1.2 基于管道的架构模式。管道架构通过一系列通过明确定义接口连接的专业处理阶段来实现深度研究能力。如图 5 所示,这些系统将研究工作流分解为离散的处理组件,各阶段之间具有明确的数据转换。
The key characteristics of pipeline implementations include:
管道实现的关键特征包括:

Fig. 4. Monolithic Deep Research Architecture
图 4. 单体深度研究架构
  • Sequential Component Organization: Research tasks flow through a predefined sequence of specialized processing modules
    顺序组件组织:研究任务通过预定义的专业处理模块序列流动
  • Standardized Interfaces: Clear data transformation specifications between pipeline stages enable modular component replacement
    标准化接口:管道阶段间清晰的数据转换规范支持模块化组件替换
  • Staged Processing Logic: Each component implements a specific transformation, with minimal dependence on global state
    分阶段处理逻辑:每个组件实现特定转换,对全局状态的依赖最小化
  • Configurable Workflow Paths: Advanced implementations enable conditional routing between alternative processing paths based on intermediary results
    可配置工作流路径:高级实现能够根据中间结果在备选处理路径之间进行条件路由

    Pipeline architectures excel in workflow customization and component reusability but may struggle with complex reasoning tasks requiring iterative refinement across components. Systems like n8n [183] and dzhng/deep-research [321] exemplify this approach, demonstrating how explicit workflow sequencing enables sophisticated research automation through composition of specialized components.
    流水线架构在工作流定制和组件复用方面表现出色,但在需要跨组件迭代优化的复杂推理任务上可能面临挑战。诸如 n8n[183]和 dzhng/deep-research[321]等系统是该方法的典型代表,展示了如何通过专用组件的组合,利用显式工作流排序实现复杂研究自动化。

    4.1.3 Multi-Agent Architecture Pattern. Multi-agent architectures implement Deep Research capabilities through ecosystems of specialized autonomous agents coordinated through explicit communication protocols. Figure 6 illustrates how these systems distribute research functionality across collaborating agents with differentiated roles and responsibilities.
    4.1.3 多智能体架构模式。多智能体架构通过由显式通信协议协调的专用自主智能体生态系统来实现深度研究能力。图 6 展示了这些系统如何将研究功能分布在具有差异化角色和职责的协作智能体之间。

Fig. 5. Pipeline-Based Deep Research Architecture
图 5. 基于流水线的深度研究架构
The defining elements of multi-agent implementations include:
多智能体实现的核心要素包括:
  • Distributed Functional Decomposition: Research capabilities are distributed across specialized agents with defined roles (searcher, analyst, critic, etc.)
    分布式功能分解:研究能力被分配到具有明确角色(搜索者、分析者、批评者等)的专业化智能体上
  • Explicit Coordination Mechanisms: Standardized message passing and task delegation protocols enable inter-agent collaboration
    显式协调机制:标准化的消息传递和任务委派协议实现智能体间协作
  • Autonomous Decision Logic: Individual agents maintain independent reasoning capabilities within their designated domains
    自主决策逻辑:各智能体在其指定领域内保持独立的推理能力
  • Dynamic Task Allocation: Advanced implementations employ flexible task assignment based on agent capabilities and current workload
    动态任务分配:先进实现方案采用基于智能体能力和当前工作负载的灵活任务分配机制
Multi-agent architectures excel in complex research tasks requiring diverse specialized capabilities and parallel processing. Their distributed nature enables exceptional scaling for complex research workflows but introduces challenges in maintaining overall coherence and consistent reasoning across agents. Representative implementations include smolagents/open_deep_research [115] and TARS [39], which demonstrate how multi-agent coordination enables sophisticated research workflows through specialized agent collaboration.
多智能体架构在需要多样化专业能力和并行处理的复杂研究任务中表现卓越。其分布式特性为复杂研究工作流提供了出色的扩展能力,但也带来了保持整体一致性和跨智能体推理连贯性的挑战。代表性实现包括 smolagents/open_deep_research [115]和 TARS [39],这些案例展示了通过专业化智能体协作,多智能体协调如何实现复杂研究工作流。

Fig. 6. Multi-Agent Deep Research Architecture
图 6. 多智能体深度研究架构

4.1.4 Hybrid Architecture Pattern. Hybrid architectures combine elements from multiple architectural patterns to balance their respective advantages within unified implementations. As shown in Figure 7, these systems employ strategic integration of architectural approaches to address specific research requirements.
4.1.4 混合架构模式。混合架构通过整合多种架构模式的要素,在统一实现中平衡各自优势。如图 7 所示,这些系统采用架构方法的战略整合来满足特定研究需求。
Key characteristics of hybrid implementations include:
混合实现的关键特性包括:
  • Tiered Architectural Organization: Different architectural patterns are employed at different system levels based on functional requirements
    分层架构组织:根据功能需求在不同系统层级采用不同的架构模式
  • Domain-Specific Optimization: Architectural approaches are selected based on domain-specific processing requirements
    领域特定优化:根据特定领域处理需求选择架构方法
  • Flexible Integration Mechanisms: Standardized interfaces enable communication between components employing different architectural patterns
    灵活集成机制:标准化接口支持采用不同架构模式的组件间通信
  • Adaptive Execution Frameworks: Control mechanisms dynamically adjust processing approaches based on task characteristics
    自适应执行框架:控制机制根据任务特征动态调整处理方法
Hybrid architectures offer exceptional flexibility and optimization opportunities but introduce implementation complexity and potential integration challenges. Systems like Perplexity/DeepResearch [209] and Camel-AI/OWL [43] exemplify this approach, combining centralized reasoning with distributed information
混合架构提供了卓越的灵活性和优化机会,但也带来了实现复杂性和潜在的集成挑战。Perplexity/DeepResearch [209] 和 Camel-AI/OWL [43] 等系统体现了这种方法,将集中式推理与分布式信息

Fig. 7. Hybrid Deep Research Architecture
图 7. 混合深度研究架构

gathering and specialized processing pipelines to achieve sophisticated research capabilities with balanced performance characteristics.
采集和专用处理流水线相结合,以实现具有平衡性能特征的复杂研究能力。

4.1.5 Emerging Agent Framework Ecosystems. Beyond the core architectural patterns described above, the Deep Research ecosystem has been significantly enhanced by specialized agent frameworks that provide standardized components for agent development. Emerging systems incorporate specialized agent frameworks [54, 142, 301] that structure reasoning in ways particularly suited to complex research tasks requiring both depth and breadth of analysis. As detailed in comprehensive analyses of agent frameworks [133, 304], these systems offer varying approaches to agent orchestration, execution control, and reasoning orchestration.
4.1.5 新兴智能体框架生态系统。除上述核心架构模式外,深度研究生态系统还通过提供标准化智能体开发组件的专业框架得到显著增强。新兴系统整合了专业智能体框架[54, 142, 301],这些框架以特别适合需要深度与广度分析的复杂研究任务的方式构建推理能力。如智能体框架综合分析[133, 304]所述,这些系统在智能体编排、执行控制和推理协调方面提供了多样化实现方案。
Key frameworks include LangGraph [134], which provides graph-based control flow for language model applications, enabling complex reasoning patterns through explicit state management and transition logic. Google’s Agent Development Kit (ADK) [91] offers a comprehensive framework for agent development with standardized interfaces for tool integration, planning, and execution monitoring. CrewAI [64] implements an agent collaboration framework designed specifically for multi-specialist workflows, enabling role-based task distribution with explicit coordination mechanisms. More experimental frameworks like Agno [3] explore agentic autonomy through self-improvement and meta-reasoning capabilities.
关键框架包括 LangGraph[134],它通过显式状态管理和转换逻辑为语言模型应用提供基于图的控制流,从而实现复杂推理模式。Google 的 Agent Development Kit (ADK)[91]提供了全面的智能体开发框架,具有工具集成、规划和执行监控的标准化接口。CrewAI[64]实现了专为多专家工作流设计的智能体协作框架,通过显式协调机制实现基于角色的任务分配。更具实验性的框架如 Agno[3]则通过自我改进和元推理能力探索智能体的自主性。
The TapeAgents framework [19] provides a particularly comprehensive approach to agent development and optimization, with explicit support for iterative refinement through systematic recording and analysis of agent behavior. These frameworks collectively demonstrate an ongoing shift toward standardized agent components that enhance development efficiency while enabling more complex reasoning and execution patterns.
TapeAgents 框架[19]提供了特别全面的智能体开发和优化方法,明确支持通过系统记录和分析智能体行为进行迭代优化。这些框架共同展示了向标准化智能体组件的持续转变,既提高了开发效率,又实现了更复杂的推理和执行模式。

4.1.6 Architectural Pattern Comparison. Table 13 provides a comparative analysis of these architectural patterns across key performance dimensions:
4.1.6 架构模式比较。表 13 从关键性能维度对这些架构模式进行了对比分析:
Table 13. Architectural Pattern Characteristics in Deep Research Systems
表 13. 深度研究系统中的架构模式特性
Characteristic  特性 Monolithic  单体式 Pipeline  管道 Multi-Agent  多智能体 Hybrid  混合
Control Structure  控制结构 Centralized  集中式 Sequential  顺序式 Distributed  分布式 Mixed  混合式
Component Coupling  组件耦合 Tight  紧密 Loose  松散 Moderate  适中 Variable  变量
Failure Propagation  故障传播 System-wide  系统范围 Stage-limited  阶段受限 Agent-isolated  代理隔离的 Component-dependent  组件依赖的
Development Complexity  开发复杂度 Minimal  最小化 Moderate  中等 Substantial  显著 Maximal  最大
Deployment Flexibility  部署灵活性 Limited  有限 Moderate  中等 Moderate  中等 High  
Representative Systems  代表性系统 grapeot/deep_research_agent n8n, dzhng/deep-research smolagents, TARS Perplexity, Camel-AI/OWL
困惑度,Camel-AI/OWL
Characteristic Monolithic Pipeline Multi-Agent Hybrid Control Structure Centralized Sequential Distributed Mixed Component Coupling Tight Loose Moderate Variable Failure Propagation System-wide Stage-limited Agent-isolated Component-dependent Development Complexity Minimal Moderate Substantial Maximal Deployment Flexibility Limited Moderate Moderate High Representative Systems grapeot/deep_research_agent n8n, dzhng/deep-research smolagents, TARS Perplexity, Camel-AI/OWL| Characteristic | Monolithic | Pipeline | Multi-Agent | Hybrid | | :--- | :--- | :--- | :--- | :--- | | Control Structure | Centralized | Sequential | Distributed | Mixed | | Component Coupling | Tight | Loose | Moderate | Variable | | Failure Propagation | System-wide | Stage-limited | Agent-isolated | Component-dependent | | Development Complexity | Minimal | Moderate | Substantial | Maximal | | Deployment Flexibility | Limited | Moderate | Moderate | High | | Representative Systems | grapeot/deep_research_agent | n8n, dzhng/deep-research | smolagents, TARS | Perplexity, Camel-AI/OWL |
benchmarking across identical tasks and environments.
在相同任务和环境下的基准测试
Each architectural pattern presents distinct advantages and limitations that influence its suitability for specific Deep Research applications. Monolithic architectures excel in reasoning coherence and implementation simplicity, making them appropriate for focused research applications with well-defined workflows. Pipeline architectures offer exceptional extensibility and component reusability, enabling customized research workflows through modular composition. Multi-agent architectures provide superior parallelization and fault tolerance, supporting complex research tasks requiring diverse specialized capabilities. Hybrid architectures balance these characteristics through strategic integration, offering flexible optimization for diverse research requirements.
每种架构模式都具有独特的优势和局限性,这影响了其在特定深度研究应用中的适用性。单体架构在推理一致性和实现简单性方面表现出色,适用于具有明确工作流程的专注型研究应用。流水线架构具有卓越的可扩展性和组件可重用性,能够通过模块化组合实现定制化的研究工作流。多智能体架构提供卓越的并行化和容错能力,支持需要多样化专业能力的复杂研究任务。混合架构通过战略整合平衡了这些特性,为多样化的研究需求提供了灵活的优化方案。
The architectural pattern selection significantly influences system capabilities, performance characteristics, and application suitability. As the Deep Research ecosystem continues to evolve, we anticipate further architectural innovation combining elements from these foundational patterns to address emerging application requirements and technical capabilities.
架构模式的选择显著影响系统能力、性能特征及应用适用性。随着深度研究生态系统的持续演进,我们预期将出现更多结合这些基础模式元素的架构创新,以满足新兴应用需求和技术能力。

4.2 Infrastructure and Computational Optimization
4.2 基础设施与计算优化

Deep Research systems require sophisticated infrastructure to support their complex reasoning and information processing capabilities.
深度研究系统需要精密的基础设施来支持其复杂的推理与信息处理能力。

4.2.1 Distributed Reasoning Architectures. Effective reasoning across expansive information landscapes requires specialized architectural approaches. Frameworks like AutoChain [78] and AutoGen [298] have pioneered distributed agent paradigms that can be applied to research workflows. Advanced systems employ distributed reasoning architectures that decompose complex queries into parallel processing paths. OpenAI/DeepResearch [197] implements a hierarchical reasoning framework that distributes analytical tasks across multiple execution threads while maintaining coherent central coordination.
4.2.1 分布式推理架构。在广阔的信息领域实现有效推理需要专门的架构方法。AutoChain[78]和 AutoGen[298]等框架开创了可应用于研究工作流的分布式智能体范式。先进系统采用分布式推理架构,将复杂查询分解为并行处理路径。OpenAI/DeepResearch[197]实现了分层推理框架,将分析任务分配到多个执行线程,同时保持连贯的中央协调。
Implementation approaches increasingly leverage specialized frameworks for efficient LLM serving, including LightLLM [177], Ollama [192], VLLM [281], and Web-LLM [176] for browser-based deployment.
实现方法越来越多地利用专门框架来高效服务 LLM,包括用于浏览器部署的 LightLLM[177]、Ollama[192]、VLLM[281]和 Web-LLM[176]。
These frameworks enable more efficient utilization of computational resources, particularly important for resource-intensive research workflows requiring extensive model inference. Such optimizations are especially critical for open-source implementations operating with more constrained computational resources compared to commercial cloud-based alternatives.
这些框架能更高效地利用计算资源,对于需要大量模型推理的资源密集型研究工作流尤为重要。此类优化对于开源实现尤为关键,因为它们相比商业云服务方案往往面临更受限的计算资源。
Parallel Reasoning Pathways. Advanced systems employ distributed reasoning architectures that decompose complex queries into parallel processing paths. OpenAI/DeepResearch [197] implements a hierarchical reasoning framework that distributes analytical tasks across multiple execution threads while maintaining coherent central coordination. Similar approaches are evident in Gemini/DeepResearch [60], which leverages Google’s distributed computing infrastructure to parallelize information analysis while preserving reasoning consistency.
并行推理路径。先进系统采用分布式推理架构,将复杂查询分解为并行处理路径。OpenAI/DeepResearch[197]实现了分层推理框架,将分析任务分配到多个执行线程,同时保持中心协调的一致性。类似方法也体现在 Gemini/DeepResearch[60]中,它利用 Google 的分布式计算基础设施并行化信息分析,同时保持推理一致性。
Open-source implementations like HKUDS/Auto-Deep-Research [112] and Agent-RL/ReSearch [2] demonstrate more accessible distributed reasoning approaches, utilizing task decomposition and asynchronous processing to enhance performance within more constrained computational environments. These systems show that effective parallelization can be achieved even without the extensive infrastructure of commercial platforms.
开源实现如 HKUDS/Auto-Deep-Research[112]和 Agent-RL/ReSearch[2]展示了更易获取的分布式推理方法,通过任务分解和异步处理来提升受限计算环境下的性能。这些系统表明,即使没有商业平台那样庞大的基础设施,也能实现有效的并行化。
Memory and State Management. Distributed reasoning introduces significant challenges in memory coherence and state management. Commercial systems implement sophisticated state synchronization mechanisms that maintain consistent reasoning contexts across distributed components. OpenAI’s implementation utilizes a hierarchical memory architecture with explicit coordination protocols [200], while Google’s approach leverages its existing distributed computing frameworks adapted for reasoning workflows.
内存与状态管理。分布式推理在内存一致性和状态管理方面带来了重大挑战。商业系统采用复杂的状态同步机制,确保分布式组件间保持一致的推理上下文。OpenAI 的实现采用具有显式协调协议的分层内存架构[200],而 Google 的方法则利用其现有分布式计算框架适配推理工作流。
Open-source alternatives like Camel-AI/OWL [43] employ simplified but effective memory management approaches, including centralized knowledge repositories with controlled access patterns. These implementations demonstrate pragmatic solutions to state management challenges within more constrained technical environments.
开源替代方案如 Camel-AI/OWL[43]采用了简化但高效的内存管理方法,包括具有受控访问模式的集中式知识库。这些实现展示了在技术环境受限情况下应对状态管理挑战的实用解决方案。

4.2.2 Parallel Search and Information Retrieval. Information acquisition represents a primary bottleneck in Deep Research performance:
4.2.2 并行搜索与信息检索。信息获取是深度研究性能的主要瓶颈:
Concurrent Query Execution. Advanced systems implement sophisticated parallel search infrastructures to accelerate information gathering. Perplexity/DeepResearch [209] employs a multi-threaded search architecture that dispatches dozens of concurrent queries across different information sources, significantly accelerating the research process. Similar capabilities are evident in dzhng/deep-research [321], which implements a specialized scheduler for concurrent web queries with adaptive rate limiting to avoid service restrictions.
并发查询执行。先进系统采用复杂的并行搜索基础设施来加速信息收集。Perplexity/DeepResearch[209]使用多线程搜索架构,向不同信息源分发数十个并发查询,显著加快了研究进程。dzhng/deep-research[321]也展现出类似能力,它实现了专门的并发网络查询调度器,并配备自适应速率限制以避免服务受限。
Infrastructure tools like Nanobrowser [184] provide optimized platforms for parallel browsing operations, enabling multiple concurrent page loads with shared resource management. These specialized components enhance the information gathering capabilities of integrated systems like Manus [164] and Flowith/OracleMode [77], which leverage concurrent browsing to accelerate their research workflows.
纳米浏览器[184]等基础设施工具为并行浏览操作提供了优化平台,通过共享资源管理实现多页面同时加载。这些专用组件增强了 Manus[164]和 Flowith/OracleMode[77]等集成系统的信息采集能力,这些系统利用并发浏览技术加速其研究工作流程。
Query Coordination and Deduplication. Effective parallel search requires sophisticated coordination to avoid redundancy and ensure comprehensive coverage. Commercial systems implement advanced query
查询协调与去重。有效的并行搜索需要复杂的协调机制以避免冗余并确保全面覆盖。商业系统采用先进的查询

planning that dynamically adapts to intermediate results, adjusting search strategies based on discovered information. OpenAI’s implementation includes explicit deduplication mechanisms that identify and consolidate redundant sources, while Perplexity employs source diversification techniques to ensure broad coverage.
规划技术,能够根据中间结果动态调整搜索策略。OpenAI 的实现方案包含显式去重机制,可识别并整合重复来源,而 Perplexity 则采用来源多样化技术来确保广泛覆盖。
Open-source tools like nickscamara/open-deep-research [42] implement pragmatic approaches to query coordination, including simple but effective caching mechanisms and result fingerprinting to avoid redundant processing. These techniques demonstrate that effective coordination can be achieved with relatively straightforward implementation approaches.
开源工具如 nickscamara/open-deep-research[42]实现了实用的查询协调方法,包括简单但高效的缓存机制和结果指纹技术以避免冗余处理。这些技术表明,通过相对简单的实现方法即可达成有效的协调机制。

4.2.3 Resource Allocation and Efficiency Optimization. Computational efficiency significantly impacts both performance and operational economics:
4.2.3 资源分配与效率优化。计算效率对系统性能和运营经济性具有显著影响:
Adaptive Resource Allocation. Advanced systems implement dynamic resource allocation based on task characteristics and complexity. Gemini/DeepResearch [60] employs sophisticated workload prediction to provision computational resources adaptively, allocating additional capacity for more complex research tasks. Similar approaches are emerging in open-source implementations like QwenLM/Qwen-Agent [224], which incorporates task complexity estimation to guide resource allocation decisions.
自适应资源分配。先进系统根据任务特征和复杂度实施动态资源分配。Gemini/DeepResearch[60]采用复杂的工作负载预测技术来自适应调配计算资源,为更复杂的研究任务分配额外算力。类似方法也出现在 QwenLM/Qwen-Agent[224]等开源实现中,该系统通过任务复杂度评估来指导资源分配决策。
Progressive Processing Strategies. Efficiency-focused implementations employ progressive processing approaches that incrementally refine results based on available information. Perplexity/DeepResearch [209] utilizes a staged analysis approach that provides preliminary findings quickly while continuing deeper analysis in the background. This strategy enhances perceived responsiveness while ensuring comprehensive results for complex queries.
渐进式处理策略。注重效率的实现采用渐进式处理方法,根据可用信息逐步优化结果。Perplexity/DeepResearch [209]采用分阶段分析方法,快速提供初步发现,同时继续在后台进行更深入的分析。这一策略既提升了感知响应速度,又确保了对复杂查询的全面结果。
Open-source alternatives like mshumer/OpenDeepResearcher [249] implement simpler but effective progressive strategies, including early result previews and incremental report generation. These approaches demonstrate pragmatic solutions to efficiency challenges without requiring sophisticated infrastructure.
mshumer/OpenDeepResearcher [249]等开源替代方案实现了更简单但有效的渐进式策略,包括早期结果预览和增量报告生成。这些方法展示了应对效率挑战的实用解决方案,无需复杂的基础设施。

4.3 System Integration and Interoperability
4.3 系统集成与互操作性

Deep Research systems must effectively coordinate diverse components and external services to deliver comprehensive capabilities.
深度研究系统必须有效协调多样化组件和外部服务,以提供全面能力。

4.3.1 API Design and Standardization. Consistent interfaces enable modular development and component interoperability:
4.3.1 API 设计与标准化。统一的接口能够实现模块化开发和组件互操作性:
Component Interface Standardization. Current Deep Research implementations employ largely incompatible architectures and interfaces. Future research could build upon emerging standardization efforts like Anthropic’s Model Context Protocol (MCP) [12] and Google’s Agent2Agent Protocol (A2A) [90, 92] to establish truly universal component interfaces. MCP provides a structured framework for model-tool interaction, enabling consistent integration patterns across diverse LLM applications, while A2A focuses on standardized agent-to-agent communication to facilitate multi-agent systems. These complementary approaches could form the foundation for comprehensive standardization enabling modular development
组件接口标准化。当前深度研究的实现大多采用互不兼容的架构和接口。未来研究可以基于新兴的标准化工作,如 Anthropic 的模型上下文协议(MCP)[12]和 Google 的 Agent2Agent 协议(A2A)[90,92],建立真正通用的组件接口。MCP 为模型与工具交互提供了结构化框架,使得不同 LLM 应用间能采用一致的集成模式;而 A2A 则专注于标准化智能体间通信以促进多智能体系统发展。这些互补性方法可共同构成全面标准化的基础,实现跨实现的模块化开发

and interchangeable components across implementations. Early steps in this direction appear in frameworks like OpenAI/AgentsSDK [199], which provides standardized agent definitions, but more comprehensive standardization would require broader industry adoption of common protocols.
和可互换组件。OpenAI/AgentsSDK[199]等框架已在此方向迈出初步步伐,提供了标准化智能体定义,但更全面的标准化需要行业更广泛地采用通用协议。
Workflow Automation. Several workflow automation platforms like Dify [259], Coze [38], and Flowise [5] have emerged as low-code environments for building LLM-powered applications, potentially offering standardized frameworks for Deep Research components. Advanced workflow orchestration platforms including Temporal [265], Restate [229], and Orkes [203] provide robust infrastructure for complex, stateful workflows with explicit support for long-running processes and reliability patterns crucial for sophisticated research applications. Implementation approaches might include defining standard message passing protocols between research components, establishing common data structures for research tasks and results, developing compatibility layers between competing standards, extending existing protocols with research-specific interaction patterns, and establishing common evaluation frameworks for component interoperability. These advances could accelerate ecosystem development by enabling specialized components from diverse developers to work seamlessly within unified frameworks, significantly enhancing the pace of innovation through componentization and reuse.
工作流自动化。多个工作流自动化平台如 Dify[259]、Coze[38]和 Flowise[5]已作为低代码环境出现,用于构建 LLM 驱动的应用程序,可能为深度研究组件提供标准化框架。包括 Temporal[265]、Restate[229]和 Orkes[203]在内的高级工作流编排平台为复杂的有状态工作流提供了强大基础设施,明确支持长期运行流程和可靠性模式,这对复杂研究应用至关重要。实施方法可能包括:定义研究组件间的标准消息传递协议,建立研究任务和结果的通用数据结构,开发竞争标准间的兼容层,用研究特定的交互模式扩展现有协议,以及建立组件互操作性的通用评估框架。这些进展可通过使来自不同开发者的专用组件在统一框架内无缝协作来加速生态系统发展,通过组件化和重用显著提升创新速度。
External Service Integration. Access to specialized external services significantly enhances research capabilities. Advanced retrieval frameworks like LlamaIndex [235] provide standardized interfaces for retrieval augmentation, enabling consistent integration patterns across diverse information sources and document formats. Systems like n8n [183] excel in external service integration through their comprehensive connector library and standardized authentication mechanisms. This capability enables access to specialized information sources and analytical services that extend beyond basic web search.
外部服务集成。访问专业的外部服务能显著提升研究能力。LlamaIndex[235]等高级检索框架提供了检索增强的标准化接口,使得跨不同信息源和文档格式的集成模式保持一致。n8n[183]等系统凭借其全面的连接器库和标准化认证机制,在外部服务集成方面表现卓越。这一能力使得研究人员能够访问专业信息源和分析服务,这些资源远超基础网络搜索的范畴。
Open-source frameworks like Jina-AI/node-DeepResearch [121] implement simplified but effective API integration patterns, providing standardized wrappers for common services while maintaining extensibility for custom integrations. These approaches balance standardization with flexibility for diverse research requirements.
Jina-AI/node-DeepResearch[121]等开源框架实现了简化而有效的 API 集成模式,既为常见服务提供标准化封装,又保持了自定义集成的扩展性。这些方法在标准化与满足多样化研究需求的灵活性之间取得了平衡。

4.3.2 Tool Integration Frameworks. Effective orchestration of diverse tools enhances overall system capabilities:
4.3.2 工具集成框架。有效协调多样化工具可增强整体系统能力:
Tool Selection and Composition. Advanced systems implement sophisticated tool selection based on task requirements and information context. Manus [164] features an adaptive tool selection framework that identifies appropriate tools for specific research subtasks, dynamically composing workflows based on available capabilities. Similar approaches are emerging in open-source implementations like grapeot/deep_ research_agent [263], which includes basic tool selection heuristics based on task classification.
工具选择与组合。先进系统根据任务需求和信息上下文实现复杂的工具选择机制。Manus [164]采用自适应工具选择框架,能够为特定研究子任务识别合适工具,并根据现有能力动态组合工作流程。类似方法也出现在开源实现中,如 grapeot/deep_research_agent [263],该系统包含基于任务分类的基础工具选择启发式规则。
Tool Execution Monitoring. Reliable tool usage requires effective execution monitoring and error handling. Commercial systems implement sophisticated monitoring frameworks that track tool execution, detect failures, and implement recovery strategies. OpenAI’s implementation includes explicit success criteria verification and fallback mechanisms for tool failures, ensuring reliable operation even with unreliable external components.
工具执行监控。可靠的工具使用需要有效的执行监控与错误处理机制。商业系统部署了复杂的监控框架来追踪工具执行状态、检测故障并实施恢复策略。OpenAI 的实现方案包含明确的成功标准验证机制和工具故障回退方案,确保即使外部组件不可靠时仍能稳定运行。
Open implementations like Agent-RL/ReSearch [2] demonstrate more accessible monitoring approaches, including simplified execution tracking and basic retry mechanisms for common failure modes. These implementations show that effective monitoring can be achieved with relatively straightforward implementation strategies.
诸如 Agent-RL/ReSearch[2]这样的开源实现展示了更易用的监控方法,包括简化的执行跟踪和针对常见故障模式的基础重试机制。这些实现表明,通过相对简单的实施策略即可实现有效的监控。
Recent advances in agent collaboration frameworks [145, 221] highlight significant challenges in agent coordination [46], particularly for complex research tasks requiring diverse, specialized capabilities working in concert toward unified research objectives.
智能体协作框架的最新进展[145, 221]突显了智能体协调[46]方面的重大挑战,特别是对于需要多样化专业能力协同完成统一研究目标的复杂科研任务。

4.3.3 Cross-Platform Compatibility. Deployment flexibility requires careful attention to environmental dependencies:
4.3.3 跨平台兼容性。部署灵活性需要特别注意环境依赖性:
Platform Abstraction Layers. Cross-platform implementations employ abstraction layers to isolate core logic from environmental dependencies. TARS [39] implements a sophisticated abstraction architecture that separates its core reasoning framework from platform-specific integration components, enabling deployment across diverse environments. Similar approaches are evident in Nanobrowser [184], which provides consistent browsing capabilities across different operating systems.
平台抽象层。跨平台实现采用抽象层将核心逻辑与环境依赖性隔离。TARS[39]实现了精密的抽象架构,将其核心推理框架与平台特定的集成组件分离,从而支持多样化环境部署。类似方法在 Nanobrowser[184]中也有体现,该浏览器在不同操作系统上提供一致的浏览能力。
Containerization and Deployment Standardization. Modern implementations leverage containerization to ensure consistent deployment across environments. OpenManus [193] provides explicit container configurations that encapsulate all dependencies, enabling reliable deployment across diverse infrastructures. Similar approaches are employed by AutoGLM-Research [330], which provides standardized deployment configurations for different environments. Alongside containerization, modern cloud platforms such as Vercel [280] offer streamlined, standardized deployment workflows for the web-based interfaces of many research applications.
容器化与部署标准化。现代实现方案利用容器化技术确保跨环境的一致部署。OpenManus [193] 提供了明确的容器配置,封装所有依赖项,从而实现在不同基础设施上的可靠部署。AutoGLM-Research [330] 采用了类似方法,为不同环境提供标准化部署配置。除容器化外,Vercel [280] 等现代云平台为众多研究应用的基于网络界面提供了简化、标准化的部署工作流。

4.3.4 Research-Oriented Coding Assistance Integration. The integration of AI-powered coding assistants represents an increasingly important dimension of Deep Research system capabilities, particularly for computational research workflows requiring custom analysis scripts, data processing pipelines[108], and research automation tools.
4.3.4 面向研究的编码辅助集成。AI 驱动的编码助手集成正成为深度研究系统能力中日益重要的维度,尤其适用于需要定制分析脚本、数据处理管道[108]和研究自动化工具的计算研究工作流。
Coding Assistant Integration Patterns. Modern research workflows increasingly depend on custom code development for data analysis, visualization, and automation tasks. AI coding assistants have emerged as crucial tools for enhancing researcher productivity in these computational aspects. The landscape of coding assistance tools demonstrates varying approaches to integration with research workflows, from IDE-native completion systems to conversational code generation interfaces. Systems like GitHub Copilot [20, 86] provide seamless integration within development environments, enabling context-aware code completion for research scripts and analysis workflows. Complementary approaches like ChatGPT-based code generation [309] offer conversational interfaces that can translate research requirements into executable implementations. More specialized frameworks like AutoDev [275], DSPy[257], and Pydantic-AI[216] enable end-to-end automated development workflows particularly suited for research prototype generation and experimental tool creation. Additionally, tools like Bolt [32] allow researchers to create web applications directly from text descriptions, handling the coding process while they focus on their vision. Evolutionary coding agents like AlphaEvolve [190] further enhance capabilities by iteratively optimizing algorithms using autonomous pipelines of LLMs and evolutionary feedback mechanisms. Recent research explores the synergy between generative AI and software
编程助手集成模式。现代研究工作流程日益依赖自定义代码开发来完成数据分析、可视化和自动化任务。AI 编程助手已成为提升研究人员在这些计算环节工作效率的关键工具。当前编程辅助工具呈现出多样化的研究流程集成方式,从 IDE 原生补全系统到对话式代码生成界面均有涵盖。诸如 GitHub Copilot[20, 86]等系统在开发环境中提供无缝集成,能够为研究脚本和分析工作流提供上下文感知的代码补全。基于 ChatGPT 的代码生成[309]等补充性方案则提供对话式界面,可将研究需求转化为可执行实现。AutoDev[275]、DSPy[257]和 Pydantic-AI[216]等更专业的框架支持端到端自动化开发流程,特别适用于研究原型生成和实验工具创建。 此外,像 Bolt[32]这样的工具能让研究人员直接从文本描述创建网页应用,由系统处理编码过程,使他们能专注于设计构想。而像 AlphaEvolve[190]这类进化编码代理则通过采用 LLMs 自主管道和进化反馈机制的迭代优化算法,进一步提升了开发能力。最新研究正在探索生成式 AI 与软件工程之间的协同效应。

engineering, leveraging techniques like zero-shot prompting to enhance coding assistants and streamline development processes [41]. However, research has revealed limitations in these assistants’ capabilities, such as ambiguous beliefs regarding research claims and a lack of credible evidence to support their responses [35]. A large-scale survey demonstrates that developers frequently decline initial suggestions, citing unmet functional or non-functional requirements and challenges in controlling the tool to generate desired outputs [148]. User resistance behaviors documented in such surveys highlight the need for comprehensive adoption strategies, including providing active support during initial use, clearly communicating system capabilities, and adhering to predefined collaboration rules to mitigate low acceptance rates[252]. This underscores the need for adaptive hint systems, which can provide personalized support for bug finding and fixing by tailoring to user understanding levels and program representations to improve accuracy in debugging tasks [226]. Pioneering studies employ physiological measurements such as EEG and eye tracking to quantify developers’ cognitive load during AI-assisted programming tasks, addressing critical gaps in understanding actual usage patterns and productivity impacts [106]. Furthermore, tools like CodeScribe address challenges in AI-driven code translation for scientific computing by combining prompt engineering with user supervision to automate conversion processes while ensuring correctness [69]. Similarly, CodeCompose’s multi-line suggestion feature deployed at Meta demonstrates substantial productivity improvements, saving 17% of keystrokes through optimized latency solutions despite initial usability challenges [72]. Moreover, for debugging tasks, ChatDBG [139] enhances debugging capabilities by enabling programmers to engage in collaborative dialogues for root cause analysis and bug resolution, leveraging LLMs to provide domain-specific reasoning. Intelligent QA assistants are also being developed to streamline bug resolution processes [308], and grey literature reviews indicate a growing trend in AI-assisted test automation [231]. Additionally, benchmarks like CodeMMLU [163] evaluate code understanding and reasoning across diverse tasks, revealing significant comprehension gaps in current models despite advanced generative capabilities. Empirical evaluations of ACATs through controlled development scenarios demonstrate nuanced variations in acceptance patterns, modification reasons, and effectiveness based on task characteristics and user expertise [260]. Generative AI tools significantly enhance developer productivity by accelerating learning processes and altering collaborative team workflows through reduced repetitive tasks, fundamentally transforming development paradigms [277]. To realize the vision of next-generation AI coding assistants, it is crucial to address integration gaps and establish robust design principles such as setting clear usage expectations and employing extendable backend architectures [186].
工程领域通过利用零样本提示等技术来增强编码助手并优化开发流程[41]。然而研究表明这些助手存在能力局限,例如对研究主张的模糊认知以及缺乏可信证据支持其回答[35]。一项大规模调查显示,开发者经常拒绝初始建议,原因包括未满足的功能性或非功能性需求,以及难以控制工具生成预期输出[148]。此类调查记录的用户抵触行为凸显了全面采用策略的必要性,包括在初始使用阶段提供主动支持、清晰传达系统能力,以及遵守预定义协作规则以降低低接受率[252]。这强调了自适应提示系统的必要性,这类系统可通过适应用户理解水平和程序表示方式,为错误查找和修复提供个性化支持,从而提高调试任务的准确性[226]。 开创性研究采用 EEG 和眼动追踪等生理测量技术来量化开发者在 AI 辅助编程任务中的认知负荷,填补了理解实际使用模式和生产力影响的关键空白[106]。此外,CodeScribe 等工具通过将提示工程与用户监督相结合,解决了科学计算中 AI 驱动代码翻译的挑战,在确保正确性的同时自动化转换流程[69]。同样,Meta 部署的 CodeCompose 多行建议功能通过优化延迟解决方案,在初期可用性挑战下仍实现了显著生产力提升,节省了 17%的击键次数[72]。在调试任务方面,ChatDBG[139]通过支持程序员进行协作对话以进行根因分析和缺陷修复,利用 LLMs 提供领域特定推理,从而增强调试能力。智能 QA 助手也正在开发中以简化缺陷解决流程[308],灰色文献综述表明 AI 辅助测试自动化呈现增长趋势[231]。 此外,CodeMMLU[163]等基准测试评估了代码理解和推理能力在不同任务中的表现,揭示了尽管当前模型具备先进的生成能力,仍存在显著的理解差距。通过受控开发场景对 ACATs 进行的实证评估表明,基于任务特性和用户专业水平的不同,在采纳模式、修改原因及有效性方面存在细微差异[260]。生成式 AI 工具通过加速学习过程和减少重复性任务来改变团队协作流程,从而显著提升开发者生产力,从根本上改变了开发范式[277]。为实现下一代 AI 编程助手的愿景,解决集成缺口并建立稳健的设计原则(如设定明确的使用预期和采用可扩展的后端架构)至关重要[186]。
Table 14. Qualitative Assessment of AI Coding Assistants for Research Applications
表 14. 研究应用中 AI 编程助手的定性评估
System  系统 Documented Capabilities  已记录功能 Integration Approach  集成方法 Evaluation Evidence  评估证据 Research-Specific Features
研究特定功能
GitHub Copilot [86, 319] Code completion, documentation
代码补全、文档生成
IDE-native integration  IDE 原生集成 User study on practices [319]
用户实践研究[319]
Limited domain specialization
有限的领域专业化
Amazon CodeWhisperer [175] Security-focused suggestions
安全优先的建议
AWS ecosystem integration
AWS 生态系统集成
Comparative evaluation [309]
对比评估 [309]
Cloud research workflows
云端研究流程
ChatGPT Code [309]  ChatGPT 代码[309] Conversational code generation
对话式代码生成
API-based interaction  基于 API 的交互 Code quality assessment [309]
代码质量评估[309]
Natural language specification
自然语言规范
Cursor [65]  光标[65] Context-aware completion
上下文感知补全
Codebase integration  代码库集成 No published evaluation  无公开评估 Repository-level understanding
仓库级理解
Codeium [206] Multi-language support  多语言支持 Editor extensions  编辑器扩展 Comparative benchmark [206]
对比基准测试[206]
Analysis workflow support
分析工作流支持
AutoDev [275] Automated development  自动化开发 Task automation pipeline
任务自动化流水线
Empirical evaluation [275]
实证评估 [275]
End-to-end implementation
端到端实现
GPT-Pilot [217] Project scaffolding  项目脚手架 Guided development process
引导式开发流程
Repository demonstrations
代码库演示
Research prototype generation
研究原型生成
Note: Capabilities and evaluations based on published studies and documented features. Comparative performance requires standardized evaluation across identical tasks.
注:功能和评估基于已发表的研究和文档记录的特性。比较性能需要在相同任务上进行标准化评估。
System Documented Capabilities Integration Approach Evaluation Evidence Research-Specific Features GitHub Copilot [86, 319] Code completion, documentation IDE-native integration User study on practices [319] Limited domain specialization Amazon CodeWhisperer [175] Security-focused suggestions AWS ecosystem integration Comparative evaluation [309] Cloud research workflows ChatGPT Code [309] Conversational code generation API-based interaction Code quality assessment [309] Natural language specification Cursor [65] Context-aware completion Codebase integration No published evaluation Repository-level understanding Codeium [206] Multi-language support Editor extensions Comparative benchmark [206] Analysis workflow support AutoDev [275] Automated development Task automation pipeline Empirical evaluation [275] End-to-end implementation GPT-Pilot [217] Project scaffolding Guided development process Repository demonstrations Research prototype generation Note: Capabilities and evaluations based on published studies and documented features. Comparative performance requires standardized evaluation across identical tasks. | System | Documented Capabilities | Integration Approach | Evaluation Evidence | Research-Specific Features | | :--- | :--- | :--- | :--- | :--- | | GitHub Copilot [86, 319] | Code completion, documentation | IDE-native integration | User study on practices [319] | Limited domain specialization | | Amazon CodeWhisperer [175] | Security-focused suggestions | AWS ecosystem integration | Comparative evaluation [309] | Cloud research workflows | | ChatGPT Code [309] | Conversational code generation | API-based interaction | Code quality assessment [309] | Natural language specification | | Cursor [65] | Context-aware completion | Codebase integration | No published evaluation | Repository-level understanding | | Codeium [206] | Multi-language support | Editor extensions | Comparative benchmark [206] | Analysis workflow support | | AutoDev [275] | Automated development | Task automation pipeline | Empirical evaluation [275] | End-to-end implementation | | GPT-Pilot [217] | Project scaffolding | Guided development process | Repository demonstrations | Research prototype generation | | Note: Capabilities and evaluations based on published studies and documented features. Comparative performance requires standardized evaluation across identical tasks. | | | | |
The diversity of coding assistance approaches highlights the importance of integration flexibility within Deep Research systems. While some implementations benefit from tightly integrated coding assistance that understands research context, others require more flexible interfaces that can accommodate diverse
编码辅助方法的多样性凸显了深度研究系统中集成灵活性的重要性。虽然某些实现得益于理解研究背景的紧密集成编码辅助,但其他情况则需要能够适应多样性的更灵活接口。

development workflows and programming paradigms. This integration dimension becomes particularly crucial as research increasingly requires custom computational tools and analysis pipelines that extend beyond pre-existing software packages[75, 244, 295]. Recent work by Chen et al. [53] demonstrates that proactive programming assistants, which automatically provide suggestions to enhance productivity and user experience, represent a key advancement in this domain. Additionally, ChatDev [220] exemplifies how linguistic communication serves as a unifying bridge for multi-agent collaboration in software development, streamlining the entire lifecycle from design to testing. Moreover, research on integrating AI assistants in Agile meetings reveals critical links to team collaboration dynamics and provides roadmaps for facilitating their adoption in development contexts [40]. As demonstrated by Talissa Dreossi[70], this hybrid approach bridges the gap between the high performance of deep learning models and the transparency of symbolic reasoning, advancing AI by providing interpretable and trustworthy applications.
开发工作流和编程范式。随着研究日益需要超越现有软件包的自定义计算工具和分析流程[75, 244, 295],这种集成维度变得尤为关键。Chen 等人[53]的最新研究表明,能主动提供建议以提升生产力和用户体验的编程助手,代表了该领域的关键进展。此外,ChatDev[220]展示了语言交流如何作为多智能体协作的统一桥梁,优化从设计到测试的完整开发生命周期。关于在敏捷会议中集成 AI 助手的研究揭示了与团队协作动态的关键联系,并为开发场景中的采用提供了路线图[40]。正如 Talissa Dreossi[70]所论证的,这种混合方法弥合了深度学习模型高性能与符号推理透明性之间的鸿沟,通过提供可解释且可信赖的应用程序推动 AI 发展。
Research Workflow Code Generation. Advanced coding assistants specifically optimized for research contexts demonstrate particular value in translating research methodologies into executable implementations. Systems like GPT-Pilot [217] enable guided development of complete research applications, while domainspecific tools can generate analysis scripts aligned with particular research methodologies or data types. These capabilities enhance research efficiency by reducing the technical barriers between research design and computational implementation.
研究流程代码生成。专为研究场景优化的高级编程助手在将研究方法转化为可执行实现方面展现出独特价值。诸如 GPT-Pilot[217]等系统能够引导完整研究应用程序的开发,而领域专用工具则可生成与特定研究方法或数据类型相匹配的分析脚本。这些能力通过降低研究设计与计算实现之间的技术壁垒,有效提升了研究效率。
Implementation patterns typically involve integration with research data management systems, version control workflows, and collaborative development environments that support reproducible research practices. The effectiveness of such integration depends significantly on the coding assistant’s understanding of researchspecific requirements including documentation standards, reproducibility considerations, and domain-specific libraries and frameworks commonly used in particular research fields[124].
实施模式通常涉及与研究数据管理系统、版本控制流程及支持可重复研究实践的协作开发环境相集成。此类集成的有效性很大程度上取决于编程助手对研究特定需求的理解,包括文档标准、可重复性考量,以及特定研究领域常用的领域专用库和框架[124]。

4.4 Technical Challenges and Solutions
4.4 技术挑战与解决方案

Deep Research systems face numerous technical challenges that must be addressed for reliable, trustworthy operation.
深度研究系统面临众多技术挑战,必须解决这些挑战才能实现可靠、可信的运行。

4.4.1 Hallucination Control and Factual Consistency. Maintaining factual accuracy represents a fundamental challenge for LLM-based research systems:
4.4.1 幻觉控制与事实一致性。保持事实准确性是 LLM 研究系统面临的基础性挑战:
Source Grounding Techniques. Advanced implementations employ explicit source grounding to enhance factual reliability. Perplexity/DeepResearch [209] implements strict attribution requirements that link all generated content to specific sources, reducing unsupported assertions. Similar approaches are evident in OpenAI/DeepResearch [197], which maintains explicit provenance tracking throughout the reasoning process.
来源锚定技术。先进实施方案采用显式来源锚定来增强事实可靠性。Perplexity/DeepResearch[209]实施了严格的归因要求,将所有生成内容与特定来源关联,减少无依据的断言。OpenAI/DeepResearch[197]也采用了类似方法,在整个推理过程中保持明确来源追踪。
Open-source implementations like grapeot/deep_research_agent [263] demonstrate more accessible grounding approaches, including simple but effective citation tracking and verification mechanisms. These techniques show that meaningful improvements in factual reliability can be achieved with straightforward implementation strategies.
grapeot/deep_research_agent[263]等开源实现展示了更易获取的锚定方法,包括简单但有效的引用追踪与验证机制。这些技术表明,通过直接的实现策略就能显著提升事实可靠性。
Contradiction Detection and Resolution. Effective research requires identification and resolution of contradictory information. Commercial systems implement sophisticated contradiction detection mechanisms that identify inconsistencies between sources and implement resolution strategies [296]. Gemini/DeepResearch
矛盾检测与解决。有效研究需要识别并解决相互矛盾的信息。商业系统采用复杂的矛盾检测机制,能够识别不同来源之间的不一致性并实施解决策略[296]。Gemini/DeepResearch

[60] includes explicit uncertainty modeling and conflicting evidence presentation, enhancing transparency when definitive conclusions cannot be reached.
[60]包含显式的不确定性建模和冲突证据呈现功能,在无法得出明确结论时增强透明度。
Open implementations like HKUDS/Auto-Deep-Research [112] employ simpler but useful contradiction identification approaches, flagging potential inconsistencies for user review. These implementations demonstrate that even basic contradiction handling can significantly enhance research reliability.
香港大学数据科学/自动深度研究[112]等开源实现采用更简单但实用的矛盾识别方法,标记潜在不一致性供用户审查。这些实现表明,即使基本的矛盾处理也能显著提升研究可靠性。

4.4.2 Privacy Protection and Security Design. Research systems must safeguard sensitive information and protect against potential misuse:
4.4.2 隐私保护与安全设计。研究系统必须保护敏感信息并防范潜在滥用:
Query and Result Isolation. Secure implementations employ strict isolation between user queries to prevent information leakage. Commercial platforms implement sophisticated tenant isolation that ensures complete separation between different users’ research activities. Similar concerns motivate open-source implementations like OpenManus [193], which enables local deployment for sensitive research applications.
查询与结果隔离。安全实现采用严格的用户查询隔离机制以防止信息泄露。商业平台实施了复杂的租户隔离方案,确保不同用户研究活动之间的完全分离。类似的考量促使了开源实现如 OpenManus[193]的出现,该方案支持敏感研究应用的本地化部署。
Source Data Protection. Responsible implementation requires careful handling of source information. Systems like Flowith/OracleMode [77] implement controlled data access patterns that respect source restrictions including authentication requirements and access limitations. These approaches enhance compliance with source terms of service while ensuring comprehensive information access. Recent advancements include benchmarking frameworks such as CI-Bench [56], which evaluates how well systems adhere to contextual norms and privacy expectations.
源数据保护。负责任的实现需要对源信息进行谨慎处理。Flowith/OracleMode[77]等系统采用受控数据访问模式,严格遵守包括认证要求和访问限制在内的源数据约束。这些方法在确保全面信息访问的同时,增强了与源服务条款的合规性。最新进展包括 CI-Bench[56]等基准测试框架,用于评估系统对上下文规范及隐私期望的遵循程度。

4.4.3 Explainability and Transparency. The scientific context places particularly stringent requirements on explanation quality. Mengaldo [170] argues that transparent explanation is not merely a feature but a fundamental requirement for scientific applications, emphasizing that black-box approaches fundamentally contradict scientific methodology’s requirement for transparent reasoning and reproducible results. This perspective suggests that explanation capabilities may require different standards in scientific Deep Research applications compared to general AI systems. Trustworthy research systems must provide insight into their reasoning processes and sources:
4.4.3 可解释性与透明度。科学背景对解释质量提出了尤为严格的要求。Mengaldo [170] 提出,透明的解释不仅是科学应用的一个特性,更是基本要求,强调黑箱方法从根本上违背了科学方法论对透明推理和可重复结果的要求。这一观点表明,科学深度研究应用中的解释能力可能需要不同于通用人工智能系统的标准。值得信赖的研究系统必须提供对其推理过程和数据来源的洞察:
Reasoning Trail Documentation. Advanced implementations maintain explicit documentation of the reasoning process. OpenAI/DeepResearch [197] includes comprehensive reasoning traces that expose the analytical steps leading to specific conclusions. Similar capabilities are emerging in open-source alternatives like mshumer/OpenDeepResearcher [249], which includes basic reasoning documentation to enhance result interpretability.
推理轨迹记录。先进实施方案会保持对推理过程的明确记录。OpenAI/DeepResearch [197] 包含全面的推理轨迹,揭示导致特定结论的分析步骤。类似功能正在开源替代方案中出现,如 mshumer/OpenDeepResearcher [249]就包含基础推理记录以增强结果可解释性。
Source Attribution and Verification. Transparent systems provide clear attribution for all information and enable verification. Perplexity/DeepResearch [209] implements comprehensive citation practices with explicit links to original sources, enabling direct verification of all claims. Similar approaches are employed by dzhng/deep-research [321], which maintains rigorous source tracking throughout the research process.
来源标注与验证。透明系统为所有信息提供清晰的来源标注并支持验证。Perplexity/DeepResearch [209] 实施了全面的引用规范,包含指向原始来源的显式链接,使得所有主张都能被直接验证。dzhng/deep-research [321] 采用了类似方法,在整个研究过程中保持严格的来源追踪。
These implementation technologies and challenges highlight the complex engineering considerations involved in creating effective Deep Research systems. While commercial platforms benefit from extensive infrastructure and specialized components, open-source implementations demonstrate that effective research capabilities can be achieved through pragmatic approaches to the same fundamental challenges. The diversity
这些实现技术和挑战凸显了构建高效深度研究系统所涉及的复杂工程考量。虽然商业平台得益于广泛的基础设施和专用组件,但开源实现证明通过务实方法应对相同的基础挑战也能获得有效的研究能力。整个生态系统中

of implementation strategies across the ecosystem reflects different priorities in balancing capability, efficiency, reliability, and accessibility.
实现策略的多样性反映了在能力、效率、可靠性和可访问性之间权衡时的不同优先级。

5 Evaluation Methodologies and Benchmarks
5 评估方法与基准测试

Rigorous evaluation of Deep Research systems presents unique challenges due to their complex capabilities and diverse application contexts. This section examines established frameworks for assessment, identifies emerging evaluation standards, and analyzes the strengths and limitations of current approaches.
深度研究系统的严格评估因其复杂能力和多样化应用场景而面临独特挑战。本节将审视现有的评估框架,识别新兴的评价标准,并分析当前方法的优势与局限。

Multi-dimensional Evaluation Framework for Deep Research Systems
深度研究系统的多维评估框架

Fig. 8. Multi-dimensional Evaluation Framework for Deep Research Systems
图 8. 深度研究系统的多维评估框架

5.1 Functional Evaluation Frameworks
5.1 功能评估框架

Functional evaluation assesses core capabilities essential to effective research performance.
功能评估针对有效科研表现所需的核心能力进行评价。

5.1.1 Task Completion Capability Assessment. The ability to successfully complete research tasks represents a fundamental evaluation dimension:
5.1.1 任务完成能力评估。成功完成研究任务的能力代表了一项基本评估维度:
Task Success Rate Metrics. Quantitative assessment of task completion provides objective performance measures. Standardized evaluation suites like WebArena [332] measure successful completion of web-based
任务成功率指标。对任务完成情况的量化评估提供了客观的性能衡量标准。诸如 WebArena[332]等标准化评估套件可测量基于网络的任务成功完成率。

research tasks. For instance, AutoGLM [330] achieves a 55.2% success rate on VAB-WebArena-Lite (improving to 59.1 % 59.1 % 59.1%59.1 \% on a second attempt) and 96.2 % 96.2 % 96.2%96.2 \% on OpenTable evaluation tasks. Similarly, benchmarks like MobileArena evaluate successful completion of mobile interface tasks, where AutoGLM [330] demonstrates a 36.2 % 36.2 % 36.2%36.2 \% success rate on AndroidLab and 89.7 % 89.7 % 89.7%89.7 \% on common tasks in popular Chinese apps [153]. Domainspecific benchmarks, such as AutoPenBench for generative agents in penetration testing [85], provide further targeted assessments. These benchmarks provide meaningful comparative metrics, though with limitations in representing real-world research complexity.
研究任务。例如,AutoGLM[330]在 VAB-WebArena-Lite 上实现了 55.2%的成功率(第二次尝试时提升至 59.1 % 59.1 % 59.1%59.1 \% ),在 OpenTable 评估任务中达到 96.2 % 96.2 % 96.2%96.2 \% 。类似地,MobileArena 等基准测试评估移动界面任务的完成情况,其中 AutoGLM[330]在 AndroidLab 上展现出 36.2 % 36.2 % 36.2%36.2 \% 的成功率,在热门中文应用[153]的常见任务中达到 89.7 % 89.7 % 89.7%89.7 \% 。领域特定基准(如用于渗透测试生成式智能体的 AutoPenBench[85])则提供了更具针对性的评估。这些基准提供了有意义的对比指标,但在体现真实世界研究复杂性方面仍存在局限。
These benchmarks provide meaningful comparative metrics, though with limitations in representing realworld research complexity. Perplexity/DeepResearch [209] explicitly highlights this distinction, noting that while benchmark performance provides comparative indicators, practical effectiveness depends significantly on task characteristics and domain specifics.
这些基准测试提供了有意义的对比指标,尽管在体现现实研究复杂性方面存在局限。Perplexity/DeepResearch [209] 明确强调了这一区别,指出虽然基准表现能提供对比指标,但实际效果很大程度上取决于任务特性和领域细节。
Multi-Attempt Resolution Rates. Effective research often involves iterative refinement with multiple attempts. Advanced evaluation frameworks incorporate multi-attempt metrics that assess system resilience and adaptability. AutoGLM [154] demonstrates significant performance improvement with second attempts (55.2% to 59.1% on WebArena-Lite), highlighting the importance of error recovery and adaptive strategies in practical research contexts.
多轮尝试解决率。有效研究通常需要多次迭代优化。先进的评估框架纳入了多轮尝试指标,以评估系统的韧性和适应性。AutoGLM [154] 展示了二次尝试带来的显著性能提升(WebArena-Lite 上从 55.2%提高到 59.1%),凸显了实际研究场景中错误恢复和自适应策略的重要性。
Open-source frameworks like Agent-RL/ReSearch [2] explicitly emphasize iterative improvement through reinforcement learning approaches, demonstrating how evaluation methods that consider adaptability provide more comprehensive assessment than single-attempt metrics alone.
诸如 Agent-RL/ReSearch [2]等开源框架通过强化学习方法明确强调迭代改进,证明了考虑适应性的评估方法比单一尝试指标能提供更全面的评估。

5.1.2 Information Retrieval Quality Evaluation. Effective information gathering forms the foundation of successful research:
5.1.2 信息检索质量评估。有效的信息收集是成功研究的基石:
Search Effectiveness Metrics. Information retrieval quality significantly impacts overall research performance. Evaluation frameworks employ metrics including precision (relevance of retrieved information), recall (comprehensiveness of coverage), and F1 scores (balanced measure of both). Systems like Perplexity/DeepResearch [209] demonstrate particular strength in recall metrics, effectively identifying comprehensive information across diverse sources.
搜索效能指标。信息检索质量对整体研究表现具有显著影响。评估框架采用的指标包括精确率(检索信息的相关性)、召回率(覆盖的全面性)和 F1 分数(两者的平衡衡量)。Perplexity/DeepResearch[209]等系统在召回率指标上表现尤为突出,能有效识别跨多样来源的全面信息。
Specialized information retrieval benchmarks like TREC [214] provide standardized assessment of search effectiveness. However, to the best of our knowledge, there is no specific evidence that the Deep Research systems from OpenAI, Google, Perplexity, or any of the open-source projects listed in this survey have been formally evaluated on TREC benchmarks [214]. This limitation motivates domain-specific evaluation approaches that better reflect particular research requirements.
TREC[214]等专业信息检索基准测试提供了搜索效能的标准化评估。但据我们所知,目前没有明确证据表明 OpenAI、Google、Perplexity 的深度研究系统或本综述中列出的任何开源项目曾在 TREC 基准[214]上进行过正式评估。这一局限性促使我们采用更符合特定研究需求的领域专用评估方法。
Source Diversity Assessment. Comprehensive research requires balanced information from diverse perspectives and sources. Advanced evaluation frameworks incorporate explicit diversity metrics that assess the breadth of source utilization. Commercial systems like Gemini/DeepResearch [60] emphasize source diversity as a key performance indicator, while open implementations like dzhng/deep-research [321] incorporate specific mechanisms to ensure balanced source consideration.
来源多样性评估。全面研究需要来自不同视角和来源的平衡信息。先进的评估框架整合了明确的多样性指标,用于衡量来源利用的广度。Gemini/DeepResearch[60]等商业系统将来源多样性作为关键绩效指标,而 dzhng/deep-research[321]等开源实现则采用特定机制来确保来源考量的均衡性。
Emerging evaluation approaches include explicit source spectra analysis that examines distribution across domains, perspectives, and publication types. These methods provide more nuanced assessment of
新兴的评估方法包括显式源谱分析,该方法可考察跨领域、跨视角及跨出版物类型的分布情况。这些方法提供了更为细致的评估手段。

information gathering quality beyond simple relevance metrics, addressing concerns about potential bias in automated research processes.
信息收集质量超越简单的相关性指标,解决对自动化研究过程中潜在偏见的担忧。

5.1.3 Knowledge Synthesis Accuracy Assessment. Transforming information into accurate, coherent insights represents a crucial capability:
5.1.3 知识合成准确性评估。将信息转化为准确、连贯的洞察力是一项关键能力:
Factual Consistency Metrics. Reliable research requires accurate synthesis without introducing errors or misrepresentations. Evaluation frameworks employ fact verification techniques that compare generated content against source materials, identifying potential inaccuracies or unsupported claims. Systems like grapeot/deep_research_agent[263] emphasize factual verification through explicit source linking, enabling direct accuracy assessment. Benchmark suites like TruthfulQA [151] assess the truthfulness of language models under challenging conditions. While specific accuracy figures for OpenAI/DeepResearch [197] and Perplexity/ DeepResearch [209] on TruthfulQA [151] are not publicly available, these systems have demonstrated notable performance on other rigorous benchmarks. For instance, OpenAI/DeepResearch [197] achieved a 26.6 % 26.6 % 26.6%26.6 \% accuracy [197] on Humanity’s Last Exam (HLE) [212]. Similarly, Perplexity/DeepResearch [209] attained a 21.1 % 21.1 % 21.1%21.1 \% accuracy [209] on the same benchmark. The development of unified, fine-grained, and multi-dimensional evaluation frameworks for summarization further advances the ability to assess the quality of synthesized content from LLMs [137]. These metrics provide standardized comparison points, though with recognized limitations in representing the complexity of real-world research synthesis.
事实一致性指标。可靠的研究需要在不引入错误或曲解的情况下进行准确的信息综合。评估框架采用事实核查技术,将生成内容与原始资料进行比对,识别潜在的不准确或缺乏依据的主张。诸如 grapeot/deep_research_agent[263]等系统通过显式来源链接强调事实核查,实现直接准确性评估。TruthfulQA[151]等基准测试套件用于评估语言模型在挑战性条件下的真实性表现。虽然 OpenAI/DeepResearch[197]和 Perplexity/DeepResearch[209]在 TruthfulQA[151]上的具体准确率数据尚未公开,但这些系统在其他严格基准测试中已展现出显著性能。例如,OpenAI/DeepResearch[197]在 Humanity's Last Exam (HLE)[212]上取得了 26.6 % 26.6 % 26.6%26.6 \% 准确率[197]。同样地,Perplexity/DeepResearch[209]在同一基准测试中达到了 21.1 % 21.1 % 21.1%21.1 \% 准确率[209]。针对摘要任务开发的统一、细粒度和多维度的评估框架[137],进一步提升了评估 LLMs 生成内容质量的能力。 这些指标提供了标准化的比较基准,尽管在体现现实世界研究合成的复杂性方面存在公认的局限性。
Logical Coherence Assessment. Effective research requires logically sound integration of information into coherent analyses. Sophisticated evaluation approaches employ reasoning validity assessment that examines logical structures and inference patterns in research outputs. This dimension proves particularly challenging for automated assessment, often requiring expert human evaluation for reliable scoring.
逻辑连贯性评估。有效的研究需要将信息以逻辑严密的方式整合成连贯的分析。先进的评估方法采用推理有效性评估,考察研究成果中的逻辑结构和推理模式。这一维度对自动化评估尤其具有挑战性,通常需要专家人工评估才能获得可靠评分。
Commercial systems like OpenAI/DeepResearch [197] and Gemini/DeepResearch [60] emphasize logical coherence in their evaluation frameworks, while open-source alternatives like mshumer/OpenDeepResearcher [249] incorporate simplified but useful logical consistency checks. These approaches highlight the importance of sound reasoning in effective research outputs beyond simple factual accuracy.
OpenAI/DeepResearch [197]和 Gemini/DeepResearch [60]等商业系统在其评估框架中强调逻辑连贯性,而 mshumer/OpenDeepResearcher [249]等开源替代方案则采用了简化但实用的逻辑一致性检查。这些方法凸显了在有效研究成果中,严谨推理的重要性远超过简单的事实准确性。

5.2 Non-Functional Evaluation Metrics
5.2 非功能性评估指标

Beyond core functionality, practical effectiveness depends on operational characteristics that impact usability and deployment.
除了核心功能外,实际效果还取决于影响可用性和部署的操作特性。

5.2.1 Performance and Efficiency Metrics. Operational efficiency significantly impacts practical utility:
5.2.1 性能与效率指标。运行效率对实际效用有显著影响:

Response Time Profiling. Timeliness represents a crucial dimension of research effectiveness. Evaluation frameworks incorporate response time metrics that measure completion duration across standardized tasks. Commercial systems demonstrate varying performance characteristics, with Perplexity/DeepResearch [209] achieving relatively quick response times (2-5 minutes for moderate tasks) while OpenAI/DeepResearch [197] typically requires longer processing (5-10 minutes) for similar complexity.
响应时间分析。时效性是研究有效性的关键维度。评估框架纳入了响应时间指标,用于测量标准化任务的完成时长。商业系统展现出不同的性能特征,其中 Perplexity/DeepResearch[209]实现了相对较快的响应时间(中等任务 2-5 分钟),而 OpenAI/DeepResearch[197]对类似复杂度任务通常需要更长的处理时间(5-10 分钟)。
Open-source implementations generally demonstrate longer response times, though with significant variation based on implementation approaches and deployment environments. Systems like nickscamara/
开源实现通常表现出更长的响应时间,但具体表现会因实现方法和部署环境而有显著差异。诸如 nickscamara/

open-deep-research [42] emphasize accessibility over performance optimization, while QwenLM/Qwen-Agent [224] incorporates specific optimizations to enhance response times within resource constraints.
开源深度研究[42]强调可访问性而非性能优化,而 QwenLM/Qwen-Agent[224]则通过特定优化在资源限制下提升响应速度。
Resource Utilization Assessment. Computational efficiency enables broader deployment and accessibility. Comprehensive evaluation includes resource profiling that measures memory consumption, computational requirements, and energy utilization across standardized workloads. Specialized benchmarks like Minerva assess programmable memory capabilities of language models, offering insights into their efficiency in handling long-context information [300]. Commercial cloud-based systems obscure some of these metrics due to their managed infrastructure, though with operational costs providing indirect resource indicators. Open implementations like Camel-AI/OWL [43] and AutoGLM-Research [330] provide more transparent resource profiles, enabling direct assessment of deployment requirements and operational economics. These metrics highlight significant variation in efficiency across the ecosystem, with implications for practical deployment scenarios and accessibility.
资源利用评估。计算效率决定了更广泛的部署可能性和可访问性。全面评估包含资源分析,即测量标准化工作负载下的内存消耗、计算需求和能源利用率。Minerva 等专业基准测试可评估语言模型的可编程内存能力,揭示其处理长上下文信息的效率[300]。商业云系统因其托管基础设施特性会模糊部分指标,但运营成本能间接反映资源状况。Camel-AI/OWL[43]和 AutoGLM-Research[330]等开源实现提供了更透明的资源画像,便于直接评估部署需求和运营经济性。这些指标揭示了生态系统内显著的效率差异,对实际部署场景和可访问性具有重要影响。

5.2.2 Reliability and Stability Metrics. Consistent performance under diverse conditions ensures practical usability:
5.2.2 可靠性与稳定性指标。在不同条件下保持一致的性能是确保实际可用性的关键:
Error Rate Analysis. Reliability under challenging conditions significantly impacts user trust and adoption. Robust evaluation frameworks incorporate error rate metrics that measure failure frequency across diverse scenarios. Commercial systems generally demonstrate lower error rates compared to open-source alternatives, though with remaining challenges in complex or novel research contexts.
错误率分析。在挑战性条件下的可靠性显著影响用户信任与采用率。稳健的评估框架整合了错误率指标,用于衡量不同场景下的故障频率。商业系统通常展现出比开源方案更低的错误率,不过在复杂或新颖的研究场景中仍存在挑战。
Specialized reliability testing employs adversarial scenarios designed to trigger failure modes, providing insight into system robustness. Systems like OpenAI/DeepResearch [197] and Agent-RL/ReSearch [2] incorporate explicit error recovery mechanisms that enhance reliability under challenging conditions, highlighting the importance of resilience in practical research applications.
专业化可靠性测试采用对抗性场景来触发故障模式,从而深入理解系统鲁棒性。诸如 OpenAI/DeepResearch [197]和 Agent-RL/ReSearch [2]等系统集成了显式错误恢复机制,增强了挑战性条件下的可靠性,凸显了韧性在实际研究应用中的重要性。
Long-Term Stability Assessment. Consistent performance over extended operation provides crucial deployment confidence. Comprehensive evaluation includes stability metrics that measure performance consistency across extended sessions and repeated executions. This dimension proves particularly relevant for open-source implementations that must operate in diverse deployment environments with varying infrastructure stability.
长期稳定性评估。持续稳定的长期运行性能为部署提供了关键信心。全面评估包含稳定性指标,这些指标衡量在长时间会话和重复执行中的性能一致性。这一维度对于开源实现尤为重要,因为它们必须在基础设施稳定性各异的不同部署环境中运行。
Systems like Flowith/OracleMode [77] and TARS [39] emphasize operational stability through robust error handling and recovery mechanisms, enabling reliable performance in production environments. These capabilities highlight the importance of engineering quality beyond core algorithmic performance in practical research applications.
诸如 Flowith/OracleMode [77]和 TARS [39]等系统通过强大的错误处理与恢复机制强调运行稳定性,从而在生产环境中实现可靠性能。这些能力凸显了在实际研究应用中,工程品质的重要性超越了核心算法性能。

5.2.3 User Experience and Usability Metrics. Effective interaction significantly impacts practical utility:
5.2.3 用户体验与可用性指标。有效的交互显著影响实际效用:

Interface Usability Assessment. Intuitive interfaces enhance accessibility and effective utilization. Usability evaluation frameworks employ standardized usability metrics including System Usability Scale (SUS) [140] scores and task completion time measurements. Commercial systems typically demonstrate stronger usability characteristics, with Perplexity/DeepResearch [209] particularly emphasizing intuitive interaction for nontechnical users. Open-source alternatives show greater variability, with implementations like HKUDS/Auto-Deep-Research [112] incorporating specific interface enhancements to improve accessibility.
界面可用性评估。直观的界面能提升可访问性和使用效率。可用性评估框架采用标准化可用性指标,包括系统可用性量表(SUS)[140]得分和任务完成时间测量。商业系统通常展现出更强的可用性特征,其中 Perplexity/DeepResearch[209]特别注重为非技术用户提供直观交互体验。开源方案则表现出更大差异性,诸如 HKUDS/Auto-Deep-Research[112]等实现通过特定界面增强功能来提升可访问性。
User studies provide more nuanced usability assessment beyond standardized metrics. Evaluations of systems like Manus [164] and Flowith/OracleMode [77] incorporate explicit user feedback to identify interaction challenges and improvement opportunities. These approaches highlight the importance of humancentered design in practical research applications beyond technical performance. Similarly, frameworks such as AdaptoML-UX [87] enable HCI researchers to employ automated ML pipelines without specialized expertise, facilitating robust model development and customization.
用户研究提供了超越标准化指标的更细致可用性评估。对 Manus[164]和 Flowith/OracleMode[77]等系统的评估结合了明确的用户反馈,以识别交互挑战和改进机会。这些方法凸显了在实际研究应用中,以人为中心的设计比技术性能更为重要。同样,AdaptoML-UX[87]等框架使 HCI 研究人员无需专业知识即可使用自动化机器学习流程,促进了稳健模型的开发和定制。
Learning Curve Assessment. Approachability for new users significantly impacts adoption and effective utilization. Comprehensive evaluation includes learning curve metrics that measure time-to-proficiency across user segments with varying technical backgrounds. Commercial systems generally demonstrate gentler learning curves, with Perplexity/DeepResearch [209] explicitly designed for accessibility to non-technical users.
学习曲线评估。新用户的易上手性显著影响采用率和有效使用率。全面评估包括学习曲线指标,用于衡量不同技术背景用户群体达到熟练所需时间。商业系统通常展现出更平缓的学习曲线,其中 Perplexity/DeepResearch[209]明确设计为非技术用户的可访问性而优化。
Open implementations show greater variability, with systems like n8n [183] requiring more technical expertise for effective deployment and utilization. More accessible alternatives like nickscamara/open-deep-research [42] incorporate simplified interfaces designed for broader accessibility, highlighting diverse approaches to the accessibility-sophistication balance across the ecosystem.
开源实现展现出更大的差异性,像 n8n[183]这样的系统需要更多专业技术才能有效部署和使用。而更易用的替代方案如 nickscamara/open-deep-research[42]则采用了简化界面设计以提高普适性,这凸显了生态系统中在易用性与复杂性之间寻求平衡的多样化方案。

5.3 Cross-Domain Evaluation Benchmarks
5.3 跨领域评估基准

Standardized benchmarks enable objective comparison across systems and domains.
标准化基准实现了跨系统和跨领域的客观比较。

5.3.1 Academic Research Task Benchmarks. Specialized benchmarks assess capabilities relevant to scholarly research:
5.3.1 学术研究任务基准。专业化的基准测试评估与学术研究相关的能力:
Literature Review Benchmarks. Comprehensive literature synthesis represents a fundamental academic research task requiring sophisticated information retrieval, critical analysis, and synthesis capabilities. To the best of our knowledge, no benchmark suite is specifically designed to evaluate systems’ ability to identify relevant literature, synthesize key findings, and highlight research gaps across scientific domains. We propose leveraging existing high-quality literature reviews published in Nature Reviews journals as gold standards. Citation networks from academic knowledge graphs-such as Microsoft Academic Graph, Semantic Scholar Academic Graph, and Open Academic Graph-could provide complementary evaluation data by measuring a system’s ability to traverse citation relationships and identify seminal works [ 1 , 31 ] [ 1 , 31 ] [1,31][1,31].
文献综述基准测试。全面的文献综述是一项基础的学术研究任务,需要复杂的信息检索、批判性分析和综合能力。据我们所知,目前尚无专门设计的基准测试套件用于评估系统在不同科学领域中识别相关文献、综合关键发现及突出研究空白的能力。我们建议利用《Nature Reviews》系列期刊发表的高质量综述文章作为黄金标准。来自学术知识图谱(如 Microsoft Academic Graph、Semantic Scholar Academic Graph 和 Open Academic Graph)的引文网络可通过衡量系统遍历引用关系及识别开创性著作的能力,提供补充性评估数据 [ 1 , 31 ] [ 1 , 31 ] [1,31][1,31]
While direct literature review benchmarks remain underdeveloped, several indirect benchmarks offer insight into related capabilities. OpenAI/DeepResearch [197] demonstrates leading performance, achieving 26.6 % 26.6 % 26.6%26.6 \% accuracy on Humanity’s Last Exam (HLE) [212] and averaging 72.57% on the GAIA benchmark [172], reflecting strong performance in complex reasoning tasks essential for literature synthesis. Similarly, Perplexity/DeepResearch [209] achieves 21.1 % 21.1 % 21.1%21.1 \% accuracy on HLE [212] and 93.9 % 93.9 % 93.9%93.9 \% on SimpleQA [290], indicating robust factual retrieval capabilities.
尽管直接的文献综述基准测试仍不完善,但若干间接基准为相关能力提供了参考。OpenAI/DeepResearch[197]展现了领先性能,在"人类终极考试"(HLE)[212]上达到 26.6 % 26.6 % 26.6%26.6 \% 准确率,并在 GAIA 基准[172]上平均获得 72.57%的成绩,这反映了其在文献综合所需复杂推理任务中的强劲表现。同样地,Perplexity/DeepResearch[209]在 HLE[212]上取得 21.1 % 21.1 % 21.1%21.1 \% 准确率,在 SimpleQA[290]上达到 93.9 % 93.9 % 93.9%93.9 \% ,显示出强大的事实检索能力。
These benchmarks include challenging cases requiring integration across multiple disciplines, identification of methodological limitations, and disambiguation of conflicting findings—all crucial for effective literature review. Such tasks demonstrate the importance of sophisticated reasoning capabilities beyond simple information retrieval. While specific performance metrics for systems like Camel-AI/OWL [43] are not publicly
这些基准测试包含需要跨学科整合、方法论局限识别以及矛盾发现消歧的挑战性案例——这些都是有效文献综述的关键要素。此类任务证明了超越简单信息检索的复杂推理能力的重要性。虽然 Camel-AI/OWL[43]等系统的具体性能指标尚未公开

available, their specialized academic optimization suggests potential effectiveness in handling complex synthesis tasks.
但其专门的学术优化设计表明其在处理复杂综合任务方面可能具有显著效果。
Methodology Evaluation Benchmarks. Critical assessment of research methodology requires sophisticated analytical capabilities. To the best of our knowledge, no benchmark is specifically designed for quantitative methodology assessment of strengths and limitations. A comprehensive methodology evaluation benchmark would need to assess a system’s ability to identify flaws in research design, statistical approaches, sampling methods, and interpretive limitations across diverse disciplines. An effective benchmark might incorporate multi-layered evaluation criteria including: reproducibility assessment, identification of confounding variables, appropriate statistical power analysis, and proper handling of uncertainty. Future benchmarks could utilize expert-annotated corpora of research papers with methodological strengths and weaknesses clearly marked, creating a gold standard against which systems’ analytical capabilities can be measured while minimizing bias through diverse evaluation metrics that reflect methodological best practices across different fields of inquiry.
方法论评估基准。对研究方法的批判性评估需要复杂的分析能力。据我们所知,目前尚无专门用于定量评估研究方法优势与局限性的基准。一个全面的方法论评估基准需要考察系统在多学科领域中识别研究设计缺陷、统计方法问题、抽样技术局限以及解释性不足的能力。有效的基准可能包含多层次评估标准:可重复性检验、混杂变量识别、统计功效分析合理性以及不确定性处理的规范性。未来的基准可采用专家标注的研究论文语料库,其中明确标注方法论优缺点,从而建立黄金标准来衡量系统的分析能力,同时通过反映不同学科领域方法论最佳实践的多样化评估指标,最大限度减少评估偏差。
Beyond standard benchmarks, case study evaluations of complete AI scientist systems provide valuable insights into current capabilities. Beel et al. [24] conduct a detailed assessment of Sakana’s AI Scientist for autonomous research, examining whether current implementations represent genuine progress toward “Artificial Research Intelligence” or remain limited in fundamental ways, highlighting the gap between current benchmarks and comprehensive research capability evaluation.
除了标准基准测试外,对完整 AI 科学家系统的案例研究评估为当前能力提供了宝贵洞见。Beel 等人[24]对 Sakana 的 AI 科学家自主研究功能进行了详细评估,研究当前实现是否真正代表了向"人工研究智能"的实质性进步,还是仍存在根本性局限,凸显了当前基准测试与全面研究能力评估之间的差距。

5.3.2 Business Analysis Task Benchmarks. Standardized evaluation for business intelligence applications:
5.3.2 商业分析任务基准。面向商业智能应用的标准化评估:

Market Analysis Benchmarks. Strategic decision support necessitates a comprehensive understanding of market dynamics. Advanced AI systems, such as OpenAI/DeepResearch [197], are designed to analyze competitive landscapes, identify market trends, and generate strategic recommendations based on diverse business information. OpenAI/DeepResearch has demonstrated significant capabilities in handling complex, multi-domain data analysis tasks, providing detailed insights and personalized recommendations. Similarly, Google’s Gemini/DeepResearch [60] offers robust performance in processing extensive datasets, delivering concise and factual reports efficiently.
市场分析基准。战略决策支持需要对市场动态有全面理解。OpenAI/DeepResearch [197]等先进 AI 系统专为分析竞争格局、识别市场趋势而设计,能够基于多样化商业信息生成战略建议。OpenAI/DeepResearch 在处理复杂的多领域数据分析任务方面展现出显著能力,可提供详细洞察与个性化推荐。同样地,Google 的 Gemini/DeepResearch [60]在处理海量数据集时表现出色,能高效生成简明扼要的事实性报告。
These benchmarks include challenging scenarios requiring integration of quantitative financial data with qualitative market dynamics and regulatory considerations. Such tasks highlight the importance of both analytical depth and domain knowledge, with systems like Manus [164] demonstrating strong performance through specialized business intelligence capabilities.
这些基准测试包含需要整合定量财务数据与定性市场动态及监管考量的挑战性场景。此类任务突显了分析深度与领域知识的重要性,像 Manus [164]这类系统通过专业化的商业智能能力展现了强劲性能。
Financial Analysis Benchmarks. Economic assessment requires sophisticated quantitative reasoning combined with contextual understanding of market dynamics. The FinEval benchmark [103] provides a standardized framework for measuring systems’ capabilities in analyzing financial statements, evaluating investment opportunities, and assessing economic risk factors across diverse scenarios. To our knowledge, no Deep Research projects have yet published official FinEval benchmark results, though several commercial demonstrations suggest strong performance in this domain. OpenAI/DeepResearch [197] has demonstrated particular strength in quantitative financial analysis through its ability to process complex numerical data while incorporating relevant market context. Meanwhile, open-source implementations show more variable performance, though specialized systems like n8n [183] achieve competitive results through strategic
金融分析基准。经济评估需要复杂的定量推理与对市场动态的情境理解相结合。FinEval 基准[103]提供了一个标准化框架,用于衡量系统在分析财务报表、评估投资机会以及跨多种情景评估经济风险因素方面的能力。据我们所知,目前尚无深度研究项目正式发布 FinEval 基准测试结果,尽管若干商业演示表明该领域表现强劲。OpenAI/DeepResearch[197]通过处理复杂数值数据并整合相关市场背景的能力,展现出在定量金融分析方面的特殊优势。与此同时,开源实现的性能表现参差不齐,不过像 n8n[183]这样的专业系统通过策略性优化取得了具有竞争力的结果。

integration with financial data sources and analytical tools. These patterns highlight the critical importance of domain-specific integrations and data accessibility in financial analysis applications, extending beyond core language model capabilities to create truly effective analytical systems.
与金融数据源和分析工具的集成。这些模式凸显了领域特定集成和数据可访问性在金融分析应用中的关键重要性,其价值超越了核心语言模型能力,可构建真正高效的分析系统。

5.3.3 General Knowledge Management Benchmarks. Broad applicability assessment across general research domains:
5.3.3 通用知识管理基准测试。跨一般研究领域的广泛适用性评估:
Factual Research Benchmarks. Accurate information gathering forms the foundation of effective research. The SimpleQA benchmark [290] evaluates language models’ ability to answer short, fact-seeking questions with a single, indisputable answer. Perplexity/DeepResearch [209] demonstrates exceptional performance on this benchmark, achieving an accuracy of 93.9% [209]. OpenAI’s Deep Research tool, integrated into ChatGPT, offers comprehensive research capabilities, though specific accuracy metrics on SimpleQA [290] are not publicly disclosed [197]. Similarly, Google’s Gemini/DeepResearch provides robust information synthesis features, but detailed performance data on SimpleQA [290] is not available.
事实研究基准测试。准确的信息收集是有效研究的基础。SimpleQA 基准测试[290]评估语言模型回答简短事实性问题的能力,这类问题具有单一明确答案。Perplexity/DeepResearch[209]在该基准测试中表现出色,准确率达到 93.9%[209]。OpenAI 集成在 ChatGPT 中的深度研究工具提供全面的研究能力,但未公开其在 SimpleQA[290]上的具体准确率指标[197]。同样,Google 的 Gemini/DeepResearch 具备强大的信息综合功能,但关于 SimpleQA[290]的详细性能数据尚未公开。
These metrics provide useful baseline performance indicators, though with recognized limitations in representing more complex research workflows. Comparative evaluation highlights the importance of information quality beyond simple factual recall, with sophisticated systems demonstrating more nuanced performance profiles across complex tasks.
这些指标提供了有用的基准性能指标,尽管在表征更复杂的研究工作流方面存在公认的局限性。对比评估凸显了信息质量超越简单事实回忆的重要性,复杂系统在跨复杂任务时展现出更细致的性能特征。
Humanities and Social Sciences Benchmarks. Comprehensive evaluation requires assessment beyond STEM domains. The MMLU benchmark [33] evaluates systems’ performance across humanities and social science research tasks, including historical analysis, ethical evaluation, and social trend identification. Performance shows greater variability compared to STEM-focused tasks, with generally lower accuracy across all systems while maintaining similar relative performance patterns. These benchmarks highlight remaining challenges in domains requiring nuanced contextual understanding and interpretive reasoning. Commercial systems maintain performance leads, though with open alternatives like smolagents/open_deep_research [115] demonstrating competitive capabilities in specific humanities domains through specialized component design.
人文与社会科学基准。全面评估需要超越 STEM 领域的考量。MMLU 基准[33]评估系统在人文社科研究任务中的表现,包括历史分析、伦理评估和社会趋势识别。与 STEM 导向任务相比,这些任务的表现呈现更大波动性,所有系统的准确率普遍较低,但保持了相似的相对性能模式。这些基准凸显了在需要细致语境理解和解释性推理的领域中仍存在的挑战。商业系统保持性能领先,不过开源替代方案如 smolagents/open_deep_research[115]通过专业化组件设计,在特定人文领域展现出竞争力。

5.4 Emerging Evaluation Approaches
5.4 新兴评估方法

Beyond established benchmarks, novel evaluation methods address unique aspects of Deep Research performance.
除既定基准外,新颖的评估方法针对深度研究性能的独特方面展开评测。
Interactive Evaluation Frameworks. Traditional static benchmarks often fail to capture the dynamic and interactive nature of real-world research workflows. To address this gap, interactive evaluation frameworks have been developed to assess AI systems’ abilities to iteratively refine research strategies through multiple interaction rounds. Notably, QuestBench [141] is a novel benchmark which specifically assesses an AI system’s ability to identify missing information and ask appropriate clarification questions, a crucial skill for real-world research scenarios where problems are often underspecified. To the best of our knowledge, no deep research system invested in this survey has yet been publicly evaluated using QuestBench. Nonetheless, these systems have demonstrated strong performance in other interactive evaluations, highlighting their effectiveness in supporting iterative research processes.
交互式评估框架。传统的静态基准测试往往难以捕捉现实世界研究流程的动态交互特性。为弥补这一不足,交互式评估框架应运而生,旨在通过多轮交互来评估 AI 系统迭代优化研究策略的能力。值得注意的是,QuestBench[141]作为新型基准测试工具,专门评估 AI 系统识别信息缺失并提出恰当澄清问题的能力——这一技能对于问题常存在模糊性的现实研究场景至关重要。据我们所知,本次调研涉及的深度研究系统尚未有公开的 QuestBench 评估记录。尽管如此,这些系统在其他交互式评估中已展现出卓越性能,充分证明了其在支持迭代研究流程方面的有效性。
Multimodal Research Evaluation. Comprehensive research increasingly involves diverse content modalities. Advanced evaluation frameworks incorporate multimodal assessment that measures systems’ ability to integrate information across text, images, data visualizations, and structured content. Commercial systems generally demonstrate stronger multimodal capabilities, with Gemini/DeepResearch [60] particularly excelling in image-inclusive research tasks.
多模态研究评估。综合性研究日益涉及多样化的内容模态。先进的评估框架纳入了多模态评估方法,用于衡量系统整合文本、图像、数据可视化和结构化内容信息的能力。商业系统通常展现出更强大的多模态能力,其中 Gemini/DeepResearch[60]在包含图像的研究任务中表现尤为突出。
Open implementations show emerging multimodal capabilities, with systems like Jina-AI/node-DeepResearch [121] incorporating specific components for multimodal content processing. These approaches highlight the growing importance of cross-modal integration in practical research applications beyond text-centric evaluation.
开源实现展示了新兴的多模态能力,诸如 Jina-AI/node-DeepResearch[121]等系统集成了专门的多模态内容处理组件。这些方法凸显了跨模态整合在实际研究应用中日益增长的重要性,其价值已超越以文本为中心的评估范畴。
Ethical and Bias Assessment. Responsible research requires careful attention to ethical considerations and potential biases. Comprehensive evaluation increasingly incorporates explicit assessment of ethical awareness, bias detection, and fairness in information processing. Commercial systems implement sophisticated safeguards, with OpenAI/DeepResearch [197] incorporating explicit ethical guidelines and bias mitigation strategies. Open implementations show varied approaches to these considerations, with systems like grapeot/deep_research_agent [263] emphasizing transparency in source selection and attribution.
伦理与偏见评估。负责任的研究需要审慎关注伦理考量和潜在偏见。全面的评估日益包含对伦理意识、偏见检测及信息处理公平性的明确评估。商业系统实施了复杂的防护措施,例如 OpenAI/DeepResearch[197]整合了明确的伦理准则和偏见缓解策略。开源实现则展现出对这些考量的多样化处理方式,诸如 grapeot/deep_research_agent[263]等系统强调来源选择与归属的透明度。
These evaluation dimensions highlight the importance of responsibility beyond technical performance, addressing growing concerns about potential amplification of existing information biases through automated research systems. Ongoing development of standardized ethical evaluation frameworks represents an active area of research with significant implications for system design and deployment.
这些评估维度凸显了超越技术性能的责任意识重要性,应对了关于自动化研究系统可能放大现有信息偏见的日益增长的担忧。标准化伦理评估框架的持续开发,是当前研究的一个活跃领域,对系统设计与部署具有重大意义。
The diverse evaluation approaches outlined in this section highlight both the complexity of comprehensive assessment and the ongoing evolution of evaluation methodologies alongside system capabilities. While standard benchmarks provide useful comparative metrics, practical effectiveness depends on alignment between system capabilities, evaluation criteria, and specific application requirements. This alignment represents a key consideration for both system developers and adopters seeking to integrate Deep Research capabilities into practical workflows.
本节概述的多样化评估方法既突显了全面评估的复杂性,也展现了评估方法随系统能力发展的持续演进。虽然标准基准测试提供了有效的对比指标,但实际效果取决于系统能力、评估标准与具体应用需求之间的匹配度。这种匹配关系是系统开发者和采用者将深度研究能力整合到实际工作流程时需考量的关键因素。

5.5 Comparative Evaluation Methodology
5.5 比较评估方法论

To ensure systematic and consistent evaluation across diverse Deep Research systems, we have developed a comprehensive evaluation framework. This section outlines our methodological approach, evaluation criteria selection, and application consistency across systems.
为确保对不同深度研究系统进行系统化且一致的评估,我们开发了综合评估框架。本节将阐述我们的方法论路径、评估标准选择以及跨系统的应用一致性。

5.5.1 Systems Selection Criteria. Our evaluation encompasses various Deep Research systems selected based on the following criteria:
5.5.1 系统选择标准。我们的评估涵盖基于以下标准筛选的各类深度研究系统:
  • Functional Completeness: Systems must implement at least two of the three core dimensions of Deep Research as defined in Section 1.1
    功能完整性:系统必须至少实现第 1.1 节定义的深度研究三个核心维度中的两个
  • Public Documentation: Sufficient technical documentation must be available to enable meaningful analysis
    公开文档:必须提供足够的技术文档以支持有意义的分析
  • Active Development: Systems must have demonstrated active development or usage within the past 12 months
    活跃开发:系统必须在过去 12 个月内展现出活跃的开发或使用状态
  • Representational Balance: Selection ensures balanced representation of commercial, open-source, general-purpose, and domain-specialized implementations
    代表性平衡:选择需确保商业、开源、通用及领域专用实现之间的平衡体现

    5.5.2 Evaluation Dimensions and Metrics Application. Our evaluation employs a consistent set of dimensions across all systems, though the specific benchmarks within each dimension vary based on system focus and available performance data. Table 15 presents the evaluation coverage across representative systems.
    5.5.2 评估维度与指标应用。我们的评估在所有系统中采用一致的维度集,尽管每个维度内的具体基准会因系统关注点和可用性能数据而有所不同。表 15 展示了代表性系统的评估覆盖情况。
Table 15. Evaluation Metrics Application Across Systems
表 15. 各系统评估指标应用情况
System  系统 Functional Benchmarks  功能基准 Performance Metrics  性能指标 Efficiency Metrics  效率指标 Domain-Specific Benchmarks
领域特定基准
Usability Assessment  可用性评估
OpenAI/DeepResearch HLE, GAIA Factual accuracy  事实准确性 Response time  响应时间 Academic citation  学术引用 User interface  用户界面
Gemini/DeepResearch MMLU Output coherence  输出一致性 Cloud compute  云计算 Market analysis  市场分析 Mobile support  移动端支持
Perplexity/DeepResearch HLE, SimpleQA Source diversity  来源多样性 Response time  响应时间 Legal search  法律检索 Multi-device  多设备支持
Grok3Beta MMLU Source verification  来源验证 Cloud efficiency  云效率 Financial analysis  财务分析 Voice interface  语音界面
Manus GAIA Cross-domain  跨领域 API latency  API 延迟 Business analysis  商业分析 Dashboard  仪表盘
Agent-RL/ReSearch HotpotQA Planning efficiency  规划效率 Local compute  本地计算 Scientific research  科学研究 CLI interface  CLI 界面
AutoGLM-Research  AutoGLM-研究 WebArena GUI navigation  图形用户界面导航 Mobile efficiency  移动端效率 Domain adaptation  领域自适应 Accessibility  无障碍性
n8n Workflow  工作流 API integration  API 集成 Self-hosted  自托管 Enterprise workflow  企业工作流 No-code design  无代码设计
System Functional Benchmarks Performance Metrics Efficiency Metrics Domain-Specific Benchmarks Usability Assessment OpenAI/DeepResearch HLE, GAIA Factual accuracy Response time Academic citation User interface Gemini/DeepResearch MMLU Output coherence Cloud compute Market analysis Mobile support Perplexity/DeepResearch HLE, SimpleQA Source diversity Response time Legal search Multi-device Grok3Beta MMLU Source verification Cloud efficiency Financial analysis Voice interface Manus GAIA Cross-domain API latency Business analysis Dashboard Agent-RL/ReSearch HotpotQA Planning efficiency Local compute Scientific research CLI interface AutoGLM-Research WebArena GUI navigation Mobile efficiency Domain adaptation Accessibility n8n Workflow API integration Self-hosted Enterprise workflow No-code design| System | Functional Benchmarks | Performance Metrics | Efficiency Metrics | Domain-Specific Benchmarks | Usability Assessment | | :--- | :--- | :--- | :--- | :--- | :--- | | OpenAI/DeepResearch | HLE, GAIA | Factual accuracy | Response time | Academic citation | User interface | | Gemini/DeepResearch | MMLU | Output coherence | Cloud compute | Market analysis | Mobile support | | Perplexity/DeepResearch | HLE, SimpleQA | Source diversity | Response time | Legal search | Multi-device | | Grok3Beta | MMLU | Source verification | Cloud efficiency | Financial analysis | Voice interface | | Manus | GAIA | Cross-domain | API latency | Business analysis | Dashboard | | Agent-RL/ReSearch | HotpotQA | Planning efficiency | Local compute | Scientific research | CLI interface | | AutoGLM-Research | WebArena | GUI navigation | Mobile efficiency | Domain adaptation | Accessibility | | n8n | Workflow | API integration | Self-hosted | Enterprise workflow | No-code design |
5.5.3 Data Collection Methods. Our evaluation data comes from four primary sources:
5.5.3 数据收集方法。我们的评估数据主要来自四个来源:

(1) Published Benchmarks: Performance metrics reported in peer-reviewed literature or official system documentation
(1)已发布的基准测试:同行评审文献或官方系统文档中报告的性能指标

(2) Technical Documentation Analysis: Capabilities and limitations outlined in official documentation, APIs, and technical specifications
(2)技术文档分析:官方文档、API 和技术规范中描述的功能与限制

(3) Repository Examination: Analysis of open-source code repositories for architectural patterns and implementation approaches
(3)代码库审查:通过分析开源代码库来研究架构模式和实现方法

(4) Experimental Verification: Where inconsistencies exist, we conducted direct testing of publicly available systems to verify capabilities
(4) 实验验证:对于存在不一致的情况,我们对公开可用的系统进行了直接测试以验证其能力
When benchmark results are unavailable for specific systems, we indicate this gap explicitly rather than extrapolating performance. This approach ensures transparency regarding the limits of our comparative analysis while maintaining the integrity of available evaluation data.
当特定系统缺乏基准测试结果时,我们会明确标注这一缺失而非推测其性能。这种方法既确保了比较分析局限性的透明度,又维护了现有评估数据的完整性。

5.5.4 Cross-System Comparison Challenges. Several methodological challenges exist in comparing Deep Research systems:
5.5.4 跨系统比较挑战。深度研究系统间的比较存在若干方法论挑战:
  • Benchmark Diversity: Different systems emphasize different benchmarks based on their focus areas
    基准多样性:不同系统根据其重点领域会侧重不同的基准测试
  • Implementation Transparency: Commercial systems often provide limited details about internal architectures
    实现透明度:商业系统通常仅提供有限的内部架构细节
  • Rapid Evolution: Systems undergo frequent updates, potentially rendering specific benchmark results obsolete
    快速演进:系统频繁更新可能导致特定基准测试结果过时
  • Domain Specialization: Domain-specific systems excel on targeted benchmarks but may perform poorly on general evaluations
    领域专精:针对特定领域优化的系统在专项测试中表现优异,但在通用评估中可能表现不佳
We address these challenges through qualitative architectural analysis alongside quantitative benchmarks, enabling meaningful comparison despite data limitations. Section 3.3 presents the resulting comparative analysis, highlighting both performance differentials and the limitations of direct comparison across heterogeneous implementations.
我们通过定性架构分析与定量基准测试相结合的方式应对这些挑战,使在数据有限的情况下仍能进行有效比较。第 3.3 节展示了由此产生的对比分析,既突出了性能差异,也揭示了异构实现间直接比较的局限性。

6 Applications and Use Cases
6 应用与使用场景

The technical capabilities of Deep Research systems enable transformative applications across diverse domains. This section examines implementation patterns, domain-specific adaptations, and representative use cases that demonstrate the practical impact of these technologies.
深度研究系统的技术能力使其能够在多个领域实现变革性应用。本节将探讨实施模式、特定领域的适应性调整以及展示这些技术实际影响的代表性用例。

Deep Research Application Domains and Use Cases
深度研究应用领域与使用场景

Fig. 9. Deep Research Application Domains and Use Cases
图 9. 深度研究应用领域与使用场景

6.1 Academic Research Applications
6.1 学术研究应用

Deep Research systems offer significant enhancements to scholarly research workflows.
深度研究系统为学术研究工作流程提供了显著增强。

6.1.1 Literature Review and Synthesis. Comprehensive literature analysis forms the foundation of effective research:
6.1.1 文献综述与综合。全面的文献分析是有效研究的基础:
Systematic Review Automation. Deep Research systems demonstrate particular effectiveness for systematic literature reviews requiring exhaustive coverage of existing research. Systems like Google’s Gemini/ DeepResearch [60] can efficiently analyze thousands of research papers, a capability that has significant implications for fields like biomedicine where the volume of literature makes comprehensive manual review increasingly challenging [289]. OpenAI/DeepResearch [197] has been successfully deployed for medical research reviews, analyzing thousands of publications to identify intervention efficacy patterns with significantly reduced human effort compared to traditional methods. Similar capabilities are evident in Perplexity/DeepResearch [209] and Gemini/DeepResearch [60], which enables rapid synthesis of research findings across disciplinary boundaries. Generative AI frameworks integrating retrieval-augmented generation further automate systematic reviews by expanding user queries to retrieve relevant scholarly articles and reduce time and resource burdens [234].
系统性文献综述自动化。深度研究系统在需要全面覆盖现有研究的系统性文献评审中展现出显著优势。Google 的 Gemini/DeepResearch[60]等系统能高效分析数千篇研究论文,这一能力对生物医学等领域具有重大意义——在这些领域中,文献数量庞大使得全面人工评审愈发困难[289]。OpenAI/DeepResearch[197]已成功应用于医学研究综述,通过分析数千份出版物来识别干预效果模式,相比传统方法大幅减少了人力投入。Perplexity/DeepResearch[209]和 Gemini/DeepResearch[60]同样展现出跨学科快速整合研究发现的能力。集成检索增强生成的生成式 AI 框架通过扩展用户查询来获取相关学术文章,进一步实现了系统性综述的自动化,从而减轻时间和资源负担[234]。
Open-source implementations like dzhng/deep-research [321] have found adoption in academic settings where local deployment and customization are prioritized. Specialized scientific implementations like AIResearcher [109] extend these capabilities with domain-specific optimizations for academic literature processing and analysis. These systems enable literature review automation with greater control over search scope and synthesis methods, particularly valuable for specialized research domains with unique requirements. Implementation patterns typically involve customization of search strategies, source weightings, and output formats to align with disciplinary conventions.
开源实现如 dzhng/deep-research[321]已在学术环境中得到应用,这些场景优先考虑本地部署和定制化需求。专业科学实现如 AIResearcher[109]通过针对学术文献处理和分析的领域特定优化扩展了这些能力。这类系统实现了文献综述自动化,并提供对搜索范围与综合方法的更强控制,对于具有独特需求的专业研究领域尤为宝贵。其实现模式通常包括定制搜索策略、来源权重和输出格式,以符合学科规范。
Research Gap Identification. Beyond simple synthesis, advanced systems effectively identify unexplored areas and research opportunities. Gemini/DeepResearch [60] has demonstrated this capability in interdisciplinary contexts, identifying connection opportunities between distinct research domains that might otherwise remain undiscovered. This application leverages the system’s ability to process extensive literature across fields while identifying patterns and absences in existing research coverage.
研究空白识别。超越简单综合,先进系统能有效识别未探索领域和研究机会。Gemini/DeepResearch[60]在跨学科背景下展示了这一能力,发现了不同研究领域间可能被忽视的联系机会。该应用利用了系统处理多领域海量文献的能力,同时识别现有研究覆盖中的模式与缺失。
Open implementations like HKUDS/Auto-Deep-Research [112] incorporate specific mechanisms for gap analysis, including explicit detection of methodological limitations and underexplored variables across research corpora. These capabilities highlight the potential for automated systems to not only synthesize existing knowledge but actively contribute to research direction through systematic gap identification.
像 HKUDS/Auto-Deep-Research[112]这样的开源实现整合了专门的差距分析机制,包括对研究方法论局限性和研究文献中未充分探索变量的显式检测。这些功能凸显了自动化系统不仅能够综合现有知识,还能通过系统性差距识别积极为研究方向做出贡献的潜力。

6.1.2 Hypothesis Generation and Testing. AI-assisted hypothesis development enhances research creativity and validation:
6.1.2 假设生成与验证。AI 辅助的假设开发增强了研究创造力和验证能力:
Hypothesis Formulation Support. Deep Research systems effectively generate testable hypotheses based on existing literature and theoretical frameworks. OpenAI/DeepResearch [197] provides explicit hypothesis generation capabilities, identifying potential causal relationships and testable predictions derived from literature synthesis. These features enable researchers to explore broader possibility spaces than might be practical through manual review alone.
假设形成支持。深度研究系统能基于现有文献和理论框架有效生成可测试假设。OpenAI/DeepResearch[197]提供明确的假设生成功能,识别文献综合得出的潜在因果关系和可检验预测。这些特性使研究人员能够探索比单纯人工评审更广阔的可能性空间。
Specialized frameworks like Camel-AI/OWL [43] implement domain-specific hypothesis generation for scientific applications, incorporating field-specific constraints and validation criteria. These approaches highlight how domain adaptation enhances the practical utility of hypothesis generation capabilities beyond
诸如 Camel-AI/OWL[43]等专业框架为科学应用实现了领域特定的假设生成功能,融入了专业领域的约束条件和验证标准。这些方法突显了领域适应如何提升假设生成能力的实际效用,使其超越

generic formulation. Implementation patterns typically involve iterative refinement with researcher feedback to align generated hypotheses with specific research objectives.
通用化的假设构建。实施模式通常包含结合研究人员反馈的迭代优化过程,以确保生成的假设与特定研究目标保持一致。
Preliminary Validation Assessment. Advanced systems support hypothesis validation through evidence assessment and methodological planning. Gemini/DeepResearch [60] enables preliminary hypothesis testing through automated data source identification, statistical power analysis, and potential confound identification. These capabilities streamline the transition from hypothesis formulation to empirical testing, reducing manual effort in research design.
初步验证评估。先进系统通过证据评估和方法规划支持假设验证。Gemini/DeepResearch[60]实现了自动化数据源识别、统计功效分析及潜在混杂因素识别等功能,从而支持初步假设检验。这些能力简化了从假设构建到实证测试的过渡流程,降低了研究设计中的手工工作量。
Open implementations like Agent-RL/ReSearch [2] incorporate specific validation planning components, guiding researchers through experimental design considerations based on hypothesis characteristics. These approaches demonstrate how Deep Research capabilities extend beyond information gathering to actively support the complete research workflow from conception through validation planning.
像 Agent-RL/ReSearch[2]这样的开源实现包含了特定的验证规划组件,能够根据假设特征引导研究人员完成实验设计考量。这些方法展示了深度研究能力如何从单纯的信息收集扩展到积极支持从构思到验证规划的完整研究流程。

6.1.3 Interdisciplinary Research Support. Cross-domain integration represents a particular strength of automated research systems:
6.1.3 跨学科研究支持。跨领域整合体现了自动化研究系统的独特优势:
Cross-Domain Knowledge Translation. Deep Research systems effectively bridge terminological and conceptual gaps between disciplines. Perplexity/DeepResearch [209] demonstrates this capability through explicit concept mapping between fields, enabling researchers from diverse backgrounds to explore unfamiliar domains with reduced onboarding barriers. This application leverages the system’s broad knowledge base to identify conceptual parallels across disciplinary boundaries.
跨领域知识转换。深度研究系统能有效弥合学科间的术语与概念鸿沟。Perplexity/DeepResearch[209]通过显式的领域间概念映射展示了这一能力,使不同背景的研究者能以更低的入门门槛探索陌生领域。该应用利用系统广博的知识库来识别跨学科边界的概念相似性。
Open frameworks like smolagents/open_deep_research [115] implement specialized agents for disciplinary translation, with explicit focus on terminological mapping and concept alignment. These approaches highlight how multi-agent architectures can effectively address the challenges of interdisciplinary communication through specialized component design[117].
诸如 smolagents/open_deep_research[115]等开放框架实现了专门用于学科翻译的智能体,其明确聚焦于术语映射和概念对齐。这些方法突显了多智能体架构如何通过专业化组件设计有效解决跨学科交流的挑战[117]。
Methodology Transfer Facilitation. Advanced systems enable effective adaptation of research methods across domains. OpenAI/DeepResearch [197] supports methodology transfer through explicit identification of adaptation requirements and implementation guidance when applying techniques from one field to another. This capability accelerates methodological innovation by facilitating cross-pollination between research traditions. Implementation patterns typically involve specialized methodological components like those in QwenLM/Qwen-Agent [224], which incorporates explicit methodology modeling to identify transfer opportunities and adaptation requirements. This is particularly relevant in fields like engineering, where AI is beginning to impact established design procedures for complex dynamical systems [67]. These approaches demonstrate how Deep Research systems can actively contribute to methodological innovation beyond simple information retrieval and synthesis.
方法论迁移促进。先进系统能够有效实现跨领域研究方法的适应性调整。OpenAI/DeepResearch [197]通过明确识别适应需求并提供技术跨领域应用时的实施指导,支持方法论迁移。这一能力通过促进研究传统间的交叉融合,加速了方法论创新。实施模式通常涉及专门的方法论组件,如 QwenLM/Qwen-Agent [224]中所采用的显式方法论建模,用于识别迁移机会和适应需求。这在工程等领域尤为重要,因为 AI 正开始影响复杂动力系统既有的设计流程[67]。这些方法展示了深度研究系统如何能超越简单的信息检索与综合,积极推动方法论创新。

6.2 Scientific Discovery Applications
6.2 科学发现应用

Deep Research technologies enable enhanced scientific investigation across disciplines.
深度研究技术能够增强跨学科的科学探索能力。

6.2.1 Data Analysis and Pattern Recognition. Automated analysis enhances insight extraction from complex scientific data:
6.2.1 数据分析与模式识别。自动化分析增强了从复杂科学数据中提取洞察的能力:
Large-Scale Data Synthesis. Deep Research systems effectively integrate findings across extensive datasets to identify broader patterns. Gemini/DeepResearch [60] has been applied to climate science research, synthesizing findings across hundreds of climate models and observational datasets to identify consistent patterns and outliers. This application leverages the system’s ability to process and integrate diverse data formats while maintaining analytical coherence. Open implementations like n8n [183] enable similar capabilities through workflow automation that coordinates specialized analytical tools across complex data processing pipelines. Furthermore, SqlCompose [161] enhances analytical workflows by automating SQL authoring to reduce syntax barriers and improve efficiency in large-scale data operations, as demonstrated through enterprise deployment and user feedback. Systems like DataInquirer quantitatively measure workflow patterns and task execution consistency, revealing significant variations across practitioners while also assessing AI tool impacts on aligning novice approaches with expert practices [325]. AI assistants specifically designed for data wrangling tasks can provide semi-automated support in transforming and cleaning data through interactive recommendations, thereby enhancing workflow efficiency [211]. Other systems assist domain experts in making sense of multi-modal personal tracking data through visualization and human-in-the-loop LLM agents [143]. Additionally, no-code machine-readable documentation frameworks support responsible dataset evaluation by facilitating quality assessment and accuracy verification during large-scale data synthesis [233]. These approaches demonstrate how tool integration capabilities extend analytical reach beyond the core language model’s native capabilities, particularly valuable for quantitative scientific applications.
大规模数据合成。深度研究系统能有效整合海量数据集中的发现,以识别更广泛的模式。Gemini/DeepResearch [60]已应用于气候科学研究,通过综合数百个气候模型和观测数据集的研究结果,识别出一致性模式和异常值。该应用充分发挥了系统处理并整合多种数据格式的能力,同时保持分析连贯性。诸如 n8n [183]等开源实现方案通过工作流自动化提供了类似能力,可在复杂数据处理流程中协调各类专业分析工具。此外,SqlCompose [161]通过自动化 SQL 编写来增强分析工作流,降低语法障碍并提升大规模数据操作效率,这一点已通过企业部署和用户反馈得到验证。DataInquirer 等系统可定量测量工作流模式和任务执行一致性,不仅揭示了不同从业者间的显著差异,还能评估 AI 工具在使新手方法向专家实践靠拢方面的影响[325]。 专为数据整理任务设计的 AI 助手可通过交互式推荐提供半自动化的数据转换与清洗支持,从而提升工作流效率[211]。其他系统则通过可视化技术和人机协同 LLM 代理,帮助领域专家理解多模态个人追踪数据[143]。此外,无代码机器可读文档框架通过促进质量评估与准确性验证,支持负责任的大规模数据合成过程[233]。这些方法展示了工具集成能力如何将分析范围扩展到核心语言模型原生功能之外,对于定量科学应用尤其具有价值。
Anomaly Detection and Investigation. Advanced systems effectively identify unexpected patterns and facilitate targeted investigation. OpenAI/DeepResearch [197] demonstrates this capability in pharmacological contexts, identifying unexpected drug interaction patterns across clinical literature and proposing mechanistic explanations for further investigation. This application combines pattern recognition with explanatory hypothesis generation to enhance scientific discovery.
异常检测与调查分析。先进系统能有效识别意外模式并促进针对性研究。OpenAI/DeepResearch [197]在药理学领域展示了这一能力,通过分析临床文献识别出意外的药物相互作用模式,并提出机制性解释以供进一步研究。该应用将模式识别与解释性假设生成相结合,从而推动科学发现。
Specialized tools like grapeot/deep_research_agent [263] implement focused anomaly detection capabilities, with particular emphasis on statistical outlier identification and contextual explanation. These approaches highlight how targeted optimization can enhance specific scientific workflows beyond generalpurpose research capabilities[125].
像 grapeot/deep_research_agent[263]这样的专业工具实现了聚焦的异常检测功能,特别强调统计异常值识别和上下文解释。这些方法展示了针对性优化如何能提升特定科研工作流,超越通用研究能力[125]。

6.2.2 Experiment Design and Simulation. AI assistance enhances experimental planning and virtual testing:
6.2.2 实验设计与模拟。人工智能辅助增强了实验规划和虚拟测试:
Experimental Protocol Optimization. Deep Research systems support experimental design through comprehensive protocol development and optimization. Gemini/DeepResearch [60] provides explicit protocol generation capabilities, incorporating existing methodological best practices while identifying potential confounds and control strategies. These features streamline experimental planning while enhancing methodological rigor.
实验方案优化。深度研究系统通过全面的方案开发与优化来支持实验设计。Gemini/DeepResearch[60]提供明确的方案生成能力,整合现有方法论最佳实践的同时识别潜在混杂因素和控制策略。这些特性简化了实验规划流程,同时增强了方法论的严谨性。
Open implementations like Agent-RL/ReSearch [2] incorporate specialized experimental design components with particular emphasis on statistical power optimization and confound control. These approaches demonstrate how focused optimization can enhance specific scientific workflows through specialized component design targeting critical research phases.
开源实现如 Agent-RL/ReSearch[2]整合了专门的实验设计组件,特别注重统计功效优化和混杂因素控制。这些方法展示了如何通过针对关键研究阶段的专项组件设计,实现聚焦优化以提升特定科学工作流程的效率。
Despite these capabilities, significant gaps remain between current systems and truly autonomous scientific discovery. Yu et al. [314] identify critical missing elements in current AI research systems, particularly highlighting limitations in open-ended exploration, creative hypothesis generation, and experimental design optimization that constrain their effectiveness in leading scientific discovery processes.
尽管具备这些能力,现有系统与真正自主的科学发现之间仍存在显著差距。Yu 等人[314]指出了当前 AI 研究系统中的关键缺失要素,特别强调了在开放式探索、创造性假设生成和实验设计优化方面的局限性,这些限制制约了它们在引领科学发现过程中的有效性。
Theoretical Model Testing. Advanced systems enable accelerated testing of theoretical models through simulation and virtual experimentation. OpenAI/DeepResearch [197] supports this application through integration with computational modeling tools, enabling rapid assessment of theoretical predictions against existing evidence. This capability accelerates theory refinement by identifying empirical constraints and validation opportunities more efficiently than manual methods.
理论模型测试。先进系统通过模拟和虚拟实验加速理论模型的测试。OpenAI/DeepResearch [197]通过与计算建模工具的集成支持这一应用,能够快速将理论预测与现有证据进行对比评估。这一能力通过比人工方法更高效地识别实证约束和验证机会,加速了理论完善过程。
Implementation patterns typically involve specialized tool integration like that found in Manus [164], which provides sophisticated orchestration of computational modeling and simulation tools within research workflows. Systems like AgentLaboratory [237] further enhance these capabilities through specialized experimental design components that generate statistically rigorous protocols based on research objectives and methodological best practices. These approaches highlight how tool integration capabilities significantly enhance scientific applications beyond the language model’s native capabilities.
实施模式通常涉及专业工具集成,如 Manus [164]所提供的,它在研究工作流中实现了计算建模与仿真工具的精妙编排。AgentLaboratory [237]等系统通过专门的实验设计组件进一步强化了这些能力,这些组件能根据研究目标和方法论最佳实践生成统计严谨的实验方案。这些方法突显了工具集成能力如何显著增强科学应用,超越语言模型原生功能。

6.2.3 Scientific Literature Integration. Comprehensive knowledge integration enhances scientific understanding:
6.2.3 科学文献整合。全面的知识整合提升科学理解:
Cross-Modal Scientific Content Analysis. Deep Research systems effectively integrate information across text, data, and visualizations prevalent in scientific literature. Gemini/DeepResearch [60] demonstrates particular strength in this application, extracting and synthesizing information from scientific figures, tables, and text into cohesive analyses. This capability enables more comprehensive literature utilization than text-only approaches.
跨模态科学内容分析。深度研究系统能有效整合科学文献中普遍存在的文本、数据和可视化信息。Gemini/DeepResearch [60]在此应用中展现出独特优势,能够从科学图表、表格和文本中提取并综合信息,形成连贯的分析。这种能力使得文献利用比纯文本方法更为全面。
Open implementations like Jina-AI/node-DeepResearch [121] incorporate specialized components for multimodal scientific content processing, enabling similar capabilities in customizable frameworks. These approaches highlight the growing importance of multimodal processing in scientific applications, reflecting the diverse information formats prevalent in scientific communication.
开源实现如 Jina-AI/node-DeepResearch [121]包含了专门的多模态科学内容处理组件,使类似功能可在可定制框架中实现。这些方法凸显了多模态处理在科学应用中日渐增长的重要性,反映了科学交流中多样化的信息形式。
Conflicting Evidence Resolution. Advanced systems help navigate contradictory findings common in scientific literature. Perplexity/DeepResearch [209] provides explicit conflict identification and resolution guidance, identifying methodological differences, contextual factors, and potential reconciliation approaches when faced with contradictory evidence. This capability enhances scientific understanding by providing structured approaches to evidence integration rather than simple aggregation.
矛盾证据解析。先进系统有助于应对科学文献中常见的矛盾发现。Perplexity/DeepResearch [209] 提供了明确的冲突识别与解决指引,当面对矛盾证据时,能识别方法学差异、背景因素以及潜在的和解方案。这一能力通过提供结构化的证据整合方法而非简单汇总,从而提升科学认知水平。
Implementation patterns typically involve sophisticated evidence modeling like that found in HKUDS/Auto-Deep-Research [112], which implements explicit evidence weighting and confidence estimation mechanisms. These approaches demonstrate how specialized components for scientific evidence handling enhance the practical utility of Deep Research systems in complex scientific contexts.
实施模式通常涉及复杂的证据建模技术,如 HKUDS/Auto-Deep-Research [112]所采用的方案,其实现了明确的证据权重分配和置信度评估机制。这些方法展示了科学证据处理的专用组件如何增强深度研究系统在复杂科学场景中的实际效用。

6.2.4 Autonomous Scientific Discovery. Fully autonomous research systems represent an emerging direction that extends current Deep Research capabilities toward greater autonomy. Recent work in this area includes the AI Scientist system [159] that implements an automated discovery loop with hypothesis generation,
6.2.4 自主科学发现。完全自主的研究系统代表了当前深度研究能力向更高自主性发展的新兴方向。该领域的最新成果包括 AI Scientist 系统[159],它实现了包含假设生成、

experimentation, and theory revision capacities. Similarly, the Dolphin system [316] demonstrates how closed-loop auto-research can integrate thinking, practice, and feedback mechanisms to implement systematic scientific discovery processes.
实验探索与理论修正能力。同样,Dolphin 系统[316]展示了闭环自动研究如何整合思考、实践与反馈机制,实现系统化的科学发现流程。
This evolution toward more autonomous operation represents a significant advancement beyond traditional tool-based approaches, enabling continuous research cycles with minimal human intervention while maintaining scientific rigor through structured validation processes. Systems like CycleResearcher [294] further enhance this approach by incorporating automated peer review mechanisms [150] that improve output quality through systematic feedback loops mimicking scientific review processes.
这种向更高自主性操作的演进,标志着对传统工具型方法的重大突破,能够在保持科学严谨性的结构化验证流程下,以最少人工干预实现持续的研究循环。诸如 CycleResearcher[294]等系统通过整合自动化同行评审机制[150]进一步强化了这一方法,这些机制通过模拟科学评审流程的系统化反馈循环来提升输出质量。
Practical implementation of these concepts appears in systems like AgentLaboratory [240], which demonstrates how LLM agents can function as effective research assistants within structured laboratory environments. Complementing these approaches, the concept of self-maintainability (SeM) addresses critical gaps in laboratory automation by enabling systems to autonomously adapt to disturbances and maintain operational readiness [191]. In addition, strategies such as BOLAA [156] orchestrate multiple specialized agents by employing a controller to manage communication among them, enhancing the resolution of complex tasks. Moreover, Automated Capability Discovery (ACD) [158] automates the evaluation of foundation models by designating one model as a scientist to propose open-ended tasks that systematically uncover unexpected capabilities and failures. Similarly, SeqMate [178] utilizes large language models to automate RNA sequencing data preparation and analysis, enabling user-friendly one-click analytics and report generation for biologists. The FutureHouse Platform [253] broadens accessibility by delivering the first publicly available superintelligent AI agents for scientific discovery through web interfaces and APIs. These implementations highlight both the significant potential and current limitations of autonomous scientific discovery systems, suggesting an evolutionary path toward increasingly capable research automation while maintaining appropriate human oversight and validation.
这些概念的实际应用体现在诸如 AgentLaboratory[240]等系统中,该系统展示了 LLM 代理如何在结构化实验室环境中作为高效研究助手发挥作用。作为补充,自维护性(SeM)概念通过使系统能够自主适应干扰并保持运行准备状态,解决了实验室自动化中的关键缺口[191]。此外,BOLAA[156]等策略通过采用控制器管理多个专业代理间的通信,提升了复杂任务的分辨率。自动化能力发现(ACD)[158]则通过指定一个模型担任科学家角色,提出开放式任务来系统性地发现意外能力与故障,从而实现了对基础模型的自动化评估。类似地,SeqMate[178]利用大语言模型自动化 RNA 测序数据准备与分析,为生物学家提供用户友好的一键式分析和报告生成功能。 未来之家平台[253]通过提供首个公开可用的超级智能 AI 代理用于科学发现,拓宽了可访问性,这些代理可通过网页界面和 API 使用。这些实现既展示了自主科学发现系统的巨大潜力,也揭示了当前存在的局限性,为研究自动化能力的持续进化指明了一条道路,同时保持适当的人工监督与验证。

6.3 Business Intelligence Applications
6.3 商业智能应用

Deep Research technologies enable enhanced strategic decision support in commercial contexts.
深度研究技术能够增强商业环境中的战略决策支持。

6.3.1 Market Research and Competitive Analysis. Comprehensive market understanding supports strategic planning:
6.3.1 市场研究与竞争分析。全面的市场认知为战略规划提供支持:
Competitor Landscape Mapping. Deep Research systems effectively synthesize comprehensive competitive intelligence across diverse sources. Gemini/DeepResearch [60] enables detailed competitor analysis across financial disclosures, product announcements, market reception, and strategic positioning to identify competitive dynamics and market opportunities. This application leverages the system’s ability to integrate information across public and specialized business sources with current market context.
竞争对手格局分析。深度研究系统能有效整合来自多元渠道的全面竞争情报。Gemini/DeepResearch [60] 支持对财务披露、产品发布、市场反响及战略定位等多维度进行细致的竞争对手分析,从而识别竞争态势与市场机遇。该应用充分发挥了系统整合公开商业数据源与专业商业情报的能力,并结合实时市场背景进行研判。
Open implementations like n8n [183] support similar capabilities through workflow automation that integrates specialized business intelligence data sources. These approaches demonstrate how effective tool integration can create sophisticated business intelligence applications by coordinating specialized components within consistent analytical frameworks.
开源实现方案如 n8n [183]通过集成专业商业智能数据源的工作流自动化,提供了类似的功能。这些方法展示了如何通过协调一致分析框架内的专业组件,实现高效工具整合以构建复杂的商业智能应用。
Emerging Trend Identification. Advanced systems effectively identify early-stage market trends and potential disruptions. OpenAI/DeepResearch [197] demonstrates this capability through temporal pattern analysis
新兴趋势识别。先进系统能有效发现早期市场趋势与潜在颠覆因素。OpenAI/DeepResearch [197] 通过时序模式分析验证了这一能力。

across industry publications, startup activity, and technology development indicators. This application combines historical pattern recognition with current signal detection to anticipate market evolution with greater lead time than manual methods alone.
横跨行业出版物、初创企业动态和技术发展指标的应用。该解决方案将历史模式识别与当前信号检测相结合,能够比纯人工方法更早预判市场演变趋势。
Implementation patterns typically involve specialized analytical components like those in Flowith/ OracleMode [77], which incorporates explicit trend modeling and weak signal amplification techniques. These approaches highlight how specialized optimization enhances business intelligence applications through components targeting specific analytical requirements.
典型实现模式包含 Flowith/OracleMode[77]等专业分析组件,这类系统整合了显性趋势建模与弱信号放大技术。这些方法通过针对特定分析需求的专业优化组件,显著提升了商业智能应用的效能。

6.3.2 Strategic Decision Support. AI-enhanced analysis informs high-stakes business decisions:
6.3.2 战略决策支持。AI 增强分析为高风险商业决策提供依据:
Investment Opportunity Assessment. Deep Research systems support investment analysis through comprehensive opportunity evaluation. Perplexity/DeepResearch [209] enables detailed investment analysis incorporating financial metrics, market positioning, competitive dynamics, and growth indicators within unified analytical frameworks. This application integrates quantitative financial assessment with qualitative market understanding to support more comprehensive investment evaluation.
投资机会评估。DeepResearch 系统通过全面机会评估支持投资分析。Perplexity/DeepResearch[209]可在统一分析框架内整合财务指标、市场定位、竞争态势和增长指标,实现精细化的投资分析。该应用将定量财务评估与定性市场认知相融合,为投资决策提供更全面的评估支持。
Open frameworks like mshumer/OpenDeepResearcher [249] implement investment analysis components with particular emphasis on structured evaluation frameworks and comprehensive source integration. These approaches demonstrate how domain-specific optimization enhances practical utility for specialized business applications beyond generic research capabilities.
开源框架如 mshumer/OpenDeepResearcher[249]实现了投资分析组件,特别注重结构化评估框架和全面的数据源整合。这些方法展示了领域特定优化如何提升专业商业应用的实用价值,超越通用研究能力。
Risk Factor Identification. Advanced systems support risk management through comprehensive threat identification and assessment. Gemini/DeepResearch [60] provides explicit risk analysis capabilities, identifying potential threats across regulatory, competitive, technological, and market dimensions with associated impact and likelihood estimation. These features enable more comprehensive risk management than might be practical through manual analysis alone.
风险因素识别。先进系统通过全面的威胁识别与评估支持风险管理。Gemini/DeepResearch[60]提供明确的风险分析功能,可识别监管、竞争、技术和市场等维度的潜在威胁,并提供相关影响及可能性评估。这些特性使得风险管理比单纯人工分析更为全面。
Implementation patterns typically involve specialized risk modeling components like those found in Manus [164], which incorporates explicit risk categorization and prioritization mechanisms. These approaches highlight how targeted optimization enhances specific business workflows through specialized components addressing critical decision support requirements.
实施模式通常包含专业风险建模组件,例如 Manus[164]中采用的明确风险分类和优先级排序机制。这些方法凸显了针对性优化如何通过满足关键决策支持需求的专用组件来增强特定业务流程。

6.3.3 Business Process Optimization. Research-driven insights enhance operational effectiveness:
6.3.3 业务流程优化。研究驱动的洞察提升运营效能:
Best Practice Identification. Deep Research systems effectively synthesize operational best practices across industries and applications. OpenAI/DeepResearch [197] enables comprehensive process benchmarking against industry standards and innovative approaches from adjacent sectors, identifying optimization opportunities that might otherwise remain undiscovered. This application leverages the system’s broad knowledge base to facilitate cross-industry learning and adaptation.
最佳实践识别。深度研究系统能有效综合跨行业应用的运营最佳实践。OpenAI/DeepResearch [197] 支持基于行业标准与相邻领域创新方法的全面流程对标,发现可能被忽视的优化机会。该应用利用系统广博的知识库促进跨行业学习与适应性改进。
Open implementations like TARS [39] support similar capabilities through workflow analysis and recommendation components designed for business process optimization. These approaches demonstrate how domain adaptation enhances practical utility for specific business applications beyond general research capabilities.
开源实现如 TARS [39]通过工作流分析与推荐组件支持类似的业务流程优化能力。这些方法展示了领域适配如何增强特定商业应用场景下的实用价值,超越通用研究能力。
Implementation Planning Support. Advanced systems support process change through comprehensive implementation guidance. Gemini/DeepResearch [60] provides detailed implementation planning incorporating change management considerations, resource requirements, and risk mitigation strategies derived from
实施规划支持。先进系统通过全面的实施指南支持流程变革。Gemini/DeepResearch [60] 提供包含变更管理考量、资源需求及风险缓解策略的详细实施规划方案,这些策略源自

similar initiatives across industries. This capability accelerates organizational learning by leveraging broader implementation experience than typically available within single organizations.
各行业中的类似举措。这一能力通过利用比单个组织通常可获得更广泛的实施经验,加速了组织学习。
Implementation patterns typically involve specialized planning components like those in QwenLM/QwenAgent [224], HuggingGPT[246], XAgent[202], Mastra[168],Letta[138] and SemanticKernel[174] which incorporates explicit process modeling and change management frameworks. These approaches highlight how targeted optimization enhances specific business workflows through specialized components addressing critical implementation challenges.
实施模式通常涉及专门的规划组件,如 QwenLM/QwenAgent[224]、HuggingGPT[246]、XAgent[202]、Mastra[168]、Letta[138]和 SemanticKernel[174]中的那些,这些组件整合了明确的流程建模和变更管理框架。这些方法突显了通过针对关键实施挑战的专业组件,如何实现针对特定业务流程的定向优化。

6.4 Financial Analysis Applications
6.4 财务分析应用

Deep Research technologies enable enhanced financial assessment and decision support.
深度研究技术能够增强财务评估和决策支持能力。

6.4.1 Investment Research and Due Diligence. AI-enhanced analysis supports investment decisions across asset classes:
6.4.1 投资研究与尽职调查。AI 增强分析支持跨资产类别的投资决策:
Comprehensive Asset Evaluation. Deep Research systems enable detailed asset analysis across financial and contextual dimensions. Perplexity/DeepResearch [209] supports investment research through integration of financial metrics, market positioning, competitive dynamics, and growth indicators within unified analytical frameworks. This application enhances investment decision quality through more comprehensive information integration than typically practical through manual methods alone.
全面资产评估。深度研究系统支持从财务与情境维度进行细致的资产分析。Perplexity/DeepResearch [209]通过将财务指标、市场定位、竞争态势和增长指标整合至统一分析框架来支持投资研究。该应用通过比纯人工方法更全面的信息整合,提升了投资决策质量。
Open implementations like n8n [183] enable similar capabilities through workflow automation that integrates specialized financial data sources and analytical tools. These approaches demonstrate how effective tool orchestration creates sophisticated financial applications by coordinating specialized components within consistent analytical frameworks.
n8n [183]等开源实现方案通过集成专业金融数据源与分析工具的工作流自动化,提供了类似能力。这些方法展示了如何通过协调专业组件形成一致的分析框架,从而构建出成熟的金融应用。
Management Quality Assessment. Advanced systems support leadership evaluation through comprehensive background analysis. OpenAI/DeepResearch [197] enables detailed management assessment incorporating historical performance, leadership approach, strategic consistency, and reputation across diverse sources. This capability enhances investment evaluation by providing deeper leadership insights than typically available through standard financial analysis.
管理质量评估。先进系统通过全面的背景分析支持领导力评估。OpenAI/DeepResearch [197]能够整合历史表现、领导风格、战略一致性及多方声誉来源,实现细致入微的管理评估。该功能通过提供比标准财务分析更深入的领导力洞察,显著提升了投资评估质量。
Implementation patterns typically involve specialized entity analysis components like those found in Manus [164], which incorporates explicit leadership evaluation frameworks. These approaches highlight how targeted optimization enhances specific financial workflows through specialized components addressing critical evaluation dimensions.
实施模式通常包含专业化的实体分析组件,例如 Manus [164]中采用的明确领导力评估框架。这些方法展示了如何通过针对关键评估维度的专用组件,实现特定财务流程的定向优化。

6.4.2 Financial Trend Analysis. Pattern recognition across financial data informs strategic positioning:
6.4.2 财务趋势分析。跨财务数据的模式识别为战略定位提供依据:
Multi-Factor Trend Identification. Deep Research systems effectively identify complex patterns across financial indicators and contextual factors. Gemini/DeepResearch [60] demonstrates this capability through integrated analysis of market metrics, macroeconomic indicators, sector-specific factors, and relevant external trends. This application enhances trend identification through more comprehensive factor integration than typically practical through manual analysis alone.
多因素趋势识别。深度研究系统能有效识别金融指标与背景因素间的复杂模式。Gemini/DeepResearch [60]通过整合分析市场指标、宏观经济指标、行业特定因素及相关外部趋势,展示了这一能力。该应用通过比人工分析更全面的因素整合,增强了趋势识别的效果。
Open frameworks like grapeot/deep_research_agent [263] implement specialized trend analysis components with particular emphasis on statistical pattern detection and causal factor identification. However,
诸如 grapeot/deep_research_agent [263]等开放框架实现了专门的趋势分析组件,特别侧重于统计模式检测和因果因素识别。然而,

research indicates that the effectiveness of such AI systems may be limited in tasks requiring deep domain understanding, as their generated outputs can exhibit redundancy or inaccuracies [254]. These approaches demonstrate how domain-specific optimization enhances practical utility for specialized financial applications beyond generic analytical capabilities.
研究表明这类 AI 系统在需要深度领域理解的任务中可能效果有限,因其生成结果可能存在冗余或误差[254]。这些方法展示了领域特定优化如何为专业金融应用提供超越通用分析能力的实际效用。
Scenario Development and Testing. Advanced systems support financial planning through structured scenario analysis. OpenAI/DeepResearch [197] enables detailed scenario development incorporating varied assumptions, historical precedents, and system dependencies with coherent projection across financial impacts. This capability enhances strategic planning by facilitating more comprehensive scenario exploration than typically practical through manual methods.
场景开发与测试。先进系统通过结构化情景分析支持财务规划。OpenAI/DeepResearch [197] 能够开发详细情景,整合多样化假设、历史先例及系统依赖关系,并生成连贯的财务影响预测。该功能通过实现比传统人工方法更全面的情景探索,从而提升战略规划能力。
Implementation patterns typically involve specialized scenario modeling components like those in Agent-RL/ReSearch [2], which incorporates explicit dependency modeling and consistency verification mechanisms. These approaches highlight how targeted optimization enhances specific financial workflows through specialized components addressing critical planning requirements.
实施模式通常涉及专业的情景建模组件,例如 Agent-RL/ReSearch [2]中的组件,其包含显式依赖关系建模和一致性验证机制。这些方法展示了如何通过针对关键规划需求的专业组件,实现特定财务流程的定向优化。

6.4.3 Risk Assessment and Modeling. Comprehensive risk analysis informs financial decisions:
6.4.3 风险评估与建模。全面的风险分析为财务决策提供依据:

Multi-Dimensional Risk Analysis. Deep Research systems enable integrated risk assessment across diverse risk categories. Perplexity/DeepResearch [209] supports comprehensive risk evaluation incorporating market, credit, operational, regulatory, and systemic risk factors within unified analytical frameworks. This application enhances risk management through more comprehensive factor integration than typically practical through compartmentalized analysis.
多维风险分析。深度研究系统支持跨不同风险类别的综合评估。Perplexity/DeepResearch [209] 提供了包含市场、信用、操作、监管和系统性风险因素的综合风险评估框架,通过统一的分析架构实现。该应用通过比传统分割式分析更全面的因素整合,提升了风险管理能力。
Open implementations like nickscamara/open-deep-research [42] implement risk analysis components with particular emphasis on integrated factor assessment and interaction modeling. These approaches demonstrate how domain adaptation enhances practical utility for specific financial applications beyond general analytical capabilities. Evaluations such as RedCode-Exec[101] show that agents are less likely to reject executing technically buggy code, indicating high risks, which highlights the need for stringent safety evaluations for diverse code agents.
诸如 nickscamara/open-deep-research [42]等开源实现特别强调综合因素评估和交互建模的风险分析组件。这些方法展示了领域适应如何增强特定金融应用的实际效用,超越通用分析能力。RedCode-Exec[101]等评估表明,代理程序更倾向于执行技术上存在缺陷的高风险代码,这凸显了对多样化代码代理进行严格安全评估的必要性。
Stress Testing and Resilience Assessment. Advanced systems support financial stability through sophisticated stress scenario analysis. Gemini/DeepResearch [60] provides detailed stress testing capabilities incorporating historical crisis patterns, theoretical risk models, and system dependency analysis to identify potential vulnerabilities. These features enable more comprehensive resilience assessment than might be practical through standardized stress testing alone.
压力测试与韧性评估。先进系统通过复杂的压力情景分析支持金融稳定。Gemini/DeepResearch [60]提供详细的压力测试功能,整合历史危机模式、理论风险模型及系统依赖性分析,以识别潜在脆弱点。这些特性使得韧性评估比仅通过标准化压力测试更为全面。
Implementation patterns typically involve specialized stress modeling components like those found in Flowith/OracleMode [77], which incorporates explicit extreme scenario generation and impact propagation mechanisms. These approaches highlight how targeted optimization enhances specific financial workflows through specialized components addressing critical stability assessment requirements.
实施模式通常涉及专门的压力建模组件,如 Flowith/OracleMode [77]中所见的组件,其包含显式的极端情景生成与影响传播机制。这些方法展示了如何通过针对关键稳定性评估需求的专用组件,实现特定金融流程的定向优化。

6.5 Educational Applications
6.5 教育应用

Deep Research technologies enable enhanced learning and knowledge development. Educational approaches to research automation have shown particular promise in scientific education [236] and data science pedagogy [274], with systems like DS-Agent automating machine learning workflows through case-based reasoning
深度研究技术能够促进强化学习和知识发展。在教育领域,研究自动化的教学方法已在科学教育[236]和数据科学教学法[274]中展现出特殊潜力,诸如 DS-Agent 等系统通过基于案例的推理实现了机器学习工作流程的自动化。

to reduce learners’ technical barriers [102], highlighting the dual role of these systems in both conducting research and developing research capabilities in human learners. Smart AI reading assistants are also being developed to enhance reading comprehension through interactive support [266]. However, adoption challenges remain significant in educational contexts, where user resistance and ineffective system utilization can impede learning progress, requiring strategies such as active support during initial use and clear communication of system capabilities [252]. Specifically in data science education, learners encounter challenges similar to those faced by data scientists when interacting with conversational AI systems, such as difficulties in formulating prompts for complex tasks and adapting generated code to local environments [57]. Structured empirical evaluations of LLMs for data science tasks, such as the work by Nathalia Nascimento et al. [185], demonstrate their effectiveness in coding challenges and provide guidance for model selection in educational tools.
为降低学习者的技术门槛[102],这些系统在开展研究的同时培养人类学习者研究能力的双重作用得到凸显。智能 AI 阅读助手也正在开发中,通过交互式支持提升阅读理解能力[266]。然而在教育场景中,采用障碍仍然显著——用户抵触情绪和系统使用低效可能阻碍学习进程,这需要采取初始使用阶段的主动支持、清晰传达系统功能等策略[252]。具体到数据科学教育领域,学习者在与对话式 AI 系统交互时遇到的挑战与数据科学家面临的类似,包括复杂任务提示词构建困难、生成代码适配本地环境等问题[57]。针对数据科学任务的 LLMs 结构化实证评估(如 Nathalia Nascimento 等人的研究[185])证明了其在编程挑战中的有效性,为教育工具中的模型选择提供了指导。

6.5.1 Personalized Learning Support. AI-enhanced research supports individualized educational experiences:
6.5.1 个性化学习支持。AI 强化的研究支持可提供个性化教育体验:
Adaptive Learning Path Development. Deep Research systems effectively generate customized learning pathways based on individual interests and knowledge gaps. OpenAI/DeepResearch [197] enables detailed learning plan development incorporating knowledge structure mapping, prerequisite relationships, and diverse learning resources tailored to individual learning styles and objectives. This application enhances educational effectiveness through more personalized learning journeys than typically available through standardized curricula.
自适应学习路径开发。深度研究系统能根据个人兴趣与知识缺口有效生成定制化学习路径。OpenAI/DeepResearch [197]支持详细学习计划的制定,整合知识结构图谱、先决条件关系以及针对个人学习风格和目标量身定制的多样化学习资源。该应用通过比标准化课程更个性化的学习历程,提升了教育成效。
Open implementations like OpenManus [193] implement personalized learning components with particular emphasis on interest-driven exploration and adaptive difficulty adjustment. These approaches demonstrate how educational adaptation enhances practical utility beyond general research capabilities.
开放实现方案如 OpenManus [193]部署了个性化学习组件,特别注重兴趣驱动式探索与自适应难度调节。这些方法展示了教育适应性如何超越通用研究能力,增强实际应用价值。
Comprehensive Question Answering. Advanced systems provide detailed explanations tailored to learner context and prior knowledge. Perplexity/DeepResearch [209] demonstrates this capability through multilevel explanations that adjust detail and terminology based on learner background, providing conceptual scaffolding appropriate to individual knowledge levels. This capability enhances learning effectiveness by providing precisely targeted explanations rather than generic responses.
综合问答系统。先进的系统能够根据学习者的背景和已有知识提供详细的定制化解释。Perplexity/DeepResearch [209]通过多层次解释展示了这一能力,这些解释会根据学习者背景调整细节和术语,为不同知识水平的个体提供合适的概念支架。这种能力通过提供精准定位的解释而非通用回答,有效提升了学习效果。
Implementation patterns typically involve specialized educational components like those in HKUDS/ Auto-Deep-Research [112], which incorporates explicit knowledge modeling and explanation generation mechanisms. These approaches highlight how targeted optimization enhances educational applications through specialized components addressing critical learning support requirements.
实现模式通常涉及专门的教育组件,如 HKUDS/Auto-Deep-Research [112]中所采用的,它整合了显式知识建模和解释生成机制。这些方法突显了如何通过针对关键学习支持需求的专门组件,实现教育应用的定向优化。

6.5.2 Educational Content Development. Research-driven content creation enhances learning materials:
6.5.2 教育内容开发。以研究为导向的内容创作能够提升学习材料质量:
Curriculum Development Support. Deep Research systems effectively synthesize educational best practices and domain knowledge into coherent curricula. Gemini/DeepResearch [60] enables comprehensive curriculum development incorporating learning science principles, domain structure mapping, and diverse resource integration. This application enhances educational design through more comprehensive knowledge integration than typically practical for individual educators.
课程开发支持。深度研究系统能有效整合教育最佳实践与领域知识,形成连贯的课程体系。Gemini/DeepResearch [60] 实现了融合学习科学原理、领域结构映射和多样化资源整合的全面课程开发方案。该应用通过比个体教育工作者常规实践更全面的知识整合,提升了教育设计水平。
Open frameworks like smolagents/open_deep_research [115] implement curriculum development components with particular emphasis on learning progression modeling and resource alignment. These approaches
诸如 smolagents/open_deep_research [115]等开放框架实现了课程开发组件,特别注重学习进程建模与资源对齐。这些方法

demonstrate how specialized adaptation enhances practical utility for educational applications beyond generic content generation.
展示了专业化适配如何为教育应用提供超越通用内容生成的实际效用。
Multi-Modal Learning Material Creation. Advanced systems generate diverse educational content formats tailored to learning objectives. OpenAI/DeepResearch [197] supports creation of integrated learning materials incorporating explanatory text, conceptual visualizations, practical examples, and assessment activities aligned with specific learning outcomes. This capability enhances educational effectiveness through more comprehensive content development than typically practical through manual methods alone.
多模态学习材料创建。先进系统能根据学习目标生成多样化的教育内容格式。OpenAI/DeepResearch [197]支持创建整合了说明性文本、概念可视化、实践案例以及与特定学习成果相匹配的评估活动的综合性学习材料。这一能力通过比单纯手工方法更全面的内容开发,提升了教育有效性。
Implementation patterns typically involve specialized content generation components like those in QwenLM/Qwen-Agent [224], which incorporates explicit learning objective modeling and multi-format content generation. These approaches highlight how targeted optimization enhances educational applications through specialized components addressing diverse learning modalities.
实现模式通常涉及专门的内容生成组件,如 QwenLM/Qwen-Agent [224]中所采用的,它包含明确的学习目标建模和多格式内容生成功能。这些方法展示了通过针对不同学习模式的专门组件进行定向优化,如何增强教育应用的效果。

6.5.3 Academic Research Training. AI-assisted research skill development supports scholarly advancement:
6.5.3 学术研究训练。AI 辅助的研究技能发展支持学术进步:
Research Methodology Instruction. Deep Research systems effectively teach research methods through guided practice and feedback. Perplexity/DeepResearch [209] provides explicit methodology training, demonstrating effective research processes while explaining rationale and providing structured feedback on learner attempts. This application enhances research skill development through more interactive guidance than typically available through traditional instruction.
研究方法指导。深度研究系统通过引导式练习和反馈有效教授研究方法。Perplexity/DeepResearch [209]提供明确的方法论培训,在解释原理的同时展示有效的研究流程,并为学习者的尝试提供结构化反馈。该应用通过比传统教学更互动的指导方式,提升了研究技能的培养效果。
Open implementations like Jina-AI/node-DeepResearch [121] support similar capabilities through research practice environments with explicit guidance and feedback mechanisms. These approaches demonstrate how educational adaptation enhances practical utility for research training beyond simple information provision.
开源实现如 Jina-AI/node-DeepResearch [121]通过配备明确指导和反馈机制的研究实践环境,支持类似功能。这些方法展示了教育适应性如何超越简单信息提供,增强研究培训的实际效用。
Critical Evaluation Skill Development. Maintaining critical thinking skills while leveraging AI research assistance presents unique educational challenges. Drosos et al. [71] demonstrate that carefully designed “provocations” can help restore critical thinking in AI-assisted knowledge work, suggesting important educational approaches for developing research skills that complement rather than rely entirely on AI capabilities. Advanced systems support critical thinking through guided source evaluation and analytical practice. OpenAI/DeepResearch [197] enables critical evaluation training, demonstrating source assessment, evidence weighing, and analytical reasoning while guiding learners through similar processes. This capability enhances critical thinking development through structured practice with sophisticated feedback.
批判性评估技能培养。在利用 AI 研究辅助工具的同时保持批判性思维能力,这对教育领域提出了独特的挑战。Drosos 等人[71]的研究表明,精心设计的"激发性干预"能够帮助恢复 AI 辅助知识工作中的批判性思维,这为开发与 AI 能力互补而非完全依赖的研究技能教育方法提供了重要启示。先进系统通过引导式来源评估和分析实践来支持批判性思维。OpenAI/DeepResearch[197]实现了批判性评估训练功能,在引导学习者完成类似流程的同时,示范如何进行来源评估、证据权衡和分析推理。该能力通过结构化实践与复杂反馈机制,有效促进了批判性思维的发展。
Implementation patterns typically involve specialized educational components like those in grapeot/ deep_research_agent [263], which incorporates explicit critical thinking modeling and guided practice mechanisms. These approaches highlight how targeted optimization enhances educational applications through specialized components addressing crucial scholarly skill development.
实现模式通常包含专业化的教育组件,如 grapeot/deep_research_agent[263]中所采用的方案,该方案整合了显式的批判性思维建模与引导式实践机制。这些方法凸显了如何通过针对关键学术技能培养的专业组件,实现教育应用的定向优化。

6.6 Personal Knowledge Management Applications
6.6 个人知识管理应用

Deep Research technologies enable enhanced individual information organization and utilization.
深度研究技术能够提升个人信息的组织与利用效率。

6.6.1 Information Organization and Curation. AI-enhanced systems support personal knowledge development:
6.6.1 信息组织与策展。AI 增强系统支持个人知识发展:
Personalized Knowledge Base Development. Deep Research systems effectively organize diverse information into coherent personal knowledge structures. Perplexity/DeepResearch [209] supports knowledge base development through automated information organization, connection identification, and gap highlighting tailored to individual interests and objectives. This application enhances personal knowledge management through more sophisticated organization than typically practical through manual methods alone.
个性化知识库开发。深度研究系统能有效将多样化信息组织成连贯的个人知识结构。Perplexity/DeepResearch [209]通过自动化信息组织、连接识别和针对个人兴趣与目标的缺口突显,支持知识库的开发。这一应用通过比单纯手动方法更为精密的组织方式,提升了个人知识管理的水平。
Open implementations like nickscamara/open-deep-research [42] implement knowledge organization components with particular emphasis on personalized taxonomy development and relationship mapping. These approaches demonstrate how individual adaptation enhances practical utility for personal applications beyond generic information management.
诸如 nickscamara/open-deep-research [42]等开源实现特别注重个性化分类体系开发和关系映射,实现了知识组织组件。这些方法展示了个人化调整如何为个人应用带来超越通用信息管理的实际效用。
Content Summarization and Abstraction. Advanced systems transform complex information into accessible personal knowledge. OpenAI/DeepResearch [197] provides multi-level content abstraction capabilities, generating overview summaries, detailed analyses, and conceptual maps from complex source materials tailored to individual comprehension preferences. This capability enhances information accessibility by providing precisely targeted representations rather than generic summaries.
内容摘要与抽象化。先进系统将复杂信息转化为易于理解的个人知识。OpenAI/DeepResearch [197] 提供多层次内容抽象能力,可根据个人理解偏好,从复杂源材料生成概览摘要、详细分析和概念图谱。这一能力通过提供精准定位的表述而非通用摘要,显著提升了信息的可及性。
Implementation patterns typically involve specialized content processing components like those in Nanobrowser [184], which incorporates explicit knowledge distillation and representation generation mechanisms. These approaches highlight how targeted optimization enhances personal knowledge applications through specialized components addressing individual information processing needs.
实现模式通常涉及专门的内容处理组件,如 Nanobrowser [184]中所采用的显式知识蒸馏与表征生成机制。这些方法展示了如何通过针对个体信息处理需求的专用组件,实现个人知识应用的定向优化。

6.6.2 Personal Learning and Development. Research-driven insights support individual growth:
6.6.2 个人学习与发展。研究驱动的洞见支持个体成长:

Interest-Driven Exploration. Deep Research systems effectively support curiosity-driven learning through guided exploration. Gemini/DeepResearch [60] enables interest-based knowledge discovery, identifying connections, extensions, and practical applications related to individual curiosities. This application enhances personal learning through more sophisticated guidance than typically available through standard search alone.
兴趣驱动探索。深度研究系统通过引导式探索有效支持好奇心驱动的学习。Gemini/DeepResearch[60]实现了基于兴趣的知识发现,能够识别与个人好奇心相关的联系、延伸及实际应用。该应用通过比标准搜索更复杂的引导机制,提升了个人学习体验。
Open frameworks like OpenManus [193] implement exploration components with particular emphasis on interest mapping and discovery facilitation. These approaches demonstrate how personalization enhances practical utility for individual learning beyond generic information retrieval.
开放式框架如 OpenManus[193]实现了探索组件,特别注重兴趣图谱和发现辅助功能。这些方法展示了如何通过个性化提升个体学习的实用价值,超越通用信息检索的局限。
Skill Development Planning. Advanced systems support personal growth through comprehensive development guidance. Perplexity/DeepResearch [209] provides detailed skill development planning, incorporating learning resource identification, progression mapping, and practice guidance tailored to individual objectives and constraints. This capability enhances personal development through more comprehensive planning support than typically available through generic guidance.
技能发展规划。先进系统通过全面的发展指导支持个人成长。Perplexity/DeepResearch[209]提供详细的技能发展规划,包含学习资源识别、进阶图谱绘制以及根据个人目标和条件定制的实践指导。这一功能通过比通用指导更全面的规划支持,增强了个人发展能力。
Implementation patterns typically involve specialized planning components like those in TARS [39], which incorporates explicit skill modeling and development path generation. These approaches highlight how targeted optimization enhances personal growth applications through specialized components addressing individual development needs.
实现模式通常包含专门的规划组件,例如 TARS[39]中的设计,它整合了显式的技能建模和发展路径生成功能。这些方法展示了针对性优化如何通过满足个体发展需求的专门组件来增强个人成长类应用。

6.6.3 Decision Support for Individual Users. Research-enhanced decision making improves personal outcomes:
6.6.3 面向个体用户的决策支持。研究强化的决策制定能改善个人成果:
Complex Decision Analysis. Deep Research systems effectively support personal decisions through comprehensive option evaluation. OpenAI/DeepResearch [197] enables detailed decision analysis, incorporating multiple criteria, preference weighting, and consequence projection tailored to individual values and constraints. This application enhances decision quality through more sophisticated analysis than typically practical through manual methods alone.
复杂决策分析。深度研究系统通过全面的选项评估有效支持个人决策。OpenAI/DeepResearch[197]实现了精细化的决策分析,结合多重标准、偏好权重以及根据个人价值观和约束条件定制的后果预测。该应用通过比单纯人工方法更为复杂的分析,显著提升了决策质量。
Open implementations like Agent-RL/ReSearch [2] implement decision support components with particular emphasis on preference elicitation and consequence modeling. These approaches demonstrate how personalization enhances practical utility for individual decision making beyond generic information provision.
像 Agent-RL/ReSearch[2]这样的开源实现特别注重偏好引导和结果建模,开发了决策支持组件。这些方法展示了如何通过个性化提升个体决策的实际效用,而不仅仅是提供通用信息。
Life Planning and Optimization. Advanced systems support long-term planning through integrated life domain analysis. Gemini/DeepResearch [60] provides comprehensive life planning support, integrating career, financial, health, and personal considerations within coherent planning frameworks tailored to individual values and objectives. This capability enhances life optimization through more integrated planning than typically achievable through domain-specific approaches alone.
生活规划与优化。先进系统通过整合生活领域分析来支持长期规划。Gemini/DeepResearch[60]提供全面的生活规划支持,将职业、财务、健康和个人因素整合到符合个人价值观与目标的连贯规划框架中。这种能力通过比单一领域方法更整合的规划,实现了更优的生活优化。
Implementation patterns typically involve specialized planning components like those in Flowith/ OracleMode [77], which incorporates explicit value modeling and multi-domain integration mechanisms. These approaches highlight how targeted optimization enhances personal planning applications through specialized components addressing holistic life considerations.
实现模式通常包含专门的规划组件,例如 Flowith/OracleMode[77]中采用的显式价值建模和多领域整合机制。这些方法通过针对整体生活考量的专门组件,展示了定向优化如何增强个人规划应用的效果。
The diverse applications outlined in this section demonstrate the broad practical impact of Deep Research technologies across domains. While specific implementation approaches vary across commercial and opensource ecosystems, common patterns emerge in domain adaptation, specialized component design, and integration with existing workflows. These patterns highlight how technical capabilities translate into practical value through thoughtful application design aligned with domain-specific requirements and user needs.
本节概述的多样化应用展示了深度研究技术在各领域的广泛实际影响。尽管商业和开源生态系统中的具体实现方法各不相同,但在领域适配、专用组件设计以及与现有工作流集成方面呈现出共同模式。这些模式揭示了技术能力如何通过与领域特定需求和用户需求相契合的精心应用设计转化为实际价值。

7 Ethical Considerations and Limitations
7 伦理考量与局限性

The integration of Deep Research systems into knowledge workflows introduces significant ethical considerations and technical limitations that must be addressed for responsible deployment. This section examines key challenges across four fundamental dimensions (see Figure 10): information integrity, privacy protection, source attribution and intellectual property, and accessibility.
将深度研究系统整合到知识工作流程中,引入了必须解决的重大伦理考量和关键技术限制,以确保负责任地部署。本节从四个基本维度(见图 10)审视关键挑战:信息完整性、隐私保护、来源归属与知识产权,以及可访问性。

7.1 Information Accuracy and Hallucination Concerns
7.1 信息准确性与幻觉问题

Deep Research systems face fundamental challenges in maintaining factual reliability despite their sophisticated capabilities.
深度研究系统在保持事实可靠性方面面临根本性挑战,尽管其具备复杂能力。

7.1.1 Factual Verification Mechanisms. Recent studies have highlighted significant challenges in reliable uncertainty communication [55], with particular concerns for research contexts where uncertainty boundaries may be unclear or contested. Some researchers have raised concerns about excessive reliance on AI-generated content in scholarly writing [ 27 , 45 , 104 , 119 , 146 , 207 , 282 , 286 , 324 , 335 ] [ 27 , 45 , 104 , 119 , 146 , 207 , 282 , 286 , 324 , 335 ] [27,45,104,119,146,207,282,286,324,335][27,45,104,119,146,207,282,286,324,335], particularly when verification mechanisms are inadequate or bypassed. These limitations are further complicated by tendencies toward misleading responses in conversation [113], presenting particular challenges for interactive research workflows where iterative refinement may inadvertently amplify initial inaccuracies. AI support systems designed for
7.1.1 事实核查机制。近期研究揭示了可靠不确定性沟通的重大挑战[55],在研究情境下尤为突出,因为不确定性的边界可能模糊或存在争议。部分学者对学术写作中过度依赖 AI 生成内容表示担忧 [ 27 , 45 , 104 , 119 , 146 , 207 , 282 , 286 , 324 , 335 ] [ 27 , 45 , 104 , 119 , 146 , 207 , 282 , 286 , 324 , 335 ] [27,45,104,119,146,207,282,286,324,335][27,45,104,119,146,207,282,286,324,335] ,特别是在核查机制不足或被绕过的情况下。这些局限性因对话中误导性回答的倾向[113]而进一步复杂化,为交互式研究工作流程带来特殊挑战,其中迭代优化可能无意间放大初始错误。专为

Ethical Dimensions of Deep Research Systems
深度研究系统的伦理维度

Key Ethical Considerations
关键伦理考量

Fig. 10. Ethical Dimensions of Deep Research Systems
图 10. 深度研究系统的伦理维度

evidence-based expository writing tasks, such as literature reviews, offer frameworks to enhance verification through structured sensemaking over source documents [247]. Addressing these challenges requires technical advancements in uncertainty representation, improvements in decision workflow design[107] and interface design improvements that effectively communicate confidence boundaries to research users[270].
基于证据的说明性写作任务(如文献综述)提供了通过结构化意义建构增强源文件验证的框架[247]。应对这些挑战需要在不确定性表征方面的技术进步、决策工作流设计的改进[107],以及能有效向研究用户传达置信边界的界面设计优化[270]。
Ensuring information accuracy requires explicit verification strategies:
确保信息准确性需要明确的验证策略:

Source Verification Approaches. Leading implementations incorporate explicit source validation mechanisms to enhance factual reliability. OpenAI/DeepResearch [197] implements multi-level verification that confirms information across multiple independent sources before incorporation into research outputs, with detailed guidelines outlined in their system documentation [196]. Similarly, Perplexity/DeepResearch [209] implements automated fact-checking that independently verifies key claims against trusted reference sources before inclusion in final reports.
源验证方法。领先的实现方案采用显式源验证机制来增强事实可靠性。OpenAI/DeepResearch[197]实施了多层次验证,在将信息纳入研究成果前需通过多个独立来源确认,其系统文档[196]中概述了详细指南。同样地,Perplexity/DeepResearch[209]采用自动化事实核查,在最终报告纳入关键主张前会对照可信参考源进行独立验证。
Open-source alternatives demonstrate varied approaches to verification. Systems like grapeot/deep_ research_agent [263] emphasize explicit citation mechanisms that maintain direct links between claims
开源替代方案展示了多种验证方法。例如,grapeot/deep_research_agent [263] 强调显式引用机制,以保持主张之间的直接链接。

and sources, enabling straightforward verification. More sophisticated implementations like HKUDS/Auto-Deep-Research [112] incorporate specialized verification modules that assess source credibility and content consistency before information utilization.
以及来源,便于直接验证。更复杂的实现如 HKUDS/Auto-Deep-Research[112]则集成了专门的验证模块,在信息利用前评估来源可信度与内容一致性。
Hallucination Detection and Prevention. Mitigating fabricated information represents a crucial challenge for LLM-based research systems. Commercial implementations employ advanced hallucination reduction techniques including strict grounding requirements and consistency verification. Gemini/DeepResearch [60] implements explicit uncertainty modeling that distinguishes between confirmed information and speculative extensions, enhancing transparency when definitive answers cannot be provided. Emerging paradigms like those proposed by Silver and Sutton [251] suggest a fundamental shift toward experience-driven learning, potentially transforming how research systems acquire and refine capabilities through interaction with information environments. Such approaches could enable more human-like research development through continuous improvement based on research experiences rather than static training alone, and could fundamentally mitigate hallucinations.
幻觉检测与预防。减轻虚构信息是 LLM 研究系统面临的关键挑战。商业实现采用先进的幻觉减少技术,包括严格的基础要求和一致性验证。Gemini/DeepResearch [60] 实现了显式不确定性建模,可区分已确认信息与推测性延伸,在无法提供明确答案时增强透明度。Silver 和 Sutton [251]提出的新兴范式建议从根本上转向经验驱动学习,可能通过信息环境交互改变研究系统获取和完善能力的方式。这类方法能基于研究经验而非仅靠静态训练实现持续改进,使研究发展更趋近人类模式,并可能从根本上减少幻觉现象。
Open implementations demonstrate pragmatic approaches to hallucination reduction within more constrained technical environments. Systems like Agent-RL/ReSearch [2] employ preventative strategies including explicit sourcing requirements and conservative synthesis guidelines that prioritize factual reliability over comprehensive coverage. Complementary approaches like Mask-DPO [100] focus on generalizable fine-grained factuality alignment, addressing a critical requirement for reliable research outputs. Recent work from the GAIR NLP team on DeepResearcher [81] has advanced these capabilities through integrated neural verification and knowledge graph alignment techniques that significantly enhance factual reliability. These approaches highlight diverse strategies for addressing a fundamental challenge that impacts all LLM-based research systems.
开源实现展示了在更受限的技术环境中减少幻觉的实用方法。像 Agent-RL/ReSearch[2]这样的系统采用了预防性策略,包括明确的来源要求和保守的合成准则,优先考虑事实可靠性而非全面覆盖。Mask-DPO[100]等互补方法专注于可泛化的细粒度事实对齐,满足可靠研究成果的关键需求。GAIR NLP 团队近期关于 DeepResearcher[81]的工作通过集成神经验证和知识图谱对齐技术显著提升了事实可靠性,从而推进了这些能力。这些方法凸显了应对影响所有基于 LLM 的研究系统的基础性挑战的多样化策略。

7.1.2 Uncertainty Communication Approaches. Transparent uncertainty representation enhances result interpretation and appropriate utilization:
7.1.2 不确定性沟通方法。透明的概率表征能增强结果解释和合理使用:
Confidence Estimation Methods. Advanced systems implement explicit confidence assessment for research findings and recommendations. OpenAI/DeepResearch [197] incorporates graduated confidence scoring that reflects evidence quality, consistency across sources, and reasoning reliability. This capability enhances result interpretation by clearly distinguishing between well-supported conclusions and more speculative findings.
置信度估计方法。先进系统对研究发现与建议实施明确的置信度评估。OpenAI/DeepResearch [197]采用分级置信度评分,反映证据质量、跨来源一致性及推理可靠性。该能力通过清晰区分有充分支持的结论与更具推测性的发现,增强了结果解读效果。
Open-source implementations demonstrate simplified but effective confidence communication approaches. Systems like mshumer/OpenDeepResearcher [249] incorporate basic confidence indicators that signal information reliability through explicit markers in research outputs. These approaches highlight the importance of transparent uncertainty communication regardless of implementation sophistication.
开源实现展示了简化但有效的置信度传达方法。诸如 mshumer/OpenDeepResearcher [249]等系统采用基础置信度指示器,通过研究输出中的显性标记来传递信息可靠性。这些方法凸显了无论实现复杂度如何,透明化不确定性沟通都具有重要意义。
Evidence Qualification Standards. Responsible systems clearly communicate limitations and contextual factors affecting result interpretation. Commercial implementations like Perplexity/DeepResearch [209] incorporate explicit evidence qualification that highlights contextual limitations, conflicting viewpoints, and temporal constraints affecting research findings. This practice enhances appropriate utilization by providing necessary context for result interpretation.
证据资格标准。负责任的系统会明确传达影响结果解读的限制条件和背景因素。Perplexity/DeepResearch[209]等商业实现采用了明确的证据资格说明,突出显示影响研究结果的情境限制、矛盾观点和时间约束。这种做法通过提供结果解读所需的必要背景,提升了使用方式的恰当性。
Open-source alternatives demonstrate varied approaches to evidence qualification. Systems like dzhng/ deep-research [321] implement explicit limitation statements that identify key constraints affecting research
开源替代方案展示了多样化的证据资格处理方法。诸如 dzhng/deep-research[321]等系统采用明确的限制声明,指出影响研究可靠性的关键约束条件。

reliability. More sophisticated implementations like Camel-AI/OWL [43] incorporate structured evidence models that represent both supporting and contradicting information within unified frameworks.
更复杂的实现如 Camel-AI/OWL[43]则采用结构化证据模型,在统一框架内同时呈现支持性和矛盾性信息。

7.1.3 Quality Control Frameworks. Systematic approaches to quality assurance enhance overall reliability:
7.1.3 质量控制框架。系统化的质量保障方法可提升整体可靠性:
Pre-Release Verification Standards. Leading implementations employ comprehensive validation processes before result delivery. Gemini Deep Research implements structured quality verification including automated consistency checking, source validation, and reasoning verification before providing research outputs. These practices enhance overall reliability through systematic error identification and correction.
预发布验证标准。领先的实现方案在交付结果前采用全面的验证流程。Gemini 深度研究在提供研究成果前实施了结构化质量验证,包括自动一致性检查、来源验证和推理验证。这些实践通过系统性的错误识别与修正,提升了整体可靠性。
Open-source implementations demonstrate more varied quality control approaches. Systems like nickscamara/ open-deep-research [42] incorporate simplified validation processes focusing on critical reliability factors including source verification and logical consistency. These approaches highlight how even basic quality control mechanisms can significantly enhance research reliability.
开源实现方案展示了更多样化的质量控制方法。诸如 nickscamara/open-deep-research[42]等系统采用简化的验证流程,重点关注来源验证和逻辑一致性等关键可靠性因素。这些方法表明,即使是基础的质量控制机制也能显著提升研究可靠性。
Feedback Integration Systems. Continuous improvement requires effective incorporation of accuracy feedback. As Deep Research systems advance toward greater autonomy, broader safety considerations become increasingly important. Bengio et al. [26] highlight potential risks from superintelligent agents and propose approaches like “Scientist AI” that balance capability with safer development paths, emphasizing the importance of integrated safety mechanisms in advanced research systems. Commercial systems implement sophisticated feedback integration including explicit accuracy reporting channels and systematic error pattern analysis. OpenAI/DeepResearch [197] includes dedicated correction mechanisms that incorporate verified accuracy feedback into system improvements, creating virtuous improvement cycles.
反馈集成系统。持续改进需要有效整合准确性反馈。随着深度研究系统向更高自主性发展,更广泛的安全考量变得日益重要。Bengio 等人[26]指出超级智能体可能带来的风险,并提出如"科学家 AI"等方案,在提升能力的同时确保更安全的发展路径,强调在高级研究系统中集成安全机制的重要性。商业系统采用复杂的反馈集成方案,包括明确的准确性报告渠道和系统性错误模式分析。OpenAI/DeepResearch[197]设有专门的校正机制,将经过验证的准确性反馈纳入系统改进,形成良性提升循环。
Open implementations demonstrate more community-oriented feedback approaches. Systems like smolagents/ open_deep_research [115] incorporate collaborative improvement frameworks that enable distributed error identification and correction through community contributions. These approaches highlight diverse strategies for enhancing reliability through user engagement across implementation contexts.
开源实现展示了更多以社区为导向的反馈方法。像 smolagents/open_deep_research[115]这样的系统采用了协作改进框架,通过社区贡献实现分布式错误识别与修正。这些方法凸显了在不同实现场景下,通过用户参与提升可靠性的多样化策略。

7.2 Privacy and Data Security
7.2 隐私与数据安全

Research systems must carefully protect sensitive information throughout the research process.
研究系统必须在整个研究过程中谨慎保护敏感信息。

7.2.1 User Data Protection Mechanisms. Safeguarding user information requires comprehensive protection strategies:
7.2.1 用户数据保护机制。保障用户信息需要全面的保护策略:
Query Isolation Practices. Leading implementations employ strict isolation between user research sessions. Commercial systems like OpenAI/DeepResearch [197] and Gemini/DeepResearch [60] implement comprehensive tenant isolation that prevents information leakage between distinct users or organizations. These practices are particularly crucial for sensitive research applications in corporate or governmental contexts.
查询隔离实践。领先的实现方案采用严格的用户研究会话隔离机制。OpenAI/DeepResearch[197]和 Gemini/DeepResearch[60]等商业系统实施了全面的租户隔离,防止不同用户或组织间的信息泄露。这些实践对于企业或政府环境中的敏感研究应用尤为关键。
Open-source implementations demonstrate varied isolation approaches depending on deployment models. Systems designed for local deployment like OpenManus [193] enable complete isolation within organizational boundaries, enhancing privacy for sensitive applications. Cloud-dependent implementations typically incorporate more limited isolation mechanisms, highlighting deployment considerations for privacy-sensitive applications.
开源实现方案根据部署模式展示了多样化的隔离方法。像 OpenManus[193]这样为本地部署设计的系统,可在组织边界内实现完全隔离,从而增强敏感应用的隐私保护。依赖云端的实现通常采用更有限的隔离机制,这突显了隐私敏感应用在部署时的考量因素。
Data Minimization Strategies. Responsible systems limit sensitive data collection and retention. Commercial implementations increasingly emphasize data minimization, collecting only information necessary for service
数据最小化策略。负责任的系统会限制敏感数据的收集和保留。商业实现方案日益强调数据最小化原则,仅收集服务必需的信息。

provision and applying appropriate retention limitations. These practices enhance privacy protection by reducing potential exposure of sensitive information through either security incidents or authorized access.
实施适当的保留限制规定。这些做法通过减少敏感信息在安全事件或授权访问中的潜在暴露,增强了隐私保护。
Open implementations demonstrate diverse approaches to data management. Systems like Nanobrowser [184] enable complete local control of browsing data, preventing external exposure of research activities. Infrastructure frameworks like Jina-AI/node-DeepResearch [121] provide flexible configuration options that enable deployment-specific privacy controls aligned with organizational requirements.
开源实现展示了多样化的数据管理方法。诸如 Nanobrowser[184]等系统实现了对浏览数据的完全本地控制,防止研究活动的外部暴露。Jina-AI/node-DeepResearch[121]等基础设施框架提供了灵活的配置选项,使部署特定的隐私控制能够符合组织要求。

7.2.2 Sensitive Information Handling. Special safeguards are required for particularly sensitive content categories:
7.2.2 敏感信息处理。对于特别敏感的内容类别需要采取特殊保护措施:
Personal Identifier Management. Advanced systems implement specific protections for personally identifiable information. Commercial implementations like Perplexity/DeepResearch [209] incorporate automatic detection and redaction of personal identifiers from research outputs unless specifically relevant to research objectives. These practices prevent inadvertent exposure of personal information through research activities.
个人标识符管理。先进系统对个人可识别信息实施特定保护措施。诸如 Perplexity/DeepResearch[209]等商业实现方案,会从研究输出中自动检测并编辑个人标识符,除非这些信息与研究目标明确相关。这些做法防止了通过研究活动意外泄露个人信息。
Open implementations demonstrate more varied approaches to identifier management. Systems like TARS [39] incorporate basic identifier detection focused on common patterns like email addresses and phone numbers. More sophisticated implementations like QwenLM/Qwen-Agent [224] provide configurable sensitivity controls that enable context-appropriate protection aligned with specific deployment requirements.
开源实现方案展示了更多样化的标识符管理方法。像 TARS[39]这样的系统包含针对常见模式(如电子邮件地址和电话号码)的基础标识符检测功能。更复杂的实现如 QwenLM/Qwen-Agent[224]则提供可配置的敏感度控制,能够根据具体部署需求实现与上下文相适应的保护措施。
Protected Category Safeguards. Responsible systems implement enhanced protections for specially regulated information categories. Commercial implementations increasingly incorporate specialized handling for information categories including health data, financial records, and other regulated content types. These practices enhance compliance with domain-specific regulatory requirements governing sensitive information.
受保护类别保障措施。负责任的系统会对特别受监管的信息类别实施增强保护措施。商业实现方案越来越多地纳入了针对健康数据、财务记录及其他受监管内容类型的专门处理机制。这些实践强化了对敏感信息领域特定法规要求的合规性。
Open-source alternatives demonstrate more varied regulatory alignment. Systems like n8n [183] provide specialized workflow components for handling regulated data categories, enabling compliance-oriented implementations in sensitive domains. These approaches highlight how specialized components can address domain-specific regulatory requirements within flexible implementation frameworks.
开源替代方案展现出更多样化的合规适应性。诸如 n8n[183]等系统提供了专门处理受监管数据类别的工作流组件,使敏感领域能够实现以合规为导向的部署方案。这些方法凸显了专用组件如何在灵活的实现框架内满足领域特定的监管要求。

7.2.3 Compliance with Regulatory Frameworks. Adherence to applicable regulations ensures legally appropriate operation:
7.2.3 法规框架合规性。遵守适用法规可确保在法律允许范围内运行:
Jurisdictional Compliance Adaptation. Advanced systems implement regionally appropriate operational standards. Commercial implementations increasingly incorporate jurisdiction-specific adaptations that align with regional privacy regulations including GDPR, CCPA, and other frameworks. These practices enhance legal compliance across diverse deployment environments with varying regulatory requirements.
司法管辖区合规性适配。先进系统实施符合区域特点的操作标准。商业实现方案越来越多地融入特定司法管辖区的适应性调整,以符合包括 GDPR、CCPA 及其他框架在内的区域隐私法规。这些实践增强了在具有不同监管要求的多样化部署环境中的法律合规性。
Open implementations demonstrate more deployment-dependent compliance approaches. Systems designed for flexible deployment like Flowith/OracleMode [77] provide configurable privacy controls that enable adaptation to specific regulatory environments. These approaches highlight the importance of adaptable privacy frameworks that can address diverse compliance requirements across implementation contexts.
开放式实现方案展现出更具部署依赖性的合规方法。专为灵活部署设计的系统(如 Flowith/OracleMode[77])提供可配置的隐私控制功能,使其能够适应特定监管环境。这些方法凸显了适应性隐私框架的重要性,该框架能够满足不同实施场景下的多样化合规要求。
Transparency and Control Mechanisms. Responsible systems provide appropriate visibility and user authority over information processing. Emerging regulatory frameworks are increasingly focusing on AI agents with autonomous capabilities. Osogami [204] proposes that regulation of autonomous AI systems should specifically consider action sequence patterns rather than individual actions in isolation, which has particular implications for Deep Research systems that execute complex multi-step research workflows. Commercial implementations increasingly emphasize transparency through explicit processing disclosures
透明与控制机制。负责任的系统应为信息处理提供适当的可见性和用户权限。新兴监管框架日益关注具备自主能力的人工智能代理。Osogami[204]提出,对自主 AI 系统的监管应特别考虑行动序列模式而非孤立地看待单个行为,这对执行复杂多步骤研究流程的深度研究系统具有特殊意义。商业实现正通过明确处理声明和符合监管要求的用户控制机制,不断加强透明度。

and user control mechanisms aligned with regulatory requirements. These practices enhance both regulatory compliance and user trust through appropriate information governance.
这些实践通过恰当的信息治理,既增强了法规遵从性,也提升了用户信任度。
Open-source alternatives demonstrate varied transparency approaches. Systems like HKUDS/Auto-DeepResearch [112] provide detailed logging of information access and processing activities, enabling appropriate oversight and verification. These approaches highlight how transparent operation can enhance both compliance and trust across implementation contexts.
开源方案展示了多样化的透明度实现方式。诸如 HKUDS/Auto-DeepResearch[112]等系统提供了信息访问和处理活动的详细日志记录,实现了有效的监督与验证。这些方法彰显了透明操作如何在各类实施场景中同时提升合规性与信任度。

7.3 Source Attribution and Intellectual Property
7.3 来源归属与知识产权

Proper acknowledgment of information sources and respect for intellectual property rights are essential for ethical information utilization.
正确承认信息来源并尊重知识产权是道德信息利用的基础。

7.3.1 Citation Generation and Verification. Accurate source attribution requires reliable citation mechanisms:
7.3.1 引用生成与验证。准确的来源归属需要可靠的引用机制:
Automated Citation Systems. Advanced implementations incorporate sophisticated citation generation for research outputs. Commercial systems like OpenAI/DeepResearch [197] and Perplexity/DeepResearch [209] implement automatic citation generation in standard academic formats, enhancing attribution quality and consistency. These capabilities support appropriate source acknowledgment without manual effort.
自动化引用系统。先进的实现方案采用复杂的引用生成技术处理研究成果。OpenAI/DeepResearch [197]和 Perplexity/DeepResearch [209]等商业系统实现了标准学术格式的自动引用生成,提高了归属质量和一致性。这些功能无需人工干预即可实现恰当来源标注。
Open implementations demonstrate varied citation approaches. Systems like mshumer/OpenDeepResearcher [249] incorporate basic citation generation focused on fundamental bibliographic information. More sophisticated alternatives like dzhng/deep-research [321] provide enhanced citation capabilities including format customization and citation verification against reference databases.
开源实现展示了多样化的引用方法。mshumer/OpenDeepResearcher [249]等系统包含专注于基础文献信息的基本引用生成功能。更复杂的替代方案如 dzhng/deep-research [321]则提供增强的引用能力,包括格式自定义和基于参考数据库的引用验证。
Citation Completeness Verification. Responsible systems ensure comprehensive attribution for all utilized information. Commercial implementations increasingly incorporate citation coverage verification that identifies unsupported claims requiring additional attribution. These practices enhance attribution reliability by ensuring all significant claims maintain appropriate source connections.
引文完整性验证。负责任的系统确保对所有使用信息进行全面归属。商业实现越来越多地采用引文覆盖验证,识别需要额外归属的未支持声明。这些实践通过确保所有重要声明保持适当的来源关联,从而提升归属可靠性。
Open-source alternatives demonstrate pragmatic approaches to attribution verification. Systems like grapeot/deep_research_agent[263] implement explicit source-claim mapping that maintains clear relationships between information and origins. These approaches highlight the importance of systematic attribution regardless of implementation sophistication.
开源替代方案展示了归属验证的实用方法。诸如 grapeot/deep_research_agent[263]等系统实现了显式的来源-声明映射,保持信息与来源间的清晰关联。这些方法凸显了系统化归属的重要性,无论实现复杂度如何。

7.3.2 Intellectual Attribution Challenges. Special attribution considerations apply to complex intellectual contributions:
7.3.2 知识产权归属挑战。复杂的智力贡献需要特殊归属考量:
Idea Attribution Practices. Research systems must appropriately acknowledge conceptual contributions beyond factual information. Commercial implementations increasingly emphasize concept-level attribution that acknowledges intellectual frameworks and theoretical approaches beyond simple facts. These practices enhance ethical information utilization by appropriately recognizing intellectual contributions.
理念归属实践。研究系统必须恰当承认超越事实信息的概念性贡献。商业实现日益强调概念层面的归属,这种归属不仅认可简单事实,还承认知识框架和理论方法。这些实践通过恰当认可智力贡献,提升了信息利用的伦理标准。
Open implementations demonstrate varied idea attribution approaches. Systems like Camel-AI/OWL [43] incorporate explicit concept attribution that identifies theoretical frameworks and analytical approaches utilized in research outputs. These approaches highlight the importance of comprehensive attribution beyond basic factual sources.
开放实现展示了多样化的理念归属方法。诸如 Camel-AI/OWL[43]等系统采用了明确的概念归属机制,能够识别研究产出中所运用的理论框架和分析方法。这些方法凸显了超越基础事实来源的全面归属的重要性。
Synthesized Knowledge Attribution. Attribution becomes particularly challenging for insights synthesized across multiple sources. Advanced systems implement specialized attribution approaches for synthetic insights
综合知识归属。对于跨多源合成的洞见,归属问题变得尤为复杂。先进系统为合成性洞见实施了专门的归属方法,

that acknowledge multiple contributing sources while clearly identifying novel connections. These practices enhance attribution accuracy for the increasingly common scenario of cross-source synthesis.
这些方法既承认多方贡献来源,又能清晰识别新颖关联。这些实践提升了日益普遍的跨源合成场景中的归属准确性。
Open-source alternatives demonstrate pragmatic approaches to synthesis attribution. Systems like AgentRL/ReSearch [2] implement explicit synthesis markers that distinguish between directly sourced information and system-generated connections. These approaches highlight the importance of transparent derivation even when direct attribution becomes challenging.
开源替代方案展示了合成归因的实用方法。如 AgentRL/ReSearch[2]等系统实现了明确的合成标记,用于区分直接获取的信息与系统生成的关联。这些方法强调了透明推导的重要性,即使在直接归因变得困难时也是如此。

7.3.3 Copyright and Fair Use Considerations. Research activities interact with copyright protections in multiple dimensions:
7.3.3 版权与合理使用考量。研究活动在多个维度上与版权保护产生交互:
Fair Use Evaluation Mechanisms. Research systems must navigate appropriate utilization of copyrighted materials. Commercial implementations increasingly incorporate fair use evaluation that considers purpose, nature, amount, and market impact when utilizing copyrighted content. These practices enhance legal compliance while enabling appropriate information utilization for legitimate research purposes.
合理使用评估机制。研究系统必须合理使用受版权保护的材料。商业实现越来越多地采用合理使用评估机制,在使用受版权内容时会考虑使用目的、内容性质、使用数量及市场影响等因素。这些做法既增强了法律合规性,又能为合法研究目的实现适当的信息利用。
Open implementations demonstrate varied copyright approaches. Systems like Jina-AI/node-DeepResearch [121] incorporate basic copyright acknowledgment focusing on proper attribution, while more sophisticated alternatives like Manus [164] provide enhanced copyright handling including content transformation assessment and restricted access mechanisms for sensitive materials.
开源实现展示了多样化的版权处理方式。Jina-AI/node-DeepResearch[121]等系统采用基础的版权声明机制,侧重于规范署名;而 Manus[164]等更复杂的方案则提供增强型版权管理功能,包括内容转换评估和敏感材料的受限访问机制。
Content Licensing Compliance. Responsible systems respect diverse license terms applicable to utilized content. Advanced implementations increasingly incorporate license-aware processing that adapts information utilization based on specific terms governing particular sources. These practices enhance compliance with varied license requirements across the information ecosystem.
内容许可合规性。负责任的系统会尊重所使用内容适用的多样化许可条款。先进实现方案越来越多地采用许可感知处理技术,能够根据特定来源的管辖条款调整信息使用方式。这些实践增强了在整个信息生态系统中对不同许可要求的合规性。
Open implementations demonstrate more standardized licensing approaches. Systems like grapeot/deep_ research_agent [263] incorporate simplified license categorization focusing on common frameworks including creative commons and commercial restrictions. These approaches highlight pragmatic strategies for license navigation within resource constraints.
开源实现展现出更标准化的许可处理方法。grapeot/deep_research_agent[263]等系统采用简化的许可分类机制,专注于创作共用许可和商业限制等常见框架。这些方法凸显了在资源限制条件下进行许可导航的实用策略。

7.3.4 Output Intellectual Property Frameworks. Clear rights management for research outputs enhances downstream utilization:
7.3.4 输出知识产权框架。明确的研究成果权利管理可提升下游利用率:
Output License Assignment. Complex questions arise regarding intellectual property in research outputs. Commercial systems increasingly implement explicit license assignment for generated content, clarifying intellectual property status for downstream utilization. These practices enhance transparency regarding usage rights for research outputs created through automated systems.
输出许可分配。研究成果的知识产权问题涉及复杂考量。商业系统越来越多地对生成内容实施明确的许可分配机制,从而为下游使用厘清知识产权状态。这些实践增强了通过自动化系统创建的研究成果在使用权利方面的透明度。
Open-source alternatives demonstrate varied approaches to output rights. Systems like OpenManus [193] incorporate explicit license designation for research outputs aligned with organizational policies and source restrictions. These approaches highlight the importance of clear intellectual property frameworks regardless of implementation context.
开源方案展示了多样化的输出权利处理方式。诸如 OpenManus[193]等系统会根据组织政策和来源限制,为研究成果设置明确的许可标识。这些方法凸显了无论实施环境如何,清晰的知识产权框架都具有重要意义。
Derivative Work Management. Research systems must address whether outputs constitute derivative works of source materials. Commercial systems increasingly implement derivative assessment frameworks that evaluate the nature and extent of source transformation in research outputs. These practices enhance appropriate categorization for downstream utilization aligned with source licenses.
衍生作品管理。研究系统必须解决产出成果是否构成源材料的衍生作品这一问题。商业系统正越来越多地实施衍生评估框架,用于评估研究成果中对源材料转化的性质与程度。这些实践能提升下游使用中对源许可证合规的分类准确性。
Open-source alternatives demonstrate varied derivation approaches. Systems such as QwenLM/Qwen-Agent [224] incorporate a basic transformation assessment focusing on content reorganization and analytical
开源方案展示了多样化的衍生处理方法。诸如 QwenLM/Qwen-Agent[224]等系统采用的基础转化评估主要关注内容重组与分析性

addition. These approaches highlight the importance of thoughtful derivative consideration regardless of implementation sophistication.
补充。这些方法表明,无论实现复杂度如何,审慎考虑衍生问题都至关重要。

7.4 Accessibility and Digital Divide
7.4 可及性与数字鸿沟

Equitable access to research capabilities requires addressing systematic barriers.
实现研究能力的公平获取需要解决系统性障碍。

7.4.1 Technology Access Disparities. Recent work has highlighted both adoption barriers and opportunities for making Deep Research systems more accessible. Bianchini et al. [29] and Tonghe Zhuang et al. [334] identify specific organizational and individual factors affecting AI adoption in scientific research contexts, with implications for Deep Research deployment. Accessibility-focused approaches like those presented by Mowar et al. [179] demonstrate how AI coding assistants can be specifically designed to support accessible development practices, suggesting parallel opportunities for accessibility-centered Deep Research systems. Extending this, systems such as ResearchAgent [18] showcase how AI can lower barriers to scientific innovation by enabling iterative refinement of research ideas through collaborative feedback mechanisms, thus democratizing access to complex ideation processes.
7.4.1 技术获取差异。近期研究既揭示了采用障碍,也指出了使深度研究系统更易获取的机遇。Bianchini 等人[29]和 Tonghe Zhuang 等人[334]识别了影响科研场景中 AI 采用的具体组织与个人因素,这对深度研究部署具有启示意义。Mowar 等人[179]提出的以可访问性为核心的方法展示了 AI 编程助手如何专门设计以支持无障碍开发实践,这为构建以可访问性为中心的深度研究系统提供了并行思路。进一步延伸,诸如 ResearchAgent[18]等系统展示了 AI 如何通过协作反馈机制实现研究想法的迭代优化,从而降低科学创新门槛,使复杂的构思过程民主化。
Resource requirements create potential exclusion for various user segments:
资源需求可能导致不同用户群体被排除在外:

Computational Requirement Considerations. Resource-intensive systems may exclude users without substantial computing access. Commercial cloud-based implementations address this challenge through shared infrastructure that reduces local requirements, though with associated cost barriers. Open-source alternatives demonstrate varied resource profiles, with systems like Camel-AI/OWL [43] emphasizing efficiency to enable broader deployment on limited hardware.
计算资源需求考量。资源密集型系统可能将不具备强大计算能力的用户排除在外。商业云平台实施方案通过共享基础设施来降低本地资源需求,从而应对这一挑战,但会带来相应的成本障碍。开源替代方案展现出多样化的资源需求特征,例如 Camel-AI/OWL[43]等系统注重效率,使其能在有限硬件上实现更广泛的部署。
Cost Barrier Mitigation. Financial requirements create systematic access disparities across socioeconomic dimensions. Commercial implementations demonstrate varied pricing approaches, with systems like Perplexity/DeepResearch [209] offering limited free access alongside premium tiers. Open-source alternatives like HKUDS/Auto-Deep-Research [112] and nickscamara/open-deep-research [42] eliminate direct cost barriers while potentially introducing technical hurdles.
成本障碍缓解。财务要求造成了社会经济层面的系统性访问差异。商业实施方案采用多样化的定价策略,例如 Perplexity/DeepResearch[209]在提供高级付费服务的同时也允许有限的免费访问。而 HKUDS/Auto-Deep-Research[112]和 nickscamara/open-deep-research[42]等开源替代方案则消除了直接成本障碍,但可能带来技术门槛。

7.4.2 User Expertise Requirements. Technical complexity creates additional access barriers beyond resource considerations:
7.4.2 用户专业知识要求。技术复杂性在资源考量之外,还造成了额外的访问壁垒:
Technical Expertise Dependencies. Complex system deployment and operation may exclude users without specialized knowledge. Commercial implementations address this challenge through managed services that eliminate deployment complexity, though with reduced customization flexibility. Open-source alternatives demonstrate varied usability profiles, with systems like OpenManus [193] emphasizing simplified deployment to enhance accessibility despite local operation.
技术专长依赖性。复杂系统的部署和操作可能将不具备专业知识的使用者排除在外。商业实施方案通过托管服务来应对这一挑战,消除了部署复杂性,但同时也降低了定制灵活性。开源替代方案展现出多样化的可用性特征,例如 OpenManus[193]等系统强调简化部署以提升可访问性,尽管需要在本地运行。
Domain Knowledge Prerequisites. Effective research still requires contextual understanding for appropriate utilization. Both commercial and open-source implementations increasingly incorporate domain guidance that assists users with limited background knowledge in specific research areas. These capabilities enhance accessibility by reducing domain expertise barriers to effective research utilization.
领域知识先决条件。有效研究仍需具备情境理解能力才能合理利用。无论是商业还是开源实现方案,都越来越多地融入领域指导功能,帮助背景知识有限的使用者适应特定研究领域。这些功能通过降低有效研究利用所需的领域专业知识门槛,增强了可访问性。

7.4.3 Inclusivity and Universal Design Approaches. Deliberate inclusive design can address systematic access barriers:
7.4.3 包容性与通用设计方法。经过深思熟虑的包容性设计能够解决系统性访问障碍:
Linguistic and Cultural Inclusivity. Language limitations create significant barriers for non-dominant language communities. Commercial implementations increasingly offer multilingual capabilities, though with persistent quality disparities across languages. Open-source alternatives demonstrate varied language support, with systems like Flowith/OracleMode [77] emphasizing extensible design that enables community-driven language expansion beyond dominant languages.
语言与文化包容性。语言限制为非主流语言群体设置了重大障碍。商业实现方案正日益提供多语言能力,但不同语言间的质量差异仍然存在。开源替代方案展现出多样化的语言支持,如 Flowith/OracleMode[77]等系统强调可扩展设计,使社区能够推动超越主流语言的语种扩展。
Disability Accommodation Approaches. Accessible design ensures appropriate access for users with diverse abilities. Commercial implementations increasingly incorporate accessibility features including screen reader compatibility, keyboard navigation, and alternative format generation. Open-source alternatives demonstrate more varied accessibility profiles, highlighting an area for continued community development to ensure equitable access across implementation contexts.
残障适配方法。无障碍设计确保不同能力用户获得适宜的访问途径。商业实现方案正逐步整合包括屏幕阅读器兼容性、键盘导航和替代格式生成在内的无障碍功能。开源替代方案呈现出更为多样的无障碍特性分布,这一领域需要持续社区开发以确保不同实现场景下的公平访问。
The ethical considerations explored in this section highlight the complex responsibilities associated with Deep Research technologies beyond technical performance. While current implementations demonstrate varying approaches to these challenges across commercial and open-source ecosystems, consistent patterns emerge in the importance of factual verification, attribution quality, privacy protection, intellectual property respect, and accessible design. Addressing these considerations represents a critical priority for responsible development and deployment of these increasingly influential research technologies.
本节探讨的伦理考量凸显了深度研究技术除技术性能外所涉及的复杂责任。尽管当前商业和开源生态系统对这些挑战采取了不同应对方式,但在事实核查、归属质量、隐私保护、知识产权尊重和可访问设计等方面仍呈现出重要的一致性模式。解决这些考量事项,对于负责任地开发和部署这些影响力日益增强的研究技术而言至关重要。

8 Future Research Directions
8 未来研究方向

The rapidly evolving field of Deep Research presents numerous opportunities for technical advancement and application expansion. Recent work by Zheng et al. [329] proposes scaling deep research capabilities via reinforcement learning in real-world environments, while Wu et al. [297] explore enhancing reasoning capabilities of LLMs with tools specifically for deep research applications. The comprehensive framework for building effective agents outlined by Anthropic [11] provides additional design principles that could inform future Deep Research systems. This section examines promising research directions (illustrated in Figure 11) that could significantly enhance capabilities, address current limitations, and expand practical impact across domains, focusing on four key areas: advanced reasoning architectures, multimodal integration, domain specialization, and human-AI collaboration with standardization.
深度学习研究这一快速发展的领域为技术进步和应用拓展提供了众多机遇。Zheng 等人[329]的最新研究提出通过强化学习在现实环境中扩展深度研究能力,而 Wu 等人[297]则探索利用专为深度研究应用设计的工具来增强 LLMs 的推理能力。Anthropic[11]提出的构建高效智能体的综合框架提供了额外设计原则,可为未来深度研究系统提供参考。本节将考察(如图 11 所示)有望显著提升能力、解决当前局限并扩大跨领域实际影响的四个关键研究方向:先进推理架构、多模态整合、领域专业化以及标准化人机协作。

8.1 Advanced Reasoning Architectures
8.1 先进推理架构

Enhanced reasoning capabilities represent a fundamental advancement opportunity for next-generation systems.
增强推理能力代表着下一代系统的基础性进步机遇。

8.1.1 Context Window Optimization and Management. The information-intensive nature of deep research
8.1.1 上下文窗口优化与管理。深度研究的信息密集型特性

tasks presents fundamental challenges for context window utilization:
任务对上下文窗口的利用提出了根本性挑战:
Information Compression and Prioritization. Current systems struggle with context window exhaustion when processing extensive research materials. Future architectures could incorporate sophisticated compression mechanisms that maintain semantic content while reducing token consumption. Early steps in this direction appear in systems like OpenAI/DeepResearch [197], which implements basic summarization for lengthy sources. Recent work on academic paper review systems demonstrates how hierarchical processing of extended research content can maintain coherence while managing context limitations [333]. Semantic navigation techniques offer complementary approaches by enabling efficient exploration of problem-solution
信息压缩与优先级处理。当前系统在处理大量研究资料时面临上下文窗口耗尽的问题。未来架构可整合先进的压缩机制,在减少令牌消耗的同时保留语义内容。OpenAI/DeepResearch[197]等系统已在此方向迈出初步步伐,实现了对冗长文献的基础性摘要处理。近期关于学术论文审阅系统的研究表明,通过对扩展研究内容进行分层处理,可在管理上下文限制的同时保持内容连贯性[333]。语义导航技术则通过高效探索问题-解决方案空间提供了互补性方法。
Research Directions for Deep Research Systems
深度研究系统的研究方向

Fig. 11. Research Directions for Deep Research Systems
图 11. 深度研究系统的研究方向

spaces within constrained domains, optimizing context usage through input filtering while enhancing generation quality [238]. More advanced approaches could develop adaptive compression that preserves crucial details while condencing secondary information based on query relevance.
在受限领域内优化空间,通过输入过滤优化上下文使用,同时提升生成质量[238]。更先进的方法可开发自适应压缩技术,在基于查询相关性浓缩次要信息的同时保留关键细节。
Implementation opportunities include developing hierarchical summarization techniques that maintain multi-level representations of sources, implementing information relevance scoring that prioritizes context allocation to critical content, and designing dynamic context management that continuously optimizes window utilization throughout research workflows. These advances could significantly enhance information processing capabilities without requiring proportional increases in context length.
实施机会包括开发分层摘要技术以保持源文件的多级表示,实现信息相关性评分以优先分配上下文给关键内容,以及设计动态上下文管理来持续优化整个研究工作流程中的窗口利用率。这些进展可显著提升信息处理能力,而无需按比例增加上下文长度。
External Memory Architectures. Beyond compression, architectural innovations could fundamentally transform context window utilization. Future systems could implement sophisticated external memory
外部内存架构。除压缩外,架构创新可能从根本上改变上下文窗口的利用方式。未来系统可实现复杂的外部内存

frameworks that maintain rich information representations outside the primary context window, accessing them through efficient retrieval mechanisms when needed. Systems like Camel-AI/OWL [43] demonstrate early steps with basic retrieval-augmented generation, but more comprehensive approaches could enable effectively unlimited knowledge integration.
在主要上下文窗口之外维护丰富信息表示的框架,通过高效的检索机制在需要时访问这些信息。Camel-AI/OWL[43]等系统展示了基于检索增强生成技术的初步探索,但更全面的方法可以实现真正无限的知识整合。
Research directions include developing differentiable retrieval mechanisms that seamlessly integrate external knowledge within reasoning flows, implementing structured memory hierarchies that organize information for efficient access, and designing memory-aware reasoning processes that explicitly consider information availability when planning analytical approaches. These architectures could fundamentally address context limitations while enhancing reasoning transparency and reliability.
研究方向包括:开发可微分检索机制以无缝整合外部知识到推理流程中,构建结构化记忆层次以实现高效信息访问,以及设计具有内存感知能力的推理过程,在规划分析方法时显式考虑信息可用性。这些架构可以从根本上解决上下文限制问题,同时提升推理的透明度和可靠性。

8.1.2 Hybrid Symbolic-Neural Approaches. Integration of complementary reasoning paradigms offers significant potential:
8.1.2 混合符号-神经方法。整合互补推理范式具有显著潜力:
Neuro-Symbolic Integration. Current Deep Research systems rely primarily on neural approaches with limited explicit reasoning structures. Future systems could integrate symbolic reasoning components that provide formal logical capabilities alongside neural flexibility, enhancing both reliability and explainability. Early examples of this direction appear in systems like Camel-AI/OWL [43], which incorporates structured knowledge representation within primarily neural architectures. Future research could develop more sophisticated integration approaches that leverage the complementary strengths of both paradigms.
神经符号集成。当前深度研究系统主要依赖神经方法,其显式推理结构有限。未来的系统可以整合符号推理组件,在保持神经灵活性的同时提供形式化逻辑能力,从而增强可靠性和可解释性。这一方向的早期案例可见于 Camel-AI/OWL[43]等系统,它们在以神经架构为主的框架中融入了结构化知识表示。未来研究可以开发更复杂的集成方法,充分利用两种范式的互补优势。
Implementation approaches might include explicit logical verification layers that validate neural-generated reasoning, hybrid architectures that select appropriate reasoning mechanisms based on task characteristics, or integrated systems that translate between symbolic and neural representations as needed throughout complex workflows. These approaches could address current challenges in reliability and consistency while maintaining the flexibility and generalization capabilities of neural foundations.
实现途径可能包括:用于验证神经生成推理的显式逻辑验证层、根据任务特征选择适当推理机制的混合架构,或在复杂工作流中按需进行符号与神经表示转换的集成系统。这些方法既能解决当前在可靠性和一致性方面的挑战,又能保持神经基础的灵活性和泛化能力。
Advanced Knowledge Graph Integration. While current systems already incorporate basic knowledge graph capabilities, future approaches could implement more sophisticated integration with dynamic, contextuallyaware knowledge structures. Beyond the entity relationship modeling seen in systems like HKUDS/Auto-DeepResearch [112], next-generation implementations could enable bidirectional updates where research findings automatically refine and expand knowledge graphs while simultaneously leveraging them for reasoning. Such approaches could incorporate uncertainty representation within graph structures, probabilistic reasoning across knowledge networks, and adaptive abstraction hierarchies that transform between detailed and high-level conceptual representations based on reasoning requirements. Research opportunities include developing dynamic knowledge graph construction techniques that automatically build and refine structured representations from unstructured sources, implementing graph-aware attention mechanisms that incorporate relationship structures into neural reasoning, and designing hybrid querying approaches that combine graph traversal with neural generation. These advances could enhance precision for complex reasoning tasks requiring structured relationship understanding.
先进知识图谱集成。虽然现有系统已具备基础知识图谱能力,但未来方法可实现与动态情境感知知识结构的更深度整合。超越 HKUDS/Auto-DeepResearch[112]等系统中的实体关系建模,新一代实现可支持双向更新机制——研究成果自动优化扩展知识图谱的同时,也能持续利用图谱进行推理。此类方法可在图结构中纳入不确定性表征,实现跨知识网络的概率推理,以及根据推理需求在细节与高层概念表征间转换的自适应抽象层次。研究机遇包括:开发能从非结构化数据自动构建优化结构化表征的动态知识图谱技术,实现将关系结构融入神经推理的图感知注意力机制,以及设计结合图遍历与神经生成的混合查询方法。 这些进展可以提升需要理解结构化关系的复杂推理任务的精确度。

8.1.3 Causal Reasoning Enhancement. Moving beyond correlation to causal understanding represents a crucial capability advancement:
8.1.3 因果推理增强。从相关性理解转向因果理解代表了一项关键的能力进步:
Causal Inference Mechanisms. Current systems excel at identifying correlations but struggle with robust causal analysis. Future research could develop specialized causal reasoning components that systematically identify potential causal relationships, evaluate evidence quality, and assess alternative explanations. Recent
因果推理机制。当前系统擅长识别相关性,但在稳健的因果分析方面仍有不足。未来研究可以开发专门的因果推理组件,系统地识别潜在因果关系、评估证据质量并检验替代解释。近期

work in healthcare research by Schuemie et al. [241] demonstrates the challenges of establishing confident observational findings, highlighting the need for more sophisticated causal reasoning in research systems. Early steps in this direction appear in systems like OpenAI/DeepResearch [197], which incorporates basic causal language in relationship descriptions. Other research explores the use of AI to assist in mining causality, for instance, by searching for instrumental variables in economic analysis [105]. More sophisticated approaches could enable reliable causal analysis across domains. Implementation opportunities include developing causal graph construction techniques that explicitly model intervention effects and counterfactuals, implementing causal uncertainty quantification that represents confidence in causal assertions, and designing specialized prompt structures that guide causal reasoning through structured analytical patterns. These advances could enhance research quality for domains where causal understanding is particularly crucial, including medicine, social sciences, and policy analysis.
Schuemie 等人在医疗健康领域的研究[241]展现了确立可靠观察性发现的挑战,凸显了研究系统需要更复杂的因果推理能力。OpenAI/DeepResearch 等系统[197]已在此方向迈出初步探索,通过在关系描述中引入基础因果语言。另有研究探索利用 AI 辅助挖掘因果关系,例如在经济学分析中寻找工具变量[105]。更先进的方法有望实现跨领域的可靠因果分析。具体实施路径包括:开发能显式建模干预效应与反事实的因果图构建技术,实现表征因果断言可信度的不确定性量化方法,以及设计通过结构化分析模式引导因果推理的专业提示框架。这些进展将显著提升医学、社会科学及政策分析等因果理解至关重要的领域的研究质量。
Intervention Modeling Techniques. Advanced causal understanding requires sophisticated intervention and counterfactual reasoning capabilities. Future systems could incorporate explicit intervention modeling that simulates potential actions and outcomes based on causal understanding, enhancing both explanatory and predictive capabilities. Early examples of this direction appear in systems like Agent-RL/ReSearch [2], which implements basic intervention simulation within reinforcement learning frameworks. More comprehensive approaches could enable sophisticated what-if analysis across domains.
干预建模技术。高级因果理解需要复杂的干预和反事实推理能力。未来的系统可以整合显式的干预建模,基于因果理解模拟潜在行动与结果,从而增强解释和预测能力。这一方向的早期案例出现在如 Agent-RL/ReSearch[2]等系统中,它们在强化学习框架内实现了基础干预模拟。更全面的方法可能实现跨领域的复杂假设分析。
Research directions include developing counterfactual generation techniques that systematically explore alternative scenarios based on causal models, implementing intervention optimization algorithms that identify high-leverage action opportunities, and designing domain-specific intervention templates that embed field-specific causal knowledge for common analysis patterns. These advances could enhance practical utility for decision support applications requiring sophisticated action planning and outcome prediction.
研究方向包括:开发基于因果模型系统探索替代场景的反事实生成技术,实施识别高杠杆行动机会的干预优化算法,以及设计嵌入领域特定因果知识的专用干预模板,用于常见分析模式。这些进展可提升需要复杂行动规划和结果预测的决策支持应用的实际效用。

8.1.4 Uncertainty Representation and Reasoning. Sophisticated uncertainty handling enhances both accuracy
8.1.4 不确定性表示与推理。复杂的处理机制能同时提升准确性

and trustworthiness:  与可信度:
Multi-Dimensional Uncertainty Modeling. Current systems employ relatively simplistic uncertainty representations that inadequately capture different uncertainty types. Future research could develop multi-dimensional uncertainty frameworks that separately represent epistemic uncertainty (knowledge limitations), aleatoric uncertainty (inherent randomness), and model uncertainty (representation limitations). Early steps in this direction appear in systems like Perplexity/DeepResearch [209], which distinguishes between source uncertainty and integration uncertainty. More comprehensive approaches could enable more nuanced and reliable uncertainty communication.
多维度不确定性建模。现有系统采用相对简化的不确定性表示方法,难以充分捕捉不同类型的不确定性。未来研究可开发多维不确定性框架,分别表征认知不确定性(知识局限)、偶然不确定性(固有随机性)和模型不确定性(表征局限)。该方向的初步尝试已见于 Perplexity/DeepResearch[209]等系统,其区分了来源不确定性与整合不确定性。更全面的方法将实现更精细可靠的不确定性传递。
Implementation opportunities include developing uncertainty propagation mechanisms that track distinct uncertainty types throughout reasoning chains, implementing uncertainty visualization techniques that effectively communicate multi-dimensional uncertainty to users, and designing uncertainty-aware planning algorithms that appropriately balance different uncertainty types in decision contexts. These advances could enhance both system reliability and appropriate user trust calibration.
具体实施机会包括开发不确定性传播机制以追踪推理链中不同类型的不确定性,实现能有效向用户传达多维不确定性的可视化技术,以及设计在决策情境中合理权衡各类不确定性的感知型规划算法。这些进展既能提升系统可靠性,又能促进用户信任度的合理校准。
Bayesian Reasoning Integration. Probabilistic reasoning frameworks offer principled approaches to uncertainty handling and knowledge integration. Future systems could incorporate explicit Bayesian reasoning components that systematically update beliefs based on evidence strength and prior knowledge, enhancing both accuracy and explainability. Early examples of this direction appear in systems like
贝叶斯推理集成。概率推理框架为不确定性处理和知识整合提供了原则性方法。未来的系统可以整合显式的贝叶斯推理组件,这些组件能根据证据强度和先验知识系统地更新信念,从而同时提升准确性和可解释性。这一方向的早期案例可见于

grapeot/deep_research_agent [263], which implements basic evidence weighting within research workflows. More sophisticated integration could enable principled uncertainty handling across domains.
grapeot/deep_research_agent [263]等系统,它们在研究工作流中实现了基本的证据加权机制。更复杂的集成可以实现跨领域的原则性不确定性处理。
Research directions include developing scalable Bayesian inference techniques compatible with large-scale language models, implementing belief update explanation mechanisms that communicate reasoning in understandable terms, and designing domain-specific prior models that incorporate field-specific background knowledge for common analysis patterns. These advances could enhance reasoning quality for domains with inherent uncertainty or limited evidence.
研究方向包括:开发与大规模语言模型兼容的可扩展贝叶斯推断技术,实现能以可理解方式传达推理过程的信念更新解释机制,以及设计包含领域特定背景知识的专业先验模型以应对常见分析模式。这些进展可提升具有固有不确定性或有限证据领域的推理质量。

8.2 Multi-Modal Deep Research
8.2 多模态深度研究

Expanding beyond text to incorporate diverse information modalities represents a significant advancement opportunity.
超越文本范畴,整合多元信息模态代表着重大发展机遇。

8.2.1 Visual Information Integration. Image understanding dramatically expands information access and
8.2.1 视觉信息整合。图像理解能力显著拓展了信息获取与

analysis capabilities:  分析能力:
Scientific Image Analysis. Current systems demonstrate limited capabilities for extracting and interpreting visual scientific content. Future research could develop specialized visual understanding components for scientific images including graphs, diagrams, experimental images, and visualizations across domains. Early steps in this direction appear in systems like Gemini/DeepResearch [60], which incorporates basic chart extraction capabilities. Frameworks such as ChartCitor [96] provide fine-grained bounding box citations to enhance explainability for complex chart understanding, improving user trust and productivity. Specialized models like LHRS-Bot [180] demonstrate sophisticated reasoning capabilities for remote sensing imagery by leveraging geographic information and multimodal learning. The development of large-scale, domain-specific multimodal datasets for areas like entomology [272] and seafloor geology [188] is crucial for training more capable models. More comprehensive approaches could enable sophisticated analysis of visual scientific communication. Implementation opportunities include developing specialized scientific visualization parsers that extract quantitative data from diverse chart types, implementing diagram understanding systems that interpret complex scientific illustrations across domains, and designing domain-specific visual analysis components optimized for field-specific imagery like medical scans or astronomical observations. These advances could dramatically expand information access beyond text-centric sources.
科学图像分析。当前系统在提取和解释视觉科学内容方面表现出有限的能力。未来研究可以开发针对科学图像的专用视觉理解组件,包括跨领域的图表、示意图、实验图像和可视化内容。这一方向的早期尝试已出现在 Gemini/DeepResearch[60]等系统中,它们具备基础的图表提取功能。ChartCitor[96]等框架通过细粒度边界框引用增强了复杂图表理解的可解释性,从而提升用户信任度和工作效率。LHRS-Bot[180]等专用模型通过结合地理信息和多模态学习,展示了针对遥感图像的复杂推理能力。开发大规模领域专用多模态数据集(如昆虫学[272]和海底地质学[188]领域)对训练更强大的模型至关重要。更全面的方法或将实现视觉科学交流的复杂分析。 实施机会包括开发专门的科学可视化解析器,用于从各类图表中提取定量数据;构建跨领域图解理解系统,用以解析复杂的科学示意图;以及设计针对特定领域(如医学扫描或天文观测)优化的专业视觉分析组件。这些进展有望大幅拓展信息获取渠道,突破以文本为中心的传统局限。
Visual Evidence Integration. Effective research increasingly requires integration of visual evidence alongside textual sources. Future systems could implement sophisticated multimodal reasoning that incorporates visual evidence within comprehensive analytical frameworks, enabling true multimodal research synthesis. Recent analyses have identified multi-modal integration as a key missing capability in current AI research systems [315], highlighting the critical importance of cross-modal reasoning for scientific applications. Early examples of this direction appear in systems like Gemini/DeepResearch [60], which provides basic integration of image-derived information. More sophisticated approaches could enable balanced evidence integration across modalities.
视觉证据整合。高效的研究日益需要将视觉证据与文本来源相结合。未来的系统可实现复杂的多模态推理,将视觉证据纳入综合分析框架,实现真正的多模态研究综合。近期分析指出,多模态整合是当前人工智能研究系统的关键缺失能力[315],凸显了跨模态推理对科学应用的核心重要性。该方向的早期案例可见于 Gemini/DeepResearch 等系统[60],它们实现了图像信息的基础整合。更先进的方法有望实现跨模态证据的均衡整合。
Research directions include developing evidence alignment techniques that match textual and visual information addressing common questions, implementing cross-modal consistency verification that identifies conflicts between textual claims and visual evidence, and designing multimodal synthesis mechanisms that
研究方向包括:开发能针对共同问题匹配文本与视觉信息的证据对齐技术,实施可识别文本主张与视觉证据间矛盾的跨模态一致性验证,以及设计能综合多模态证据的合成机制。

generate integrated understanding across information types. These advances could enhance research quality for domains with significant visual information components.
生成跨信息类型的综合理解。这些进展可以提升具有重要视觉信息成分领域的研究质量。

8.2.2 Multimodal Source Analysis. Comprehensive understanding requires integrated analysis across diverse information formats:
8.2.2 多模态源分析。全面理解需要对不同信息格式进行整合分析:
Video Content Processing. Video represents an increasingly important but currently underutilized information source. Future research could develop specialized video understanding components that extract and interpret temporal visual information, including presentations, interviews, demonstrations, and dynamic processes. Initial steps in this direction are emerging in systems like OpenAI’s DALL-E 3, though not yet integrated into Deep Research workflows. Comprehensive integration could enable access to the extensive knowledge embedded in video content.
视频内容处理。视频作为一种日益重要但目前尚未充分利用的信息源。未来研究可以开发专门的视频理解组件,用于提取和解释时序视觉信息,包括演示、访谈、示范和动态过程。OpenAI 的 DALL-E 3 等系统已在这一方向上迈出初步步伐,尽管尚未整合到深度研究工作流程中。全面整合将有助于获取视频内容中蕴含的丰富知识。
Implementation opportunities include developing lecture understanding systems that extract structured knowledge from educational videos, implementing process analysis components that interpret demonstrations and procedures, and designing integrated audio-visual analysis that combines visual information with spoken content for comprehensive understanding. These advances could expand information access to the rapidly growing corpus of video knowledge.
实施机会包括开发能够从教育视频中提取结构化知识的讲座理解系统,构建可解读演示和操作流程的过程分析组件,以及设计结合视觉信息与语音内容的综合视听分析系统以实现全面理解。这些进展有望将信息获取范围扩展至快速增长的视频知识库。
Audio Content Integration. Spoken information in podcasts, lectures, interviews, and discussions represents a valuable knowledge source. Future systems could incorporate sophisticated audio processing that extracts, interprets, and integrates spoken information within research workflows. Early examples of speech processing appear in transcription services, but comprehensive research integration remains limited. Advanced approaches could enable seamless incorporation of spoken knowledge alongside traditional text sources.
音频内容整合。播客、讲座、访谈和讨论中的语音信息是宝贵的知识来源。未来系统可整合先进的音频处理技术,在研究流程中提取、解析并融合语音信息。语音处理的早期应用已出现在转录服务中,但全面的研究整合仍显不足。先进方法有望实现语音知识与传统文本来源的无缝结合。
Research directions include developing speaker identification and attribution systems that maintain appropriate source tracking for spoken content, implementing domain-specific terminology extraction that accurately captures specialized vocabulary in varied acoustic conditions, and designing temporal alignment techniques that connect spoken information with related textual or visual content. These advances could expand information access while maintaining appropriate attribution and context.
研究方向包括开发能够对语音内容保持适当来源追踪的说话人识别与归属系统,实现能在不同声学条件下准确捕捉专业词汇的领域特定术语提取,以及设计能将语音信息与相关文本或视觉内容关联起来的时间对齐技术。这些进展可在保持适当归属和上下文的同时扩展信息获取途径。

8.2.3 Cross-Modal Reasoning Techniques. Effective multimodal research requires specialized reasoning approaches across information types:
8.2.3 跨模态推理技术。有效的多模态研究需要跨信息类型的专门推理方法:
Multi-Modal Chain of Thought Reasoning. Current reasoning processes typically operate primarily within single modalities despite handling diverse information types. Future systems could implement true multimodal reasoning chains that explicitly incorporate diverse information types throughout the analytical process, not just in final outputs. Early steps appear in systems like Gemini/DeepResearch [60], which demonstrates basic visual incorporation in reasoning steps. More sophisticated approaches could enable reasoning flows that seamlessly transition between textual analysis, visual processing, numerical computation, and spatial reasoning based on task requirements.
多模态思维链推理。当前的推理过程主要局限于单一模态内运作,尽管处理的是多样化的信息类型。未来的系统可以实现真正的多模态推理链,在整个分析过程中明确整合多种信息类型,而不仅仅体现在最终输出环节。Gemini/DeepResearch[60]等系统已展现出初步成果,演示了在推理步骤中融入视觉信息的基本能力。更先进的方案有望实现根据任务需求,在文本分析、视觉处理、数值计算和空间推理之间无缝切换的推理流程。
Research opportunities include developing explicit multi-modal reasoning protocols that formalize information transfer between modalities, implementing cross-modal verification techniques that leverage complementary information types throughout reasoning chains, and designing unified representation frameworks that enable coherent reasoning across diverse information formats. These advances could significantly enhance reasoning quality for complex research tasks requiring integrated understanding across modalities,
研究机遇包括:开发明确的多模态推理协议以规范模态间的信息传递,实施贯穿推理链的跨模态验证技术以利用互补信息类型,以及设计统一表征框架来实现跨多样信息格式的连贯推理。这些进展将显著提升需要跨模态综合理解的复杂研究任务的推理质量。

moving beyond the current text-centric reasoning paradigms to more human-like analytical processes that naturally leverage the most appropriate modality for each reasoning component.
超越当前以文本为中心的推理范式,转向更接近人类的分析过程,自然地利用最适合每种推理组件的模态。
Cross-Modal Consistency Verification. Integrating diverse information modalities introduces new consistency challenges. Future research could develop specialized verification mechanisms that assess consistency across textual, visual, numerical, and temporal information, enhancing overall reliability. Early steps in this direction appear in systems like Gemini/DeepResearch [60], which implements basic cross-format validation. More sophisticated approaches could enable reliable integration of increasingly diverse information types.
跨模态一致性验证。整合多样信息模态带来了新的一致性挑战。未来研究可开发专门的验证机制,用于评估文本、视觉、数值和时间信息之间的一致性,从而提升整体可靠性。Gemini/DeepResearch 等系统已在此方向迈出初步步伐[60],实现了基础的跨格式验证。更复杂的方法将能可靠整合日益多样化的信息类型。
Implementation opportunities include developing cross-modal contradiction detection algorithms that identify conflicts between information expressed in different formats, implementing uncertainty alignment techniques that reconcile confidence estimates across modalities, and designing multimodal fact verification systems that leverage complementary evidence types for enhanced reliability. These advances could address emerging challenges in multimodal information integration.
实施机会包括:开发跨模态矛盾检测算法以识别不同格式表达信息间的冲突,实现跨模态不确定性对齐技术以协调各模态间的置信度估计,以及设计多模态事实验证系统,利用互补证据类型提升可靠性。这些进展有望解决多模态信息整合中的新兴挑战。
Multimodal Explanation Generation. Effective communication often requires coordinated explanation across modalities. Future systems could generate truly multimodal research outputs that combine textual, visual, and interactive components to enhance understanding and persuasiveness. Early examples of this direction appear in systems like mshumer/OpenDeepResearcher [249], which implements basic report visualization. More comprehensive approaches could enable sophisticated multimodal communication tailored to content requirements.
多模态解释生成。有效的沟通通常需要跨模态的协调解释。未来的系统可以生成真正多模态的研究成果,结合文本、视觉和交互组件,以增强理解力和说服力。这一方向的早期案例可见于 mshumer/OpenDeepResearcher[249]等系统,它们实现了基本的报告可视化功能。更全面的方法可以实现根据内容需求定制的复杂多模态沟通。
Research directions include developing coordinated generation architectures that produce aligned content across modalities, implementing adaptive format selection algorithms that identify optimal representation formats for different content types, and designing multimodal narrative structures that effectively combine diverse formats within coherent explanatory frameworks. These advances could enhance communication effectiveness across application domains.
研究方向包括:开发能生成跨模态对齐内容的协调生成架构,实现可自适应选择不同内容类型最佳呈现格式的算法,以及设计能在连贯解释框架内有效整合多种格式的多模态叙事结构。这些进展有望提升各应用领域的沟通效能。

8.3 Domain-Specific Optimization
8.3 领域特定优化

Tailored enhancement for particular fields offers significant performance improvements for specialized applications.
针对特定领域的定制化增强能为专业应用带来显著的性能提升。

8.3.1 Scientific Domain Adaptation. Scientific research presents unique requirements and opportunities for specialization:
8.3.1 科学领域适配。科学研究呈现出独特的专业化需求和机遇:
Field-Specific Model Adaptation. Current systems employ relatively general architectures across scientific domains. Future research could develop specialized adaptation techniques that optimize performance for particular scientific fields including physics, chemistry, biology, and others with distinct knowledge structures and reasoning patterns. Early steps in this direction appear in systems like AutoGLM-Research [330], which implements domain-specific prompting. Domain-specialized research agents have demonstrated particular promise in physics [305], chemistry [6, 34, 50, 326], materials science [189], oceanography [28], geospatial analysis [165], patent research [227, 285], and broader scientific discovery workflows [84]. These specialized implementations highlight the value of domain adaptation beyond general research capabilities. More comprehensive adaptation could enable significant performance improvements for scientific applications.
领域专用模型适配。当前系统在科学各领域采用相对通用的架构。未来研究可开发专门的适配技术,以优化特定科学领域的性能,包括物理、化学、生物学等具有独特知识结构和推理模式的学科。这一方向的早期探索体现在 AutoGLM-Research[330]等系统中,它实现了领域特定的提示机制。领域专用研究代理已在物理学[305]、化学[6,34,50,326]、材料科学[189]、海洋学[28]、地理空间分析[165]、专利研究[227,285]以及更广泛的科学发现工作流[84]中展现出特殊潜力。这些专用实现凸显了领域适配超越通用研究能力的价值。更全面的适配有望为科学应用带来显著的性能提升。
Implementation approaches might include domain-specific fine-tuning regimes that emphasize field-relevant reasoning patterns, specialized architectural modifications that enhance performance for domain-characteristic
实施方法可能包括:强调领域相关推理模式的专用微调方案,以及针对领域特征性能提升的专用架构修改。

tasks, or hybrid systems that incorporate symbolic components for domain-specific formal reasoning. These approaches could address current limitations in scientific reasoning while maintaining general capabilities for cross-domain research.
任务,或是融合了符号组件以实现领域特定形式推理的混合系统。这些方法既能解决当前科学推理中的局限性,又能保持跨领域研究的通用能力。
Scientific Workflow Integration. Effective scientific application requires integration with existing research methodologies and tools. Future systems could implement specialized interfaces for scientific workflows including experimental design, data analysis, literature integration, and theory development. Early examples of this direction appear in systems like n8n [183], which provides workflow automation for data processing. Platforms designed to support machine learning development in fundamental science also illustrate this trend, enabling research in federated cloud environments [9]. More comprehensive integration could enable seamless incorporation within scientific research processes. Research assistant tools employing prompt-based templates demonstrate domain-agnostic support for tasks such as enhanced literature search queries and preliminary peer review, facilitating standardized assistance across diverse scientific fields [245]. User studies highlight varying automation needs across DS/ML workflows, suggesting targeted rather than complete end-to-end automation aligns with researcher preferences [284]. Research opportunities include developing experimental design assistants that generate and refine research protocols based on literature and objectives, implementing integrated analysis pipelines that combine automated and human analytical components, and designing theory development frameworks that link empirical findings with formal theoretical structures. These advances could enhance practical scientific impact beyond general information access [44, 288].
科学工作流集成。有效的科学应用需要与现有研究方法和工具进行整合。未来的系统可以实现针对科学工作流的专用接口,包括实验设计、数据分析、文献整合和理论发展。这一方向的早期案例出现在 n8n[183]等系统中,该系统为数据处理提供了工作流自动化功能。专为基础科学中机器学习开发设计的平台也体现了这一趋势,支持在联邦云环境中开展研究[9]。更全面的集成可以实现科学研究流程中的无缝衔接。采用基于提示模板的研究辅助工具展示了跨领域任务支持能力,例如增强型文献检索查询和初步同行评审,有助于为不同科学领域提供标准化辅助[245]。用户研究表明,数据科学/机器学习工作流中存在多样化的自动化需求,这表明针对性而非完全端到端的自动化更符合研究人员的偏好[284]。 研究机会包括开发实验设计助手,根据文献和目标生成并优化研究方案;实现结合自动化与人工分析组件的集成分析流程;以及设计将实证发现与形式化理论结构相连接的理论发展框架。这些进展有望在通用信息获取之外,提升实际科学影响力[44, 288]。
Legal Reasoning Enhancement. Current systems struggle with the precision and structure of legal analysis. Future research could develop specialized legal reasoning components that incorporate case-based reasoning, statutory interpretation, and doctrinal analysis within coherent legal frameworks. Early steps in this direction appear in systems like OpenAI/DeepResearch [197], which incorporates basic legal language handling. More comprehensive specialization could enable sophisticated legal applications across practice areas.
法律推理增强。现有系统难以满足法律分析所需的精确性与结构性。未来研究可开发专门的法律推理组件,将案例推理、法规解释和学说分析纳入协调的法律框架中。OpenAI/DeepResearch 等系统已在此方向迈出初步步伐[197],实现了基础法律语言处理。更全面的专业化将能为各实践领域实现复杂的法律应用。
Implementation opportunities include developing case analysis systems that extract and apply relevant precedent principles, implementing statutory interpretation frameworks that apply established analytical methodologies to legislative text, and designing multi-jurisdictional reasoning approaches that navigate conflicts of law across legal boundaries. These advances could enhance practical utility for legal research and analysis applications.
实施机会包括开发案例解析系统以提取并应用相关判例原则,构建法规解释框架将既定分析方法应用于立法文本,以及设计跨法域推理方法来解决不同法律边界间的冲突。这些进展可提升法律研究分析应用的实际效用。
Regulatory Compliance Specialization. Compliance applications require comprehensive coverage with exceptional precision. Future systems could implement specialized compliance components that ensure complete regulatory coverage, systematic obligation identification, and reliable guidance across complex regulatory landscapes. Early examples of this direction appear in general information retrieval, but true compliance optimization remains limited. Advanced approaches could enable reliable automation of currently labor-intensive compliance processes.
监管合规专业化。合规应用需要具备极高精确度的全面覆盖能力。未来系统可部署专业合规组件,确保完整覆盖监管要求、系统化识别法定义务,并在复杂监管环境中提供可靠指引。这一方向的早期案例已出现在通用信息检索领域,但真正的合规优化仍显不足。先进方法有望实现当前劳动密集型合规流程的可靠自动化。
Research directions include developing regulatory change tracking systems that monitor and interpret evolving requirements, implementing obligation extraction techniques that identify and classify compliance requirements across regulatory texts, and designing responsibility mapping approaches that connect regulatory
研究方向包括开发监管变更追踪系统以监控和解读不断变化的要求,实施义务提取技术以识别和分类监管文本中的合规要求,以及设计责任映射方法以将监管要求与

obligations with organizational functions and processes. These advances could enhance practical utility for compliance-intensive industries facing complex regulatory environments.
这些进展能够提升合规密集型行业在面临复杂监管环境时的实际效用,将义务与组织功能和流程相结合。

8.3.3 Medical and Healthcare Research Support. Healthcare applications present unique requirements and ethical considerations:
8.3.3 医疗健康研究支持。医疗健康应用具有独特的需求和伦理考量:
Clinical Evidence Synthesis. Medical applications require exceptional precision and comprehensive evidence integration. Future research could develop specialized medical components that synthesize clinical evidence across studies, guidelines, and practice observations while maintaining rigorous evaluation standards. Recent efforts such as Google’s co-scientist project [97] demonstrate the potential for AI to assist in scientific research including medical domains. Early steps in this direction appear in systems like Perplexity/DeepResearch [209], which implements enhanced citation for medical claims. More comprehensive specialization could enable reliable clinical decision support.
临床证据综合。医疗应用需要极高的精确性和全面的证据整合。未来研究可开发专门的医疗组件,在保持严格评估标准的同时,综合来自各类研究、指南和实践观察的临床证据。谷歌的联合科学家项目[97]等近期成果展示了人工智能辅助包括医学领域在内的科学研究的潜力。该方向的早期尝试已出现在 Perplexity/DeepResearch[209]等系统中,这些系统实现了针对医学主张的增强引用功能。更全面的专业化发展有望实现可靠的临床决策支持。
Implementation approaches might include evidence grading systems that apply established frameworks like GRADE [21] to clinical research, meta-analysis components that systematically integrate quantitative findings across studies, and guideline alignment techniques that map evidence to established clinical recommendations. These advances could enhance practical utility for evidence-based medicine while maintaining appropriate caution for this high-stakes domain.
实施方法可能包括:应用 GRADE[21]等成熟框架对临床研究进行证据分级的系统、系统整合跨研究定量结果的荟萃分析组件,以及将证据映射到现有临床建议的指南对齐技术。这些进展可在保持对这一高风险领域适当谨慎的同时,提升循证医学的实际应用价值。
Patient-Specific Research Adaptation. Personalized medicine requires adapting general knowledge to individual patient contexts. Future systems could implement specialized personalization components that adapt research findings based on patient characteristics, comorbidities, preferences, and other individual factors. Early examples of this direction appear in basic filtering of contraindications, but comprehensive personalization remains limited. Advanced approaches could enable truly personalized evidence synthesis for clinical applications.
患者特异性研究适配。个性化医疗需要将通用知识适配到个体患者的特定情境中。未来的系统可实现专门的个性化组件,根据患者特征、并发症、偏好及其他个体因素调整研究发现。这一方向的早期案例已体现在禁忌症的基础筛选方面,但全面的个性化应用仍有限。先进方法有望实现真正个性化的临床证据合成。
Research opportunities include developing comorbidity reasoning systems that adjust recommendations based on condition interactions, implementing preference integration frameworks that incorporate patient values in evidence synthesis, and designing personalized risk-benefit analysis approaches that quantify individual trade-offs for treatment options. These advances could enhance clinical utility while respecting the complexity of individual patient contexts.
研究机遇包括开发能根据病症相互作用调整建议的并发症推理系统,构建将患者价值观纳入证据合成的偏好整合框架,以及设计量化治疗方案个体化权衡的个性化风险-收益分析方法。这些进展可在尊重患者个体情境复杂性的同时,提升临床实用性。

8.4 Human-AI Collaboration and Standardization
8.4 人机协作与标准化

Enhancing human-AI partnership and establishing common standards represent crucial directions for practical research impact and ecosystem development.
强化人机协作并建立共同标准,是实现实际研究影响力和生态系统发展的关键方向。

8.4.1 Interactive Research Workflows. Effective collaboration requires sophisticated interaction throughout the research process:
8.4.1 交互式研究工作流。有效协作需要贯穿研究过程的精密交互:
Adaptive Query Refinement. Current systems offer limited interaction during query formulation and refinement. Future research could develop sophisticated refinement interfaces that collaboratively develop research questions through iterative clarification, expansion, and focusing based on initial results and user feedback. Early steps in this direction appear in systems like HKUDS/Auto-Deep-Research [112], which implements basic clarification dialogues, and benchmarks such as QuestBench [141], which evaluates AI systems’ ability to identify missing information and formulate appropriate clarification questions in underspecified reasoning tasks. More comprehensive approaches could enable truly collaborative question
自适应查询优化。现有系统在查询构建和优化阶段的交互能力有限。未来研究可开发精密的优化界面,通过基于初步结果和用户反馈的迭代澄清、扩展和聚焦,协作式构建研究问题。该方向的早期探索已见于 HKUDS/Auto-Deep-Research[112]等系统(实现了基础澄清对话功能)和 QuestBench[141]等基准测试(评估 AI 系统在信息不全的推理任务中识别缺失信息并提出适当澄清问题的能力)。更全面的方法或将实现真正协作式的问题构建。

development. Frameworks like AutoAgent [262] demonstrate how zero-code interfaces can enable nontechnical users to effectively guide deep research processes through intuitive interaction patterns, while other systems are exploring methods that go beyond standard retrieval-augmented generation to better handle question identification in real-time conversations [4]. Implementation opportunities include developing intent clarification systems that identify potential ambiguities and alternatives in research questions, implementing scope adjustment interfaces that dynamically expand or narrow research focus based on initial findings, and designing perspective diversification tools that suggest alternative viewpoints relevant to research objectives. These advances could enhance research quality by improving question formulation through human-AI collaboration.
发展。诸如 AutoAgent [262]等框架展示了零代码界面如何通过直观的交互模式让非技术人员有效引导深度研究流程,而其他系统正在探索超越标准检索增强生成的方法,以更好地处理实时对话中的问题识别[4]。实施机会包括开发意图澄清系统(用于识别研究问题中潜在的歧义和替代方案)、实现范围调整界面(根据初步发现动态扩展或缩小研究焦点),以及设计视角多样化工具(为研究目标推荐相关替代观点)。这些进展可通过人机协作改进问题表述方式,从而提升研究质量。
Interactive Exploration Interfaces. Current systems typically present relatively static research outputs. Future research could develop sophisticated exploration interfaces that enable dynamic navigation, drilling down, and expansion across research findings based on evolving interests. Early examples of this direction appear in systems like OpenManus [193], which provides basic exploration capabilities. Advanced approaches could enable truly interactive research experiences tailored to discovery patterns.
交互式探索界面。当前系统通常呈现相对静态的研究成果。未来研究可开发复杂的探索界面,实现基于兴趣演变的动态导航、深入挖掘和研究成果扩展功能。该方向的早期案例可见于 OpenManus[193]等系统,其提供了基础探索能力。更先进的方案有望实现真正交互式的研究体验,并能根据发现模式进行定制。
Research directions include developing information visualization techniques specifically designed for research navigation, implementing adaptive detail management that expands or collapses content areas based on user interest signals, and designing seamless source transition mechanisms that enable smooth movement between synthesis and original sources. These advances could enhance discovery by enabling more exploratory and serendipitous research experiences.
研究方向包括:开发专为研究导航设计的信息可视化技术,实施基于用户兴趣信号展开或折叠内容区域的自适应细节管理机制,以及设计能在综合内容与原始来源间无缝切换的流畅过渡方案。这些进展通过支持更具探索性和偶发性的研究体验,有望提升发现效率。

8.4.2 Expertise Augmentation Models. Effective augmentation requires adaptation to user expertise and
8.4.2 专业知识增强模型。有效的增强需要适配用户专业水平和

objectives:  目标:
Expertise-Adaptive Interaction. Current systems offer limited adaptation to user knowledge levels and expertise. Future research could develop sophisticated adaptation mechanisms that tailor research approaches, explanations, and outputs based on user domain knowledge and research sophistication. Early steps in this direction appear in systems like Perplexity/DeepResearch [209], which implements basic terminology adjustment. More comprehensive adaptation could enable truly personalized research assistance aligned with individual expertise.
专业自适应交互。现有系统对用户知识水平和专业能力的适应性有限。未来的研究可以开发更精细的适应机制,根据用户的领域知识和研究熟练度来定制研究方法、解释和输出。该方向的初步尝试已出现在 Perplexity/DeepResearch[209]等系统中,它们实现了基本的术语调整功能。更全面的自适应能力将能提供真正个性化的研究辅助,与个人专业水平完美契合。
Implementation approaches might include expertise inference systems that dynamically assess user knowledge through interaction patterns, explanation adaptation mechanisms that adjust detail and terminology based on expertise models, and knowledge gap identification tools that highlight potentially unfamiliar concepts within research contexts. Furthermore, mechanisms that learn to strategically request expert assistance when encountering gaps exceeding autonomous capability - as formalized in the Learning to Yield and Request Control (YRC) coordination problem [66] - are crucial for optimizing intervention timing and resolution effectiveness. These advances could enhance research effectiveness across diverse user populations with varying domain familiarity.
实现方法可能包括:通过交互模式动态评估用户知识水平的专业推断系统、根据专业模型调整细节和术语的解释适配机制,以及在研究情境中突出潜在陌生概念的知识缺口识别工具。此外,当遇到超出自主能力范围的缺口时,能策略性请求专家协助的机制——如学习性让步与请求控制(YRC)协调问题[66]所形式化的方案——对于优化干预时机和分辨率有效性至关重要。这些进展能提升研究效率,适用于具有不同领域熟悉度的多样化用户群体。
Complementary Capability Design. Optimal augmentation leverages complementary human and AI strengths. Future systems could implement specialized interfaces designed around capability complementarity, emphasizing AI contributions in information processing while prioritizing human judgment for subjective evaluation and contextual understanding. Early examples of this direction appear in systems like Agent-
互补能力设计。最优增强应发挥人类与人工智能的互补优势。未来系统可围绕能力互补性设计专用界面,强调 AI 在信息处理方面的贡献,同时优先考虑人类在主观评估和情境理解方面的判断力。该方向的早期案例可见于 Agent-等系统。
RL/ReSearch [2], which implements basic division of analytical responsibilities. More sophisticated approaches could enable truly synergistic human-AI research partnerships.
RL/ReSearch [2]实现了分析职责的基本划分。更复杂的方法可以实现真正协同的人机研究伙伴关系。
Research opportunities include developing explanation components specifically designed to facilitate human judgment rather than replace it, implementing confidence signaling mechanisms that highlight areas particularly requiring human evaluation, and designing interactive critique frameworks that enable efficient human feedback on system reasoning. Feng Xiong et al. [303] redefine the collaborative dynamics between human researchers and AI systems. These advances could enhance collaborative effectiveness by optimizing around natural capability distributions.
研究机会包括开发专门设计用于促进而非取代人类判断的解释组件,实施突出特别需要人工评估领域的置信度信号机制,以及设计支持对系统推理进行高效人工反馈的交互式批评框架。Feng Xiong 等人[303]重新定义了人类研究者与 AI 系统间的协作动态。这些进展可以通过围绕自然能力分布进行优化来提升协作效率。

8.4.3 Framework Standardization Efforts. Common architectures enable modular development and component interoperability:
8.4.3 框架标准化工作。通用架构支持模块化开发和组件互操作性:
Component Interface Standardization. Advanced implementations employ standardized interfaces between major system components. The OpenAI/AgentsSDK [199] defines explicit interface standards for agent components, enabling modular development and component substitution. Emerging industry standards like Anthropic’s Model Context Protocol (MCP) [12] provide standardized interaction frameworks for large language models and tools, enabling consistent integration patterns across implementations. Similarly, Google’s Agent2Agent Protocol (A2A) [90, 92] establishes standardized communication patterns between autonomous agents, facilitating reliable multi-agent coordination. Open-source alternatives like smolagents/open_deep_research [115] implement comparable messaging protocols between agent components, highlighting industry convergence toward standardized interaction patterns. Projects like Open_deep_search [8] further demonstrate how standardized protocols enable effective collaboration between specialized research agents. Integration of diverse API interactions, as explored in Toolllm [223], provides additional standardization opportunities for managing external tool usage within research workflows.
组件接口标准化。先进的实现方案在主要系统组件之间采用标准化接口。OpenAI/AgentsSDK [199] 为智能体组件定义了明确的接口标准,支持模块化开发和组件替换。新兴行业标准如 Anthropic 的模型上下文协议(MCP)[12]为大型语言模型与工具提供了标准化交互框架,实现跨实施方案的一致集成模式。同样地,Google 的 Agent2Agent 协议(A2A)[90,92]建立了自主智能体间的标准化通信模式,促进可靠的多智能体协同。开源替代方案如 smolagents/open_deep_research[115]实现了智能体组件间类似的消息协议,彰显了行业向标准化交互模式的趋同。Open_deep_search[8]等项目进一步展示了标准化协议如何使专业研究智能体之间实现有效协作。 如 Toolllm[223]所探讨的多样化 API 交互集成,为研究流程中管理外部工具使用提供了额外的标准化机会。
Evaluation Metric Standardization. Current evaluation practices vary widely across implementations. Future research could establish standardized evaluation frameworks that enable consistent assessment and comparison across systems and components. Early examples of this direction appear in benchmarks like HLE [212] and MMLU [33], but comprehensive standardization remains limited. Advanced standardization could enable more efficient development through reliable quality signals and clear improvement metrics.
评估指标标准化。当前各实现方案的评估实践差异较大。未来研究可建立标准化评估框架,实现跨系统和组件的一致评估与比较。HLE[212]和 MMLU[33]等基准测试已出现这一方向的早期范例,但全面标准化仍显不足。高级标准化可通过可靠的质量信号和明确的改进指标,实现更高效的开发。
Research opportunities include developing standardized benchmark suites targeting specific research capabilities, implementing common evaluation methodologies across research domains and applications, and designing multi-dimensional assessment frameworks that provide nuanced performance profiles beyond simple accuracy metrics. These advances could enhance ecosystem quality by establishing clear standards and highlighting genuine improvements.
研究机遇包括:开发针对特定研究能力的标准化基准测试套件,在跨研究领域和应用中实施通用评估方法,以及设计多维评估框架以提供超越简单准确率指标的细致性能画像。这些进展将通过建立明确标准与突显真实改进,提升生态系统质量。

8.4.4 Cross-Platform Research Protocols. Interoperability across diverse systems enhances collective capabilities:
8.4.4 跨平台研究协议。不同系统间的互操作性能够增强集体能力:
Research Result Exchange Formats. Current systems typically produce outputs in incompatible formats. Future research could develop standardized exchange formats that enable seamless sharing of research results across platforms and systems, enhancing collective capabilities. Early steps in this direction appear in basic document formats, but true research-specific standardization remains limited. Comprehensive standardization could enable research workflows spanning multiple specialized systems.
研究成果交换格式。当前系统通常生成互不兼容的输出格式。未来研究可开发标准化交换格式,实现跨平台和系统的研究成果无缝共享,从而增强集体能力。这一方向的初步尝试已体现在基础文档格式中,但真正针对研究领域的标准化仍显不足。全面的标准化将有望实现跨越多个专业系统的研究流程协同。
Implementation opportunities include defining standard structures for research findings with appropriate attribution and confidence metadata, establishing common formats for evidence representation across systems, and developing shared schemas for research questions and objectives to enable distributed processing. These advances could enhance capability through specialization and complementary system utilization.
实施机会包括:为研究成果定义标准结构并配备适当的归属和可信度元数据,建立跨系统证据表示的通用格式,以及开发共享的研究问题和目标模式以实现分布式处理。这些进展可通过专业化和互补系统利用来增强能力。
Distributed Research Coordination. Advanced interoperability enables coordinated research across systems with complementary capabilities. Future research could develop sophisticated coordination frameworks that enable multi-system research workflows with appropriate task allocation, result integration, and process management. Early examples of this direction appear in workflows like those enabled by n8n [183], but comprehensive research-specific coordination remains limited. Advanced approaches could enable truly distributed research ecosystems with specialized components addressing distinct process elements.
分布式研究协调。先进的互操作性使得具有互补能力的系统间能够协调研究。未来研究可开发复杂的协调框架,支持多系统研究工作流,实现适当的任务分配、结果整合和流程管理。这一方向的早期案例可见于 n8n[183]等工作流系统,但专门针对研究的全面协调仍显不足。先进方法有望构建真正的分布式研究生态系统,由专门组件处理不同的流程环节。
Research directions include developing distributed search coordination protocols that efficiently leverage specialized search capabilities, implementing cross-system result verification techniques that ensure consistency across distributed findings, and designing efficient coordination protocols that minimize communication overhead in distributed research workflows. These advances could enhance collective capability through specialization and parallelization across the ecosystem.
研究方向包括开发能高效利用专业搜索能力的分布式搜索协调协议,实现确保分布式发现一致性的跨系统结果验证技术,以及设计能最大限度减少分布式研究工作流中通信开销的高效协调协议。这些进展可通过生态系统内的专业化与并行化来增强集体能力。

8.4.5 Joint Human-AI Knowledge Creation. Moving beyond information retrieval to collaborative insight generation:
8.4.5 人机协同知识创造。超越信息检索,迈向协作洞察生成:

Collaborative Creation Environments. Advanced collaboration requires sophisticated content co-creation capabilities. Future research could develop specialized collaborative environments that enable fluid transition between human and AI contributions within unified document development. Early steps in this direction appear in systems like mshumer/OpenDeepResearcher, which implements basic collaborative document generation. Advanced interfaces like those explored in Self-Explanation in Social AI Agents [23] demonstrate how explanation capabilities can enhance collaborative research through more transparent reasoning processes. Similarly, innovative interaction paradigms like AI-Instruments [232] show how prompts can be embodied as instruments to abstract and reflect commands as general-purpose tools, suggesting novel approaches to research interface design that enhance collaborative capabilities through intuitive interaction patterns. Approaches where AI agents learn to assist other agents by observing them also show promise for developing more effective collaborative behaviors [127]. Effidit demonstrates comprehensive writing support through multifunctional capabilities including text polishing and context-aware phrase refinement, extending collaborative editing beyond basic generation [248]. More comprehensive approaches could enable truly integrated co-creation experiences.
协同创作环境。高级协作需要复杂的内容共同创作能力。未来的研究可以开发专门的协作环境,实现在统一文档开发过程中人类与 AI 贡献之间的流畅切换。这一方向的早期尝试已出现在如 mshumer/OpenDeepResearcher 等系统中,该系统实现了基础的协作文档生成功能。类似《社交 AI 智能体中的自我解释》[23]所探索的高级界面展示了如何通过更透明的推理过程来增强协作研究中的解释能力。而像 AI-Instruments[232]这样的创新交互范式则展示了提示如何被具象化为抽象工具,将指令转化为通用工具,这为通过直观交互模式增强协作能力的研究界面设计提供了新思路。通过观察其他智能体来学习协助行为的方法,也为开发更有效的协作行为展现了前景[127]。 Effidit 通过多功能能力展示全面的写作支持,包括文本润色和上下文感知的短语优化,将协作编辑从基础生成扩展到更高级阶段[248]。更全面的方法有望实现真正融合的共创体验。
Implementation opportunities include developing section suggestion systems that propose potential content expansions based on document context, implementing stylistic adaptation mechanisms that align AI-generated content with established document voice and approach, and incorporating implicit feedback mechanisms that interpret rejected suggestions as negative signals to refine outputs while preserving original intent [271], and designing seamless revision interfaces that enable efficient editing across human and AI contributions, like iterative human-AI co-editing as demonstrated by REVISE [302] - a framework allowing writers to dynamically modify summary segments through fill-in-the-middle generation. These advances could enhance collaborative productivity by reducing friction in joint content development [116].
实施机会包括开发基于文档上下文提出潜在内容扩展的章节建议系统,构建使 AI 生成内容与既定文档风格和手法保持一致的文体适应机制,以及整合将拒绝建议解读为负面信号以优化输出同时保留原始意图的隐式反馈机制[271]。此外,还可设计无缝修订界面实现人机贡献间的高效编辑,如 REVISE 框架[302]展示的迭代式人机协同编辑——该框架允许作者通过中间填充生成动态修改摘要段落。这些进展通过减少联合内容开发中的摩擦,有望提升协作效率[116]。
Mixed-Initiative Research Design. Sophisticated collaboration includes shared determination of research direction and approach. Future systems could implement mixed-initiative frameworks that dynamically balance direction setting between human preferences and AI-identified opportunities throughout the research process. Early examples of this direction appear in systems like smolagents/open_deep_research [115], which implements basic suggestion mechanisms. Advanced approaches could enable truly collaborative research planning with balanced initiative distribution.
混合主动式研究设计。成熟的协作包括共同确定研究方向和方法的共享决策。未来的系统可实现混合主动框架,在整个研究过程中动态平衡人类偏好与 AI 识别机会之间的方向设定。这一方向的早期案例可见于 smolagents/open_deep_research[115]等系统,它们实现了基础建议机制。更先进的方法可实现真正协作式研究规划,并保持主动权的均衡分配。
Research directions include developing opportunity identification systems that highlight promising but unexplored research directions, implementing trade-off visualization techniques that communicate potential research path alternatives and implications, and designing preference elicitation frameworks that efficiently capture evolving research priorities throughout the process, and integrating explainable reward function mechanisms to enhance human understanding of AI’s decision logic, thereby improving collaborative efficiency in value alignment contexts [239]. These advances could enhance discovery by combining human insight with AI-identified opportunities in balanced partnerships.
研究方向包括开发能够识别具有潜力但尚未被探索的研究方向的系统,实施能够展示潜在研究路径替代方案及其影响的权衡可视化技术,设计能够在整个过程中高效捕捉研究优先级演变的偏好提取框架,以及整合可解释的奖励函数机制以提升人类对人工智能决策逻辑的理解,从而在价值对齐场景中提升协作效率[239]。这些进展有望通过在平衡合作中结合人类洞察与人工智能识别出的机会,提升发现效率。
The future research directions outlined in this section highlight both the significant potential for advancement and the multi-faceted nature of Deep Research development. Progress will likely emerge through complementary advances across reasoning architectures, multimodal capabilities, domain specialization, human-AI collaboration, and ecosystem standardization. While commercial implementations like OpenAI/DeepResearch [197], Gemini/DeepResearch [60], and Perplexity/DeepResearch [209] will undoubtedly drive significant innovation, open-source alternatives and academic research will play crucial roles in expanding the boundaries of what’s possible and ensuring broad participation in this rapidly evolving field.
本节概述的未来研究方向既彰显了深度研究发展的巨大潜力,也体现了其多维度特性。进步很可能通过推理架构、多模态能力、领域专业化、人机协作以及生态系统标准化等互补性突破共同实现。虽然 OpenAI/DeepResearch [197]、Gemini/DeepResearch [60]和 Perplexity/DeepResearch [209]等商业应用必将推动重大创新,但开源替代方案和学术研究将在拓展可能性边界、确保这一快速发展领域广泛参与方面发挥关键作用。

9 Conclusion  9 结论

This survey has examined the rapidly evolving domain of Deep Research systems, tracing their development from initial implementations in 2023 through the sophisticated ecosystem emerging in 2025. Through comprehensive analysis of commercial offerings like OpenAI/DeepResearch [197], Gemini/DeepResearch [60], and Perplexity/DeepResearch [209], alongside open-source alternatives including HKUDS/Auto-DeepResearch [112], dzhng/deep-research [321], and numerous others, we have identified key technical patterns, implementation approaches, and application opportunities that characterize this transformative technology domain.
本调查深入研究了快速发展的深度研究系统领域,追溯了从 2023 年初始实现到 2025 年成熟生态系统的演进历程。通过对 OpenAI/DeepResearch[197]、Gemini/DeepResearch[60]和 Perplexity/DeepResearch[209]等商业产品,以及 HKUDS/Auto-DeepResearch[112]、dzhng/deep-research[321]等开源替代方案的全面分析,我们识别出这一变革性技术领域的关键技术模式、实现方法和应用机遇。

9.1 Key Findings and Contributions
9.1 主要发现与贡献

Our analysis reveals several fundamental insights about the current state and trajectory of Deep Research systems:
我们的分析揭示了关于深度研究系统现状与发展轨迹的若干基础性洞见:
Technical Architecture Patterns. Effective Deep Research implementations demonstrate consistent architectural patterns across foundation models, environmental interaction, task planning, and knowledge synthesis dimensions. Commercial implementations like OpenAI/DeepResearch [197] and Gemini/DeepResearch [60] typically leverage proprietary foundation models with extensive context lengths and sophisticated reasoning capabilities, while open-source alternatives like Camel-AI/OWL [43] and QwenLM/Qwen-Agent [224] demonstrate how effective research capabilities can be achieved with more accessible models through specialized optimization.
技术架构模式。高效的深度研究实现展示了在基础模型、环境交互、任务规划和知识合成维度上一致的架构模式。商业实现如 OpenAI/DeepResearch[197]和 Gemini/DeepResearch[60]通常利用具有超长上下文和复杂推理能力的专有基础模型,而开源替代方案如 Camel-AI/OWL[43]和 QwenLM/Qwen-Agent[224]则展示了如何通过专门优化,使用更易获取的模型实现高效研究能力。
Environmental interaction capabilities show greater diversity, with specialized tools like Nanobrowser [184] and dzhng/deep-research [321] demonstrating exceptional effectiveness in web navigation and content extraction, while comprehensive platforms like Manus [164] and AutoGLM-Search [330] offer broader interaction capabilities across multiple environments. These patterns highlight both the value of specialization and the importance of comprehensive environmental access for effective research.
环境交互能力呈现出更大的多样性,Nanobrowser[184]和 dzhng/deep-research[321]等专用工具在网络导航和内容提取方面表现出卓越效能,而 Manus[164]和 AutoGLM-Search[330]等综合平台则提供跨多环境的更广泛交互能力。这些模式既凸显了专业化的价值,也体现了全面环境访问对高效研究的重要性。
Task planning and execution approaches reveal similar diversity, with frameworks like OpenAI/AgentsSDK [199] and Flowith/OracleMode [77] providing sophisticated planning capabilities, while systems like AgentRL/ReSearch [2] and smolagents/open_deep_research [115] emphasize execution reliability and collaborative approaches respectively. Knowledge synthesis capabilities demonstrate consistent emphasis on information evaluation, though with varied approaches to presentation and interactivity across implementations like HKUDS/Auto-Deep-Research [112] and mshumer/OpenDeepResearcher [249].
任务规划与执行方法同样呈现出多样性,OpenAI/AgentsSDK[199]和 Flowith/OracleMode[77]等框架提供复杂的规划能力,而 AgentRL/ReSearch[2]和 smolagents/open_deep_research[115]等系统则分别强调执行可靠性与协作方法。知识合成能力普遍重视信息评估,但在呈现方式和交互性上存在差异,如 HKUDS/Auto-Deep-Research[112]与 mshumer/OpenDeepResearcher[249]等不同实现方案。
Implementation Approach Distinctions. Our analysis highlights meaningful distinctions between commercial and open-source implementation approaches. Commercial platforms typically offer optimized performance, sophisticated interfaces, and comprehensive capabilities, though with associated costs and customization limitations. Systems like OpenAI/DeepResearch [197] and Perplexity/DeepResearch [209] demonstrate exceptional performance on standard benchmarks, though with significant variation in application focus and interaction models.
实现方法的差异性。我们的分析揭示了商业与开源实现方法之间的显著区别。商业平台通常提供优化性能、复杂界面和全面功能,但伴随成本与定制限制。OpenAI/DeepResearch[197]和 Perplexity/DeepResearch[209]等系统在标准基准测试中表现卓越,但在应用焦点和交互模式上存在明显差异。
Open-source implementations demonstrate greater architectural diversity and customization flexibility, though typically with increased deployment complexity and more limited performance on standard benchmarks. Projects like dzhng/deep-research [321], nickscamara/open-deep-research [42], and HKUDS/Auto-Deep-Research [112] offer complete research pipelines with varied architectural approaches, while specialized components like Jina-AI/node-DeepResearch [121] and Nanobrowser [184] enable customized workflows addressing specific requirements. Frameworks such as AutoChain [78] provide lightweight tools to simplify the creation and evaluation of custom generative agents, enabling rapid iteration for specialized applications.
开源实现展现出更强的架构多样性和定制灵活性,但通常伴随着更高的部署复杂度以及在标准基准测试中较为有限的性能表现。诸如 dzhng/deep-research [321]、nickscamara/open-deep-research [42]和 HKUDS/Auto-Deep-Research [112]等项目提供了采用不同架构方法的完整研究流程,而 Jina-AI/node-DeepResearch [121]和 Nanobrowser [184]等专用组件则支持针对特定需求定制工作流。AutoChain [78]等框架提供了轻量级工具来简化自定义生成式智能体的创建与评估,从而支持专业应用的快速迭代。
These distinctions highlight complementary roles within the ecosystem, with commercial implementations offering accessibility and performance for general users, while open-source alternatives enable customization, control, and potentially lower operational costs for specialized applications and high-volume usage. This diversity enhances overall ecosystem health through competition, specialization, and diverse innovation paths.
这些差异凸显了生态系统中互补的角色分工:商业实现为普通用户提供了易用性和高性能,而开源替代方案则为专业应用和大规模使用场景提供了定制化、控制权以及可能更低的运营成本。这种多样性通过竞争、专业化和多元创新路径,有效促进了整个生态系统的健康发展。
Application Domain Adaptations. Our examination of application patterns reveals meaningful adaptations across domains including academic research[118, 273, 276], scientific discovery[6, 10, 25, 47, 79, 83, 98, 99, 110, 129, 130, 135, 155, 166, 169, 218, 255, 258, 264, 269, 310, 312, 322, 327], business intelligence[187], financial analysis, education [14, 215, 219, 317], and personal knowledge management[136, 336]. Academic applications exemplified by systems like OpenAI/DeepResearch [197] and Camel-AI/OWL [43] demonstrate particular emphasis on comprehensive literature coverage, methodological understanding, and citation quality. Scientific implementations like Gemini/DeepResearch [60] and Agent-RL/ReSearch [2] emphasize experimental design, data analysis, and theory development capabilities.
应用领域适配。我们对应用模式的研究揭示了跨领域的显著适配,包括学术研究[118, 273, 276]、科学发现[6, 10, 25, 47, 79, 83, 98, 99, 110, 129, 130, 135, 155, 166, 169, 218, 255, 258, 264, 269, 310, 312, 322, 327]、商业智能[187]、金融分析、教育[14, 215, 219, 317]以及个人知识管理[136, 336]。以 OpenAI/DeepResearch[197]和 Camel-AI/OWL[43]为代表的学术应用系统,特别强调文献覆盖的全面性、方法论理解及引证质量。而 Gemini/DeepResearch[60]和 Agent-RL/ReSearch[2]等科学类实现则更注重实验设计、数据分析和理论发展能力。
Business applications leveraging systems like Manus [164] and n8n [183] show stronger focus on information currency, competitive analysis, and actionable insight generation. Educational implementations demonstrate
商业应用领域采用如 Manus[164]和 n8n[183]等系统时,更侧重信息的时效性、竞争分析及可操作见解的生成。教育类实现则展现出

adaptations for learning support, content development, and research skill training across systems like Perplexity/DeepResearch [209] and OpenManus [193]. These patterns highlight how general deep research capabilities translate into domain value through specialized adaptation addressing field-specific requirements and workflows.
针对学习支持、内容开发和跨系统研究技能训练(如 Perplexity/DeepResearch[209]和 OpenManus[193])的适应性调整。这些模式展现了通用深度研究能力如何通过针对领域特定需求和工作流程的专业化适配,转化为具体领域的价值。
Ethical Consideration Approaches. Our analysis reveals both common patterns and implementation diversity in addressing crucial ethical dimensions including information accuracy, privacy protection, intellectual property respect, and accessibility. Commercial implementations typically demonstrate sophisticated approaches to factual verification, with systems like OpenAI/DeepResearch [197] and Perplexity/DeepResearch [209] implementing multi-level verification and explicit attribution, while open-source alternatives like grapeot/deep_research_agent [263] and HKUDS/Auto-Deep-Research [112] demonstrate pragmatic approaches within more constrained technical environments.
伦理考量方法。我们的分析揭示了处理关键伦理维度(包括信息准确性、隐私保护、知识产权尊重和可访问性)时既存在共性模式也存在实施多样性。商业实现通常展现出复杂的事实核查方法,例如 OpenAI/DeepResearch[197]和 Perplexity/DeepResearch[209]系统采用多级验证和明确归属机制,而开源替代方案如 grapeot/deep_research_agent[263]和 HKUDS/Auto-Deep-Research[112]则展示了在更受限技术环境中的务实处理方式。
Privacy protection shows similar patterns, with commercial systems implementing comprehensive safeguards appropriate to their cloud-based operation, while open-source alternatives like OpenManus [193] emphasize local deployment for sensitive applications. Attribution and intellectual property approaches demonstrate consistent emphasis on source transparency and appropriate utilization boundaries, though with varied implementation sophistication across the ecosystem.
隐私保护呈现出相似的模式,商业系统实施了适合其云端运营的全面保障措施,而 OpenManus[193]等开源替代方案则针对敏感应用强调本地化部署。在归属权和知识产权处理方面,整个生态系统都持续强调源码透明度和合理使用边界,尽管不同项目在实现复杂度上存在差异。
These patterns highlight both shared ethical priorities across the ecosystem and implementation diversity reflecting different technical constraints, deployment models, and user requirements. This diversity represents a strength in addressing multi-faceted ethical challenges through complementary approaches and continuous innovation.
这些模式既凸显了整个生态系统共同的伦理优先事项,也体现了因技术限制、部署模式和用户需求差异而呈现的实施多样性。这种多样性代表着通过互补方法和持续创新来应对多层面伦理挑战的优势。

9.2 Limitations and Outlook
9.2 局限性与展望

While this survey provides comprehensive analysis of current Deep Research systems and emerging trends, several limitations warrant acknowledgment:
尽管本次调查对当前深度研究系统及新兴趋势进行了全面分析,仍有若干局限性需要指出:
Rapidly Evolving Landscape. The accelerating pace of development in this domain presents inherent challenges for comprehensive analysis. New systems and capabilities continue to emerge, with commercial offerings like OpenAI/DeepResearch [197], Gemini/DeepResearch [60], and Perplexity/DeepResearch [209] receiving frequent updates, while the open-source ecosystem continuously expands through new projects and enhancements to existing frameworks like dzhng/deep-research [321] and HKUDS/Auto-Deep-Research [112].
快速演进的格局。该领域加速发展的步伐给全面分析带来了固有挑战。新系统和功能持续涌现,包括 OpenAI/DeepResearch[197]、Gemini/DeepResearch[60]和 Perplexity/DeepResearch[209]等商业产品频繁更新,同时开源生态系统通过 dzhng/deep-research[321]和 HKUDS/Auto-Deep-Research[112]等新项目及现有框架的增强功能不断扩展。
This survey captures the state of the art as of early 2025, but both technical capabilities and implementation approaches will continue to evolve rapidly. The classification framework and analysis methodology provided here offer a structural foundation for continued assessment as the field progresses through subsequent development phases.
本调查反映了截至 2025 年初的技术发展现状,但技术能力和实施方法都将持续快速演进。本文提供的分类框架和分析方法为该领域后续发展阶段中的持续评估奠定了结构化基础。
Implementation Detail Limitations. Comprehensive technical analysis faces challenges due to limited implementation transparency, particularly for commercial systems. While open-source implementations like nickscamara/open-deep-research [42] and Agent-RL/ReSearch [2] enable detailed architectural examination, commercial systems like OpenAI/DeepResearch [197] and Gemini/DeepResearch [60] reveal limited internal details, restricting comprehensive comparative analysis of certain technical dimensions.
实施细节的局限性。由于商业系统实施透明度有限,全面技术分析面临挑战。虽然 nickscamara/open-deep-research [42]和 Agent-RL/ReSearch [2]等开源实现支持详细的架构审查,但 OpenAI/DeepResearch [197]和 Gemini/DeepResearch [60]等商业系统披露的内部细节有限,这限制了对某些技术维度进行全面对比分析的能力。
Our approach addresses this limitation through behavioral analysis, publicly available documentation examination, and consistent evaluation across standardized benchmarks and qualitative assessment frameworks. These methods enable meaningful comparison despite transparency variations, though complete architectural analysis remains challenging for proprietary implementations.
我们的方法通过行为分析、公开文档审查以及在标准化基准和定性评估框架下的一致性评估来解决这一局限性。尽管透明度存在差异,这些方法仍能实现有意义的比较,不过对于专有实现的完整架构分析仍具挑战性。
Application Impact Assessment. Evaluating real-world impact presents persistent challenges given the early deployment stage of many Deep Research systems. While initial applications demonstrate promising capabilities across domains including academic research[17, 208, 225, 292], business intelligence, and education [ 14 , 215 , 317 ] [ 14 , 215 , 317 ] [14,215,317][14,215,317], a comprehensive long-term impact assessment requires extended observation beyond the scope of this survey. Potential transformative effects on research methodologies, knowledge work, and information access patterns remain partially speculative despite encouraging early indications.
应用影响评估。鉴于许多深度研究系统尚处于早期部署阶段,评估实际影响存在持续挑战。虽然初期应用在学术研究[17, 208, 225, 292]、商业智能和教育 [ 14 , 215 , 317 ] [ 14 , 215 , 317 ] [14,215,317][14,215,317] 等领域展现出良好潜力,但全面的长期影响评估需要超出本调查范围的持续观察。尽管早期迹象令人鼓舞,但研究方法论、知识工作和信息获取模式可能产生的变革性影响仍存在部分推测性。
Future research should incorporate longitudinal analysis of deployment patterns, usage evolution, and organizational integration to assess realized impact beyond technical capabilities and early applications. Such analysis would complement the technical and architectural focus of the current survey with valuable perspectives on practical significance and societal implications.
未来研究应纳入部署模式、使用演变和组织整合的纵向分析,以评估超越技术能力和早期应用的实际影响。此类分析将为当前调查的技术和架构重点补充有关实际意义和社会影响的宝贵视角。

9.3 Broader Implications
9.3 更广泛的影响

Beyond specific findings, this survey highlights several broader implications for the future of knowledge work and information access:
除具体发现外,本次调查还凸显了知识工作和信息获取未来发展的若干更广泛影响:
Research Methodology Transformation. Deep Research systems demonstrate potential to fundamentally transform research methodologies across domains. The comprehensive information access, advanced reasoning capabilities, and efficient knowledge synthesis demonstrated by systems like OpenAI/DeepResearch [197], Gemini/DeepResearch [60], and their open-source alternatives suggest significant opportunities to accelerate discovery, enhance comprehensiveness, and enable novel cross-domain connections beyond traditional research approaches.
研究方法转型。深度研究系统展现出从根本上改变各领域研究方法的潜力。OpenAI/DeepResearch [197]、Gemini/DeepResearch [60]及其开源替代方案所展示的全面信息获取、高级推理能力和高效知识合成能力,表明其存在显著机遇来加速发现、提升全面性,并实现超越传统研究方法的新型跨领域关联。
Rather than simply automating existing processes, these systems enable fundamentally new research approaches leveraging capabilities exceeding human information processing in scale while complementing human insight, creativity, and contextual understanding. This complementarity suggests evolution toward collaborative research models rather than replacement of human researchers, with significant potential for productivity enhancement and discovery acceleration. However, Ashktorab et al. [15] highlight that in human-AI collaboration, users may exhibit overreliance behaviors, appending AI-generated responses even when conflicting, which can compromise data quality.
这些系统并非简单地自动化现有流程,而是通过超越人类信息处理规模的能力,同时补充人类的洞察力、创造力和情境理解,实现了根本性的新型研究方法。这种互补性表明研究模式正朝着人机协作方向演进,而非取代人类研究者,具有显著提升生产力和加速发现的潜力。然而 Ashktorab 等人[15]指出,在人机协作中,用户可能表现出过度依赖行为——即使存在冲突仍附加 AI 生成的响应——这会损害数据质量。
Knowledge Access Democratization. The emergence of accessible Deep Research implementations across commercial and open-source ecosystems demonstrates potential for broader knowledge democratization. Systems like Perplexity/DeepResearch [209] with free access tiers and open-source alternatives like nickscamara/open-deep-research [42] and HKUDS/Auto-Deep-Research [112] enable sophisticated research capabilities previously requiring specialized expertise and substantial resources, potentially reducing barriers to high-quality information access and analysis.
知识获取民主化。深度研究实现方案在商业和开源生态系统中的普及,展现了知识民主化更广泛的可能性。诸如 Perplexity/DeepResearch[209]等提供免费访问层级的系统,以及 nickscamara/open-deep-research[42]和 HKUDS/Auto-Deep-Research[112]等开源替代方案,使得原本需要专业知识和大量资源才能获得的复杂研究能力变得可及,这有望降低获取高质量信息和分析的门槛。
This democratization carries significant implications for education, entrepreneurship, civic participation, and individual knowledge development. While accessibility challenges remain, particularly regarding technical
这种民主化进程对教育、创业、公民参与和个人知识发展具有深远影响。尽管可访问性挑战仍然存在,特别是在技术

expertise requirements and computational resources, the overall trajectory suggests broadening access to advanced research capabilities with potential positive impacts on knowledge equity across society.
尽管仍存在专业知识要求和计算资源限制,但总体趋势表明先进研究能力的获取渠道正在拓宽,这可能对社会各阶层的知识公平产生积极影响。
Collective Intelligence Enhancement. Beyond individual applications, Deep Research systems demonstrate potential for collective intelligence enhancement through improved knowledge integration, insight sharing, and collaborative discovery. The capabilities demonstrated by systems like Manus [164], Flowith/OracleMode [77], and smolagents/open_deep_research [115] suggest opportunities for enhanced knowledge synthesis across organizational and disciplinary boundaries, potentially addressing fragmentation challenges in increasingly complex knowledge domains.
集体智能增强。除个人应用外,深度研究系统通过改进知识整合、洞察共享和协作发现,展现出增强集体智能的潜力。诸如 Manus[164]、Flowith/OracleMode[77]和 smolagents/open_deep_research[115]等系统展示的能力表明,存在跨越组织和学科边界进行增强型知识合成的机会,有望应对日益复杂的知识领域中的碎片化挑战。
Rather than viewing these systems as isolated tools, their integration into collaborative knowledge ecosystems highlights potential for systemic enhancement of collective sense-making, evidence-based decision making, and shared understanding development. This perspective emphasizes the social and organizational dimensions of Deep Research impact beyond technical capabilities and individual productivity enhancement.
与其将这些系统视为孤立工具,不如将其融入协作知识生态系统,这凸显了系统性增强集体意义构建、循证决策和共同理解发展的潜力。这一视角强调了深度研究影响的社会和组织维度,超越了技术能力和个人生产力提升的范畴。

9.4 Final Thoughts  9.4 最终思考

The rapid emergence and evolution of Deep Research systems represent a significant advancement in the application of artificial intelligence to knowledge discovery and utilization. While technical implementations will continue to evolve and specific systems will emerge and recede, the fundamental capability shift enabled by these technologies appears likely to persist and expand.
深度研究系统的快速涌现和演进,代表了人工智能在知识发现与利用应用领域的重大进步。尽管技术实现将持续演变,特定系统会此消彼长,但这些技术带来的根本性能力转变很可能会持续存在并不断扩大。
The diverse ecosystem spanning commercial platforms like OpenAI/DeepResearch [197], Gemini/DeepResearch [60], and Perplexity/DeepResearch [209], alongside open-source alternatives like dzhng/deep-research [321], HKUDS/Auto-Deep-Research [112], and numerous specialized components, demonstrates innovation across multiple technical dimensions, implementation approaches, and application domains. This diversity enhances overall ecosystem health through competition, specialization, and complementary development trajectories.
这一多元化生态体系涵盖了 OpenAI/DeepResearch[197]、Gemini/DeepResearch[60]和 Perplexity/DeepResearch[209]等商业平台,以及 dzhng/deep-research[321]、HKUDS/Auto-Deep-Research[112]等开源替代方案和众多专用组件,展现了跨多个技术维度、实现方法和应用领域的创新。这种多样性通过竞争、专业化和互补的发展路径,增强了整个生态系统的健康度。
As research continues across advanced reasoning architectures, multimodal capabilities, domain specialization, human-AI collaboration, and ecosystem standardization, we anticipate continued rapid advancement building on the foundation established by current implementations. This evolution will likely yield increasingly sophisticated research capabilities with significant implications for knowledge work across domains, potentially transforming how information is discovered, validated, synthesized, and utilized throughout society.
随着研究在高级推理架构、多模态能力、领域专业化、人机协作和生态系统标准化等方面持续推进,我们预期基于当前实现所奠定的基础将继续快速发展。这一演进可能会产生日益复杂的研究能力,对各领域的知识工作产生重大影响,并可能彻底改变社会中发现、验证、综合和利用信息的方式。
The responsible development of these powerful capabilities requires continued attention to ethical considerations including information accuracy, privacy protection, intellectual property respect, and accessibility. By addressing these considerations alongside technical advancement, the Deep Research ecosystem can fulfill its potential for positive impact on knowledge discovery and utilization while minimizing potential harms or misuse.
这些强大能力的负责任发展需要持续关注伦理考量,包括信息准确性、隐私保护、知识产权尊重和可访问性。通过在技术进步的同时解决这些问题,深度研究生态系统能够充分发挥其在知识发现和利用方面的积极影响潜力,同时将潜在危害或滥用的可能性降至最低。
In conclusion, Deep Research represents both a fascinating technical domain for continued research and a potentially transformative capability for practical knowledge work across society. The frameworks, analysis, and directions presented in this survey provide a foundation for continued examination of this rapidly evolving field with significant implications for the future of information access, knowledge synthesis, and discovery processes.
综上所述,深度研究既是一个值得持续探索的迷人技术领域,也是可能彻底改变社会知识工作的潜在能力。本综述提出的框架、分析和发展方向,为持续研究这一快速演进的领域奠定了基础,该领域对未来的信息获取、知识综合与发现流程具有重大意义。

References  参考文献

[1] Adilzhan Adilkhanov, Amir Yelenov, Assylkhan Seitzhanov, Ayan Mazhitov, Azamat Abdikarimov, Danissa Sandykbayeva, Daryn Kenzhebek, Dinmukhammed Mukashev, Ilyas Umurbekov, Jabrail Chumakov, Kamila Spanova, Karina Burunchina, Madina Yergibay, Margulan Issa, Moldir Zabirova, Nurdaulet Zhuzbay, Nurlan Kabdyshev, Nurlan Zhaniyar, Rasul Yermagambet, Rustam Chibar, Saltanat Seitzhan, Soibkhon Khajikhanov, Tasbolat Taunyazov, Temirlan Galimzhanov, Temirlan Kaiyrbay, Tleukhan Mussin, Togzhan Syrymova, Valeriya Kostyukova, Yerkebulan Massalim, Yermakhan Kassym, Zerde Nurbayeva, and Zhanat Kappassov. 2025. Survey on Vision-Language-Action Models. arXiv:2502.06851 [cs.CL] https://arxiv.org/abs/2502.06851
[2] Agent-RL. 2024. ReSearch. https://github.com/Agent-RL/ReSearch.
[3] Agno-AGI. 2025. Agno. https://github.com/agno-agi/agno.
[4] Garima Agrawal, Sashank Gummuluri, and Cosimo Spera. 2024. Beyond-RAG: Question Identification and Answer Generation in Real-Time Conversations. arXiv:2410.10136 [cs.CL] https://arxiv.org/abs/2410.10136
[4] Garima Agrawal, Sashank Gummuluri 与 Cosimo Spera. 2024. 《超越 RAG:实时对话中的问题识别与答案生成》. arXiv:2410.10136 [cs.CL] https://arxiv.org/abs/2410.10136

[5] Flowise AI. 2023. Flowise: Low-code LLM Application Building Tool. https://flowiseai.com/.
[5] Flowise AI. 2023. 《Flowise:低代码 LLM 应用构建工具》. https://flowiseai.com/.

[6] Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, N. M. Anoop Krishnan, and Kevin Maik Jablonka. 2025. Probing the limitations of multimodal language models for chemistry and materials research. arXiv:2411.16955 [cs.LG] https://arxiv.org/abs/2411.16955
[6] Nawaf Alampara、Mara Schilling-Wilhelmi、Martiño Ríos-García、Indrajeet Mandal、Pranav Khetarpal、Hargun Singh Grover、N. M. Anoop Krishnan 与 Kevin Maik Jablonka。2025 年。《探究多模态语言模型在化学与材料研究中的局限性》。arXiv:2411.16955 [cs.LG] https://arxiv.org/abs/2411.16955

[7] AlphaProof and AlphaGeometry teams. 2024. AI achieves silver-medal standard solving International Mathematical Olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/.
[7] AlphaProof 与 AlphaGeometry 团队。2024 年。《AI 解题能力达国际数学奥林匹克银牌水平》。https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

[8] Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, Himanshu Tyagi, and Pramod Viswanath. 2025. Open Deep Search: Democratizing Search with Open-source Reasoning Agents. arXiv:2503.20201 [cs.LG] https://arxiv.org/abs/2503.20201
[8] Salaheddin Alzubi、Creston Brooks、Purva Chiniya、Edoardo Contente、Chiara von Gerlach、Lucas Irwin、Yihan Jiang、Arda Kaz、Windsor Nguyen、Sewoong Oh、Himanshu Tyagi 与 Pramod Viswanath。2025 年。《开放深度搜索:用开源推理智能体实现搜索民主化》。arXiv:2503.20201 [cs.LG] https://arxiv.org/abs/2503.20201

[9] Lucio Anderlini, Matteo Barbetti, Giulio Bianchini, Diego Ciangottini, Stefano Dal Pra, Diego Michelotto, Carmelo Pellegrino, Rosa Petrini, Alessandro Pascolini, and Daniele Spiga. 2025. Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform. arXiv:2502.21266 [cs.DC] https://arxiv.org/abs/2502.21266
[9] Lucio Anderlini、Matteo Barbetti、Giulio Bianchini、Diego Ciangottini、Stefano Dal Pra、Diego Michelotto、Carmelo Pellegrino、Rosa Petrini、Alessandro Pascolini 和 Daniele Spiga。2025 年。通过 AI_INFN 平台在联邦云中支持基础科学的机器学习开发。arXiv:2502.21266 [cs.DC] https://arxiv.org/abs/2502.21266

[10] Mehrad Ansari and Seyed Mohamad Moosavi. 2023. Agent-based Learning of Materials Datasets from Scientific Literature. https://github.com/AI4ChemS/Eunomia. arXiv:2312.11690 [cs.AI] https://arxiv.org/abs/2312.11690
[10] Mehrad Ansari 和 Seyed Mohamad Moosavi。2023 年。基于代理的科学文献材料数据集学习。https://github.com/AI4ChemS/Eunomia。arXiv:2312.11690 [cs.AI] https://arxiv.org/abs/2312.11690

[11] Anthropic. 2024. Building effective agents. https://www.anthropic.com/engineering/building-effective-agents.
[11] Anthropic。2024 年。构建高效代理。https://www.anthropic.com/engineering/building-effective-agents。

[12] Antropic. 2024. Model Context Protocol (MCP). https://docs.anthropic.com/en/docs/agents-and-tools/mcp.
[12] Anthropic。2024 年。模型上下文协议(MCP)。https://docs.anthropic.com/en/docs/agents-and-tools/mcp。

[13] Antropic. 2025. Claude takes research to new places. https://www.anthropic.com/news/research.
[13] Anthropic. 2025. Claude 将研究推向新高度. https://www.anthropic.com/news/research.

[14] Prakash Aryan. 2024. LLMs as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments. arXiv:2412.06229 [cs.AI] https://arxiv.org/abs/2412.06229
[14] Prakash Aryan. 2024. 作为辩论伙伴的 LLMs:利用遗传算法和对抗搜索实现自适应论证. arXiv:2412.06229 [ cs.AI] https://arxiv.org/abs/2412.06229

[15] Zahra Ashktorab, Qian Pan, Werner Geyer, Michael Desmond, Marina Danilevsky, James M. Johnson, Casey Dugan, and Michelle Bachman. 2024. Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions. arXiv:2409.08937 [cs.HC] https://arxiv.org/abs/2409.08937
[15] Zahra Ashktorab, Qian Pan, Werner Geyer, Michael Desmond, Marina Danilevsky, James M. Johnson, Casey Dugan, 和 Michelle Bachman. 2024. 人机文本生成中的新兴依赖行为:幻觉、数据质量评估与认知强制函数. arXiv:2409.08937 [cs.HC] https://arxiv.org/abs/2409.08937

[16] assafelovic. 2023. GPT-Researcher. https://github.com/assafelovic/gpt-researcher/.
[17] Ahmet Yasin Aytar, Kemal Kilic, and Kamer Kaya. 2024. A Retrieval-Augmented Generation Framework for Academic Literature Navigation in Data Science. arXiv:2412.15404 [cs.IR] https://arxiv.org/abs/2412.15404
[17] Ahmet Yasin Aytar、Kemal Kilic 和 Kamer Kaya。2024。《数据科学学术文献导航的检索增强生成框架》。arXiv:2412.15404 [cs.IR] https://arxiv.org/abs/2412.15404

[18] Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. 2025. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models. arXiv:2404.07738 [cs.CL] https://arxiv.org/ abs/2404.07738
[18] Jinheon Baek、Sujay Kumar Jauhar、Silviu Cucerzan 和 Sung Ju Hwang。2025。《ResearchAgent:基于大语言模型的科学文献迭代研究创意生成》。arXiv:2404.07738 [cs.CL] https://arxiv.org/abs/2404.07738

[19] Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, and Quaizar Vohra. 2024. TapeAgents: a Holistic Framework for Agent Development and Optimization. arXiv:2412.08445 [cs.AI] https: //arxiv.org/abs/2412.08445
[19] Dzmitry Bahdanau、Nicolas Gontier、Gabriel Huang、Ehsan Kamalloo、Rafael Pardinas、Alex Piché、Torsten Scholak、Oleh Shliazhko、Jordan Prince Tremblay、Karam Ghanem、Soham Parikh、Mitul Tiwari 和 Quaizar Vohra。2024。《TapeAgents:智能体开发与优化的整体框架》。arXiv:2412.08445 [cs.AI] https://arxiv.org/abs/2412.08445

[20] Gal Bakal, Ali Dasdan, Yaniv Katz, Michael Kaufman, and Guy Levin. 2025. Experience with GitHub Copilot for Developer Productivity at Zoominfo. arXiv:2501.13282 [cs.SE] https://arxiv.org/abs/2501.13282
[20] Gal Bakal、Ali Dasdan、Yaniv Katz、Michael Kaufman 和 Guy Levin。2025。《Zoominfo 开发者使用 GitHub Copilot 提升生产力的实践经验》。arXiv:2501.13282 [cs.SE] https://arxiv.org/abs/2501.13282

[21] Howard Balshem, Mark Helfand, Holger J Schünemann, Andrew D Oxman, Regina Kunzand Jan Brozek, Gunn E Vist, Yngve Falck-Ytter, Joerg Meerpohl, Susan Norris, and Gordon H Guyatt. 2011. GRADE guidelines: 3. Rating the quality of evidence. https://pubmed.ncbi.nlm.nih.gov/21208779/.
[21] Howard Balshem、Mark Helfand、Holger J Schünemann、Andrew D Oxman、Regina Kunz、Jan Brozek、Gunn E Vist、Yngve Falck-Ytter、Joerg Meerpohl、Susan Norris 和 Gordon H Guyatt。2011 年。《GRADE 指南:第 3 部分 证据质量评级》。https://pubmed.ncbi.nlm.nih.gov/21208779/

[22] Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White, Jordan Boyd-Graber, and Benjamin Van Durme. 2023. MegaWika: Millions of reports and their sources across 50 diverse languages. arXiv:2307.07049 [cs.CL] https://arxiv.org/abs/2307.07049
[22] Samuel Barham、Orion Weller、Michelle Yuan、Kenton Murray、Mahsa Yarmohammadi、Zhengping Jiang、Siddharth Vashishtha、Alexander Martin、Anqi Liu、Aaron Steven White、Jordan Boyd-Graber 和 Benjamin Van Durme。2023 年。《MegaWika:跨越 50 种语言的数百万篇报告及其来源》。arXiv:2307.07049 [cs.CL] https://arxiv.org/abs/2307.07049

[23] Rhea Basappa, Mustafa Tekman, Hong Lu, Benjamin Faught, Sandeep Kakar, and Ashok K. Goel. 2024. Social AI Agents Too Need to Explain Themselves. Springer Nature Switzerland, 351-360. doi:10.1007/978-3-031-63028-6_29
[23] Rhea Basappa、Mustafa Tekman、Hong Lu、Benjamin Faught、Sandeep Kakar 和 Ashok K. Goel。2024 年。《社交 AI 代理也需要自我解释》。Springer Nature Switzerland 出版社,351-360 页。doi:10.1007/978-3-031-63028-6_29

[24] Joeran Beel, Min-Yen Kan, and Moritz Baumgart. 2025. Evaluating Sakana’s AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards ‘Artificial Research Intelligence’ (ARI)? arXiv:2502.14297 [cs.IR] https://arxiv.org/abs/2502.14297
[24] Joeran Beel、Min-Yen Kan 和 Moritz Baumgart。2025 年。《评估 Sakana AI 科学家在自主研究中的表现:是美好愿景还是迈向"人工研究智能"(ARI)的新现实?》arXiv:2502.14297 [cs.IR] https://arxiv.org/abs/2502.14297

[25] Morad Behandish, John Maxwell III, and Johan de Kleer. 2022. AI Research Associate for Early-Stage Scientific Discovery. arXiv:2202.03199 [cs.AI] https://arxiv.org/abs/2202.03199
[25] Morad Behandish、John Maxwell III 和 Johan de Kleer。2022。《面向早期科学发现的人工智能研究助理》。arXiv:2202.03199 [cs.AI] https://arxiv.org/abs/2202.03199

[26] Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. 2025. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? arXiv:2502.15657 [cs.AI] https://arxiv.org/abs/2502.15657
[26] Yoshua Bengio、Michael Cohen、Damiano Fornasiere、Joumana Ghosn、Pietro Greiner、Matt MacDermott、Sören Mindermann、Adam Oberman、Jesse Richardson、Oliver Richardson、Marc-Antoine Rondeau、Pierre-Luc St-Charles 和 David Williams-King。2025。《超级智能体带来的灾难性风险:科学家 AI 能否提供更安全的路径?》arXiv:2502.15657 [cs.AI] https://arxiv.org/abs/2502.15657

[27] Karim Benharrak, Tim Zindulka, and Daniel Buschek. 2024. Deceptive Patterns of Intelligent and Interactive Writing Assistants. arXiv:2404.09375 [cs.HC] https://arxiv.org/abs/2404.09375
[27] Karim Benharrak、Tim Zindulka 和 Daniel Buschek。2024。《智能交互式写作助手的欺骗性模式》。arXiv:2404.09375 [cs.HC] https://arxiv.org/abs/2404.09375

[28] Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. 2024. OceanGPT: A Large Language Model for Ocean Science Tasks. http://oceangpt.zjukg.cn/. arXiv:2310.02031 [cs.CL] https: //arxiv.org/abs/2310.02031
[28] 毕祯、张宁宇、薛一达、区艺昕、纪大雄、郑国周和陈华钧。2024。《OceanGPT:面向海洋科学任务的大语言模型》。http://oceangpt.zjukg.cn/。arXiv:2310.02031 [cs.CL] https://arxiv.org/abs/2310.02031

[29] Stefano Bianchini, Moritz Müller, and Pierre Pelletier. 2024. Drivers and Barriers of AI Adoption and Use in Scientific Research. arXiv:2312.09843 [cs.CY] https://arxiv.org/abs/2312.09843
[29] Stefano Bianchini、Moritz Müller 和 Pierre Pelletier。2024 年。《人工智能在科研中的采纳与使用:驱动因素与障碍》。arXiv:2312.09843 [cs.CY] https://arxiv.org/abs/2312.09843

[30] bindAI. 2025. ChatGPT Deep Research vs Perplexity - Which One Is Better? https://blog.getbind.co/2025/02/03/ chatgpt-deep-research-is-it-better-than-perplexity/.
[30] bindAI。2025 年。《ChatGPT 深度研究 vs Perplexity——孰优孰劣?》https://blog.getbind.co/2025/02/03/chatgpt-deep-research-is-it-better-than-perplexity/

[31] Francisco Bolanos, Angelo Salatino, Francesco Osborne, and Enrico Motta. 2024. Artificial Intelligence for Literature Reviews: Opportunities and Challenges. arXiv:2402.08565 [cs.AI] https://arxiv.org/abs/2402.08565
[31] Francisco Bolanos、Angelo Salatino、Francesco Osborne 和 Enrico Motta。2024 年。《文献综述中的人工智能:机遇与挑战》。arXiv:2402.08565 [cs.AI] https://arxiv.org/abs/2402.08565

[32] Bolt. 2024. Bolt. https://bolt.new/.
[32] Bolt。2024 年。《Bolt》。https://bolt.new/

[33] bracai. 2025. MMLU benchmark: Testing LLMs multi-task capabilities. https://www.bracai.eu/post/mmlu-benchmark.
[33] bracai. 2025. MMLU 基准测试:评估 LLMs 的多任务能力。https://www.bracai.eu/post/mmlu-benchmark.

[34] Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2023. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics.chem-ph] https://arxiv.org/abs/ 2304.05376
[34] Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, 和 Philippe Schwaller. 2023. ChemCrow:通过化学工具增强大语言模型。arXiv:2304.05376 [physics.chem-ph] https://arxiv.org/abs/2304.05376

[35] Chris Brown and Jason Cusati. 2024. Exploring the Evidence-Based Beliefs and Behaviors of LLM-Based Programming Assistants. arXiv:2407.13900 [cs.SE] https://arxiv.org/abs/2407.13900
[35] Chris Brown 和 Jason Cusati. 2024. 探究基于 LLM 的编程助手的循证信念与行为。arXiv:2407.13900 [cs.SE] https://arxiv.org/abs/2407.13900

[36] browserbase. 2025. Open-operator. https://github.com/browserbase/open-operator.
[37] btahir. 2024. open_deep_research. https://github.com/btahir/open-deep-research.
[38] ByteDance. 2024. Coze Space. https://www.coze.cn/space-preview.
[38] 字节跳动. 2024. Coze Space. https://www.coze.cn/space-preview.

[39] ByteDance. 2025. agent-tars. https://github.com/bytedance/UI-TARS-desktop/tree/main/apps/agent-tars.
[39] 字节跳动. 2025. agent-tars. https://github.com/bytedance/UI-TARS-desktop/tree/main/apps/agent-tars.

[40] Beatriz Cabrero-Daniel, Tomas Herda, Victoria Pichler, and Martin Eder. 2024. Exploring Human-AI Collaboration in Agile: Customised LLM Meeting Assistants. arXiv:2404.14871 [cs.SE] https://arxiv.org/abs/2404.14871
[40] Beatriz Cabrero-Daniel, Tomas Herda, Victoria Pichler, 和 Martin Eder. 2024. 探索敏捷开发中的人机协作:定制化 LLM 会议助手. arXiv:2404.14871 [ cs.SE] https://arxiv.org/abs/2404.14871

[41] Filipe Calegario, Vanilson Burégio, Francisco Erivaldo, Daniel Moraes Costa Andrade, Kailane Felix, Nathalia Barbosa, Pedro Lucas da Silva Lucena, and César França. 2023. Exploring the intersection of Generative AI and Software Development. arXiv:2312.14262 [cs.SE] https://arxiv.org/abs/2312.14262
[41] Filipe Calegario、Vanilson Burégio、Francisco Erivaldo、Daniel Moraes Costa Andrade、Kailane Felix、Nathalia Barbosa、Pedro Lucas da Silva Lucena 和 César França。2023。探索生成式人工智能与软件开发的交叉领域。arXiv:2312.14262 [cs.SE] https://arxiv.org/abs/2312.14262

[42] Nicholas Camara. 2025. open-deep-research. https://github.com/nickscamara/open-deep-research.
[42] Nicholas Camara。2025。open-deep-research。https://github.com/nickscamara/open-deep-research

[43] Camel AI. 2025. OWL. https://github.com/camel-ai/owl.
[43] Camel AI。2025。OWL。https://github.com/camel-ai/owl

[44] Franck Cappello, Sandeep Madireddy, Robert Underwood, Neil Getty, Nicholas Lee-Ping Chia, Nesar Ramachandra, Josh Nguyen, Murat Keceli, Tanwi Mallick, Zilinghan Li, Marieme Ngom, Chenhui Zhang, Angel Yanguas-Gil, Evan Antoniuk, Bhavya Kailkhura, Minyang Tian, Yufeng Du, Yuan-Sen Ting, Azton Wells, Bogdan Nicolae, Avinash Maurya, M. Mustafa Rafique, Eliu Huerta, Bo Li, Ian Foster, and Rick Stevens. 2025. EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants. arXiv:2502.20309 [cs.AI] https://arxiv.org/abs/2502.20309
[44] Franck Cappello、Sandeep Madireddy、Robert Underwood、Neil Getty、Nicholas Lee-Ping Chia、Nesar Ramachandra、Josh Nguyen、Murat Keceli、Tanwi Mallick、Zilinghan Li、Marieme Ngom、Chenhui Zhang、Angel Yanguas-Gil、Evan Antoniuk、Bhavya Kailkhura、Minyang Tian、Yufeng Du、Yuan-Sen Ting、Azton Wells、Bogdan Nicolae、Avinash Maurya、M. Mustafa Rafique、Eliu Huerta、Bo Li、Ian Foster 和 Rick Stevens。2025。EAIRA:建立评估 AI 模型作为科研助手的方论体系。arXiv:2502.20309 [cs.AI] https://arxiv.org/abs/2502.20309

[45] Peter Cardon, Carolin Fleischmann, Jolanta Aritz, Minna Logemann, and Jeanette Heidewald. 2023. The Challenges and Opportunities of AI-Assisted Writing: Developing AI Literacy for the AI Age. https://journals.sagepub.com/doi/ abs/10.1177/23294906231176517.
[45] Peter Cardon、Carolin Fleischmann、Jolanta Aritz、Minna Logemann 和 Jeanette Heidewald。2023。《AI 辅助写作的挑战与机遇:为 AI 时代培养 AI 素养》。https://journals.sagepub.com/doi/ abs/10.1177/23294906231176517。

[46] Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657 [cs.AI] https://arxiv.org/abs/2503.13657
[46] Mert Cemri、Melissa Z. Pan、Shuyi Yang、Lakshya A. Agrawal、Bhavya Chopra、Rishabh Tiwari、Kurt Keutzer、Aditya Parameswaran、Dan Klein、Kannan Ramchandran、Matei Zaharia、Joseph E. Gonzalez 和 Ion Stoica。2025。《多智能体 LLM 系统为何失败?》arXiv:2503.13657 [ cs.AI] https://arxiv.org/abs/2503.13657

[47] Eric Chamoun, Michael Schlichktrull, and Andreas Vlachos. 2024. Automated Focused Feedback Generation for Scientific Writing Assistance. arXiv:2405.20477 [cs.CL] https://arxiv.org/abs/2405.20477
[47] Eric Chamoun、Michael Schlichktrull 和 Andreas Vlachos。2024。《科学写作辅助的自动化聚焦反馈生成》arXiv:2405.20477 [ cs.CL] https://arxiv.org/abs/2405.20477

[48] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. https://github.com/thunlp/ChatEval. arXiv:2308.07201 [cs.CL] https://arxiv.org/abs/2308.07201
[48] Chi-Min Chan、Weize Chen、Yusheng Su、Jianxuan Yu、Wei Xue、Shanghang Zhang、Jie Fu 和 Zhiyuan Liu。2023。《ChatEval:通过多智能体辩论构建更优秀的基于 LLM 的评估系统》https://github.com/thunlp/ChatEval。arXiv:2308.07201 [ cs.CL] https://arxiv.org/abs/2308.07201

[49] Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024. From Persona to Personalization: A Survey on Role-Playing Language Agents. arXiv:2404.18231 [cs.CL] https://arxiv.org/ abs/2404.18231
[49] 陈江杰、王新涛、徐睿、袁思雨、张一凯、石伟、谢健、李爽、杨瑞涵、朱廷辉、陈爱丽、李念琪、陈丽达、胡才玉、吴思烨、任思远、傅子权、肖阳华。2024。《从角色设定到个性化:角色扮演语言智能体研究综述》。arXiv:2404.18231 [cs.CL] https://arxiv.org/abs/2404.18231

[50] Kexin Chen, Hanqun Cao, Junyou Li, Yuyang Du, Menghao Guo, Xin Zeng, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, and Guangyong Chen. 2024. An Autonomous Large Language Model Agent for Chemical Literature Data Mining. arXiv:2402.12993 [cs.IR] https://arxiv.org/abs/2402.12993
[50] 陈可欣、曹汉群、李俊友、杜宇阳、郭梦浩、曾鑫、李兰青、邱杰忠、衡鹏安、陈光勇。2024。《用于化学文献数据挖掘的自主大语言模型智能体》。arXiv:2402.12993 [cs.IR] https://arxiv.org/abs/2402.12993

[51] Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, and Yu Qiao. 2024. GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. https: //uni-medical.github.io/GMAI-MMBench.github.io/. arXiv:2408.03361 [eess.IV] https://arxiv.org/abs/2408.03361
[51] 陈鹏程、叶瑾、王国安、李彦君、邓忠英、李伟、李天斌、段昊东、黄子彦、苏彦周、王本有、张少亭、傅斌、蔡健飞、庄博涵、Eric J Seibel、何俊君、乔宇。2024。《GMAI-MMBench:面向通用医疗 AI 的综合多模态评估基准》。https://uni-medical.github.io/GMAI-MMBench.github.io/。arXiv:2408.03361 [eess.IV] https://arxiv.org/abs/2408.03361

[52] Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, and Dianbo Liu. 2025. AutoBench: An Automated Benchmark for Scientific Discovery in LLMs. https://github.com/AutoBench/AutoBench. arXiv:2502.15224 [cs.LG] https://arxiv.org/abs/2502.15224
[52] Tingting Chen、Srinivas Anumasa、Beibei Lin、Vedant Shah、Anirudh Goyal 和 Dianbo Liu。2025 年。《AutoBench:面向 LLMs 科学发现的自动化基准测试》。https://github.com/AutoBench/AutoBench。arXiv:2502.15224 [cs.LG] https://arxiv.org/abs/2502.15224

[53] Valerie Chen, Alan Zhu, Sebastian Zhao, Hussein Mozannar, David Sontag, and Ameet Talwalkar. 2025. Need Help? Designing Proactive AI Assistants for Programming. arXiv:2410.04596 [cs.HC] https://arxiv.org/abs/2410.04596
[53] Valerie Chen、Alan Zhu、Sebastian Zhao、Hussein Mozannar、David Sontag 和 Ameet Talwalkar。2025 年。《需要帮助吗?面向编程任务的主动式 AI 助手设计》。arXiv:2410.04596 [cs.HC] https://arxiv.org/abs/2410.04596

[54] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. arXiv:2308.10848 [cs.CL] https://arxiv. org/abs/2308.10848
[54] Weize Chen、Yusheng Su、Jingwei Zuo、Cheng Yang、Chenfei Yuan、Chi-Min Chan、Heyang Yu、Yaxi Lu、Yi-Hsin Hung、Chen Qian、Yujia Qin、Xin Cong、Ruobing Xie、Zhiyuan Liu、Maosong Sun 和 Jie Zhou。2023 年。《AgentVerse:促进多智能体协作与涌现行为探索》。arXiv:2308.10848 [cs.CL] https://arxiv.org/abs/2308.10848

[55] Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. 2024. Can AI Assistants Know What They Don’t Know? arXiv:2401.13275 [cs.CL] https://arxiv.org/abs/2401.13275
[55] Qinyuan Cheng、Tianxiang Sun、Xiangyang Liu、Wenwei Zhang、Zhangyue Yin、Shimin Li、Linyang Li、Zhengfu He、Kai Chen 和 Xipeng Qiu。2024 年。《AI 助手能知道自己不了解什么吗?》。arXiv:2401.13275 [cs.CL] https://arxiv.org/abs/2401.13275

[56] Zhao Cheng, Diane Wan, Matthew Abueg, Sahra Ghalebikesabi, Ren Yi, Eugene Bagdasarian, Borja Balle, Stefan Mellem, and Shawn O’Banion. 2024. CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data. https: //www.aimodels.fyi/papers/arxiv/ci-bench-benchmarking-contextual-integrity-ai-assistants. arXiv:2409.13903 [cs.AI] https://arxiv.org/abs/2409.13903
[56] 赵成、Diane Wan、Matthew Abueg、Sahra Ghalebikesabi、任毅、Eugene Bagdasarian、Borja Balle、Stefan Mellem 和 Shawn O'Banion。2024。《CI-Bench:基于合成数据的人工智能助手上下文完整性基准测试》。https://www.aimodels.fyi/papers/arxiv/ci-bench-benchmarking-contextual-integrity-ai-assistants。arXiv:2409.13903 [cs.AI] https://arxiv.org/abs/2409.13903

[57] Bhavya Chopra, Ananya Singha, Anna Fariha, Sumit Gulwani, Chris Parnin, Ashish Tiwari, and Austin Z. Henley. 2023. Conversational Challenges in AI-Powered Data Science: Obstacles, Needs, and Design Opportunities. arXiv:2310.16164 [cs.HC] https://arxiv.org/abs/2310.16164
[57] Bhavya Chopra、Ananya Singha、Anna Fariha、Sumit Gulwani、Chris Parnin、Ashish Tiwari 和 Austin Z. Henley。2023。《AI 驱动数据科学中的对话挑战:障碍、需求与设计机遇》。arXiv:2310.16164 [cs.HC] https://arxiv.org/abs/2310.16164

[58] Daniel J. H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz Münchmeyer, Maja Rudolph, Frederic Sala, and Sai Chaitanya Tadepalli. 2025. Theoretical Physics Benchmark (TPBench) - a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics. https://tpbench.org/. arXiv:2502.15815 [cs.LG] https://arxiv.org/abs/2502.15815
[58] Daniel J. H. Chung、高志奇、Yurii Kvasiuk、李天一、Moritz Münchmeyer、Maja Rudolph、Frederic Sala 和 Sai Chaitanya Tadepalli。2025。《理论物理基准(TPBench)——关于 AI 在理论物理中推理能力的数据集与研究》。https://tpbench.org/。arXiv:2502.15815 [cs.LG] https://arxiv.org/abs/2502.15815

[59] Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emircan Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2024. Automated Code Review In Practice. arXiv:2412.18531 [cs.SE] https://arxiv.org/abs/ 2412.18531
[59] Umut Cihan、Vahid Haratian、Arda İçöz、Mert Kaan Gül、Ömercan Devran、Emircan Furkan Bayendur、Baykal Mehmet Uçar 和 Eray Tüzün。2024。《自动化代码审查实践》。arXiv:2412.18531 [cs.SE] https://arxiv.org/abs/2412.18531

[60] Dave Citron. 2025. Deep Research is now available on Gemini 2.5 Pro Experimental. https://blog.google/products/ gemini/deep-research-gemini-2-5-pro-experimental/.
[60] Dave Citron。2025。《深度研究功能现已在 Gemini 2.5 Pro 实验版上线》。https://blog.google/products/gemini/deep-research-gemini-2-5-pro-experimental/

[61] Cline. 2024. Cline. https://github.com/cline/cline.
[61] Cline。2024。《Cline 项目》。https://github.com/cline/cline

[62] Cognition Labs. 2025. Devin.ai. https://devin.ai
[62] Cognition Labs。2025。《Devin.ai 人工智能》。https://devin.ai

[63] Consensus. 2025. Consensus. https://consensus.app/.
[64] crewAIInc. 2023. CrewAI. https://github.com/crewAIInc/crewAI.
[65] Cursor. 2023. Cursor. https://www.cursor.com/.
[66] Mohamad H. Danesh, Tu Trinh, Benjamin Plaut, and Nguyen X. Khanh. 2025. Learning to Coordinate with Experts. https://github.com/modanesh/YRC-Bench. arXiv:2502.09583 [cs.LG] https://arxiv.org/abs/2502.09583
[66] Mohamad H. Danesh, Tu Trinh, Benjamin Plaut, 与 Nguyen X. Khanh. 2025. 学习与专家协同. https://github.com/modanesh/YRC-Bench. arXiv:2502.09583 [cs.LG] https://arxiv.org/abs/2502.09583

[67] Kristin M. de Payrebrune, Kathrin Flaßkamp, Tom Ströhla, Thomas Sattel, Dieter Bestle, Benedict Röder, Peter Eberhard, Sebastian Peitz, Marcus Stoffel, Gulakala Rutwik, Borse Aditya, Meike Wohlleben, Walter Sextro, Maximilian Raff, C. David Remy, Manish Yadav, Merten Stender, Jan van Delden, Timo Lüddecke, Sabine C. Langer, Julius Schultz, and Christopher Blech. 2024. The impact of AI on engineering design procedures for dynamical systems. arXiv:2412.12230 [eess.SY] https://arxiv.org/abs/2412.12230
[67] Kristin M. de Payrebrune, Kathrin Flaßkamp, Tom Ströhla, Thomas Sattel, Dieter Bestle, Benedict Röder, Peter Eberhard, Sebastian Peiz, Marcus Stoffel, Gulakala Rutwik, Borse Aditya, Meike Wohlleben, Walter Sextro, Maximilian Raff, C. David Remy, Manish Yadav, Merten Stender, Jan van Delden, Timo Lüddecke, Sabine C. Langer, Julius Schultz, 与 Christopher Blech。2024。人工智能对动力系统工程设计流程的影响。arXiv:2412.12230 [eess.SY] https://arxiv.org/abs/2412.12230

[68] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501.12948
[68] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu 等。2025。DeepSeek-R1:通过强化学习激励 LLMs 的推理能力。arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501.12948

[69] Akash Dhruv and Anshu Dubey. 2025. Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing. arXiv:2410.24119 [cs.SE] https://arxiv.org/abs/2410.24119
[69] Akash Dhruv 与 Anshu Dubey。2025。利用大语言模型实现科学计算中的代码翻译与软件开发。arXiv:2410.24119 [cs.SE] https://arxiv.org/abs/2410.24119

[70] Talissa Dreossi. 2025. Bridging Logic Programming and Deep Learning for Explainability through ILASP. Electronic Proceedings in Theoretical Computer Science 416 (Feb. 2025), 314-323. doi:10.4204/eptcs.416.31
[70] Talissa Dreossi. 2025. 通过 ILASP 桥接逻辑编程与深度学习以实现可解释性. 理论计算机科学电子论文集 416 (2025 年 2 月), 314-323. doi:10.4204/eptcs.416.31

[71] Ian Drosos, Advait Sarkar, Xiaotong Xu, and Neil Toronto. 2025. “It makes you think”: Provocations Help Restore Critical Thinking to AI-Assisted Knowledge Work. arXiv:2501.17247 [cs.HC] https://arxiv.org/abs/2501.17247
[71] Ian Drosos, Advait Sarkar, Xiaotong Xu, Neil Toronto. 2025. "它促使你思考":挑衅性设计有助于在 AI 辅助知识工作中恢复批判性思维. arXiv:2501.17247 [cs.HC] https://arxiv.org/abs/2501.17247

[72] Omer Dunay, Daniel Cheng, Adam Tait, Parth Thakkar, Peter C Rigby, Andy Chiu, Imad Ahmad, Arun Ganesan, Chandra Maddila, Vijayaraghavan Murali, Ali Tayyebi, and Nachiappan Nagappan. 2024. Multi-line AI-assisted Code Authoring. arXiv:2402.04141 [cs.SE] https://arxiv.org/abs/2402.04141
[72] Omer Dunay, Daniel Cheng, Adam Tait, Parth Thakkar, Peter C Rigby, Andy Chiu, Imad Ahmad, Arun Ganesan, Chandra Maddila, Vijayaraghavan Murali, Ali Tayyebi, Nachiappan Nagappan. 2024. 多行 AI 辅助代码编写. arXiv:2402.04141 [cs.SE] https://arxiv.org/abs/2402.04141

[73] Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, and Tristan Miller. 2025. Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation. arXiv:2502.05151 [cs.CL] https://arxiv.org/abs/2502.05151
[73] Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller. 2025. 大语言模型赋能科学转型:AI 辅助科学发现、实验、内容生成与评估综述. arXiv:2502.05151 [cs.CL] https://arxiv.org/abs/2502.05151

[74] Elicit. 2025. Elicit. https://elicit.com/?redirected=true.
[75] Michael D. Ernst. 2017. Natural Language is a Programming Language: Applying Natural Language Processing to Software Development. https://drops.dagstuhl.de/storage/00lipics/lipics-vol071-snapl2017/LIPIcs.SNAPL.2017.4/ LIPIcs.SNAPL.2017.4.pdf.
[75] Michael D. Ernst. 2017. 自然语言即编程语言:将自然语言处理应用于软件开发. https://drops.dagstuhl.de/storage/00lipics/lipics-vol071-snapl2017/LIPIcs.SNAPL.2017.4/ LIPIcs.SNAPL.2017.4.pdf.

[76] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. arXiv:2405.06211 [cs.CL] https://arxiv.org/abs/2405.06211
[76] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li. 2024. 检索增强大语言模型(RAG)研究综述:迈向检索增强型 LLMs. arXiv:2405.06211 [cs.CL] https://arxiv.org/abs/2405.06211

[77] Flowith. 2024. Flowith Oracle Mode. https://flowith.net/.
[77] Flowith. 2024. Flowith Oracle 模式. https://flowith.net/.

[78] Forethought-Technologies. 2023. AutoChain. https://github.com/Forethought-Technologies/AutoChain.
[79] César França. 2023. AI empowering research: 10 ways how science can benefit from AI. arXiv:2307.10265 [cs.GL] https://arxiv.org/abs/2307.10265
[79] César França. 2023. 人工智能赋能科研:科学受益于 AI 的 10 种方式. arXiv:2307.10265 [ cs.GL] https://arxiv.org/abs/2307.10265

[80] Future-House. 2023. PaperQA. https://github.com/Future-House/paper-qa.
[81] GAIR-NLP. 2025. DeepResearcher. https://github.com/GAIR-NLP/DeepResearcher.
[82] Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. 2023. AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn. arXiv:2306.08640 [cs.CV] https: //arxiv.org/abs/2306.08640
[82] 高迪菲、季磊、周洛纬、林庆宏、陈卓雅、范子涵、Mike Zheng Shou。2023。AssistGPT:一个能够规划、执行、检查与学习的通用多模态助手。arXiv:2306.08640 [cs.CV] https://arxiv.org/abs/2306.08640

[83] Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. 2024. Empowering Biomedical Discovery with AI Agents. arXiv:2404.02831 [cs.AI] https://arxiv.org/abs/2404.02831
[83] 高尚华、方艾达、黄业鹏、Valentina Giunchiglia、Ayush Noori、Jonathan Richard Schwarz、Yasha Ektefaie、Jovana Kondic、Marinka Zitnik。2024。用 AI 智能体赋能生物医学发现。arXiv:2404.02831 [cs.AI] https://arxiv.org/abs/2404.02831

[84] Alireza Ghafarollahi and Markus J. Buehler. 2024. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv:2409.05556 [cs.AI] https://arxiv.org/abs/2409.05556
[84] Alireza Ghafarollahi、Markus J. Buehler。2024。SciAgents:通过多智能体图推理实现科学发现自动化。arXiv:2409.05556 [cs.AI] https://arxiv.org/abs/2409.05556

[85] Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, and Roberto Bifulco. 2024. AutoPenBench: Benchmarking Generative Agents for Penetration Testing. https://github.com/lucagioacchini/auto-pen-bench. arXiv:2410.03225 [cs.CR] https://arxiv.org/abs/2410.03225
[85] Luca Gioacchini、Marco Mellia、Idilio Drago、Alexander Delsanto、Giuseppe Siracusano、Roberto Bifulco。2024。AutoPenBench:渗透测试生成式智能体基准测试。https://github.com/lucagioacchini/auto-pen-bench。arXiv:2410.03225 [cs.CR] https://arxiv.org/abs/2410.03225

[86] Github. 2021. Github Copilot. https://github.com/features/copilot?ref=nav.poetries.top.
[87] Amr Gomaa, Michael Sargious, and Antonio Krüger. 2024. AdaptoML-UX: An Adaptive User-centered GUI-based AutoML Toolkit for Non-AI Experts and HCI Researchers. https://github.com/MichaelSargious/AdaptoML_UX. arXiv:2410.17469 [cs.HC] https://arxiv.org/abs/2410.17469
[87] Amr Gomaa、Michael Sargious 和 Antonio Krüger。2024 年。AdaptoML-UX:面向非 AI 专家和人机交互研究者的自适应用户中心图形界面 AutoML 工具包。https://github.com/MichaelSargious/AdaptoML_UX。arXiv:2410.17469 [cs.HC] https://arxiv.org/abs/2410.17469

[88] Google. 2021. BIG-bench. https://github.com/google/BIG-bench.
[88] Google。2021 年。BIG-bench。https://github.com/google/BIG-bench。

[89] Google. 2024. Try Deep Research and our new experimental model in Gemini, your AI assistant. https://blog.google/ products/gemini/google-gemini-deep-research/.
[89] Google。2024 年。在您的 AI 助手 Gemini 中尝试深度研究功能及我们的新实验模型。https://blog.google/products/gemini/google-gemini-deep-research/。

[90] Google. 2025. A2A. https://github.com/google/A2A.
[90] Google。2025 年。A2A。https://github.com/google/A2A。

[91] Google. 2025. Agent Development Kit. https://google.github.io/adk-docs/.
[91] Google. 2025. 智能体开发套件. https://google.github.io/adk-docs/.

[92] Google. 2025. Announcing the Agent2Agent Protocol (A2A). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/.
[92] Google. 2025. 发布智能体间交互协议(A2A). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/.

[93] Google. 2025. Gemini 2.0 Flash (Feb '25): Intelligence, Performance and Price Analysis. https://artificialanalysis.ai/ models/gemini-2-0-flash.
[93] Google. 2025. Gemini 2.0 Flash (2025 年 2 月版): 智能、性能与价格分析. https://artificialanalysis.ai/ models/gemini-2-0-flash.

[94] Google. 2025. gemini-fullstack-langgraph-quickstart. https://github.com/google-gemini/gemini-fullstack-langgraphquickstart.
[95] Google. 2025. NotebookLm. https://notebooklm.google/.
[96] Kanika Goswami, Puneet Mathur, Ryan Rossi, and Franck Dernoncourt. 2025. ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution. arXiv:2502.00989 [cs.CL] https://arxiv.org/abs/2502.00989
[96] Kanika Goswami, Puneet Mathur, Ryan Rossi, 和 Franck Dernoncourt. 2025. ChartCitor: 细粒度图表视觉归因的多智能体框架. arXiv:2502.00989 [ cs.CL] https://arxiv.org/abs/2502.00989

[97] Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni,
Nenad Tomasev, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, and Vivek Natarajan. 2025. Towards an AI co-scientist. https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf.
Nenad Tomasev, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, 和 Vivek Natarajan. 2025. 迈向 AI 科研助手时代. https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf.

[98] Alexander H. Gower, Konstantin Korovin, Daniel Brunnsåker, Filip Kronström, Gabriel K. Reder, Ievgeniia A. Tiukova, Ronald S. Reiserer, John P. Wikswo, and Ross D. King. 2024. The Use of AI-Robotic Systems for Scientific Discovery. arXiv:2406.17835 [cs.LG] https://arxiv.org/abs/2406.17835
[98] Alexander H. Gower、Konstantin Korovin、Daniel Brunnsåker、Filip Kronström、Gabriel K. Reder、Ievgeniia A. Tiukova、Ronald S. Reiserer、John P. Wikswo 和 Ross D. King。2024。《AI-机器人系统在科学发现中的应用》。arXiv:2406.17835 [cs.LG] https://arxiv.org/abs/2406.17835

[99] Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. 2025. LLMs can Realize Combinatorial Creativity: Generating Creative Ideas via LLMs for Scientific Research. arXiv:2412.14141 [cs.AI] https://arxiv.org/abs/2412.14141
[99] 顾天阳、王静瑾、张智浩和李浩宏。2025。《LLMs 可实现组合式创造力:通过 LLMs 为科学研究生成创新想法》。arXiv:2412.14141 [cs.AI] https://arxiv.org/abs/2412.14141

[100] Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. 2025. Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs. arXiv:2503.02846 [cs.CL] https://arxiv.org/abs/2503.02846
[100] 顾雨哲、张文伟、吕成奇、林达华和陈凯。2025。《Mask-DPO:LLMs 细粒度事实对齐的通用化方法》。arXiv:2503.02846 [cs.CL] https://arxiv.org/abs/2503.02846

[101] Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. https://github.com/AI-secure/RedCode. arXiv:2411.07781 [cs.SE] https://arxiv.org/abs/2411.07781
[101] 郭成权、刘迅、谢楚林、周安迪、曾毅、林子楠、宋旦和李波。2024。《RedCode:代码代理的风险代码执行与生成基准》。https://github.com/AI-secure/RedCode。arXiv:2411.07781 [cs.SE] https://arxiv.org/abs/2411.07781

[102] Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. 2024. DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning. https://github.com/guosyjlu/DS-Agent. arXiv:2402.17453 [cs.LG] https://arxiv.org/abs/2402.17453
[102] 郭思远、邓成、文颖、陈鹤昌、常毅、王军。2024。《DS-Agent:通过案例推理赋能大语言模型的自动化数据科学》。https://github.com/guosyjlu/DS-Agent。arXiv:2402.17453 [cs.LG] https://arxiv.org/abs/2402.17453

[103] Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, and Liwen Zhang. 2024. FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models. arXiv:2308.09975 [cs.CL] https://arxiv.org/abs/2308.09975
[103] 郭鑫、夏昊天、刘兆伟、曹瀚阳、杨志、刘志强、王思哲、牛金毅、王楚琪、王彦辉、梁晓龙、黄晓明、朱冰、魏忠宇、陈赟、沈蔚宁、张立文。2024。《FinEval:面向大语言模型的中文金融领域知识评估基准》。arXiv:2308.09975 [cs.CL] https://arxiv.org/abs/2308.09975

[104] Hilda Hadan, Derrick Wang, Reza Hadi Mogavi, Joseph Tu, Leah Zhang-Kennedy, and Lennart E. Nacke. 2024. The Great AI Witch Hunt: Reviewers Perception and (Mis)Conception of Generative AI in Research Writing. https: //arxiv.org/abs/2407.12015.
[104] 希尔达·哈丹、王德瑞克、莫加维·雷扎·哈迪、约瑟夫·图、张-肯尼迪·莉娅、伦纳特·E·纳克。2024。《AI 大围猎:研究者对生成式 AI 在学术写作中的认知与误解》。https://arxiv.org/abs/2407.12015

[105] Sukjin Han. 2024. Mining Causality: AI-Assisted Search for Instrumental Variables. arXiv:2409.14202 [econ.EM] https://arxiv.org/abs/2409.14202
[105] 韩硕镇。2024。《因果挖掘:AI 辅助的工具变量搜索》。arXiv:2409.14202 [econ.EM] https://arxiv.org/abs/2409.14202

[106] Ebtesam Al Haque, Chris Brown, Thomas D. LaToza, and Brittany Johnson. 2025. Towards Decoding Developer Cognition in the Age of AI Assistants. arXiv:2501.02684 [cs.HC] https://arxiv.org/abs/2501.02684
[106] Ebtesam Al Haque、Chris Brown、Thomas D. LaToza 和 Brittany Johnson。2025。《人工智能助手时代的开发者认知解码研究》。arXiv:2501.02684 [cs.HC] https://arxiv.org/abs/2501.02684

[107] Gaole He, Patrick Hemmer, Michael Vössing, Max Schemmer, and Ujwal Gadiraju. 2025. Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition. arXiv:2501.10909 [cs.AI] https://arxiv.org/abs/2501.10909
[107] Gaole He、Patrick Hemmer、Michael Vössing、Max Schemmer 和 Ujwal Gadiraju。2025。《细粒度适度依赖:基于多步骤透明决策流程的人机协作复杂任务分解》。arXiv:2501.10909 [cs.AI] https://arxiv.org/abs/2501.10909

[108] Kaveen Hiniduma, Suren Byna, Jean Luca Bez, and Ravi Madduri. 2024. AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI. In Proceedings of the 36th International Conference on Scientific and Statistical Database Management (SSDBM 2024). ACM, 1-12. doi:10.1145/3676288.3676296
[108] Kaveen Hiniduma、Suren Byna、Jean Luca Bez 和 Ravi Madduri。2024。《AI 数据准备检查器(AIDRIN):面向人工智能的数据准备量化评估》。载于《第 36 届国际科学与统计数据库管理会议论文集》(SSDBM 2024)。ACM 出版社,1-12 页。doi:10.1145/3676288.3676296

[109] HKUDS. 2025. AI-Researcher. https://github.com/HKUDS/AI-Researcher.
[109] HKUDS。2025。《AI-Researcher》。https://github.com/HKUDS/AI-Researcher

[110] Brendan Hogan, Anmol Kabra, Felipe Siqueira Pacheco, Laura Greenstreet, Joshua Fan, Aaron Ferber, Marta Ummus, Alecsander Brito, Olivia Graham, Lillian Aoki, Drew Harvell, Alex Flecker, and Carla Gomes. 2024. AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification. arXiv:2410.21480 [cs.LG] https://arxiv.org/abs/2410.21480
[110] Brendan Hogan, Anmol Kabra, Felipe Siqueira Pacheco, Laura Greenstreet, Joshua Fan, Aaron Ferber, Marta Ummus, Alecsander Brito, Olivia Graham, Lillian Aoki, Drew Harvell, Alex Flecker, Carla Gomes. 2024. AiSciVision:面向科学图像分类的大规模多模态模型专业化框架. arXiv:2410.21480 [cs.LG] https://arxiv.org/abs/2410.21480

[111] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. https://github.com/geekan/MetaGPT. arXiv:2308.00352 [cs.AI] https://arxiv.org/abs/2308.00352
[111] 洪思瑞、朱明辰、陈家琪、郑晓武、程宇航、张采瑶、王金林、王子力、史蒂文·卡·申·尤、林子娟、周立阳、冉晨宇、肖凌峰、吴成林和于尔根·施密德胡伯。2024.MetaGPT:多智能体协作框架中的元编程。https://github.com/geekan/MetaGPT. arXiv:2308.00352 [ cs.AI] https://arxiv.org/abs/2308.00352

[112] Hong Kong University Data Science Lab. 2024. Auto-Deep-Research. https://github.com/HKUDS/Auto-DeepResearch.
[112] 香港大学数据科学实验室. 2024. 自动深度研究. https://github.com/HKUDS/Auto-DeepResearch.

[113] Betty Li Hou, Kejian Shi, Jason Phang, James Aung, Steven Adler, and Rosie Campbell. 2024. Large Language Models as Misleading Assistants in Conversation. arXiv:2407.11789 [cs.CL] https://arxiv.org/abs/2407.11789
[113] Betty Li Hou、Kejian Shi、Jason Phang、James Aung、Steven Adler 和 Rosie Campbell。2024。《大型语言模型作为对话中的误导性助手》。arXiv:2407.11789 [cs.CL] https://arxiv.org/abs/2407.11789

[114] Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo Huang, Wuhe Zou, Weidong Zhang, and Hai-Tao Zheng. 2024. LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles. https://github.com/THUKElab/LatEval. arXiv:2308.10855 [cs.CL] https://arxiv.org/abs/2308.10855
[114] 黄树林、马世荣、李英辉、黄梦佐、邹武和、张卫东、郑海涛。2024 年。LatEval:基于横向思维谜题不完整信息的交互式 LLMs 评估基准。https://github.com/THUKElab/LatEval。arXiv:2308.10855 [cs.CL] https://arxiv.org/abs/2308.10855

[115] HuggingFace. 2025. smolagents: open_deep_research. https://github.com/huggingface/smolagents/tree/main/ examples/open_deep_research.
[115] HuggingFace。2025 年。smolagents:open_deep_research。https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research。

[116] Faria Huq, Abdus Samee, David Chuan-En Lin, Alice Xiaodi Tang, and Jeffrey P Bigham. 2025. NoTeeline: Supporting Real-Time, Personalized Notetaking with LLM-Enhanced Micronotes. In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25). ACM, 1064-1081. doi:10.1145/3708359.3712086
[116] Faria Huq、Abdus Samee、David Chuan-En Lin、Alice Xiaodi Tang、Jeffrey P Bigham。2025 年。NoTeeline:通过 LLM 增强的微笔记支持实时个性化笔记记录。载于《第 30 届国际智能用户界面会议论文集》(IUI '25)。ACM,1064-1081 页。doi:10.1145/3708359.3712086

[117] Kurando IIDA and Kenjiro MIMURA. 2024. CATER: Leveraging LLM to Pioneer a Multidimensional, ReferenceIndependent Paradigm in Translation Quality Evaluation. arXiv:2412.11261 [cs.CL] https://arxiv.org/abs/2412.11261
[117] 饭田藏人、三村健次郎。2024 年。CATER:利用 LLM 开创翻译质量评估的多维参考独立范式。arXiv:2412.11261 [cs.CL] https://arxiv.org/abs/2412.11261

[118] Seyed Mohammad Ali Jafari. 2024. Streamlining the Selection Phase of Systematic Literature Reviews (SLRs) Using AI-Enabled GPT-4 Assistant API. arXiv:2402.18582 [cs.DL] https://arxiv.org/abs/2402.18582
[118] Seyed Mohammad Ali Jafari. 2024. 利用 AI 驱动的 GPT-4 助手 API 简化系统文献综述(SLRs)的筛选阶段。arXiv:2402.18582 [cs.DL] https://arxiv.org/abs/2402.18582

[119] Rishab Jain and Aditya Jain. 2023. Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work. arXiv:2312.10057 [cs.CY] https://arxiv.org/abs/2312.10057
[119] Rishab Jain 与 Aditya Jain. 2023. 生成式 AI 在科研论文写作中的应用:学术工作中新型算法偏见与不确定性。arXiv:2312.10057 [cs.CY] https://arxiv.org/abs/2312.10057

[120] Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. 2025. AIDE: AI-Driven Exploration in the Space of Code. arXiv:2502.13138 [cs.AI] https://arxiv.org/abs/2502.13138
[120] Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko 及 Yuxiang Wu. 2025. AIDE:代码空间的 AI 驱动探索。arXiv:2502.13138 [cs.AI] https://arxiv.org/abs/2502.13138

[121] Jina AI. 2025. node-DeepResearch. https://github.com/jina-ai/node-DeepResearch.
[122] Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2025. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? https: //github.com/LiqiangJing/DSBench. arXiv:2409.07703 [cs.AI] https://arxiv.org/abs/2409.07703
[122] Liqiang Jing、Zhehui Huang、Xiaoyang Wang、Wenlin Yao、Wenhao Yu、Kaixin Ma、Hongming Zhang、Xinya Du 和 Dong Yu。2025 年。《DSBench:数据科学智能体距离成为数据科学专家还有多远?》https://github.com/LiqiangJing/DSBench。arXiv:2409.07703 [cs.AI] https://arxiv.org/abs/2409.07703

[123] Nicola Jones. 2025. OpenAI’s ‘deep research’ tool: is it useful for scientists? https://www.nature.com/articles/d41586-025-00377-9.
[123] Nicola Jones。2025 年。《OpenAI 的"深度研究"工具:对科学家有用吗?》https://www.nature.com/articles/d41586-025-00377-9

[124] Vijay Joshi and Iver Band. 2024. Disrupting Test Development with AI Assistants: Building the Base of the Test Pyramid with Three AI Coding Assistants. (Oct. 2024). doi:10.36227/techrxiv.173014488.82191966/v1
[124] Vijay Joshi 和 Iver Band。2024 年。《用 AI 助手颠覆测试开发:用三个 AI 编程助手构建测试金字塔基础》(2024 年 10 月)。doi:10.36227/techrxiv.173014488.82191966/v1

[125] Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Zachary Henley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ‘24). ACM, 1-19. doi:10.1145/3654777.3676345
[125] Majeed Kazemitabaar、Jack Williams、Ian Drosos、Tovi Grossman、Austin Zachary Henley、Carina Negreanu 和 Advait Sarkar。2024 年。《通过交互式任务分解改进 AI 辅助数据分析中的引导与验证》。载于《第 37 届 ACM 用户界面软件与技术研讨会论文集》(UIST '24)。ACM,1-19 页。doi:10.1145/3654777.3676345

[126] CTOL Editors Ken. 2025. Gemini Launches Deep Research on 2.5 Pro Aiming to Redefine AI-Powered Analysis with Strong Lead Over OpenAI. https://www.ctol.digital/news/gemini-deep-research-launch-2-5-pro-vs-openai/.
[126] CTOL 编辑 Ken. 2025. Gemini 启动 2.5 Pro 深度研究 旨在以显著领先优势重新定义 OpenAI 驱动的 AI 分析. https://www.ctol.digital/news/gemini-deep-research-launch-2-5-pro-vs-openai/.

[127] Antti Keurulainen, Isak Westerlund, Samuel Kaski, and Alexander Ilin. 2021. Learning to Assist Agents by Observing Them. arXiv:2110.01311 [cs.AI] https://arxiv.org/abs/2110.01311
[127] Antti Keurulainen, Isak Westerlund, Samuel Kaski, 和 Alexander Ilin. 2021. 通过观察智能体学习辅助方法. arXiv:2110.01311 [cs.AI] https://arxiv.org/abs/2110.01311

[128] Abdullah Khalili and Abdelhamid Bouchachia. 2022. Toward Building Science Discovery Machines. arXiv:2103.15551 [cs.AI] https://arxiv.org/abs/2103.15551
[128] Abdullah Khalili 和 Abdelhamid Bouchachia. 2022. 迈向构建科学发现机器. arXiv:2103.15551 [cs.AI] https://arxiv.org/abs/2103.15551

[129] Stefan Kramer, Mattia Cerrato, Sašo Džeroski, and Ross King. 2023. Automated Scientific Discovery: From Equation Discovery to Autonomous Discovery Systems. arXiv:2305.02251 [cs.AI] https://arxiv.org/abs/2305.02251
[129] Stefan Kramer, Mattia Cerrato, Sašo Džeroski, 和 Ross King. 2023. 自动化科学发现:从方程发现到自主发现系统. arXiv:2305.02251 [cs.AI] https://arxiv.org/abs/2305.02251

[130] Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, Tom Hope, Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, Jingyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Shah, and Iryna Gurevych. 2024. What Can Natural Language Processing Do for Peer Review? arXiv:2405.06563 [cs.CL] https://arxiv.org/abs/2405.06563
[130] 伊利亚·库兹涅佐夫、奥萨马·穆罕默德·阿夫扎尔、科恩·德克森、尼尔斯·戴克、亚历山大·戈德堡、汤姆·霍普、德克·霍维、乔纳森·K·库默菲尔德、安妮·劳舍尔、凯文·莱顿-布朗、卢晟、毛萨姆、玛戈特·米斯克斯、奥雷莉·内维奥尔、丹麦·普鲁西、曲丽珍、罗伊·施瓦茨、诺亚·A·史密斯、塔玛·索洛里奥、王静妍、朱晓丹、安娜·罗杰斯、尼哈尔·B·沙阿与伊琳娜·古列维奇。2024。《自然语言处理能为同行评审做什么?》arXiv:2405.06563 [cs.CL] https://arxiv.org/abs/2405.06563

[131] Martin Lance. 2024. open_deep_research. https://github.com/langchain-ai/open_deep_research.
[131] 马丁·兰斯。2024。open_deep_research。https://github.com/langchain-ai/open_deep_research

[132] Hao Lang, Fei Huang, and Yongbin Li. 2025. Debate Helps Weak-to-Strong Generalization. arXiv:2501.13124 [cs.CL] https://arxiv.org/abs/2501.13124
[132] 郎浩、黄飞与李永彬。2025。《辩论助力弱到强泛化》arXiv:2501.13124 [cs.CL] https://arxiv.org/abs/2501.13124

[133] LangChain. 2025. How to think about agent frameworks. https://blog.langchain.dev/how-to-think-about-agentframeworks/. https://docs.google.com/spreadsheets/d/1B37VxTBuGLeTSPVWtz7UMsCdtXrqV5hCjWkbHN8tfAo/
[133] LangChain。2025。《如何思考智能体框架》https://blog.langchain.dev/how-to-think-about-agentframeworks/ https://docs.google.com/spreadsheets/d/1B37VxTBuGLeTSPVWtz7UMsCdtXrqV5hCjWkbHN8tfAo/

[134] langChain AI. 2024. LangGraph. https://github.com/langchain-ai/langgraph.
[135] Andrew Laverick, Kristen Surrao, Inigo Zubeldia, Boris Bolliet, Miles Cranmer, Antony Lewis, Blake Sherwin, and Julien Lesgourgues. 2024. Multi-Agent System for Cosmological Parameter Analysis. arXiv:2412.00431 [astro-ph.IM] https://arxiv.org/abs/2412.00431
[135] Andrew Laverick、Kristen Surrao、Inigo Zubeldia、Boris Bolliet、Miles Cranmer、Antony Lewis、Blake Sherwin 和 Julien Lesgourgues. 2024. 宇宙学参数分析的多智能体系统. arXiv:2412.00431 [astro-ph.IM] https://arxiv.org/abs/2412.00431

[136] Eunhae Lee. 2024. Towards Ethical Personal AI Applications: Practical Considerations for AI Assistants with Long-Term Memory. arXiv:2409.11192 [cs.CY] https://arxiv.org/abs/2409.11192
[136] Eunhae Lee. 2024. 迈向伦理化的个人 AI 应用:具有长期记忆功能的 AI 助手实践考量. arXiv:2409.11192 [cs.CY] https://arxiv.org/abs/2409.11192

[137] Yuho Lee, Taewon Yun, Jason Cai, Hang Su, and Hwanjun Song. 2024. UniSumEval: Towards Unified, FineGrained, Multi-Dimensional Summarization Evaluation for LLMs. https://github.com/DISL-Lab/UniSumEval-v1.0. arXiv:2409.19898 [cs.CL] https://arxiv.org/abs/2409.19898
[137] Yuho Lee、Taewon Yun、Jason Cai、Hang Su 和 Hwanjun Song. 2024. UniSumEval:面向 LLMs 的统一、细粒度、多维度摘要评估框架. https://github.com/DISL-Lab/UniSumEval-v1.0. arXiv:2409.19898 [cs.CL] https://arxiv.org/abs/2409.19898

[138] Letta-AI. 2023. Letta. https://github.com/letta-ai/letta.
[139] Kyla Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund. 2025. ChatDBG: An AI-Powered Debugging Assistant. arXiv:2403.16354 [cs.SE] https://arxiv.org/abs/2403.16354
[139] Kyla Levin, Nicolas van Kempen, Emery D. Berger, 和 Stephen N. Freund. 2025. ChatDBG: 一款 AI 驱动的调试助手. arXiv:2403.16354 [cs.SE] https://arxiv.org/abs/2403.16354

[140] James R. Lewis. 2018. The System Usability Scale: Past, Present, and Future. International Journal of Human-Computer Interaction 34, 7 (2018), 577-590. doi:10.1080/10447318.2018.1455307 arXiv:https://doi.org/10.1080/10447318.2018.1455307
[140] James R. Lewis. 2018. 系统可用性量表:过去、现在与未来. 国际人机交互杂志 34, 7 (2018), 577-590. doi:10.1080/10447318.2018.1455307 arXiv: https://doi.org/10.1080/10447318.2018.1455307

[141] Belinda Z. Li, Been Kim, and Zi Wang. 2025. QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? arXiv:2503.22674 [cs.AI] https://arxiv.org/abs/2503.22674
[141] Belinda Z. Li, Been Kim, 和 Zi Wang. 2025. QuestBench: LLMs 能否在推理任务中提出正确问题以获取信息? arXiv:2503.22674 [cs.AI] https://arxiv.org/abs/2503.22674

[142] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv:2303.17760 [cs.AI] https: //arxiv.org/abs/2303.17760
[142] 李国豪、Hasan Abed Al Kader Hammoud、Hani Itani、Dmitrii Khizbullin 和 Bernard Ghanem。2023 年。CAMEL:面向大语言模型社会"思维"探索的交流智能体。arXiv:2303.17760 [cs.AI] https://arxiv.org/abs/2303.17760

[143] Jiachen Li, Xiwen Li, Justin Steinberg, Akshat Choube, Bingsheng Yao, Xuhai Xu, Dakuo Wang, Elizabeth Mynatt, and Varun Mishra. 2025. Vital Insight: Assisting Experts’ Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-In-The-Loop LLM Agents. arXiv:2410.14879 [cs.HC] https://arxiv. org/abs/2410.14879
[143] 李嘉宸、李希文、Justin Steinberg、Akshat Choube、姚炳生、徐旭海、王达阔、Elizabeth Mynatt 和 Varun Mishra。2025 年。Vital Insight:通过可视化和人机协同 LLM 智能体辅助专家对多模态个人追踪数据进行情境驱动的意义构建。arXiv:2410.14879 [cs.HC] https://arxiv.org/abs/2410.14879

[144] Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025. TORL: Scaling Tool-Integrated RL. https://github.com/GAIRNLP/ToRL. https://arxiv.org/pdf/2503.23383
[144] 李雪峰、邹浩阳和刘鹏飞。2025 年。TORL:工具集成强化学习的规模化实现。https://github.com/GAIRNLP/ToRL https://arxiv.org/pdf/2503.23383

[145] Yuan Li, Yixuan Zhang, and Lichao Sun. 2023. MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents. arXiv:2310.06500 [cs.AI] https://arxiv.org/abs/2310. 06500
[145] 李源、张艺璇和孙立超。2023 年。MetaAgents:基于 LLM 的任务型协同框架——通过协作式生成智能体模拟人类行为交互。arXiv:2310.06500 [cs.AI] https://arxiv.org/abs/2310.06500

[146] Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. How Does the Disclosure of AI Assistance Affect the Perceptions of Writing? arXiv:2410.04545 [cs.CL] https://arxiv.org/abs/2410.04545
[146] 李卓彦、梁晨、彭静、殷明. 2024. AI 辅助披露如何影响写作感知?arXiv:2410.04545 [cs.CL] https://arxiv.org/abs/2410.04545

[147] Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding. https://github.com/ Leezekun/MMSci. arXiv:2407.04903 [cs.CL] https://arxiv.org/abs/2407.04903
[147] 李泽坤、杨贤俊、崔圭利、朱婉蓉、谢瑞安、金贤贞、林镇赫、池成英、李炳柱、严希峰、琳达·露丝·佩佐德、斯蒂芬·D·威尔逊、林佑相、王威廉. 2025. MMSci:面向研究生跨学科多模态科学理解的数据集. https://github.com/Leezekun/MMSci. arXiv:2407.04903 [cs.CL] https://arxiv.org/abs/2407.04903

[148] Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2023. A Large-Scale Survey on the Usability of AI Programming Assistants: Successes and Challenges. arXiv:2303.17125 [cs.SE] https://arxiv.org/abs/2303.17125
[148] 梁珍妮、杨晨阳、布拉德·A·迈尔斯. 2023. 关于 AI 编程助手可用性的大规模调研:成功与挑战. arXiv:2303.17125 [cs.SE] https://arxiv.org/abs/2303.17125

[149] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023. Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL] https://arxiv.org/abs/2211.09110
[149] Percy Liang、Rishi Bommasani、Tony Lee、Dimitris Tsipras、Dilara Soylu、Michihiro Yasunaga、Yian Zhang、Deepak Narayanan、Yuhuai Wu、Ananya Kumar、Benjamin Newman、Binhang Yuan、Bobby Yan、Ce Zhang、Christian Cosgrove、Christopher D. Manning、Christopher Ré、Diana Acosta-Navas、Drew A. Hudson、Eric Zelikman、Esin Durmus、Faisal Ladhak、Frieda Rong、Hongyu Ren、Huaxiu Yao、Jue Wang、Keshav Santhanam、Laurel Orr、Lucia Zheng、Mert Yuksekgonul、Mirac Suzgun、Nathan Kim、Neel Guha、Niladri Chatterji、Omar Khattab、Peter Henderson、Qian Huang、Ryan Chi、Sang Michael Xie、Shibani Santurkar、Surya Ganguli、Tatsunori Hashimoto、Thomas Icard、Tianyi Zhang、Vishrav Chaudhary、William Wang、Xuechen Li、Yifan Mai、Yuhui Zhang、Yuta Koreeda。2023。《语言模型的整体评估》。arXiv:2211.09110 [cs.CL] https://arxiv.org/abs/2211.09110

[150] Jialiang Lin, Jiaxin Song, Zhangping Zhou, Yidong Chen, and Xiaodong Shi. 2023. Automated scholarly paper review: Concepts, technologies, and challenges. Information Fusion 98 (Oct. 2023), 101830. doi:10.1016/j.inffus.2023.101830
[150] 林嘉亮、宋佳欣、周章平、陈一东、石晓东。2023。《学术论文自动评审:概念、技术与挑战》。《信息融合》98 卷(2023 年 10 月),101830 页。doi:10.1016/j.inffus.2023.101830

[151] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs.CL] https://arxiv.org/abs/2109.07958
[151] Stephanie Lin、Jacob Hilton 和 Owain Evans。2022 年。《TruthfulQA:测量模型如何模仿人类谬误》。arXiv:2109.07958 [cs.CL] https://arxiv.org/abs/2109.07958

[152] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. 2023. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv:2311.05437 [cs.CV] https://arxiv.org/abs/2311.05437
[152] 刘世龙、程浩、刘昊天、张浩、李峰、任天和、邹雪艳、杨建伟、苏航、朱军、张磊、高剑峰、李春元。2023。《LLaVA-Plus:学习使用工具创建多模态智能体》。arXiv:2311.05437 [cs.CV] https://arxiv.org/abs/2311.05437

[153] Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and Jie Tang. 2024. AutoGLM: Autonomous Foundation Agents for GUIs. arXiv:2411.00820 [cs.HC] https://arxiv.org/abs/2411.00820
[153] 刘晓、秦波、梁栋柱、董广、赖瀚宇、张翰辰、赵翰林、翁一龙、孙佳岱、王佳琪、高俊杰、单俊俊、刘康宁、张书丹、姚舜天、程思怡、姚文涛、赵文怡、刘星汉、刘欣怡、陈欣颖、杨欣悦、杨阳、徐一凡、杨宇、王雨佳、徐雨霖、齐泽涵、董雨潇、唐杰。2024。AutoGLM:面向图形用户界面的自主基础智能体。arXiv:2411.00820 [cs.HC] https://arxiv.org/abs/2411.00820

[154] Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and Jie Tang. 2024. AutoGLM: Autonomous Foundation Agents for GUIs. arXiv:2411.00820 [cs.HC] https://arxiv.org/abs/2411.00820
[154] 肖刘、秦波、梁东柱、董广、赖汉宇、张汉辰、赵翰林、Iat Long Iong、孙佳迪、王佳琪、高俊杰、单俊俊、刘康宁、张书丹、姚顺天、程思怡、姚文涛、赵文怡、刘星汉、刘欣怡、陈欣颖、杨新月、杨阳、徐一凡、杨宇、王雨佳、徐雨林、齐泽涵、董雨潇和唐杰。2024 年。《AutoGLM:面向图形用户界面的自主基础智能体》。arXiv:2411.00820 [cs.HC] https://arxiv.org/abs/2411.00820

[155] Zijun Liu, Kaiming Liu, Yiqi Zhu, Xuanyu Lei, Zonghan Yang, Zhenhe Zhang, Peng Li, and Yang Liu. 2024. AIGS: Generating Science from AI-Powered Automated Falsification. arXiv:2411.11910 [cs.LG] https://arxiv.org/abs/2411. 11910
[155] 刘子君、刘凯明、朱一琦、雷轩宇、杨宗翰、张振和、李鹏和刘洋。2024 年。《AIGS:基于 AI 自动化证伪的科学发现生成》。arXiv:2411.11910 [cs.LG] https://arxiv.org/abs/2411.11910

[156] Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. 2023. BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents. https://github.com/salesforce/BOLAA. arXiv:2308.05960 [cs.AI] https://arxiv.org/abs/2308.05960
[156] 刘志伟、姚蔚然、张建国、薛乐、Shelby Heinecke、Rithesh Murthy、冯一浩、陈泽元、Juan Carlos Niebles、Devansh Arpit、徐然、Phil Mui、王欢、Caiming Xiong 和 Silvio Savarese。2023 年。《BOLAA:LLM 增强型自主智能体的基准测试与编排框架》。https://github.com/salesforce/BOLAA。arXiv:2308.05960 [cs.AI] https://arxiv.org/abs/2308.05960

[157] Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, and Wenpeng Yin. 2025. AAAR-1.0: Assessing AI’s Potential to Assist Research. https://renzelou.github.io/AAAR-1.0/. arXiv:2410.22394 [cs.CL] https://arxiv.org/abs/2410.22394
[157] 任泽楼、韩子旭、司佳旺、杜江书、神代凉、鲁晓新、谢健、孙宇轩、张雨森、安智贤、方鸿超、邹卓阳、马文超、李曦、张凯、夏聪颖、黄立夫、尹文鹏。2025 年。《AAAR-1.0:评估 AI 辅助科研的潜力》。https://renzelou.github.io/AAAR-1.0/。arXiv:2410.22394 [cs.CL] https://arxiv.org/abs/2410.22394

[158] Cong Lu, Shengran Hu, and Jeff Clune. 2025. Automated Capability Discovery via Model Self-Exploration. arXiv:2502.07577 [cs.LG] https://arxiv.org/abs/2502.07577
[158] 卢聪、胡生然、Jeff Clune。2025 年。《通过模型自我探索实现自动化能力发现》。arXiv:2502.07577 [cs.LG] https://arxiv.org/abs/2502.07577

[159] Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292 [cs.AI] https://arxiv.org/abs/2408.06292
[159] Chris Lu、卢聪、Robert Tjarko Lange、Jakob Foerster、Jeff Clune、David Ha。2024 年。《AI 科学家:迈向全自动开放式科学发现》。arXiv:2408.06292 [cs.AI] https://arxiv.org/abs/2408.06292

[160] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. https://scienceqa.github.io/. arXiv:2209.09513 [cs.CL] https://arxiv.org/abs/2209.09513
[160] 卢盼、Swaroop Mishra、Tony Xia、邱亮、张凯威、朱松纯、Oyvind Tafjord、Peter Clark、Ashwin Kalyan。2022 年。《学习解释:通过思维链实现科学问答的多模态推理》。https://scienceqa.github.io/。arXiv:2209.09513 [cs.CL] https://arxiv.org/abs/2209.09513

[161] Chandra Maddila, Negar Ghorbani, Kosay Jabre, Vijayaraghavan Murali, Edwin Kim, Parth Thakkar, Nikolay Pavlovich Laptev, Olivia Harman, Diana Hsu, Rui Abreu, and Peter C. Rigby. 2024. AI-Assisted SQL Authoring at Industry Scale. arXiv:2407.13280 [cs.SE] https://arxiv.org/abs/2407.13280
[161] Chandra Maddila、Negar Ghorbani、Kosay Jabre、Vijayaraghavan Murali、Edwin Kim、Parth Thakkar、Nikolay Pavlovich Laptev、Olivia Harman、Diana Hsu、Rui Abreu 和 Peter C. Rigby。2024。《工业级 AI 辅助 SQL 编写》。arXiv:2407.13280 [cs.SE] https://arxiv.org/abs/2407.13280

[162] Srijoni Majumdar, Edith Elkind, and Evangelos Pournaras. 2025. Generative AI Voting: Fair Collective Choice is Resilient to LLM Biases and Inconsistencies. arXiv:2406.11871 [cs.AI] https://arxiv.org/abs/2406.11871
[162] Srijoni Majumdar、Edith Elkind 和 Evangelos Pournaras。2025。《生成式 AI 投票:公平集体选择对 LLM 偏见与不一致性的抗性》。arXiv:2406.11871 [cs.AI] https://arxiv.org/abs/2406.11871

[163] Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, and Nghi D. Q. Bui. 2025. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. https://github.com/FSoft-AI4Code/CodeMMLU. arXiv:2410.01999 [cs.SE] https://arxiv.org/abs/2410.01999
[163] Dung Nguyen Manh、Thang Phan Chau、Nam Le Hai、Thong T. Doan、Nam V. Nguyen、Quang Pham 和 Nghi D. Q. Bui。2025。《CodeMMLU:评估 CodeLLMs 代码理解与推理能力的多任务基准》。https://github.com/FSoft-AI4Code/CodeMMLU。arXiv:2410.01999 [cs.SE] https://arxiv.org/abs/2410.01999

[164] Manus. 2025. Manus. https://manus.im/.
[164] Manus。2025。《Manus》。https://manus.im/

[165] Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, and Stefano Ermon. 2024. GeoLLM: Extracting Geospatial Knowledge from Large Language Models. arXiv:2310.06213 [cs.CL] https://arxiv.org/abs/2310. 06213
[165] Rohin Manvi、Samar Khanna、Gengchen Mai、Marshall Burke、David Lobell 和 Stefano Ermon。2024。《GeoLLM:从大语言模型中提取地理空间知识》。arXiv:2310.06213 [cs.CL] https://arxiv.org/abs/2310.06213

[166] David M. Markowitz. 2024. From Complexity to Clarity: How AI Enhances Perceptions of Scientists and the Public’s Understanding of Science. arXiv:2405.00706 [cs.CL] https://arxiv.org/abs/2405.00706
[166] David M. Markowitz。2024。《从复杂到清晰:人工智能如何提升科学家形象与公众科学认知》。arXiv:2405.00706 [cs.CL] https://arxiv.org/abs/2405.00706

[167] Jonathan Mast. 2025. ChatGPT’s Deep Research vs. Google’s Gemini 1.5 Pro with Deep Research: A Detailed Comparison. https://whitebeardstrategies.com/ai-prompt-engineering/chatgpts-deep-research-vs-googles-gemini-1-5-pro-with-deep-research-a-detailed-comparison/.
[167] Jonathan Mast。2025。《ChatGPT 深度研究 vs Google 的 Gemini 1.5 Pro 深度研究:详细对比》。https://whitebeardstrategies.com/ai-prompt-engineering/chatgpts-deep-research-vs-googles-gemini-1-5-pro-with-deep-research-a-detailed-comparison/

[168] Mastra-AI. 2025. Mastra. https://github.com/mastra-ai/mastra.
[168] Mastra-AI。2025。《Mastra》。https://github.com/mastra-ai/mastra

[169] Shray Mathur, Noah van der Vleuten, Kevin Yager, and Esther Tsai. 2024. VISION: A Modular AI Assistant for Natural Human-Instrument Interaction at Scientific User Facilities. arXiv:2412.18161 [cs.AI] https://arxiv.org/abs/2412.18161
[169] Shray Mathur、Noah van der Vleuten、Kevin Yager 和 Esther Tsai。2024 年。《VISION:面向科学用户设施自然人机交互的模块化 AI 助手》。arXiv:2412.18161 [cs.AI] https://arxiv.org/abs/2412.18161

[170] Gianmarco Mengaldo. 2025. Explain the Black Box for the Sake of Science: the Scientific Method in the Era of Generative Artificial Intelligence. arXiv:2406.10557 [cs.AI] https://arxiv.org/abs/2406.10557
[170] Gianmarco Mengaldo。2025 年。《为科学解释黑箱:生成式人工智能时代的科学方法》。arXiv:2406.10557 [cs.AI] https://arxiv.org/abs/2406.10557

[171] MGX Technologies. 2025. MGX. dev. https://mgx.dev
[171] MGX Technologies。2025 年。《MGX.dev》。https://mgx.dev

[172] Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA:A Benchmark for General AI Assistants. https://huggingface.co/gaia-benchmark. https://arxiv.org/pdf/2311.12983
[172] Gregoire Mialon、Clementine Fourrier、Craig Swift、Thomas Wolf、Yann LeCun 和 Thomas Scialom。2023 年。《GAIA:通用 AI 助手基准测试》。https://huggingface.co/gaia-benchmark。https://arxiv.org/pdf/2311.12983

[173] Microsoft. 2023. Microsoft Copilot. https://www.microsoft.com/en-us/microsoft-copilot/organizations.
[174] Microsoft. 2023. Semantic-kernel. https://github.com/microsoft/semantic-kernel.
[175] mirayayerdem. 2022. Github-Copilot-Amazon-Whisperer-ChatGPT. https://github.com/mirayayerdem/Github-Copilot-Amazon-Whisperer-ChatGPT.
[176] Mlc-ai. 2023. web-llm. https://github.com/mlc-ai/web-llm.
[177] ModelTC. 2025. lightllm. https://github.com/ModelTC/lightllm.
[178] Devam Mondal and Atharva Inamdar. 2024. SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing. arXiv:2407.03381 [q-bio.GN] https://arxiv.org/abs/2407.03381
[178] Devam Mondal 与 Atharva Inamdar. 2024. SeqMate:一种用于自动化 RNA 测序的新型大语言模型流程. arXiv:2407.03381 [q-bio.GN] https://arxiv.org/abs/2407.03381

[179] Peya Mowar, Yi-Hao Peng, Jason Wu, Aaron Steinfeld, and Jeffrey P. Bigham. 2025. CodeA11y: Making AI Coding Assistants Useful for Accessible Web Development. arXiv:2502.10884 [cs.HC] https://arxiv.org/abs/2502.10884
[179] Peya Mowar、Yi-Hao Peng、Jason Wu、Aaron Steinfeld 与 Jeffrey P. Bigham. 2025. CodeA11y:让 AI 编程助手助力无障碍网页开发. arXiv:2502.10884 [cs.HC] https://arxiv.org/abs/2502.10884

[180] Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. 2024. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. https://github.com/NJU-LHRS/LHRS-Bot. arXiv:2402.02544 [cs.CV] https://arxiv.org/abs/2402.02544
[180] Dilxat Muhtar、Zhenshi Li、Feng Gu、Xueliang Zhang 与 Pengfeng Xiao. 2024. LHRS-Bot:通过 VGI 增强的大型多模态语言模型赋能遥感技术. https://github.com/NJU-LHRS/LHRS-Bot. arXiv:2402.02544 [cs.CV] https://arxiv.org/abs/2402.02544

[181] Manisha Mukherjee, Sungchul Kim, Xiang Chen, Dan Luo, Tong Yu, and Tung Mai. 2025. From Documents to Dialogue: Building KG-RAG Enhanced AI Assistants. arXiv:2502.15237 [cs.IR] https://arxiv.org/abs/2502.15237
[181] Manisha Mukherjee、Sungchul Kim、Xiang Chen、Dan Luo、Tong Yu 和 Tung Mai。2025。《从文档到对话:构建知识图谱增强的 RAG 智能助手》。arXiv:2502.15237 [cs.IR] https://arxiv.org/abs/2502.15237

[182] Sheshera Mysore, Mahmood Jasim, Haoru Song, Sarah Akbar, Andre Kenneth Chase Randall, and Narges Mahyar. 2023. How Data Scientists Review the Scholarly Literature. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval (CHIIR '23). ACM, 137-152. doi:10.1145/3576840.3578309
[182] Sheshera Mysore、Mahmood Jasim、Haoru Song、Sarah Akbar、Andre Kenneth Chase Randall 和 Narges Mahyar。2023。《数据科学家如何评论文献》。载于《2023 年人类信息交互与检索会议论文集》(CHIIR '23)。ACM 出版社,137-152 页。doi:10.1145/3576840.3578309

[183] n8n. 2023. n8n. https://github.com/n8n-io/n8n.
[183] n8n。2023。《n8n》。https://github.com/n8n-io/n8n

[184] Nanobrowser Team. 2024. Nanobrowser. https://github.com/nanobrowser/nanobrowser.
[184] Nanobrowser 团队。2024。《Nanobrowser》。https://github.com/nanobrowser/nanobrowser

[185] Nathalia Nascimento, Everton Guimaraes, Sai Sanjna Chintakunta, and Santhosh Anitha Boominathan. 2024. LLM4DS: Evaluating Large Language Models for Data Science Code Generation. https://github.com/DataForScience/LLM4DS. arXiv:2411.11908 [cs.SE] https://arxiv.org/abs/2411.11908
[185] Nathalia Nascimento、Everton Guimaraes、Sai Sanjna Chintakunta 与 Santhosh Anitha Boominathan。2024。《LLM4DS:评估大语言模型在数据科学代码生成中的应用》。https://github.com/DataForScience/LLM4DS。arXiv:2411.11908 [cs.SE] https://arxiv.org/abs/2411.11908

[186] Khanh Nghiem, Anh Minh Nguyen, and Nghi D. Q. Bui. 2024. Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals. arXiv:2403.14592 [cs.SE] https://arxiv.org/abs/2403.14592
[186] Khanh Nghiem、Anh Minh Nguyen 与 Nghi D. Q. Bui。2024。《展望下一代 AI 编程助手:见解与提案》。arXiv:2403.14592 [cs.SE] https://arxiv.org/abs/2403.14592

[187] Alex Nguyen, Zilong Wang, Jingbo Shang, and Dheeraj Mekala. 2024. DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering. arXiv:2404.00439 [cs.CL] https://arxiv.org/abs/2404.00439
[187] Alex Nguyen、Zilong Wang、Jingbo Shang 与 Dheeraj Mekala。2024。《DOCMASTER:文档问答中标注、训练与推理的统一平台》。arXiv:2404.00439 [cs.CL] https://arxiv.org/abs/2404.00439

[188] Kien X. Nguyen, Fengchun Qiao, Arthur Trembanis, and Xi Peng. 2024. SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey. https://github.com/deep-real/SeafloorAI. arXiv:2411.00172 [cs.CV] https: //arxiv.org/abs/2411.00172
[188] Kien X. Nguyen、Fengchun Qiao、Arthur Trembanis 与 Xi Peng。2024。《SeafloorAI:面向海底地质调查的大规模视觉语言数据集》。https://github.com/deep-real/SeafloorAI。arXiv:2411.00172 [cs.CV] https://arxiv.org/abs/2411.00172

[189] Ziqi Ni, Yahao Li, Kaijia Hu, Kunyuan Han, Ming Xu, Xingyu Chen, Fengqi Liu, Yicong Ye, and Shuxin Bai. 2024. MatPilot: an LLM-enabled AI Materials Scientist under the Framework of Human-Machine Collaboration. arXiv:2411.08063 [physics.soc-ph] https://arxiv.org/abs/2411.08063
[189] 倪子琪、李亚豪、胡凯佳、韩坤远、徐明、陈星宇、刘凤琦、叶一聪、白书鑫。2024 年。MatPilot:基于人机协作框架的 LLM 赋能 AI 材料科学家。arXiv:2411.08063 [physics.soc-ph] https://arxiv.org/abs/2411.08063

[190] Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery. https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/. https://storage.googleapis.com/deepmind-media/DeepMind.com/ Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf
[190] Alexander Novikov、Ngân Vũ、Marvin Eisenberger、Emilien Dupont、Po-Sen Huang、Adam Zsolt Wagner、Sergey Shirobokov、Borislav Kozlovskii、Francisco J. R. Ruiz、Abbas Mehrabian、M. Pawan Kumar、Abigail See、Swarat Chaudhuri、George Holland、Alex Davies、Sebastian Nowozin、Pushmeet Kohli、Matej Balog。2025 年。AlphaEvolve:用于科学与算法发现的编程智能体。https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/。https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf

[191] Koji Ochiai, Yuya Tahara-Arai, Akari Kato, Kazunari Kaizu, Hirokazu Kariyazaki, Makoto Umeno, Koichi Takahashi, Genki N. Kanda, and Haruka Ozaki. 2025. Automating Care by Self-maintainability for Full Laboratory Automation. arXiv:2501.05789 [q-bio.QM] https://arxiv.org/abs/2501.05789
[191] 小池浩司、田原-新井悠也、加藤朱里、海津和成、刈谷崎博和、梅野诚、高桥浩一、神田元気、尾崎遥。2025 年。《全实验室自动化中基于自维护性的护理自动化》。arXiv:2501.05789 [q-bio.QM] https://arxiv.org/abs/2501.05789

[192] Ollama. 2023. Ollama. https://github.com/ollama/ollama.
[192] Ollama。2023 年。《Ollama》。https://github.com/ollama/ollama

[193] Open Manus Team. 2025. OpenManus. https://github.com/mannaandpoem/OpenManus.
[193] Open Manus 团队。2025 年。《OpenManus》。https://github.com/mannaandpoem/OpenManus

[194] OpenAI. 2025. codex. https://github.com/openai/codex.
[194] OpenAI。2025 年。《codex》。https://github.com/openai/codex

[195] OpenAI. 2025. Compare models - OpenAI API. https://platform.openai.com/docs/models/compare?model=o3.
[195] OpenAI. 2025. 模型比较 - OpenAI API. https://platform.openai.com/docs/models/compare?model=o3.

[196] OpenAI. 2025. Deep Research System Card. https://cdn.openai.com/deep-research-system-card.pdf.
[196] OpenAI. 2025. 深度研究系统卡片. https://cdn.openai.com/deep-research-system-card.pdf.

[197] OpenAI. 2025. Introducing Deep Research. https://openai.com/index/introducing-deep-research/.
[197] OpenAI. 2025. 深度研究系统介绍. https://openai.com/index/introducing-deep-research/.

[198] OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/.
[198] OpenAI. 2025. OpenAI o3 和 o4-mini 模型发布. https://openai.com/index/introducing-o3-and-o4-mini/.

[199] OpenAI. 2025. OpenAI Agents SDK. https://github.com/openai/openai-agents-python.
[199] OpenAI. 2025. OpenAI 智能体开发套件. https://github.com/openai/openai-agents-python.

[200] OpenAI. 2025. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf.
[200] OpenAI. 2025. OpenAI o3 和 o4-mini 系统说明卡. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf.

[201] OpenAI. 2025. Thinking with images. https://openai.com/index/thinking-with-images/.
[201] OpenAI. 2025. 图像思维. https://openai.com/index/thinking-with-images/.

[202] OpenBMB. 2023. XAgent. https://github.com/OpenBMB/XAgent.
[203] Orkes. 2022. Orkes. https://orkes.io/use-cases/agentic-workflows.
[204] Takauki Osogami. 2025. Position: AI agents should be regulated based on autonomous action sequences. arXiv:2503.04750 [cs.CY] https://arxiv.org/abs/2503.04750
[204] Takauki Osogami. 2025. 观点:应基于自主行动序列对 AI 代理进行监管。arXiv:2503.04750 [ cs.CY] https://arxiv.org/abs/2503.04750

[205] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] https://arxiv.org/abs/2203.02155
[205] Long Ouyang、Jeff Wu、Xu Jiang、Diogo Almeida、Carroll L. Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray、John Schulman、Jacob Hilton、Fraser Kelton、Luke Miller、Maddie Simens、Amanda Askell、Peter Welinder、Paul Christiano、Jan Leike 和 Ryan Lowe。2022。《通过人类反馈训练语言模型遵循指令》。arXiv:2203.02155 [cs.CL] https://arxiv.org/abs/2203.02155

[206] Md Sultanul Islam Ovi, Nafisa Anjum, Tasmina Haque Bithe, Md. Mahabubur Rahman, and Mst. Shahnaj Akter Smrity. 2024. Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants. arXiv:2409.19922 [cs.SE] https://arxiv.org/abs/2409.19922
[206] Md Sultanul Islam Ovi、Nafisa Anjum、Tasmina Haque Bithe、Md. Mahabubur Rahman 和 Mst. Shahnaj Akter Smrity。2024。《ChatGPT、Codeium 和 GitHub Copilot 的基准测试:AI 驱动的编程与调试助手的比较研究》。arXiv:2409.19922 [cs.SE] https://arxiv.org/abs/2409.19922

[207] Carlos Alves Pereira, Tanay Komarlu, and Wael Mobeirek. 2023. The Future of AI-Assisted Writing. arXiv:2306.16641 [cs.HC] https://arxiv.org/abs/2306.16641
[207] Carlos Alves Pereira、Tanay Komarlu 和 Wael Mobeirek。2023。《AI 辅助写作的未来》。arXiv:2306.16641 [cs.HC] https://arxiv.org/abs/2306.16641

[208] Mike Perkins and Jasper Roe. 2024. Generative AI Tools in Academic Research: Applications and Implications for Qualitative and Quantitative Research Methodologies. arXiv:2408.06872 [cs.HC] https://arxiv.org/abs/2408.06872
[208] Mike Perkins 和 Jasper Roe。2024。《学术研究中的生成式 AI 工具:对定性与定量研究方法的应用与启示》。arXiv:2408.06872 [cs.HC] https://arxiv.org/abs/2408.06872

[209] Perplexity. 2025. Introducing Perplexity Deep Research. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research.
[209] Perplexity. 2025. 推出 Perplexity 深度研究功能. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research.

[210] Perplexity. 2025. Sonar by Perplexity. https://docs.perplexity.ai/guides/model-cards#research-models.
[210] Perplexity. 2025. Perplexity Sonar 模型. https://docs.perplexity.ai/guides/model-cards#research-models.

[211] Tomas Petricek, Gerrit J. J. van den Burg, Alfredo Nazábal, Taha Ceritli, Ernesto Jiménez-Ruiz, and Christopher K. I. Williams. 2022. AI Assistants: A Framework for Semi-Automated Data Wrangling. arXiv:2211.00192 [cs.DB] https://arxiv.org/abs/2211.00192
[211] Tomas Petricek, Gerrit J. J. van den Burg, Alfredo Nazábal, Taha Ceritli, Ernesto Jiménez-Ruiz, 和 Christopher K. I. Williams. 2022. AI 助手:半自动化数据清洗框架. arXiv:2211.00192 [cs.DB] https://arxiv.org/abs/2211.00192

[212] Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. 2025. Humanity’s Last Exam. arXiv:2501.14249 [cs.LG] https://arxiv.org/abs/ 2501.14249
[212] Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, 等. 2025. 人类终极测试. arXiv:2501.14249 [cs.LG] https://arxiv.org/abs/2501.14249

[213] Evangelos Pournaras. 2023. Science in the Era of ChatGPT, Large Language Models and Generative AI: Challenges for Research Ethics and How to Respond. arXiv:2305.15299 [cs.CY] https://arxiv.org/abs/2305.15299
[213] Evangelos Pournaras. 2023. ChatGPT、大语言模型与生成式 AI 时代的科学研究:研究伦理面临的挑战及应对策略. arXiv:2305.15299 [ cs.CY] https://arxiv.org/abs/2305.15299

[214] Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented
[214] Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, 和 Jimmy Lin. 2024. Ragnarök: 一个可复用的 RAG 框架及 TREC 2024 检索增强基准
Generation Track. arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.16828
生成轨迹。arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.16828

[215] James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, David H. Smith IV, Sven Strickroth, and Daniel Zingaro. 2024. Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools. arXiv:2412.14732 [cs.CY] https://arxiv.org/abs/2412.14732
[215] James Prather、Juho Leinonen、Natalie Kiesler、Jamie Gorson Benario、Sam Lau、Stephen MacNeil、Narges Norouzi、Simone Opel、Vee Pettit、Leo Porter、Brent N. Reeves、Jaromir Savelka、David H. Smith IV、Sven Strickroth 和 Daniel Zingaro。2024。《超越炒作:生成式人工智能研究、教学实践与工具现状全面评述》。arXiv:2412.14732 [cs.CY] https://arxiv.org/abs/2412.14732

[216] Pydantic. 2024. Pydantic-AI. https://github.com/pydantic/pydantic-ai.
[217] Pythagora-io. 2024. gpt-pilot. https://github.com/Pythagora-io/gpt-pilot.
[218] Jingyuan Qi, Zian Jia, Minqian Liu, Wangzhi Zhan, Junkai Zhang, Xiaofei Wen, Jingru Gan, Jianpeng Chen, Qin Liu, Mingyu Derek Ma, Bangzheng Li, Haohui Wang, Adithya Kulkarni, Muhao Chen, Dawei Zhou, Ling Li, Wei Wang, and Lifu Huang. 2024. MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design. arXiv:2412.16270 [cs.AI] https://arxiv.org/abs/2412.16270
[218] 齐靖远、贾子安、刘敏倩、詹王志、张俊凯、温晓飞、甘靖如、陈建鹏、刘琴、马明宇、李邦正、王浩辉、Adithya Kulkarni、陈慕豪、周大为、李玲、王伟、黄立夫. 2024. MetaScientist:人机协同的机械超材料自动化设计框架. arXiv:2412.16270 [cs.AI] https://arxiv.org/abs/2412.16270

[219] Laryn Qi, J. D. Zamfirescu-Pereira, Taehan Kim, Björn Hartmann, John DeNero, and Narges Norouzi. 2024. A Knowledge-Component-Based Methodology for Evaluating AI Assistants. arXiv:2406.05603 [cs.CY] https://arxiv.org/ abs/2406.05603
[219] Laryn Qi、J. D. Zamfirescu-Pereira、Taehan Kim、Björn Hartmann、John DeNero、Narges Norouzi. 2024. 基于知识组件的 AI 助手评估方法论. arXiv:2406.05603 [cs.CY] https://arxiv.org/abs/2406.05603

[220] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. https://github.com/OpenBMB/ChatDev. https://aclanthology.org/2024.acl-long.810.pdf
[220] 陈骞、刘伟、刘鸿章、陈诺、党雨凡、李嘉豪、杨成、陈伟泽、苏雨生、丛鑫、徐巨源、李大海、刘知远、孙茂松。2024 年。《ChatDev:面向软件开发的多智能体协作框架》。https://github.com/OpenBMB/ChatDev。https://aclanthology.org/2024.acl-long.810.pdf

[221] Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. 2025. Benchmarking Agentic Workflow Generation. https://github.com/zjunlp/WorfBench. arXiv:2410.07869 [cs.CL] https://arxiv.org/abs/2410.07869
[221] 乔硕飞、方润南、邱志松、王晓彬、张宁豫、姜勇、谢鹏君、黄斐、陈华钧。2025 年。《智能体工作流生成基准评测》。https://github.com/zjunlp/WorfBench。arXiv:2410.07869 [cs.CL] https://arxiv.org/abs/2410.07869

[222] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789 [cs.AI] https://arxiv.org/abs/2307.16789
[222] 秦禹嘉、梁世浩、叶奕宁、朱坤伦、严澜、卢雅西、林衍凯、丛鑫、唐相如、钱比尔、赵思涵、洪琬婷、田润初、谢若冰、周杰、马克·格斯坦、李大海、刘知远、孙茂松。2023 年。《ToolLLM:赋能大语言模型掌握 16000+真实世界 API》。arXiv:2307.16789 [cs.AI] https://arxiv.org/abs/2307.16789

[223] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789 [cs.AI] https://arxiv.org/abs/2307.16789
[223] 秦雨佳、梁世浩、叶一宁、朱坤伦、严澜、卢雅茜、林彦凯、丛欣、唐相如、钱比尔、赵思涵、洪罗兰、田润初、谢若冰、周杰、马克·格斯坦、李大海、刘知远、孙茂松。2023。《ToolLLM:助力大语言模型掌握 16000+真实世界 API》。arXiv:2307.16789 [cs.AI] https://arxiv.org/abs/2307.16789

[224] Qwen LM. 2024. Qwen-Agent. https://github.com/QwenLM/Qwen-Agent.
[224] 通义千问 LM。2024。《Qwen-Agent》。https://github.com/QwenLM/Qwen-Agent

[225] Joaquin Ramirez-Medina, Mohammadmehdi Ataei, and Alidad Amirfazli. 2025. Accelerating Scientific Research Through a Multi-LLM Framework. arXiv:2502.07960 [physics.app-ph] https://arxiv.org/abs/2502.07960
[225] 华金·拉米雷斯-梅迪纳、穆罕默德迈赫迪·阿塔伊、阿里达德·阿米尔法兹利。2025。《通过多 LLM 框架加速科学研究》。arXiv:2502.07960 [physics.app-ph] https://arxiv.org/abs/2502.07960

[226] Ruchit Rawal, Victor-Alexandru Pădurean, Sven Apel, Adish Singla, and Mariya Toneva. 2024. Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations. arXiv:2412.12471 [cs.SE] https: //arxiv.org/abs/2412.12471
[226] 鲁奇特·拉瓦尔、维克托-亚历山德鲁·珀杜雷安、斯文·阿佩尔、阿迪什·辛格拉、玛利亚·托涅娃。2024。《提示在 Python 和基于文本的程序表征中差异化辅助发现与修复缺陷》。arXiv:2412.12471 [cs.SE] https://arxiv.org/abs/2412.12471

[227] Runtao Ren, Jian Ma, and Jianxi Luo. 2025. Large language model for patent concept generation. Advanced Engineering Informatics 65 (May 2025), 103301. doi:10.1016/j.aei.2025.103301
[227] 任润涛、马健、罗建曦。2025。用于专利概念生成的大语言模型。《先进工程信息学》65 卷(2025 年 5 月),103301 页。doi:10.1016/j.aei.2025.103301

[228] ResearchRabbit. 2025. ResearchRabbit. https://www.researchrabbit.ai/.
[228] ResearchRabbit。2025。ResearchRabbit。https://www.researchrabbit.ai/

[229] Restate. 2024. Restate. https://restate.dev/.
[229] Restate。2024。Restate。https://restate.dev/

[230] reworkd. 2023. AgentGPT. https://github.com/reworkd/AgentGPT.
[230] reworkd。2023。AgentGPT。https://github.com/reworkd/AgentGPT

[231] Filippo Ricca, Alessandro Marchetto, and Andrea Stocco. 2025. A Multi-Year Grey Literature Review on AI-assisted Test Automation. https://arxiv.org/pdf/2408.06224.
[231] Filippo Ricca、Alessandro Marchetto 和 Andrea Stocco。2025 年。《AI 辅助测试自动化的多年灰文献综述》。https://arxiv.org/pdf/2408.06224.

[232] Nathalie Riche, Anna Offenwanger, Frederic Gmeiner, David Brown, Hugo Romat, Michel Pahud, Nicolai Marquardt, Kori Inkpen, and Ken Hinckley. 2025. AI-Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools. https://arxiv.org/abs/2502.18736.
[232] Nathalie Riche、Anna Offenwanger、Frederic Gmeiner、David Brown、Hugo Romat、Michel Pahud、Nicolai Marquardt、Kori Inkpen 和 Ken Hinckley。2025 年。《AI 工具:将提示具象化为抽象与反射图形界面命令的通用工具》。https://arxiv.org/abs/2502.18736.

[233] Anthony Cintron Roman, Jennifer Wortman Vaughan, Valerie See, Steph Ballard, Jehu Torres, Caleb Robinson, and Juan M. Lavista Ferres. 2024. Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments. arXiv:2312.06153 [cs.LG] https://arxiv.org/abs/2312.06153
[233] Anthony Cintron Roman、Jennifer Wortman Vaughan、Valerie See、Steph Ballard、Jehu Torres、Caleb Robinson 和 Juan M. Lavista Ferres。2024 年。《开放数据表:面向开放数据集与负责任 AI 评估的机器可读文档》。arXiv:2312.06153 [cs.LG] https://arxiv.org/abs/2312.06153

[234] Kaushik Roy, Vedant Khandelwal, Harshul Surana, Valerie Vera, Amit Sheth, and Heather Heckman. 2023. GEAR-Up: Generative AI and External Knowledge-based Retrieval Upgrading Scholarly Article Searches for Systematic Reviews. arXiv:2312.09948 [cs.IR] https://arxiv.org/abs/2312.09948
[234] Kaushik Roy、Vedant Khandelwal、Harshul Surana、Valerie Vera、Amit Sheth 和 Heather Heckman。2023 年。《GEAR-Up:基于生成式 AI 与外部知识检索的系统性文献综述搜索升级方案》。arXiv:2312.09948 [cs.IR] https://arxiv.org/abs/2312.09948

[235] Run-llama. 2023. LlamaIndex. https://github.com/run-llama/llama_index.
[236] Sergey V Samsonau, Aziza Kurbonova, Lu Jiang, Hazem Lashen, Jiamu Bai, Theresa Merchant, Ruoxi Wang, Laiba Mehnaz, Zecheng Wang, and Ishita Patil. 2024. Artificial Intelligence for Scientific Research: Authentic Research Education Framework. arXiv:2210.08966 [cs.CY] https://arxiv.org/abs/2210.08966
[236] 谢尔盖·V·萨姆索诺夫、阿齐扎·库尔博诺娃、姜璐、哈泽姆·拉申、白佳木、特蕾莎·麦钱特、王若曦、莱巴·梅纳兹、王泽成 和 伊希塔·帕蒂尔. 2024. 人工智能助力科研:真实研究教育框架. arXiv:2210.08966 [cs.CY] https://arxiv.org/abs/2210.08966

[237] SamuelSchmidgall. 2025. AgentLaboratory. https://github.com/SamuelSchmidgall/AgentLaboratory.
[237] 塞缪尔·施密德高尔. 2025. 智能体实验室. https://github.com/SamuelSchmidgall/AgentLaboratory.

[238] Thomas Sandholm, Sarah Dong, Sayandev Mukherjee, John Feland, and Bernardo A. Huberman. 2024. Semantic Navigation for AI-assisted Ideation. arXiv:2411.03575 [cs.HC] https://arxiv.org/abs/2411.03575
[238] 托马斯·桑德霍尔姆、莎拉·董、萨扬德夫·慕克吉、约翰·费兰 和 伯纳多·A·休伯曼. 2024. 语义导航:AI 辅助创意生成技术. arXiv:2411.03575 [cs.HC] https://arxiv.org/abs/2411.03575

[239] Lindsay Sanneman and Julie Shah. 2021. Explaining Reward Functions to Humans for Better Human-Robot Collaboration. arXiv:2110.04192 [cs.RO] https://arxiv.org/abs/2110.04192
[239] Lindsay Sanneman 和 Julie Shah。2021。《向人类解释奖励函数以促进人机协作》。arXiv:2110.04192 [cs.RO] https://arxiv.org/abs/2110.04192

[240] Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. Agent Laboratory: Using LLM Agents as Research Assistants. arXiv:2501.04227 [cs.HC] https://arxiv.org/abs/2501.04227
[240] Samuel Schmidgall、Yusheng Su、Ze Wang、Ximeng Sun、Jialian Wu、Xiaodong Yu、Jiang Liu、Zicheng Liu 和 Emad Barsoum。2025。《智能体实验室:将 LLM 智能体用作研究助手》。arXiv:2501.04227 [cs.HC] https://arxiv.org/abs/2501.04227

[241] Martijn J. Schuemie, M. Soledad Cepeda, Marc A. Suchard, Jianxiao Yang, Yuxi Tian, Alejandro Schuler, Patrick B. Ryan, David Madigan, and George Hripcsak. 2020. How Confident Are We About Observational Findings in Healthcare: A Benchmark Study. Harvard Data Science Review 2, 1 (2020). doi:10.1162/99608f92.147cc28e
[241] Martijn J. Schuemie、M. Soledad Cepeda、Marc A. Suchard、Jianxiao Yang、Yuxi Tian、Alejandro Schuler、Patrick B. Ryan、David Madigan 和 George Hripcsak。2020。《我们对医疗观察性研究结果的信心程度:一项基准研究》。《哈佛数据科学评论》2 卷 1 期(2020 年)。doi:10.1162/99608f92.147cc28e

[242] Scispace. 2024. Scispace. https://scispace.com/.
[242] Scispace。2024。《Scispace》。https://scispace.com/

[243] Scite. 2025. Scite. https://scite.ai/.
[244] Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward. Information and Software Technology 178 (Feb. 2025), 107610. doi:10.1016/j.infsof.2024.107610
[244] Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, Iftekhar Ahmed. 2025. 基于 AI 的编码助手实践应用:现状、认知与未来方向. 《信息与软件技术》178 卷 (2025 年 2 月), 107610 号. doi:10.1016/j.infsof.2024.107610

[245] Mahsa Shamsabadi and Jennifer D’Souza. 2024. A FAIR and Free Prompt-based Research Assistant. arXiv:2405.14601 [cs.CL] https://arxiv.org/abs/2405.14601
[245] Mahsa Shamsabadi, Jennifer D’Souza. 2024. 公平免费的基于提示的研究助手. arXiv:2405.14601 [cs.CL] https://arxiv.org/abs/2405.14601

[246] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. https://github.com/microsoft/JARVIS. https: //arxiv.org/pdf/2303.17580
[246] 沈永亮, 宋凯涛, 谭旭, 李东升, 卢卫明, 庄越挺. 2023. HuggingGPT:利用 ChatGPT 和 Hugging Face 生态解决 AI 任务. https://github.com/microsoft/JARVIS. https://arxiv.org/pdf/2303.17580

[247] Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag. 2023. Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks. arXiv:2304.02623 [cs.CL] https://arxiv.org/abs/2304.02623
[247] 沈泽江、Tal August、Pao Siangliulue、Kyle Lo、Jonathan Bragg、Jeff Hammerbacher、Doug Downey、Joseph Chee Chang 和 David Sontag。2023 年。《超越摘要:为现实世界说明性写作任务设计 AI 支持系统》。arXiv:2304.02623 [cs.CL] https://arxiv.org/abs/2304.02623

[248] Shuming Shi, Enbo Zhao, Duyu Tang, Yan Wang, Piji Li, Wei Bi, Haiyun Jiang, Guoping Huang, Leyang Cui, Xinting Huang, Cong Zhou, Yong Dai, and Dongyang Ma. 2022. Effidit: Your AI Writing Assistant. arXiv:2208.01815 [cs.CL] https://arxiv.org/abs/2208.01815
[248] 史书明、赵恩博、唐杜宇、王岩、李丕基、毕伟、江海云、黄国平、崔乐阳、黄新婷、周聪、戴勇和马东阳。2022 年。《Effidit:您的 AI 写作助手》。arXiv:2208.01815 [cs.CL] https://arxiv.org/abs/2208.01815

[249] Michael Shumer. 2025. OpenDeepResearcher. https://github.com/mshumer/OpenDeepResearcher.
[249] Michael Shumer。2025 年。《OpenDeepResearcher》。https://github.com/mshumer/OpenDeepResearcher

[250] Significant-Gravitas. 2023. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT.
[250] Significant-Gravitas。2023 年。《AutoGPT》。https://github.com/Significant-Gravitas/AutoGPT

[251] David Silver and Richard Sutton. 2025. Welcome to the Era of Experience. https://storage.googleapis.com/deepmind-media/Era-of-Experience /The Era of Experience Paper.pdf.
[251] David Silver 和 Richard Sutton。2025。《欢迎来到体验时代》。https://storage.googleapis.com/deepmind-media/Era-of-Experience /The Era of Experience Paper.pdf。

[252] Auste Simkute, Ewa Luger, Michael Evans, and Rhianne Jones. 2024. “It is there, and you need it, so why do you not use it?” Achieving better adoption of AI systems by domain experts, in the case study of natural science research. arXiv:2403.16895 [cs.HC] https://arxiv.org/abs/2403.16895
[252] Auste Simkute、Ewa Luger、Michael Evans 和 Rhianne Jones。2024。《"它就在那里,你也需要,为何不用?"——以自然科学研究为例提升领域专家对 AI 系统的采纳度》。arXiv:2403.16895 [cs.HC] https://arxiv.org/abs/2403.16895

[253] Michael Skarlinski, Tyler Nadolski, James Braza, Remo Storni, Mayk Caldas, Ludovico Mitchener, Michaela Hinks, Andrew White, and Sam Rodriques. 2025. FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery. https://www.futurehouse.org/research-announcements/launching-futurehouse-platform-ai-agents.
[253] Michael Skarlinski、Tyler Nadolski、James Braza、Remo Storni、Mayk Caldas、Ludovico Mitchener、Michaela Hinks、Andrew White 和 Sam Rodriques。2025。《FutureHouse 平台:用于科学发现的超级智能 AI 体》。https://www.futurehouse.org/research-announcements/launching-futurehouse-platform-ai-agents。

[254] Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, and Yili Hong. 2025. Performance Evaluation of Large Language Models in Statistical Programming. arXiv:2502.13117 [stat.AP] https://arxiv.org/abs/2502.13117
[254] 宋心怡、谢可欣、李丽娜、陈瑞哲、Jared M. Clark、何浩、何浩然、闵杰、张欣蕾、郑思敏、张志阳、邓新伟 和 洪一力。2025。《大语言模型在统计编程中的性能评估》。arXiv:2502.13117 [stat.AP] https://arxiv.org/abs/2502.13117

[255] Jamshid Sourati and James Evans. 2021. Accelerating science with human versus alien artificial intelligences. arXiv:2104.05188 [cs.AI] https://arxiv.org/abs/2104.05188
[255] Jamshid Sourati 和 James Evans。2021。《用人类智能与外星人工智能加速科学发展》。arXiv:2104.05188 [cs.AI] https://arxiv.org/abs/2104.05188

[256] Jamshid Sourati and James Evans. 2023. Accelerating science with human-aware artificial intelligence. arXiv:2306.01495 [cs.AI] https://arxiv.org/abs/2306.01495
[256] Jamshid Sourati 和 James Evans。2023。《通过人类感知人工智能加速科学发展》。arXiv:2306.01495 [cs.AI] https://arxiv.org/abs/2306.01495

[257] StanfordNLP. 2024. DSPy. https://github.com/stanfordnlp/dspy.
[257] StanfordNLP。2024。《DSPy》。https://github.com/stanfordnlp/dspy

[258] Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. 2025. Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System. arXiv:2410.09403 [cs.AI] https://arxiv.org/abs/2410.09403
[258] 苏浩阳、陈仁奇、唐世翔、尹振飞、郑新哲、李金哲、齐碧清、吴琦、李辉、欧阳万里、Philip Torr、Bowen Zhou 和董南青。2025。《多头优于单脑:基于 LLM 的多智能体系统提升科学创意生成》。arXiv:2410.09403 [cs.AI] https://arxiv.org/abs/2410.09403

[259] Ltd. Suzhou Yuling Artificial Intelligence Technology Co. 2023. Dify: Open-source LLM Application Development Platform. https://dify.ai/.
[259] 苏州宇灵人工智能科技有限公司. 2023. Dify:开源 LLM 应用开发平台. https://dify.ai/.

[260] Xin Tan, Xiao Long, Xianjun Ni, Yinghao Zhu, Jing Jiang, and Li Zhang. 2024. How far are AI-powered programming assistants from meeting developers’ needs? arXiv:2404.12000 [cs.SE] https://arxiv.org/abs/2404.12000
[260] 谭新, 龙啸, 倪贤俊, 朱英豪, 蒋静, 张立. 2024. AI 编程助手距离满足开发者需求还有多远?arXiv:2404.12000 [cs.SE] https://arxiv.org/abs/2404.12000

[261] Brian Tang and Kang G. Shin. 2024. Steward: Natural Language Web Automation. arXiv:2409.15441 [cs.AI] https://arxiv.org/abs/2409.15441
[261] Brian Tang, Kang G. Shin. 2024. Steward:自然语言网页自动化. arXiv:2409.15441 [cs.AI] https://arxiv.org/abs/2409.15441

[262] Jiabin Tang, Tianyu Fan, and Chao Huang. 2025. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents. arXiv:2502.05957 [cs.AI] https://arxiv.org/abs/2502.05957
[262] 唐家斌, 范天宇, 黄超. 2025. AutoAgent:全自动零代码 LLM 智能体框架. arXiv:2502.05957 [cs.AI] https://arxiv.org/abs/2502.05957

[263] Yan Tang. 2025. deep_research_agent. https://github.com/grapeot/deep_research_agent.
[263] 颜唐. 2025. deep_research_agent. https://github.com/grapeot/deep_research_agent

[264] Tadahiro Taniguchi, Shiro Takagi, Jun Otsuka, Yusuke Hayashi, and Hiro Taiyo Hamada. 2024. Collective Predictive Coding as Model of Science: Formalizing Scientific Activities Towards Generative Science. arXiv:2409.00102 [physics.socph] https://arxiv.org/abs/2409.00102
[264] 谷口忠大、高木史郎、大塚淳、林佑介、浜田泰阳. 2024. 作为科学模型的集体预测编码:将科学活动形式化以迈向生成科学. arXiv:2409.00102 [physics.socph] https://arxiv.org/abs/2409.00102

[265] Temporalio. 2020. Temporal. https://github.com/temporalio/temporal.
[265] Temporalio. 2020. Temporal. https://github.com/temporalio/temporal

[266] Enkeleda Thaqi, Mohamed Omar Mantawy, and Enkelejda Kasneci. 2024. SARA: Smart AI Reading Assistant for Reading Comprehension. In Proceedings of the 2024 Symposium on Eye Tracking Research and Applications (ETRA '24). ACM, 1-3. doi:10.1145/3649902.3655661
[266] Enkeleda Thaqi, Mohamed Omar Mantawy, Enkelejda Kasneci. 2024. SARA:用于阅读理解的人工智能智能阅读助手. 见《2024 眼动追踪研究与应用研讨会论文集》(ETRA '24). ACM, 1-3. doi:10.1145/3649902.3655661

[267] TheBlewish. 2024. Automated-AI-Web-Researcher-Ollama. https://github.com/TheBlewish/Automated-AI-Web-Researcher-Ollama.
[267] TheBlewish. 2024. 自动化 AI 网络研究工具-Ollama. https://github.com/TheBlewish/Automated-AI-Web-Researcher-Ollama.

[268] Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, and Hao Peng. 2024. SciCode: A Research Coding Benchmark Curated by Scientists. https://scicode-bench.github.io/. arXiv:2407.13168 [cs.AI] https://arxiv.org/abs/2407.13168
[268] 田敏阳、高璐瑜、张诗卓、陈新安、范存伟、郭雪菲、Roland Haas、季盼、Kittithat Krongchon、李尧、刘胜彦、罗迪、马宇涛、童浩、Trinh Kha、田晨宇、王子涵、吴博豪、熊彦宇、殷胜珠、朱敏慧、Kilian Lieret、卢燕欣、刘庚霖、杜宇峰、陶天华、Ofir Press、Jamie Callan、Eliu Huerta、彭浩. 2024. SciCode:科学家精选的研究编码基准. https://scicode-bench.github.io/. arXiv:2407.13168 [cs.AI] https://arxiv.org/abs/2407.13168

[269] Ievgeniia A. Tiukova, Daniel Brunnsåker, Erik Y. Bjurström, Alexander H. Gower, Filip Kronström, Gabriel K. Reder, Ronald S. Reiserer, Konstantin Korovin, Larisa B. Soldatova, John P. Wikswo, and Ross D. King. 2024. Genesis: Towards the Automation of Systems Biology Research. arXiv:2408.10689 [cs.AI] https://arxiv.org/abs/2408.10689
[269] Ievgeniia A. Tiukova、Daniel Brunnsåker、Erik Y. Bjurström、Alexander H. Gower、Filip Kronström、Gabriel K. Reder、Ronald S. Reiserer、Konstantin Korovin、Larisa B. Soldatova、John P. Wikswo、Ross D. King. 2024. Genesis:迈向系统生物学研究的自动化. arXiv:2408.10689 [cs.AI] https://arxiv.org/abs/2408.10689

[270] Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Aleksandr Gordeev, Vladimir Dokholyan, and Maksim Kuprashevich. 2024. GigaCheck: Detecting LLM-generated Content. arXiv:2410.23728 [cs.CL] https://arxiv.org/abs/2410.23728
[270] Irina Tolstykh、Aleksandra Tsybina、Sergey Yakubson、Aleksandr Gordeev、Vladimir Dokholyan 和 Maksim Kuprashevich。2024 年。《GigaCheck:检测 LLM 生成内容》。arXiv:2410.23728 [cs.CL] https://arxiv.org/abs/2410.23728

[271] Benjamin Towle and Ke Zhou. 2024. Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback. arXiv:2410.11009 [cs.CL] https://arxiv.org/abs/2410.11009
[271] Benjamin Towle 和 Ke Zhou。2024 年。《通过单次隐式负面反馈增强 AI 辅助写作》。arXiv:2410.11009 [cs.CL] https://arxiv.org/abs/2410.11009

[272] Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, and Khoa Luu. 2025. InsectFoundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding. https: //uark-cviu.github.io/projects/insect-foundation/. arXiv:2502.09906 [cs.CV] https://arxiv.org/abs/2502.09906
[272] Thanh-Dat Truong、Hoang-Quan Nguyen、Xuan-Bac Nguyen、Ashley Dowling、Xin Li 和 Khoa Luu。2025 年。《InsectFoundation:面向视觉语言昆虫理解的基础模型与大型多模态数据集》。https://uark-cviu.github.io/projects/insect-foundation/。arXiv:2502.09906 [cs.CV] https://arxiv.org/abs/2502.09906

[273] Joseph Tu, Hilda Hadan, Derrick M. Wang, Sabrina A Sgandurra, Reza Hadi Mogavi, and Lennart E. Nacke. 2024. Augmenting the Author: Exploring the Potential of AI Collaboration in Academic Writing. arXiv:2404.16071 [cs.HC] https://arxiv.org/abs/2404.16071
[273] Joseph Tu、Hilda Hadan、Derrick M. Wang、Sabrina A Sgandurra、Reza Hadi Mogavi 和 Lennart E. Nacke。2024 年。《增强作者:探索 AI 协作在学术写作中的潜力》。arXiv:2404.16071 [cs.HC] https://arxiv.org/abs/2404.16071

[274] Xinming Tu, James Zou, Weijie J. Su, and Linjun Zhang. 2023. What Should Data Science Education Do with Large Language Models? arXiv:2307.02792 [cs.CY] https://arxiv.org/abs/2307.02792
[274] 涂新明、James Zou、苏伟杰和张林俊。2023。《数据科学教育应如何应对大型语言模型?》arXiv:2307.02792 [cs.CY] https://arxiv.org/abs/2307.02792

[275] Michele Tufano, Anisha Agarwal, Jinu Jang, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2024. AutoDev: Automated AI-Driven Development. arXiv:2403.08299 [cs.SE] https://arxiv.org/abs/2403.08299
[275] Michele Tufano、Anisha Agarwal、Jinu Jang、Roshanak Zilouchian Moghaddam 和 Neel Sundaresan。2024。《AutoDev:自动化 AI 驱动开发》arXiv:2403.08299 [cs.SE] https://arxiv.org/abs/2403.08299

[276] Aleksei Turobov, Diane Coyle, and Verity Harding. 2024. Using ChatGPT for Thematic Analysis. arXiv:2405.08828 [cs.HC] https://arxiv.org/abs/2405.08828
[276] Aleksei Turobov、Diane Coyle 和 Verity Harding。2024。《使用 ChatGPT 进行主题分析》arXiv:2405.08828 [cs.HC] https://arxiv.org/abs/2405.08828

[277] Rasmus Ulfsnes, Nils Brede Moe, Viktoria Stray, and Marianne Skarpen. 2024. Transforming Software Development with Generative AI: Empirical Insights on Collaboration and Workflow. arXiv:2405.01543 [cs.SE] https://arxiv.org/ abs/2405.01543
[277] Rasmus Ulfsnes、Nils Brede Moe、Viktoria Stray 和 Marianne Skarpen。2024。《生成式 AI 变革软件开发:关于协作与工作流程的实证研究》arXiv:2405.01543 [cs.SE] https://arxiv.org/abs/2405.01543

[278] Stanford University. 2025. STORM. https://storm.genie.stanford.edu/.
[278] 斯坦福大学. 2025. STORM. https://storm.genie.stanford.edu/.

[279] Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn. 2024. INQUIRE: A Natural World Text-to-Image Retrieval Benchmark. https://inquirebenchmark.github.io/. arXiv:2411.02537 [cs.CV] https://arxiv.org/abs/2411.02537
[279] Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, 与 Grant Van Horn. 2024. INQUIRE: 自然世界文本到图像检索基准. https://inquirebenchmark.github.io/. arXiv:2411.02537 [ cs.CV] https://arxiv.org/abs/2411.02537

[280] Vercel. 2020. Vercel. https://vercel.com/.
[281] Vllm-project. 2023. vllm. https://github.com/vllm-project/vllm.
[282] Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, and Tanja Käser. 2023. Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance. arXiv:2311.03311 [cs.CL] https://arxiv.org/abs/2311.03311
[282] Thiemo Wambsganss、Xiaotian Su、Vinitra Swamy、Seyed Parsa Neshaei、Roman Rietsche 和 Tanja Käser。2023。《揭示大型语言模型的下游性别偏见:关于 AI 教育写作辅助的研究》。arXiv:2311.03311 [cs.CL] https://arxiv.org/abs/2311.03311

[283] April Yi Wang, Dakuo Wang, Jaimie Drozdal, Michael Muller, Soya Park, Justin D. Weisz, Xuye Liu, Lingfei Wu, and Casey Dugan. 2022. Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks. ACM Transactions on Computer-Human Interaction 29, 2 (Jan. 2022), 1-33. doi:10.1145/3489465
[283] April Yi Wang、Dakuo Wang、Jaimie Drozdal、Michael Muller、Soya Park、Justin D. Weisz、Xuye Liu、Lingfei Wu 和 Casey Dugan。2022。《文档至关重要:以人为本的 AI 系统辅助计算笔记本中的数据科学代码文档》。《ACM 人机交互汇刊》29 卷 2 期(2022 年 1 月),1-33 页。doi:10.1145/3489465

[284] Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, and Lisa Amini. 2021. How Much Automation Does a Data Scientist Want? arXiv:2101.03970 [cs.LG] https://arxiv.org/abs/ 2101.03970
[284] Dakuo Wang、Q. Vera Liao、Yunfeng Zhang、Udayan Khurana、Horst Samulowitz、Soya Park、Michael Muller 和 Lisa Amini。2021。《数据科学家需要多少自动化?》arXiv:2101.03970 [cs.LG] https://arxiv.org/abs/2101.03970

[285] Suyuan Wang, Xueqian Yin, Menghao Wang, Ruofeng Guo, and Kai Nan. 2024. EvoPat: A Multi-LLM-based Patents Summarization and Analysis Agent. arXiv:2412.18100 [cs.DL] https://arxiv.org/abs/2412.18100
[285] Suyuan Wang、Xueqian Yin、Menghao Wang、Ruofeng Guo 和 Kai Nan。2024。《EvoPat:基于多 LLM 的专利摘要与分析智能体》。arXiv:2412.18100 [cs.DL] https://arxiv.org/abs/2412.18100

[286] Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, Yibin Liu, Jialong Wu, Shengwei Ding, Long Li, Zhiwei Huang, Xinle Deng, Teng Yu, Gangan Ma, Han
[286] 田楠·王、家敏·陈、清瑞·贾、帅·王、若愚·方、慧琳·王、兆伟·高、春照·谢、丑·徐、继红·戴、艺彬·刘、嘉隆·吴、盛伟·丁、龙·李、志伟·黄、新乐·邓、腾·余、甘甘·马、韩
Xiao, Zixin Chen, Danjun Xiang, Yunxia Wang, Yuanyuan Zhu, Yi Xiao, Jing Wang, Yiru Wang, Siran Ding, Jiayang Huang, Jiayi Xu, Yilihamu Tayier, Zhenyu Hu, Yuan Gao, Chengfeng Zheng, Yueshu Ye, Yihang Li, Lei Wan, Xinyue Jiang, Yujie Wang, Siyu Cheng, Zhule Song, Xiangru Tang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang, and Wangchunshu Zhou. 2024. Weaver: Foundation Models for Creative Writing. arXiv:2401.17268 [cs.CL] https://arxiv.org/abs/2401.17268
肖、子欣·陈、丹俊·向、云霞·王、媛媛·朱、毅·肖、静·王、一如·王、思然·丁、佳阳·黄、佳怡·徐、伊力哈木·塔依尔、振宇·胡、元·高、成峰·郑、月书·叶、艺航·李、磊·万、新月·江、雨洁·王、思雨·程、竹乐·宋、相如·唐、晓华·徐、宁宇·张、华君·陈、雨辰·埃莉诺·江、王春树·周。2024。《织梦者:创意写作的基础模型》。arXiv:2401.17268 [cs.CL] https://arxiv.org/abs/2401.17268

[287] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL] https://arxiv.org/abs/2203.11171
[287] 学志·王、杰森·魏、戴尔·舒尔曼斯、阔克·乐、埃德·池、沙兰·纳朗、阿康莎·乔杜里、丹尼·周。2023。《自洽性提升语言模型中的思维链推理》。arXiv:2203.11171 [cs.CL] https://arxiv.org/abs/2203.11171

[288] Yao Wang, Mingxuan Cui, and Arthur Jiang. 2025. Enabling AI Scientists to Recognize Innovation: A Domain-Agnostic Algorithm for Assessing Novelty. arXiv:2503.01508 [cs.AI] https://arxiv.org/abs/2503.01508
[288] 姚·王、明轩·崔、亚瑟·江。2025。《赋能 AI 科学家识别创新:一种领域无关的新颖性评估算法》。arXiv:2503.01508 [cs.AI] https://arxiv.org/abs/2503.01508

[289] Ying-Mei Wang and Tzeng-J Chen. 2025. AI’s deep research revolution: Transforming biomedical literature analysis. https://journals.lww.com/jcma/citation/9900/ai_s_deep_research_revolution___transforming.508.aspx.
[289] Ying-Mei Wang 和 Tzeng-J Chen。2025 年。人工智能的深度研究革命:变革生物医学文献分析。https://journals.lww.com/jcma/citation/9900/ai_s_deep_research_revolution___transforming.508.aspx。

[290] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. Measuring short-form factuality in large language models. https://cdn.openai.com/papers/ simpleqa.pdf.
[290] Jason Wei、Nguyen Karina、Hyung Won Chung、Yunxin Joy Jiao、Spencer Papay、Amelia Glaese、John Schulman 和 William Fedus。2024 年。测量大型语言模型中的短文本事实性。https://cdn.openai.com/papers/simpleqa.pdf。

[291] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https://arxiv.org/abs/2201.11903
[291] Jason Wei、Xuezhi Wang、Dale Schuurmans、Maarten Bosma、Brian Ichter、Fei Xia、Ed Chi、Quoc Le 和 Denny Zhou。2023 年。思维链提示激发大型语言模型中的推理能力。arXiv:2201.11903 [cs.CL] https://arxiv.org/abs/2201.11903

[292] Shufa Wei, Xiaolong Xu, Xianbiao Qi, Xi Yin, Jun Xia, Jingyi Ren, Peijun Tang, Yuxiang Zhong, Yihao Chen, Xiaoqin Ren, Yuxin Liang, Liankai Huang, Kai Xie, Weikang Gui, Wei Tan, Shuanglong Sun, Yongquan Hu, Qinxian Liu, Nanjin Li, Chihao Dai, Lihua Wang, Xiaohui Liu, Lei Zhang, and Yutao Xie. 2023. AcademicGPT: Empowering Academic Research. arXiv:2311.12315 [cs.CL] https://arxiv.org/abs/2311.12315
[292] Shufa Wei、Xiaolong Xu、Xianbiao Qi、Xi Yin、Jun Xia、Jingyi Ren、Peijun Tang、Yuxiang Zhong、Yihao Chen、Xiaoqin Ren、Yuxin Liang、Liankai Huang、Kai Xie、Weikang Gui、Wei Tan、Shuanglong Sun、Yongquan Hu、Qinxian Liu、Nanjin Li、Chihao Dai、Lihua Wang、Xiaohui Liu、Lei Zhang 和 Yutao Xie。2023 年。AcademicGPT:赋能学术研究。arXiv:2311.12315 [cs.CL] https://arxiv.org/abs/2311.12315

[293] Sarah Welsh. 2025. AI Benchmark Deep Dive: Gemini 2.5 and Humanity’s Last Exam. https://arize.com/blog/ai-benchmark-deep-dive-gemini-humanitys-last-exam/.
[293] Sarah Welsh. 2025. AI 基准测试深度解析:Gemini 2.5 与人类终极考核. https://arize.com/blog/ai-benchmark-deep-dive-gemini-humanitys-last-exam/.

[294] Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. 2025. CycleResearcher: Improving Automated Research via Automated Review. arXiv:2411.00816 [cs.CL] https://arxiv.org/ abs/2411.00816
[294] 翁艺璇、朱敏君、包广生、张宏波、王金东、张悦与杨林怡. 2025. CycleResearcher:通过自动评审提升自动化研究水平. arXiv:2411.00816 [cs.CL] https://arxiv.org/abs/2411.00816

[295] Man Fai Wong, Shangxin Guo, Ching Nam Hang, Siu Wai Ho, and Chee Wei Tan. 2023. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 25, 6 (June 2023), 888. doi:10. 3390 / e 25060888 3390 / e 25060888 3390//e250608883390 / \mathrm{e} 25060888
[295] 黄文辉、郭尚新、幸南航、何兆伟与陈志伟. 2023. AI 辅助编程中大代码的自然语言生成与理解研究综述. Entropy 25, 6 (2023 年 6 月), 888. doi:10. 3390 / e 25060888 3390 / e 25060888 3390//e250608883390 / \mathrm{e} 25060888

[296] Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F. Wong, and Lidia S. Chao. 2024. A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions. arXiv:2310.14724 [cs.CL] https://arxiv.org/abs/2310.14724
[296] 吴俊超、杨舒、詹润哲、袁玉林、黄德辉与赵丽迪亚. 2024. LLM 生成文本检测研究综述:必要性、方法与未来方向. arXiv:2310.14724 [cs.CL] https://arxiv.org/abs/2310.14724

[297] Junde Wu, Jiayuan Zhu, and Yuyuan Liu. 2025. Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research. arXiv:2502.04644 [cs.AI] https://arxiv.org/abs/2502.04644
[297] 吴俊德、朱佳媛和刘宇轩。2025 年。《代理推理:利用工具实现 LLMs 深度研究的推理方法》。arXiv:2502.04644 [cs.AI] https://arxiv.org/abs/2502.04644

[298] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. https://github.com/microsoft/autogen. arXiv:2308.08155 [cs.AI] https://arxiv.org/abs/2308.08155
[298] 吴青云、Gagan Bansal、张洁玉、吴怡然、李北彬、朱尔康、江丽、张晓云、张少坤、刘佳乐、Ahmed Hassan Awadallah、Ryen W White、Doug Burger 和王驰。2023 年。《AutoGen:通过多智能体对话实现下一代 LLM 应用》。https://github.com/microsoft/autogen。arXiv:2308.08155 [cs.AI] https://arxiv.org/abs/2308.08155

[299] xAI. 2025. Grok 3 Beta - The age of reasoning agents. https://x.ai/news/grok-3.
[299] xAI。2025 年。《Grok 3 Beta - 推理智能体时代》。https://x.ai/news/grok-3

[300] Menglin Xia, Victor Ruehle, Saravan Rajmohan, and Reza Shokri. 2025. Minerva: A Programmable Memory Test Benchmark for Language Models. https://github.com/gkamradt/LLMTest_NeedleInAHaystack. arXiv:2502.03358 [cs.CL] https://arxiv.org/abs/2502.03358
[300] 夏梦琳、Victor Ruehle、Saravan Rajmohan 和 Reza Shokri。2025 年。《Minerva:语言模型的可编程记忆测试基准》。https://github.com/gkamradt/LLMTest_NeedleInAHaystack。arXiv:2502.03358 [cs.CL] https://arxiv.org/abs/2502.03358

[301] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. OpenAgents: An Open Platform for Language Agents in the Wild. arXiv:2310.10634 [cs.CL] https://arxiv.org/abs/2310.10634
[301] 谢天宝、周帆、程周俊、施鹏、翁罗轩、刘一涛、Toh Jing Hua、赵俊宁、刘倩、刘彻、Leo Z. Liu、徐一恒、苏宏锦、Dongchan Shin、熊才明、余韬. 2023. OpenAgents:开放环境中的语言智能体开放平台. arXiv:2310.10634 [cs.CL] https://arxiv.org/abs/2310.10634

[302] Yujia Xie, Xun Wang, Si-Qing Chen, Wayne Xiong, and Pengcheng He. 2023. Interactive Editing for Text Summarization. arXiv:2306.03067 [cs.CL] https://arxiv.org/abs/2306.03067
[302] 谢雨佳、王迅、陈思清、Wayne Xiong、何鹏程. 2023. 文本摘要的交互式编辑方法. arXiv:2306.03067 [cs.CL] https://arxiv.org/abs/2306.03067

[303] Feng Xiong, Xinguo Yu, and Hon Wai Leong. 2024. AI-Empowered Human Research Integrating Brain Science and Social Sciences Insights. arXiv:2411.12761 [cs.HC] https://arxiv.org/abs/2411.12761
[303] 熊峰、余新国、梁汉伟. 2024. 人工智能赋能的人类研究:融合大脑科学与社会科学的洞见. arXiv:2411.12761 [cs.HC] https://arxiv.org/abs/2411.12761

[304] Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. 2024. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. arXiv:2412.14161 [cs.CL] https://arxiv.org/abs/2412.14161
[304] Frank F. Xu、宋雨凡、李博轩、唐宇轩、Kritanjali Jain、鲍梦雪、王子若、周旭辉、郭志彤、曹沐容、杨明阳、卢浩阳、Amaad Martin、苏哲、Leander Maben、Raj Mehta、池伟、Lawrence Jang、谢怡清、周书燕、Graham Neubig. 2024. TheAgentCompany:LLM 智能体在关键现实任务中的基准测试. arXiv:2412.14161 [cs.CL] https://arxiv.org/abs/2412.14161

[305] Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, and Yang Wang. 2025. UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models. https://github.com/YangLabHKUST/UGPhysics. arXiv:2502.00334 [cs.CL] https://arxiv.org/abs/2502.00334
[305] 徐鑫、徐启运、肖彤、陈天昊、闫雨辰、张佳欣、刁诗哲、杨灿和王洋。2025 年。《UGPhysics:面向大语言模型本科物理推理能力的综合基准》。https://github.com/YangLabHKUST/UGPhysics。arXiv:2502.00334 [cs.CL] https://arxiv.org/abs/2502.00334

[306] Te-Lun Yang, Jyi-Shane Liu, Yuen-Hsien Tseng, and Jyh-Shing Roger Jang. 2025. Knowledge Retrieval Based on Generative AI. arXiv:2501.04635 [cs.IR] https://arxiv.org/abs/2501.04635
[306] 杨德伦、刘继善、曾元显和张智星。2025 年。《基于生成式 AI 的知识检索》。arXiv:2501.04635 [cs.IR] https://arxiv.org/abs/2501.04635

[307] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600
[307] 杨志林、齐鹏、张赛征、Yoshua Bengio、William W. Cohen、Ruslan Salakhutdinov 和 Christopher D. Manning。2018 年。《HotpotQA:面向多样化可解释多跳问答的数据集》。arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600

[308] Yi Yao, Jun Wang, Yabai Hu, Lifeng Wang, Yi Zhou, Jack Chen, Xuming Gai, Zhenming Wang, and Wenjun Liu. 2024. BugBlitz-AI: An Intelligent QA Assistant. arXiv:2406.04356 [cs.SE] https://arxiv.org/abs/2406.04356
[308] 姚毅、王军、胡亚白、王立峰、周毅、Jack Chen、盖旭明、王振明和刘文军。2024 年。《BugBlitz-AI:智能 QA 助手》。arXiv:2406.04356 [cs.SE] https://arxiv.org/abs/2406.04356

[309] Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the Code Quality of AIAssisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv:2304.10778 [cs.SE] https://arxiv.org/abs/2304.10778
[309] Burak Yetiştiren、Işık Özsoy、Miray Ayerdem 和 Eray Tüzün。2023 年。《评估 AI 辅助代码生成工具的代码质量:关于 GitHub Copilot、Amazon CodeWhisperer 和 ChatGPT 的实证研究》。arXiv:2304.10778 [cs.SE] https://arxiv.org/abs/2304.10778

[310] Xiaoxin Yin. 2024. “Turing Tests” For An AI Scientist. https://github.com/MatthewFilipovich/pycharge. arXiv:2405.13352 [cs.AI] https://arxiv.org/abs/2405.13352
[310] Xiaoxin Yin。2024 年。《AI 科学家的"图灵测试"》。https://github.com/MatthewFilipovich/pycharge。arXiv:2405.13352 [cs.AI] https://arxiv.org/abs/2405.13352

[311] yoheinakajima. 2024. BabyAGI. https://github.com/yoheinakajima/babyagi.
[311] yoheinakajima。2024 年。《BabyAGI》。https://github.com/yoheinakajima/babyagi。

[312] Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, and Tae In Ahn. 2025. Knowledge Synthesis of Photosynthesis Research Using a Large Language Model. arXiv:2502.01059 [cs.CL] https://arxiv.org/abs/2502.01059
[312] Seungri Yoon、Woosang Jeon、Sanghyeok Choi、Taehyeong Kim 和 Tae In Ahn。2025 年。《利用大型语言模型进行光合作用研究的知识综合》。arXiv:2502.01059 [cs.CL] https://arxiv.org/abs/2502.01059

[313] You.com. 2023. You.com. https://you.com/about.
[314] Hengjie Yu and Yaochu Jin. 2025. Unlocking the Potential of AI Researchers in Scientific Discovery: What Is Missing? arXiv:2503.05822 [cs.CY] https://arxiv.org/abs/2503.05822
[314] 余恒杰与金耀初. 2025. 释放 AI 研究者在科学发现中的潜力:缺失了什么?arXiv:2503.05822 [cs.CY] https://arxiv.org/abs/2503.05822

[315] Hengjie Yu and Yaochu Jin. 2025. Unlocking the Potential of AI Researchers in Scientific Discovery: What Is Missing? arXiv:2503.05822 [cs.CY] https://arxiv.org/abs/2503.05822
[315] 余恒杰与金耀初. 2025. 释放 AI 研究者在科学发现中的潜力:缺失了什么?arXiv:2503.05822 [cs.CY] https://arxiv.org/abs/2503.05822

[316] Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, and Bowen Zhou. 2025. Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback. arXiv:2501.03916 [cs.AI] https://arxiv.org/abs/2501.03916
[316] 袁佳康、闫相超、冯诗洋、张博、陈涛、石博天、欧阳万里、乔宇、白磊与周博文. 2025. Dolphin:通过思考、实践与反馈迈向闭环自动研究. arXiv:2501.03916 [cs.AI] https://arxiv.org/abs/2501.03916

[317] Siyu Yuan, Cheng Jiayang, Lin Qiu, and Deqing Yang. 2024. Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models? arXiv:2406.11375 [cs.CL] https://arxiv.org/abs/2406.11375
[317] 袁思雨、程佳阳、林秋与杨德清。2024。《提升科学概念理解:教师模型类比能否赋能学生模型?》arXiv:2406.11375 [cs.CL] https://arxiv.org/abs/2406.11375

[318] Hector Zenil, Jesper Tegnér, Felipe S. Abrahão, Alexander Lavin, Vipin Kumar, Jeremy G. Frey, Adrian Weller, Larisa Soldatova, Alan R. Bundy, Nicholas R. Jennings, Koichi Takahashi, Lawrence Hunter, Saso Dzeroski, Andrew Briggs, Frederick D. Gregory, Carla P. Gomes, Jon Rowe, James Evans, Hiroaki Kitano, and Ross King. 2023. The Future of Fundamental Science Led by Generative Closed-Loop Artificial Intelligence. arXiv:2307.07522 [cs.AI] https://arxiv.org/abs/2307.07522
[318] Hector Zenil、Jesper Tegnér、Felipe S. Abrahão、Alexander Lavin、Vipin Kumar、Jeremy G. Frey、Adrian Weller、Larisa Soldatova、Alan R. Bundy、Nicholas R. Jennings、Koichi Takahashi、Lawrence Hunter、Saso Dzeroski、Andrew Briggs、Frederick D. Gregory、Carla P. Gomes、Jon Rowe、James Evans、Hiroaki Kitano 与 Ross King。2023。《生成式闭环人工智能引领的基础科学未来》arXiv:2307.07522 [cs.AI] https://arxiv.org/abs/2307.07522

[319] Beiq Zhang, Peng Liang, Xiyu Zhou, Aakash Ahmad, and Muhammad Waseem. 2023. Practices and Challenges of Using GitHub Copilot: An Empirical Study. In Proceedings of the 35th International Conference on Software Engineering and Knowledge Engineering (SEKE2023, Vol. 2023). KSI Research Inc., 124-129. doi:10.18293/seke2023-077
[319] 张北齐、梁鹏、周西玉、Aakash Ahmad 与 Muhammad Waseem。2023。《GitHub Copilot 使用实践与挑战:一项实证研究》载于《第 35 届国际软件工程与知识工程会议论文集》(SEKE2023,第 2023 卷)。KSI Research Inc.,124-129 页。doi:10.18293/seke2023-077

[320] Cedegao E. Zhang, Katherine M. Collins, Adrian Weller, and Joshua B. Tenenbaum. 2023. AI for Mathematics: A Cognitive Science Perspective. arXiv:2310.13021 [q-bio.NC] https://arxiv.org/abs/2310.13021
[320] Cedegao E. Zhang, Katherine M. Collins, Adrian Weller, 和 Joshua B. Tenenbaum。2023。《数学人工智能:认知科学视角》。arXiv:2310.13021 [q-bio.NC] https://arxiv.org/abs/2310.13021

[321] David Zhang. 2025. deep-research. https://github.com/dzhng/deep-research.
[321] David Zhang。2025。deep-research。https://github.com/dzhng/deep-research

[322] Kevin Zhang and Hod Lipson. 2024. Aligning AI-driven discovery with human intuition. arXiv:2410.07397 [cs.LG] https://arxiv.org/abs/2410.07397
[322] Kevin Zhang 和 Hod Lipson。2024。《人工智能驱动发现与人类直觉的对齐》。arXiv:2410.07397 [cs.LG] https://arxiv.org/abs/2410.07397

[323] Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, and Qiaozhu Mei. 2024. MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows. https://github.com/xingjian-zhang/massw. arXiv:2406.06357 [cs.CL] https://arxiv.org/abs/2406.06357
[323] Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, 和 Qiaozhu Mei。2024。《MASSW:AI 辅助科研工作流的新数据集与基准任务》。https://github.com/xingjian-zhang/massw。arXiv:2406.06357 [cs.CL] https://arxiv.org/abs/2406.06357

[324] Zheng Zhang, Jie Gao, Ranjodh Singh Dhaliwal, and Toby Jia-Jun Li. 2023. VISAR: A Human-AI Argumentative Writing Assistant with Visual Programming and Rapid Draft Prototyping. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23). ACM, 1-30. doi:10.1145/3586183.3606800
[324] 张政、高洁、Ranjodh Singh Dhaliwal 和李佳骏。2023。VISAR:具备可视化编程与快速草稿原型设计功能的人机协同议论文写作助手。见《第 36 届 ACM 用户界面软件与技术研讨会论文集》(UIST '23)。ACM 出版社,1-30 页。doi:10.1145/3586183.3606800

[325] Jinjin Zhao, Avidgor Gal, and Sanjay Krishnan. 2024. A System for Quantifying Data Science Workflows with Fine-Grained Procedural Logging and a Pilot Study. arXiv:2405.17845 [cs.HC] https://arxiv.org/abs/2405.17845
[325] 赵金金、Avidgor Gal 和 Sanjay Krishnan。2024。基于细粒度过程日志的数据科学工作流量化系统及试点研究。arXiv:2405.17845 [cs.HC] https://arxiv.org/abs/2405.17845

[326] Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Yansi Li, Zhongyang Dai, Xin Chen, and Kai Yu. 2024. ChemDFM-X: towards large multimodal model for chemistry. https://github.com/OpenDFM/ChemDFM-X. Science China Information Sciences 67, 12 (Dec. 2024). doi:10.1007/ s11432-024-4243-0
[326] 赵子涵、陈波、李景飘、陈璐、闻丽阳、王鹏宇、朱子辰、张丹阳、李燕思、戴中阳、陈欣和俞凯。2024。ChemDFM-X:面向化学领域的大型多模态模型。https://github.com/OpenDFM/ChemDFM-X。《中国科学:信息科学》67 卷 12 期(2024 年 12 月)。doi:10.1007/s11432-024-4243-0

[327] Raigul Zheldibayeva. 2025. The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars. arXiv:2503.05820 [cs.CY] https://arxiv.org/abs/2503.05820
[327] Raigul Zheldibayeva。2025。人工智能与同行反馈对科研写作能力的影响:基于 CGScholar 平台的哈萨克斯坦学者研究。arXiv:2503.05820 [cs.CY] https://arxiv.org/abs/2503.05820

[328] Dewu Zheng, Yanlin Wang, Ensheng Shi, Xilin Liu, Yuchi Ma, Hongyu Zhang, and Zibin Zheng. 2025. Top General Performance = = == Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark. https: //github.com/DeepSoftwareAnalytics/MultiCodeBench. arXiv:2412.18573 [cs.SE] https://arxiv.org/abs/2412.18573
[328] 郑德武、王彦霖、石恩升、刘西林、马宇驰、张宏宇、郑子彬。2025。《综合性能顶尖即等于领域性能顶尖?DomainCodeBench:多领域代码生成基准测试》。https://github.com/DeepSoftwareAnalytics/MultiCodeBench。arXiv:2412.18573 [cs.SE] https://arxiv.org/abs/2412.18573

[329] Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. 2025. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments. arXiv:2504.03160 [cs.AI] https://arxiv.org/abs/2504.03160
[329] 郑宇翔、傅达远、胡相坤、蔡晓杰、叶绿蔓山、卢鹏瑞、刘鹏飞。2025。《DeepResearcher:通过真实环境中的强化学习扩展深度研究》。arXiv:2504.03160 [cs.AI] https://arxiv.org/abs/2504.03160

[330] Zhipu AI. 2025. AutoGLM-Research. https://autoglm-research.zhipuai.cn/.
[330] 智谱 AI。2025。《AutoGLM-Research》。https://autoglm-research.zhipuai.cn/

[331] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL] https://arxiv.org/abs/2304.06364
[331] 钟万钧、崔瑞祥、郭一多、梁耀波、陆帅、王彦霖、Amin Saied、陈伟柱、段楠。2023。《AGIEval:面向基础模型评估的人本基准》。arXiv:2304.06364 [cs.CL] https://arxiv.org/abs/2304.06364

[332] Shuyan Zhou, Chengrun Chi, Ceyao Zheng, Bailin Zhang, Yonatan Bisk, Daniel Fried, Ishan Misra, Karthik Raghunathan, Tongshuang Zhao, Baian Zhou, et al. 2024. WebArena: A Benchmark for Web Agents. https://github.com/web-arenax/webarena.
[332] Shuyan Zhou, Chengrun Chi, Ceyao Zheng, Bailin Zhang, Yonatan Bisk, Daniel Fried, Ishan Misra, Karthik Raghunathan, Tongshuang Zhao, Baian Zhou 等. 2024. WebArena: 网页智能体基准测试. https://github.com/web-arenax/webarena.

[333] Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025. DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process. arXiv:2503.08569 [cs.CL] https://arxiv.org/abs/2503.08569
[333] Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang. 2025. DeepReview: 通过类人深度思考流程改进基于 LLM 的论文评审. arXiv:2503.08569 [cs.CL] https://arxiv.org/abs/2503.08569

[334] Tonghe Zhuang and Zhicheng Lin. 2024. The why, what, and how of AI-based coding in scientific research. arXiv:2410.02156 [cs.CY] https://arxiv.org/abs/2410.02156
[334] Tonghe Zhuang, Zhicheng Lin. 2024. 科研中 AI 编程的动因、内容与方法. arXiv:2410.02156 [cs.CY] https://arxiv.org/abs/2410.02156

[335] Dennis Zyska, Nils Dycke, Jan Buchmann, Ilia Kuznetsov, and Iryna Gurevych. 2023. CARE: Collaborative AI-Assisted Reading Environment. arXiv:2302.12611 [cs.CL] https://arxiv.org/abs/2302.12611
[335] Dennis Zyska, Nils Dycke, Jan Buchmann, Ilia Kuznetsov, Iryna Gurevych. 2023. CARE: 协作式 AI 辅助阅读环境. arXiv:2302.12611 [cs.CL] https://arxiv.org/abs/2302.12611

[336] Tolga Çöplü, Arto Bendiken, Andrii Skomorokhov, Eduard Bateiko, Stephen Cobb, and Joshua J. Bouw. 2024. Prompt-Time Symbolic Knowledge Capture with Large Language Models. https://github.com/HaltiaAI/paper-PTSKC. arXiv:2402.00414 [cs.CL] https://arxiv.org/abs/2402.00414
[336] Tolga Çöplü、Arto Bendiken、Andrii Skomorokhov、Eduard Bateiko、Stephen Cobb 和 Joshua J. Bouw。2024。《基于大语言模型的即时符号知识捕获》。https://github.com/HaltiaAI/paper-PTSKC。arXiv:2402.00414 [cs.CL] https://arxiv.org/abs/2402.00414

  1. *Corresponding author: rux@zju.edu.cn
    *通讯作者:rux@zju.edu.cn

    Authors’ Contact Information: Renjun Xu; Jingwen Peng, Zhejiang University, China.
    作者联系方式:徐仁君;彭靖雯,中国浙江大学。
  2. *Humanity’s Last Exam: Tests frontier research capabilities **Massive Multitask Language Understanding: Tests general knowledge ***GAIA Score(pass@1): Average score
    *人类终极考试:测试前沿研究能力 **海量多任务语言理解:测试通识水平 ***GAIA 评分(单次通过率):平均得分