Agent AI:
Surveying the Horizons of Multimodal Interaction
智能体 AI：探索多模态交互的疆界

Zane Durante^1†, Qiuyuan Huang^2‡∗, Naoki Wake^2∗,
扎内·杜兰特 ^1† ，黄秋远 ^2‡∗ ，渡边直树 ^2∗ ，
Ran Gong^3†, Jae Sung Park^4†, Bidipta Sarkar^1†, Rohan Taori^1†, Yusuke Noda⁵,
龚然 ^3† ，朴宰成 ^4† ，比迪普塔·萨卡尔 ^1† ，罗汉·陶里 ^1† ，野田佑介 ⁵ ，
Demetri Terzopoulos³, Yejin Choi⁴, Katsushi Ikeuchi², Hoi Vo⁵, Li Fei-Fei¹, Jianfeng Gao²
德米特里·特佐普洛斯 ³ 、崔艺珍 ⁴ 、池内克史 ² 、胡海 ⁵ 、李飞飞 ¹ 、高剑峰 ²
¹Stanford University; ²Microsoft Research, Redmond;
¹ 斯坦福大学； ² 微软研究院雷德蒙德分院；
³University of California, Los Angeles; ⁴University of Washington; ⁵Microsoft Gaming
³ 加州大学洛杉矶分校； ⁴ 华盛顿大学； ⁵ 微软游戏事业部 Equal Contribution. ^‡ Project Lead. ^† Work done while interning at Microsoft Research, Redmond.
同等贡献者。 ^‡ 项目负责人。 ^† 工作完成于微软研究院雷德蒙德分院实习期间。

Abstract 摘要

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define “Agent AI” as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
多模态人工智能系统或将无所不在地融入日常生活。赋予这些系统具身化智能体形态，使其在物理和虚拟环境中进行交互，是实现更自然互动的有效路径。当前系统主要利用现有基础模型作为构建具身智能体的基础模块。将智能体嵌入此类环境能增强模型处理视觉与上下文数据的能力，这对开发更复杂、情境感知的 AI 系统至关重要。例如，能感知用户行为、人类活动、环境物体、声音表达及场景整体情绪的系统，可指导智能体在特定环境中作出响应。为加速多模态智能体研究，我们将"智能体 AI"定义为能处理视觉刺激、语言输入及其他环境数据，并产生具身化行为的交互系统。我们重点研究如何通过整合外部知识、多感官输入和人类反馈来优化基于下一动作预测的智能体。研究表明，在具身环境中开发智能体 AI 系统，可缓解大型基础模型的幻觉问题及其生成与环境不符输出的倾向。新兴的智能体 AI 领域涵盖多模态交互中更广泛的具身化与代理特性。除物理世界中的智能体外，我们展望未来人们能轻松创建虚拟场景，并与虚拟环境中的具身化智能体互动。

Refer to caption — Figure 1: Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a promising avenue toward Artificial General Intelligence (AGI). Agent AI training has demonstrated the capacity for multi-modal understanding in the physical world. It provides a framework for reality-agnostic training by leveraging generative AI alongside multiple independent data sources. Large foundation models trained for agent and action-related tasks can be applied to physical and virtual worlds when trained on cross-reality data. We present the general overview of an Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm.
图 1：跨领域应用的智能体 AI 系统概览。智能体 AI 正成为通向通用人工智能(AGI)的重要路径。其训练已展示出物理世界多模态理解能力，通过生成式 AI 与多源独立数据构建了现实无关的训练框架。经跨现实数据训练的智能体相关基础模型，可同时应用于物理与虚拟世界。本文呈现的智能体 AI 系统概览展示了其作为 AGI 实现路径的潜力。

1 Introduction 1 引言

1.1 Motivation 1.1 研究动机

Historically, AI systems were defined at the 1956 Dartmouth Conference as artificial life forms that could collect information from the environment and interact with it in useful ways. Motivated by this definition, Minsky’s MIT group built in 1970 a robotics system, called the “Copy Demo,” that observed “blocks world” scenes and successfully reconstructed the observed polyhedral block structures. The system, which comprised observation, planning, and manipulation modules, revealed that each of these subproblems is highly challenging and further research was necessary. The AI field fragmented into specialized subfields that have largely independently made great progress in tackling these and other problems, but over-reductionism has blurred the overarching goals of AI research.
历史上，人工智能系统在 1956 年达特茅斯会议上被定义为能够从环境中收集信息并以有用方式与之互动的人造生命体。受此定义启发，明斯基的 MIT 团队于 1970 年建造了名为"复制演示"的机器人系统，该系统能观察"积木世界"场景并成功重建观察到的多面体积木结构。这个包含观察、规划和操作模块的系统揭示出：每个子问题都具有极高挑战性，需要进一步研究。此后 AI 领域分化成多个专门子领域，各自独立地在解决这些问题及其他难题上取得重大进展，但过度还原主义模糊了人工智能研究的总体目标。

To advance beyond the status quo, it is necessary to return to AI fundamentals motivated by Aristotelian Holism. Fortunately, the recent revolution in Large Language Models (LLMs) and Visual Language Models (VLMs) has made it possible to create novel AI agents consistent with the holistic ideal. Seizing upon this opportunity, this article explores models that integrate language proficiency, visual cognition, context memory, intuitive reasoning, and adaptability. It explores the potential completion of this holistic synthesis using LLMs and VLMs. In our exploration, we also revisit system design based on Aristotle’s Final Cause, the teleological “why the system exists”, which may have been overlooked in previous rounds of AI development.
要突破现状，必须回归受亚里士多德整体论启发的 AI 基本原理。所幸近期大语言模型（LLMs）和视觉语言模型（VLMs）的革命性进展，使得创建符合整体理念的新型 AI 智能体成为可能。抓住这一机遇，本文探索融合语言能力、视觉认知、情境记忆、直觉推理和适应性的模型体系，研究如何利用 LLMs 和 VLMs 实现这种整体性融合。在探索过程中，我们还重新审视了基于亚里士多德"终极因"（即"系统为何存在"的目的论）的系统设计理念——这一维度在过往 AI 发展浪潮中可能被忽视了。

With the advent of powerful pretrained LLMs and VLMs, a renaissance in natural language processing and computer vision has been catalyzed. LLMs now demonstrate an impressive ability to decipher the nuances of real-world linguistic data, often achieving abilities that parallel or even surpass human expertise OpenAI (2023). Recently, researchers have shown that LLMs may be extended to act as agents within various environments, performing intricate actions and tasks when paired with domain-specific knowledge and modules Xi et al. (2023). These scenarios, characterized by complex reasoning, understanding of the agent’s role and its environment, along with multi-step planning, test the agent’s ability to make highly nuanced and intricate decisions within its environmental constraints Wu et al. (2023); Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).
随着强大的预训练 LLMs 和 VLMs 的出现，自然语言处理和计算机视觉领域迎来了复兴。LLMs 如今展现出解析现实世界语言数据细微差别的惊人能力，其表现往往能与人类专家比肩甚至超越（OpenAI，2023）。最新研究表明，当结合领域特定知识和模块时，LLMs 可延伸作为各类环境中的智能体，执行复杂动作和任务（Xi 等，2023）。这些场景以复杂推理、对智能体角色及其环境的理解，以及多步骤规划为特征，考验着智能体在环境约束下做出高度精细决策的能力（Wu 等，2023；Meta 基础 AI 研究外交团队等，2022）。

Building upon these initial efforts, the AI community is on the cusp of a significant paradigm shift, transitioning from creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments. In this context, this article investigates the immense potential of using LLMs and VLMs as agents, emphasizing models that have a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. Leveraging LLMs and VLMs as agents, especially within domains like gaming, robotics, and healthcare, promises not just a rigorous evaluation platform for state-of-the-art AI systems, but also foreshadows the transformative impacts that Agent-centric AI will have across society and industries. When fully harnessed, agentic models can redefine human experiences and elevate operational standards. The potential for sweeping automation ushered in by these models portends monumental shifts in industries and socio-economic dynamics. Such advancements will be intertwined with multifaceted leader-board, not only technical but also ethical, as we will elaborate upon in Section 11. We delve into the overlapping areas of these sub-fields of Agent AI and illustrate their interconnectedness in Fig.1.
基于这些初步成果，AI 社区正面临重大范式转变——从开发面向被动结构化任务的 AI 模型，转向构建能在多元复杂环境中承担动态代理角色的模型。在此背景下，本文探讨了将 LLMs 和 VLMs 作为智能体的巨大潜力，重点关注兼具语言精通、视觉认知、上下文记忆、直觉推理和适应能力的模型。在游戏、机器人和医疗等领域部署这类智能体模型，不仅能为前沿 AI 系统提供严谨评估平台，更预示着以智能体为核心的 AI 将对社会各行业产生的变革性影响。当这类模型被充分应用时，将重新定义人类体验并提升操作标准。这些模型带来的全面自动化潜力，预示着行业格局和社会经济形态的重大变革。如第 11 节所述，此类进步将涉及多维度的评估体系——不仅包含技术维度，更涵盖伦理考量。图 1 展示了智能体 AI 各子领域的交叉关联，我们将深入探讨这些重叠区域。

1.2 Background 1.2 背景

We will now introduce relevant research papers that support the concepts, theoretical background, and modern implementations of Agent AI.
我们将介绍支撑智能体 AI 概念、理论基础与现代实现的相关研究论文。

Large Foundation Models:
大型基础模型：

LLMs and VLMs have been driving the effort to develop general intelligent machines (Bubeck et al., 2023; Mirchandani et al., 2023). Although they are trained using large text corpora, their superior problem-solving capacity is not limited to canonical language processing domains. LLMs can potentially tackle complex tasks that were previously presumed to be exclusive to human experts or domain-specific algorithms, ranging from mathematical reasoning (Imani et al., 2023; Wei et al., 2022; Zhu et al., 2022) to answering questions of professional law (Blair-Stanek et al., 2023; Choi et al., 2023; Nay, 2022). Recent research has shown the possibility of using LLMs to generate complex plans for robots and game AI (Liang et al., 2022; Wang et al., 2023a, b; Yao et al., 2023a; Huang et al., 2023a), marking an important milestone for LLMs as general-purpose intelligent agents.
LLMs 与视觉语言模型(VLMs)正推动通用智能机器的研发进程(Bubeck 等, 2023; Mirchandani 等, 2023)。虽然训练数据源于海量文本语料，但其卓越的问题解决能力已突破传统语言处理范畴。LLMs 能够处理曾被认为专属于人类专家或领域算法的复杂任务，涵盖数学推理(Imani 等, 2023; Wei 等, 2022; Zhu 等, 2022)到专业法律咨询(Blair-Stanek 等, 2023; Choi 等, 2023; Nay, 2022)。最新研究表明 LLMs 可为机器人和游戏 AI 生成复杂行动方案(Liang 等, 2022; Wang 等, 2023a,b; Yao 等, 2023a; Huang 等, 2023a)，这标志着 LLMs 作为通用智能体的重要里程碑。

Embodied AI: 具身人工智能：

A number of works leverage LLMs to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), specifically the LLMs’ WWW-scale domain knowledge and emergent zero-shot embodied abilities to perform complex task planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code, then using a low-level controller to execute these subtasks. Additionally, they incorporate environmental feedback to improve task performance (Huang et al., 2022b), (Liang et al., 2022), (Wang et al., 2023a), and (Ikeuchi et al., 2023).
多项研究利用 LLMs 执行任务规划(Huang 等, 2022a; Wang 等, 2023b; Yao 等, 2023a; Li 等, 2023a)，特别是发挥其互联网级领域知识和涌现的零样本具身能力来完成复杂任务推理。近期机器人研究也通过 LLMs 将自然语言指令分解为子任务序列(Ahn 等, 2022a; Huang 等, 2022b; Liang 等, 2022)，这些子任务以自然语言或 Python 代码形式呈现，再由底层控制器执行。研究还引入环境反馈机制来优化任务表现(Huang 等, 2022b; Liang 等, 2022; Wang 等, 2023a; Ikeuchi 等, 2023)。

Interactive Learning: 交互式学习：

AI agents designed for interactive learning operate using a combination of machine learning techniques and user interactions. Initially, the AI agent is trained on a large dataset. This dataset includes various types of information, depending on the intended function of the agent. For instance, an AI designed for language tasks would be trained on a massive corpus of text data. The training involves using machine learning algorithms, which could include deep learning models like neural networks. These training models enable the AI to recognize patterns, make predictions, and generate responses based on the data on which it was trained. The AI agent can also learn from real-time interactions with users. This interactive learning can occur in various ways: 1) Feedback-based learning: The AI adapts its responses based on direct user feedback Li et al. (2023b); Yu et al. (2023a); Parakh et al. (2023); Zha et al. (2023); Wake et al. (2023a, b, c). For example, if a user corrects the AI’s response, the AI can use this information to improve future responses Zha et al. (2023); Liu et al. (2023a). 2) Observational Learning: The AI observes user interactions and learns implicitly. For example, if users frequently ask similar questions or interact with the AI in a particular way, the AI might adjust its responses to better suit these patterns. It allows the AI agent to understand and process human language, multi-model setting, interpret the cross reality-context, and generate human-users’ responses. Over time, with more user interactions and feedback, the AI agent’s performance generally continuous improves. This process is often supervised by human operators or developers who ensure that the AI is learning appropriately and not developing biases or incorrect patterns.
专为交互式学习设计的 AI 智能体通过机器学习技术与用户互动相结合的方式运作。首先，AI 智能体会在大型数据集上进行训练。该数据集包含各类信息，具体取决于智能体的预期功能。例如，为语言任务设计的 AI 将在海量文本语料库上进行训练。训练过程涉及使用机器学习算法，可能包括神经网络等深度学习模型。这些训练模型使 AI 能够识别模式、进行预测，并根据训练数据生成响应。AI 智能体还能从与用户的实时互动中学习，这种交互式学习可通过多种方式实现：1）基于反馈的学习：AI 根据用户直接反馈调整响应（Li 等人，2023b；Yu 等人，2023a；Parakh 等人，2023；Zha 等人，2023；Wake 等人，2023a、b、c）。例如若用户纠正 AI 的响应，AI 会利用该信息改进后续回答（Zha 等人，2023；Liu 等人，2023a）；2）观察式学习：AI 通过观察用户交互进行隐式学习。例如当用户频繁提出相似问题或以特定方式交互时，AI 会调整响应以适配这些模式。这使得 AI 智能体能够理解并处理人类语言、多模态场景、解析跨现实情境并生成面向人类用户的响应。随着用户互动和反馈的积累，AI 智能体性能通常持续提升。该过程通常由人类操作员或开发者监督，以确保 AI 正确学习且不会形成偏见或错误模式。

1.3 Overview 1.3 内容概览

Multimodal Agent AI (MAA) is a family of systems that generate effective actions in a given environment based on the understanding of multimodal sensory input. With the advent of Large Language Models (LLMs) and Vision-Language Models (VLMs), numerous MAA systems have been proposed in fields ranging from basic research to applications. While these research areas are growing rapidly by integrating with the traditional technologies of each domain (e.g., visual question answering and vision-language navigation), they share common interests such as data collection, benchmarking, and ethical perspectives. In this paper, we focus on the some representative research areas of MAA, namely multimodality, gaming (VR/AR/MR), robotics, and healthcare, and we aim to provide comprehensive knowledge on the common concerns discussed in these fields. As a result we expect to learn the fundamentals of MAA and gain insights to further advance their research. Specific learning outcomes include:
多模态智能体 AI（MAA）是一类基于对多模态感知输入的理解，在给定环境中生成有效行动的系统体系。随着大语言模型（LLMs）和视觉语言模型（VLMs）的出现，从基础研究到应用领域已涌现大量 MAA 系统。虽然这些研究领域通过与各领域传统技术（如视觉问答和视觉语言导航）的融合正快速发展，但它们在数据收集、基准测试和伦理视角等方面存在共同关注点。本文聚焦 MAA 若干代表性研究领域——多模态交互、游戏（VR/AR/MR）、机器人学和医疗保健，旨在系统梳理这些领域共同讨论的核心议题。通过研究，我们希望掌握 MAA 的基础原理，并获得推动其研究发展的关键洞见。具体学习目标包括：

•

MAA Overview: A deep dive into its principles and roles in contemporary applications, providing researcher with a thorough grasp of its importance and uses.
MAA 概述：深入探讨其原理及在现代应用中的角色，为研究人员全面理解其重要性和用途提供指导。
•

Methodologies: Detailed examples of how LLMs and VLMs enhance MAAs, illustrated through case studies in gaming, robotics, and healthcare.
方法论：通过游戏、机器人和医疗保健领域的案例研究，详细展示 LLMs 和 VLMs 如何增强多模态智能代理（MAAs）的具体示例。
•

Performance Evaluation: Guidance on the assessment of MAAs with relevant datasets, focusing on their effectiveness and generalization.
性能评估：关于如何利用相关数据集评估 MAAs 的指导，重点关注其有效性和泛化能力。
•

Ethical Considerations: A discussion on the societal impacts and ethical leader-board of deploying Agent AI, highlighting responsible development practices.
伦理考量：探讨部署 Agent AI 的社会影响和伦理排行榜，强调负责任的发展实践。
•

Emerging Trends and Future leader-board: Categorize the latest developments in each domain and discuss the future directions.
新兴趋势与未来排行榜：分类梳理各领域的最新进展，并讨论未来发展方向。

Computer-based action and generalist agents (GAs) are useful for many tasks. A GA to become truly valuable to its users, it can natural to interact with, and generalize to a broad range of contexts and modalities. We aims to cultivate a vibrant research ecosystem and create a shared sense of identity and purpose among the Agent AI community. MAA has the potential to be widely applicable across various contexts and modalities, including input from humans. Therefore, we believe this Agent AI area can engage a diverse range of researchers, fostering a dynamic Agent AI community and shared goals. Led by esteemed experts from academia and industry, we expect that this paper will be an interactive and enriching experience, complete with agent instruction, case studies, tasks sessions, and experiments discussion ensuring a comprehensive and engaging learning experience for all researchers.
基于计算机的行动和通用智能体（GAs）能胜任多种任务。若要让 GA 真正为用户创造价值，它需具备自然交互能力，并能泛化至广泛场景与模态。我们的目标是培育活跃的研究生态，在 Agent AI 社区中建立共同的身份认同与目标。MAA 有望在多种场景和模态中广泛应用，包括人类输入。因此，我们相信这一领域能吸引多元研究者，形成充满活力的 Agent AI 社区与共享目标。在学界和业界权威专家的引领下，本文将通过智能体指令、案例研究、任务环节和实验讨论，为所有研究者提供互动丰富、内容全面的学习体验。

This paper aims to provide general and comprehensive knowledge about the current research in the field of Agent AI. To this end, the rest of the paper is organized as follows. Section 2 outlines how Agent AI benefits from integrating with related emerging technologies, particularly large foundation models. Section 3 describes a new paradigm and framework that we propose for training Agent AI. Section 4 provides an overview of the methodologies that are widely used in the training of Agent AI. Section 5 categorizes and discusses various types of agents. Section 6 introduces Agent AI applications in gaming, robotics, and healthcare. Section 7 explores the research community’s efforts to develop a versatile Agent AI, capable of being applied across various modalities, domains, and bridging the sim-to-real gap. Section 8 discusses the potential of Agent AI that not only relies on pre-trained foundation models, but also continuously learns and self-improves by leveraging interactions with the environment and users. Section 9 introduces our new datasets that are designed for the training of multimodal Agent AI. Section 11 discusses the hot topic of the ethics consideration of AI agent, limitations, and societal impact of our paper.
本文旨在系统梳理 Agent AI 领域的研究现状。全文结构如下：第 2 章阐述 Agent AI 如何通过融合新兴技术（尤其是大型基础模型）获益；第 3 章提出全新的 Agent AI 训练范式与框架；第 4 章概述当前主流的 Agent AI 训练方法；第 5 章分类讨论各类智能体；第 6 章介绍 Agent AI 在游戏、机器人和医疗领域的应用；第 7 章探讨研究界开发通用 Agent AI 的努力，包括跨模态/领域应用与虚实迁移；第 8 章论述 Agent AI 不仅依赖预训练基础模型，更能通过环境与用户交互持续自我进化；第 9 章发布专为多模态 Agent AI 训练设计的新数据集；第 11 章聚焦 AI 智能体伦理、本文局限性及社会影响等热点议题。

2 Agent AI Integration 2Agent AI 集成

Foundation models based on LLMs and VLMs, as proposed in previous research, still exhibit limited performance in the area of embodied AI, particularly in terms of understanding, generating, editing, and interacting within unseen environments or scenarios Huang et al. (2023a); Zeng et al. (2023). Consequently, these limitations lead to sub-optimal outputs from AI agents. Current agent-centric AI modeling approaches focus on directly accessible and clearly defined data (e.g. text or string representations of the world state) and generally use domain and environment-independent patterns learned from their large-scale pretraining to predict action outputs for each environment Xi et al. (2023); Wang et al. (2023c); Gong et al. (2023a); Wu et al. (2023). In (Huang et al., 2023a), we investigate the task of knowledge-guided collaborative and interactive scene generation by combining large foundation models, and show promising results that indicate knowledge-grounded LLM agents can improve the performance of 2D and 3D scene understanding, generation, and editing, alongside with other human-agent interactions Huang et al. (2023a). By integrating an Agent AI framework, large foundation models are able to more deeply understand user input to form a complex and adaptive HCI system. Emergent ability of LLM and VLM works invisible in generative AI, embodied AI, knowledge augmentation for multi-model learning, mix-reality generation, text to vision editing, human interaction for 2D/3D simulation in gaming or robotics tasks. Agent AI recent progress in foundation models present an imminent catalyst for unlocking general intelligence in embodied agents. The large action models, or agent-vision-language models open new possibilities for general-purpose embodied systems such as planning, problem-solving and learning in complex environments. Agent AI test further step in metaverse, and route the early version of AGI.
先前研究中提出的基于 LLMs 和 VLMs 的基础模型，在具身人工智能领域仍表现出性能局限，特别是在未知环境或场景下的理解、生成、编辑与交互方面(Huang 等，2023a；Zeng 等，2023)。这些局限性导致智能体输出效果欠佳。当前以智能体为核心的 AI 建模方法主要关注可直接访问且明确定义的数据（如文本或世界状态的字符串表示），通常利用从大规模预训练中习得的领域与环境无关模式来预测每个环境的行动输出(Xi 等，2023；Wang 等，2023c；Gong 等，2023a；Wu 等，2023)。在(Huang 等，2023a)中，我们通过结合大型基础模型研究了知识引导的协作交互式场景生成任务，结果表明基于知识的 LLM 智能体能够提升 2D/3D 场景理解、生成与编辑性能，同时优化人机交互(Huang 等，2023a)。通过整合 Agent AI 框架，大型基础模型能更深入理解用户输入，形成复杂自适应的 HCI 系统。LLM 和 VLM 在生成式 AI、具身 AI、多模型学习的知识增强、混合现实生成、文本到视觉编辑、游戏或机器人任务中的 2D/3D 人机交互方面展现出不可见的涌现能力。基础模型中 Agent AI 的最新进展为解锁具身智能体的通用智能提供了迫近的催化剂。大型行动模型（或称智能体-视觉-语言模型）为通用具身系统开辟了新可能，例如复杂环境中的规划、问题解决与学习。Agent AI 在元宇宙中更进一步，铺设了早期 AGI 的发展路径。

2.1 Infinite AI agent 2.1 无限 AI 代理

AI agents have the capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. AI agent systems generally possess the following abilities: 1) Predictive Modeling: AI agents can predict likely outcomes or suggest next steps based on historical data and trends. For instance, they might predict the continuation of a text, the answer to a question, the next action for a robot, or the resolution of a scenario. 2) Decision Making: In some applications, AI agents can make decisions based on their inferences. Generally, the agent will base their decision on what is most likely to achieve a specified goal. For AI applications like recommendation systems, an agent can decide what products or content to recommend based on its inferences about user preferences. 3) Handling Ambiguity: AI agents can often handle ambiguous input by inferring the most likely interpretation based on context and training. However, their ability to do so is limited by the scope of their training data and algorithms. 4) Continuous Improvement: While some AI agents have the ability to learn from new data and interactions, many large language models do not continuously update their knowledge-base or internal representation after training. Their inferences are usually based solely on the data that was available up to the point of their last training update.
AI 智能体具备根据训练数据和输入信息进行解读、预测和响应的能力。尽管这些能力十分先进且持续提升，但我们必须认识到其局限性以及底层训练数据的影响。AI 智能体系统通常具有以下能力：1) 预测建模：能基于历史数据和趋势预测可能结果或建议后续步骤，例如预测文本延续、问题答案、机器人下一步动作或情景解决方案；2) 决策制定：在某些应用中可根据推理做出决策，通常以最可能达成特定目标为依据，如推荐系统会根据用户偏好推断决定推荐内容；3) 处理模糊性：通常能通过上下文和训练推断最可能的解释来处理模糊输入，但该能力受限于训练数据范围和算法；4) 持续改进：部分 AI 智能体可从新数据和交互中学习，但多数大语言模型在训练后不会持续更新知识库或内部表征，其推断仅基于最后训练更新时可用数据。

We show augmented interactive agents for multi-modality and cross reality-agnostic integration with an emergence mechanism in Fig. 2. An AI agent requires collecting extensive training data for every new task, which can be costly or impossible for many domains. In this study, we develop an infinite agent that learns to transfer memory information from general foundation models (e.g., GPT-X, DALL-E) to novel domains or scenarios for scene understanding, generation, and interactive editing in physical or virtual worlds.
如图 2 所示，我们展示了具备涌现机制的多模态与跨现实通用整合的增强交互智能体。传统 AI 智能体需为每个新任务收集大量训练数据，这对许多领域可能成本高昂或不可行。本研究开发的无限智能体能够将通用基础模型（如 GPT-X、DALL-E）的记忆信息迁移至新领域或场景，实现物理/虚拟世界的场景理解、生成与交互编辑。

An application of such an infinite agent in robotics is RoboGen Wang et al. (2023d). In this study, the authors propose a pipeline that autonomously run the cycles of task proposition, environment generation, and skill learning. RoboGen is an effort to transfer the knowledge embedded in large models to robotics.
这类无限智能体在机器人领域的应用实例是 RoboGen（Wang 等，2023d）。该研究提出了能自主运行任务提议、环境生成与技能学习循环的流程框架，致力于将大模型中的知识迁移至机器人领域。

2.2 Agent AI with Large Foundation Models
2.2 基于大型基础模型的智能体 AI

Recent studies have indicated that large foundation models play a crucial role in creating data that act as benchmarks for determining the actions of agents within environment-imposed constraints. For example, using foundation models for robotic manipulation Black et al. (2023); Ko et al. (2023) and navigation Shah et al. (2023a); Zhou et al. (2023a). To illustrate, Black et al. employed an image-editing model as a high-level planner to generate images of future sub-goals, thereby guiding low-level policies Black et al. (2023). For robot navigation, Shah et al. proposed a system that employs a LLM to identify landmarks from text and a VLM to associate these landmarks with visual inputs, enhancing navigation through natural language instructions Shah et al. (2023a).
近期研究表明，大型基础模型在生成基准数据方面发挥着关键作用，这些数据用于确定智能体在环境约束下的行为。例如，Black 等人(2023)和 Ko 等人(2023)将基础模型应用于机器人操控，Shah 等人(2023a)与 Zhou 等人(2023a)则将其用于导航领域。具体而言，Black 团队采用图像编辑模型作为高级规划器来生成未来子目标图像，从而指导底层策略(Black et al., 2023)。在机器人导航方面，Shah 等人开发了一个系统，利用 LLM 从文本中识别地标，并通过视觉语言模型(VLM)将这些地标与视觉输入关联，从而通过自然语言指令增强导航能力(Shah et al., 2023a)。

There is also growing interest in the generation of conditioned human motions in response to language and environmental factors. Several AI systems have been proposed to generate motions and actions that are tailored to specific linguistic instructions Kim et al. (2023); Zhang et al. (2022); Tevet et al. (2022) and to adapt to various 3D scenes Wang et al. (2022a). This body of research emphasizes the growing capabilities of generative models in enhancing the adaptability and responsiveness of AI agents across diverse scenarios.
针对语言和环境因素生成条件化人体运动的研究也日益受到关注。多个 AI 系统已被提出用于生成符合特定语言指令的运动行为(Kim et al., 2023; Zhang et al., 2022; Tevet et al., 2022)，并能适应不同 3D 场景(Wang et al., 2022a)。这些研究凸显了生成模型在提升 AI 智能体跨场景适应性与响应能力方面的不断增强。

2.2.1 Hallucinations 2.2.1 幻觉问题

Agents that generate text are often prone to hallucinations, which are instances where the generated text is nonsensical or unfaithful to the provided source content Raunak et al. (2021); Maynez et al. (2020). Hallucinations can be split into two categories, intrinsic and extrinsic Ji et al. (2023). Intrinsic hallucinations are hallucinations that are contradictory to the source material, whereas extrinsic hallucinations are when the generated text contains additional information that was not originally included in the source material.
生成文本的智能体经常会出现幻觉现象，即生成的内容不符合逻辑或偏离原始素材（Raunak 等人，2021；Maynez 等人，2020）。幻觉可分为内在和外在两类（Ji 等人，2023）。内在幻觉指与源材料直接矛盾的内容，而外在幻觉则是生成文本包含源材料中未提及的额外信息。

Some promising routes for reducing the rate of hallucination in language generation involve using retrieval-augmented generation Lewis et al. (2020); Shuster et al. (2021) or other methods for grounding natural language outputs via external knowledge retrieval Dziri et al. (2021); Peng et al. (2023). Generally, these methods seek to augment language generation by retrieving additional source material and by providing mechanisms to check for contradictions between the generated response and the source material.
降低语言生成中幻觉率的有前景方法包括采用检索增强生成技术（Lewis 等人，2020；Shuster 等人，2021），或通过外部知识检索来锚定自然语言输出的其他方法（Dziri 等人，2021；Peng 等人，2023）。这些方法通常通过检索额外源材料，并提供机制检查生成响应与源材料间的矛盾，从而增强语言生成质量。

Within the context of multi-modal agent systems, VLMs have been shown to hallucinate as well Zhou et al. (2023b). One common cause of hallucination for vision-based language-generation is due to the over-reliance on co-occurrence of objects and visual cues in the training data Rohrbach et al. (2018). AI agents that exclusively rely upon pretrained LLMs or VLMs and use limited environment-specific finetuning can be particularly vulnerable to hallucinations since they rely upon the internal knowledge-base of the pretrained models for generating actions and may not accurately understand the dynamics of the world state in which they are deployed.
在多模态智能体系统中，视觉语言模型（VLM）同样存在幻觉现象（Zhou 等人，2023b）。基于视觉的语言生成出现幻觉的常见原因，是训练数据中物体共现和视觉线索的过度依赖（Rohrbach 等人，2018）。仅依赖预训练 LLMs 或 VLMs 且使用有限环境微调的 AI 智能体尤其容易产生幻觉，因为它们依赖预训练模型的内部知识库生成动作，可能无法准确理解所处环境的世界状态动态。

2.2.2 Biases and Inclusivity
2.2.2 偏见与包容性

AI agents based on LLMs or LMMs (large multimodal models) have biases due to several factors inherent in their design and training process. When designing these AI agents, we must be mindful of being inclusive and aware of the needs of all end users and stakeholders. In the context of AI agents, inclusivity refers to the measures and principles employed to ensure that the agent’s responses and interactions are inclusive, respectful, and sensitive to a wide range of users from diverse backgrounds. We list key aspects of agent biases and inclusivity below.
基于 LLMs 或 LMMs（大型多模态模型）的 AI 智能体，其设计和训练过程中的固有因素会导致多种偏见。设计这些 AI 智能体时，我们必须注意包容性，关注所有终端用户和利益相关者的需求。在 AI 智能体语境下，包容性指为确保智能体响应和交互具有包容性、尊重性，并对不同背景用户保持敏感所采取的措施和原则。下文列出智能体偏见和包容性的关键方面。

•

Training Data: Foundation models are trained on vast amounts of text data collected from the internet, including books, articles, websites, and other text sources. This data often reflects the biases present in human society, and the model can inadvertently learn and reproduce these biases. This includes stereotypes, prejudices, and slanted viewpoints related to race, gender, ethnicity, religion, and other personal attributes. In particular, by training on internet data and often only English text, models implicitly learn the cultural norms of Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies Henrich et al. (2010) who have a disproportionately large internet presence. However, it is essential to recognize that datasets created by humans cannot be entirely devoid of bias, since they frequently mirror the societal biases and the predispositions of the individuals who generated and/or compiled the data initially.
训练数据：基础模型通过从互联网收集的海量文本数据（包括书籍、文章、网站等）进行训练。这些数据往往反映人类社会存在的偏见，模型可能无意中学习并复现这些偏见，包括与种族、性别、民族、宗教等个人特征相关的刻板印象和片面观点。特别是通过互联网数据（通常仅为英语文本）训练时，模型会隐式学习西方发达工业国（WEIRD）社会的文化规范（Henrich 等人，2010），这些社会在互联网上占比过高。但需认识到人类创建的数据集无法完全消除偏见，因为它们常常反映社会偏见和原始数据生成/编译者的倾向。
•

Historical and Cultural Biases: AI models are trained on large datasets sourced from diverse content. Thus, the training data often includes historical texts or materials from various cultures. In particular, training data from historical sources may contain offensive or derogatory language representing a particular society’s cultural norms, attitudes, and prejudices. This can lead to the model perpetuating outdated stereotypes or not fully understanding contemporary cultural shifts and nuances.
历史与文化偏见：AI 模型训练使用的海量数据集来源多样，常包含代表不同文化的历史文本资料。特别是历史源材料可能包含反映特定社会文化规范、态度和偏见的冒犯性语言，导致模型延续过时的刻板印象，或无法充分理解当代文化变迁与细微差异。
•

Language and Context Limitations: Language models might struggle with understanding and accurately representing nuances in language, such as sarcasm, humor, or cultural references. This can lead to misinterpretations or biased responses in certain contexts. Furthermore, there are many aspects of spoken language that are not captured by pure text data, leading to a potential disconnect between human understanding of language and how models understand language.
语言与语境局限：语言模型可能难以理解和准确表达语言中的细微差别（如讽刺、幽默或文化典故），导致特定语境下的误解或偏见响应。此外，纯文本数据无法捕捉口语的诸多特征，造成人类语言理解与模型语言理解间的潜在脱节。
•

Policies and Guidelines: AI agents operate under strict policies and guidelines to ensure fairness and inclusivity. For instance, in generating images, there are rules to diversify depictions of people, avoiding stereotypes related to race, gender, and other attributes.
政策与准则：人工智能代理需遵循严格的政策和准则，以确保公平性与包容性。例如在生成图像时，存在规范要求多样化呈现人物形象，避免涉及种族、性别等属性的刻板印象。
•

Overgeneralization: These models tend to generate responses based on patterns seen in the training data. This can lead to overgeneralizations, where the model might produce responses that seem to stereotype or make broad assumptions about certain groups.
过度泛化：这些模型倾向于根据训练数据中的模式生成回应。这可能导致过度泛化，使模型可能产生看似刻板印象或对某些群体做出广泛假设的回应。
•

Constant Monitoring and Updating: AI systems are continuously monitored and updated to address any emerging biases or inclusivity issues. Feedback from users and ongoing research in AI ethics play a crucial role in this process.
持续监控与更新：人工智能系统会持续受到监控和更新，以解决新出现的偏见或包容性问题。用户反馈和人工智能伦理领域的持续研究在此过程中发挥着关键作用。
•

Amplification of Dominant Views: Since the training data often includes more content from dominant cultures or groups, the model may be more biased towards these perspectives, potentially underrepresenting or misrepresenting minority viewpoints.
主流观点放大效应：由于训练数据通常包含更多来自主流文化或群体的内容，模型可能更偏向这些视角，从而可能低估或错误呈现少数群体的观点。
•

Ethical and Inclusive Design: AI tools should be designed with ethical considerations and inclusivity as core principles. This includes respecting cultural differences, promoting diversity, and ensuring that the AI does not perpetuate harmful stereotypes.
伦理与包容性设计：人工智能工具应以伦理考量和包容性为核心设计原则。这包括尊重文化差异、促进多样性，并确保人工智能不会延续有害的刻板印象。
•

User Guidelines: Users are also guided on how to interact with AI in a manner that promotes inclusivity and respect. This includes refraining from requests that could lead to biased or inappropriate outputs. Furthermore, it can help mitigate models learning harmful material from user interactions.
用户指南：用户还会获得指导，了解如何以促进包容性和尊重的方式与人工智能互动。这包括避免提出可能导致偏见或不恰当输出的请求。此外，这也有助于防止模型从用户互动中学习有害内容。

Despite these measures, AI agents still exhibit biases. Ongoing efforts in agent AI research and development are focused on further reducing these biases and enhancing the inclusivity and fairness of agent AI systems. Efforts to Mitigate Biases:
尽管采取了这些措施，人工智能代理仍表现出偏见。当前代理人工智能的研发工作正致力于进一步减少这些偏见，提升代理人工智能系统的包容性和公平性。减轻偏见的措施包括：

•

Diverse and Inclusive Training Data: Efforts are made to include a more diverse and inclusive range of sources in the training data.
多样化与包容性训练数据：致力于在训练数据中纳入更多样化和包容性的来源。
•

Bias Detection and Correction: Ongoing research focuses on detecting and correcting biases in model responses.
偏见检测与修正：当前研究重点在于识别并修正模型响应中存在的偏见。
•

Ethical Guidelines and Policies: Models are often governed by ethical guidelines and policies designed to mitigate biases and ensure respectful and inclusive interactions.
伦理准则与政策：模型通常遵循旨在减少偏见、确保尊重与包容性互动的伦理准则和政策框架。
•

Diverse Representation: Ensuring that the content generated or the responses provided by the AI agent represent a wide range of human experiences, cultures, ethnicities, and identities. This is particularly relevant in scenarios like image generation or narrative construction.
多元表征：确保 AI 智能体生成的内容或提供的响应能体现人类经验、文化、族裔和身份的多样性。这在图像生成或叙事构建等场景中尤为重要。
•

Bias Mitigation: Actively working to reduce biases in the AI’s responses. This includes biases related to race, gender, age, disability, sexual orientation, and other personal characteristics. The goal is to provide fair and balanced responses that do not perpetuate stereotypes or prejudices.
偏见缓解：积极减少 AI 响应中的偏见，包括与种族、性别、年龄、残障、性取向等个人特征相关的偏见，旨在提供不延续刻板印象的公平回应。
•

Cultural Sensitivity: The AI is designed to be culturally sensitive, acknowledging and respecting the diversity of cultural norms, practices, and values. This includes understanding and appropriately responding to cultural references and nuances.
文化敏感性：AI 设计需具备文化敏感性，认知并尊重文化规范、习俗与价值观的多样性，包括准确理解文化典故与细微差异。
•

Accessibility: Ensuring that the AI agent is accessible to users with different abilities, including those with disabilities. This can involve incorporating features that make interactions easier for people with visual, auditory, motor, or cognitive impairments.
可访问性：确保 AI 智能体适用于不同能力用户，包括残障人士，需整合辅助功能以方便视听觉障碍、行动或认知障碍者的交互。
•

Language-based Inclusivity: Providing support for multiple languages and dialects to cater to a global user base, and being sensitive to the nuances and variations within a language Liu et al. (2023b).
语言包容性：支持多语言及方言以适应全球用户，敏感处理语言内部差异（Liu 等，2023b）。
•

Ethical and Respectful Interactions: The Agent is programmed to interact ethically and respectfully with all users, avoiding responses that could be deemed offensive, harmful, or disrespectful.
道德与尊重交互：智能体编程需遵循道德规范，避免冒犯性、有害或不尊重的回应。
•

User Feedback and Adaptation: Incorporating user feedback to continually improve the inclusivity and effectiveness of the AI agent. This includes learning from interactions to better understand and serve a diverse user base.
用户反馈与适应：通过用户反馈持续提升 AI 包容性，从交互中学习以更好服务多元用户群体。
•

Compliance with Inclusivity Guidelines: Adhering to established guidelines and standards for inclusivity in AI agent, which are often set by industry groups, ethical boards, or regulatory bodies.
包容性准则合规：遵守行业组织、伦理委员会或监管机构制定的 AI 包容性标准。

Despite these efforts, it’s important to be aware of the potential for biases in responses and to interpret them with critical thinking. Continuous improvements in AI agent technology and ethical practices aim to reduce these biases over time. One of the overarching goals for inclusivity in agent AI is to create an agent that is respectful and accessible to all users, regardless of their background or identity.
尽管存在这些措施，仍需保持对潜在偏见的警觉并批判性解读响应内容。AI 技术的持续改进与伦理实践将逐步减少这些偏见。智能体 AI 包容性的核心目标是创建尊重所有用户且不受背景身份限制的服务。

2.2.3 Data Privacy and Usage
2.2.3 数据隐私与使用

One key ethical consideration of AI agents involves comprehending how these systems handle, store, and potentially retrieve user data. We discuss key aspects below:
AI 智能体的关键伦理考量涉及理解系统如何处理、存储及调用用户数据，主要方面如下：

Data Collection, Usage and Purpose.
数据收集、用途与目的

When using user data to improve model performance, model developers access the data the AI agent has collected while in production and interacting with users. Some systems allow users to view their data through user accounts or by making a request to the service provider. It is important to recognize what data the AI agent collects during these interactions. This could include text inputs, user usage patterns, personal preferences, and sometimes more sensitive personal information. Users should also understand how the data collected from their interactions is used. If, for some reason, the AI holds incorrect information about a particular person or group, there should be a mechanism for users to help correct this once identified. This is important for both accuracy and to be respectful of all users and groups. Common uses for retrieving and analyzing user data include improving user interaction, personalizing responses, and system optimization. It is extremely important for developers to ensure the data is not used for purposes that users have not consented to, such as unsolicited marketing.
开发者通过生产环境中智能体与用户的交互数据提升模型性能。部分系统允许用户通过账户或申请查看自身数据。需明确智能体收集的数据类型（包括文本输入、使用模式、个人偏好及敏感信息），并知晓数据用途。若 AI 持有错误信息，应建立用户纠错机制。数据主要用于优化交互体验、个性化响应及系统改进，严禁未经许可用于营销等用途。

Storage and Security. 存储与安全

Developers should know where the user interaction data is stored and what security measures are in place to protect it from unauthorized access or breaches. This includes encryption, secure servers, and data protection protocols. It is extremely important to determine if agent data is shared with third parties and under what conditions. This should be transparent and typically requires user consent.
开发者需明确用户数据的存储位置及防护措施（如加密、安全服务器及保护协议），并公开披露与第三方共享数据的条件（通常需用户授权）。

Data Deletion and Retention.
数据删除与留存

It is also important for users to understand how long user data is stored and how users can request its deletion. Many data protection laws give users the right to be forgotten, meaning they can request their data be erased. AI agents must adhere to data protection laws like GDPR in the EU or CCPA in California. These laws govern data handling practices and user rights regarding their personal data.
用户还需了解其数据会存储多久，以及如何申请删除数据。许多数据保护法赋予用户"被遗忘权"，即有权要求清除个人数据。AI 代理必须遵守欧盟《通用数据保护条例》(GDPR)或加州《消费者隐私法案》(CCPA)等法规。这些法律对数据处理行为及用户个人数据权利作出了明确规定。

Data Portability and Privacy Policy.
数据可携性与隐私政策

Furthermore, developers must create the AI agent’s privacy policy to document and explain to users how their data is handled. This should detail data collection, usage, storage, and user rights. Developers should ensure that they obtain user consent for data collection, especially for sensitive information. Users typically have the option to opt-out or limit the data they provide. In some jurisdictions, users may even have the right to request a copy of their data in a format that can be transferred to another service provider.
此外，开发者必须制定 AI 代理的隐私政策，向用户记录和说明其数据处理方式。该政策应详细说明数据收集、使用、存储及用户权利等事项。开发者应确保在数据收集（尤其是敏感信息）时获得用户同意。用户通常有权选择退出或限制提供的数据范围。在某些司法管辖区，用户甚至有权要求以可转移至其他服务提供商的格式获取其数据副本。

Anonymization. 匿名化处理

For data used in broader analysis or AI training, it should ideally be anonymized to protect individual identities. Developers must understand how their AI agent retrieves and uses historical user data during interactions. This could be for personalization or improving response relevance.
用于广泛分析或 AI 训练的数据，理想情况下应进行匿名化处理以保护个人身份。开发者必须了解其 AI 代理在交互过程中如何调取和使用历史用户数据——这些数据可能用于个性化服务或提升响应相关性。

In summary, understanding data privacy for AI agents involves being aware of how user data is collected, used, stored, and protected, and ensuring that users understand their rights regarding accessing, correcting, and deleting their data. Awareness of the mechanisms for data retrieval, both by users and the AI agent, is also crucial for a comprehensive understanding of data privacy.
总之，理解 AI 代理的数据隐私问题需要明确：用户数据的收集、使用、存储和保护方式；确保用户知晓其对数据的访问、更正及删除权利；同时深刻认知用户与 AI 代理双方的数据检索机制，这对全面理解数据隐私至关重要。

2.2.4 Interpretability and Explainability
2.2.4 可解释性与可说明性

Imitation Learning $\rightarrow$ Decoupling.
模仿学习 $\rightarrow$ 解耦

Agents are typically trained using a continuous feedback loop in Reinforcement Learning (RL) or Imitation Learning (IL), starting with a randomly initialized policy. However, this approach faces leader-board in obtaining initial rewards in unfamiliar environments, particularly when rewards are sparse or only available at the end of a long-step interaction. Thus, a superior solution is to use an infinite-memory agent trained through IL, which can learn policies from expert data, improving exploration and utilization of unseen environmental space with emergent infrastructure as shown in Fig. 3. With expert characteristics to help the agent explore better and utilize the unseen environmental space. Agent AI, can learn policies and new paradigm flow directly from expert data.
智能体通常通过强化学习（RL）或模仿学习（IL）中的持续反馈循环进行训练，初始策略随机生成。但这种方法在陌生环境中获取初始奖励时面临瓶颈——尤其是当奖励信号稀疏或仅存在于长程交互末端时。因此更优方案是采用通过 IL 训练的无限记忆智能体，它能直接从专家数据中学习策略，如图 3 所示，借助涌现的基础设施提升对未知环境空间的探索与利用效能。具备专家特征的智能体 AI 不仅能更高效探索环境，还能直接基于专家数据学习策略与新范式流。

Traditional IL has an agent mimicking an expert demonstrator’s behavior to learn a policy. However, learning the expert policy directly may not always be the best approach, as the agent may not generalize well to unseen situations. To tackle this, we propose learning an agent with in-context prompt or a implicit reward function that captures key aspects of the expert’s behavior, as shown in Fig. 3. This equips the infinite memory agent with physical-world behavior data for task execution, learned from expert demonstrations. It helps overcome existing imitation learning drawbacks like the need for extensive expert data and potential errors in complex tasks. The key idea behind the Agent AI has two parts: 1) the infinite agent that collects physical-world expert demonstrations as state-action pairs and 2) the virtual environment that imitates the agent generator. The imitating agent produces actions that mimic the expert’s behavior, while the agent learns a policy mapping from states to actions by reducing a loss function of the disparity between the expert’s actions and the actions generated by the learned policy.
传统的模仿学习让智能体模仿专家示范者的行为来学习策略。然而直接学习专家策略未必总是最佳方案，因为智能体可能难以泛化到未见过的场景。为此，我们提出通过学习具有上下文提示的智能体或隐式奖励函数来捕捉专家行为的关键特征（如图 3 所示）。这种方法为具有无限记忆的智能体提供了从专家示范中学习到的物理世界行为数据，用以执行任务。这有助于克服现有模仿学习的缺陷，例如需要大量专家数据以及在复杂任务中可能出现的错误。Agent AI 的核心思想包含两个部分：1）无限智能体收集物理世界专家示范作为状态-动作对；2）虚拟环境模拟智能体生成器。模仿智能体产生模拟专家行为的动作，同时智能体通过最小化专家动作与学习策略生成动作之间的差异损失函数，来学习从状态到动作的策略映射。

Decoupling $\rightarrow$ Generalization.
解耦 $\rightarrow$ 泛化性

Rather than relying on a task-specific reward function, the agent learns from expert demonstrations, which provide a diverse set of state-action pairs covering various task aspects. The agent then learns a policy that maps states to actions by imitating the expert’s behavior. Decoupling in imitation learning refers to separating the learning process from the task-specific reward function, allowing the policy to generalize across different tasks without explicit reliance on the task-specific reward function. By decoupling, the agent can learn from expert demonstrations and learn a policy that is adaptable to a variety of situations. Decoupling enables transfer learning, where a policy learned in one domain can adapt to others with minimal fine-tuning. By learning a general policy that is not tied to a specific reward function, the agent can leverage the knowledge it acquired in one task to perform well in other related tasks. Since the agent does not rely on a specific reward function, it can adapt to changes in the reward function or environment without the need for significant retraining. This makes the learned policy more robust and generalizable across different environments. Decoupling in this context refers to the separation of two tasks in the learning process: learning the reward function and learning the optimal policy.
智能体并非依赖特定任务的奖励函数，而是通过专家示范进行学习。这些示范提供了涵盖任务各环节的多样化状态-动作组合。随后智能体通过模仿专家行为，学习将状态映射到动作的策略。模仿学习中的解耦机制指的是将学习过程与任务特定奖励函数分离，使得策略能够在不显式依赖特定奖励函数的情况下泛化至不同任务。通过解耦，智能体既能从专家示范中学习，又能掌握适应多种情境的策略。解耦机制支持迁移学习——在一个领域习得的策略只需最小调整即可适应其他领域。通过学习不绑定特定奖励函数的通用策略，智能体能够将已掌握的知识迁移至其他相关任务。由于不依赖特定奖励函数，智能体无需大规模重新训练即可适应奖励函数或环境的变化，这使得习得策略在不同环境中更具鲁棒性和泛化能力。此处的解耦特指学习过程中两个任务的分离：奖励函数学习与最优策略学习。

Generalization $\rightarrow$ Emergent Behavior.
泛化能力 $\rightarrow$ 涌现行为。

Generalization explains how emergent properties or behaviors can arise from simpler components or rules. The key idea lies in identifying the basic elements or rules that govern the behavior of the system, such as individual neurons or basic algorithms. Consequently, by observing how these simple components or rules interact with one another. These interactions of these components often lead to the emergence of complex behaviors, which are not predictable by examining individual components alone. Generalization across different levels of complexity allows a system to learn general principles applicable across these levels, leading to emergent properties. This enables the system to adapt to new situations, demonstrating the emergence of more complex behaviors from simpler rules. Furthermore, the ability to generalize across different complexity levels facilitates knowledge transfer from one domain to another, which contributes to the emergence of complex behaviors in new contexts as the system adapts.
泛化解释了涌现特性或行为如何从更简单的组件或规则中产生。其核心在于识别支配系统行为的基本元素或规则，例如单个神经元或基础算法。通过观察这些简单组件或规则之间的相互作用，这些组件的互动往往会催生复杂行为，而这些行为仅通过单独检查单个组件是无法预测的。跨不同复杂度层级的泛化能力使系统能够学习适用于各层级的通用原则，从而产生涌现特性。这使得系统能够适应新情境，展现出简单规则催生复杂行为的涌现现象。此外，跨复杂度层级的泛化能力促进了知识从一个领域向另一个领域的迁移，当系统适应新环境时，这种能力有助于在新情境中产生复杂行为。

2.2.5 Inference Augmentation
2.2.5 推理增强

The inference ability of an AI agent lies in its capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. Particularly, in the context of large language models, it refers to its capacity to draw conclusions, make predictions, and generate responses based on the data it has been trained on and the input it receives. Inference augmentation in AI agents refers to enhancing the AI’s natural inference abilities with additional tools, techniques, or data to improve its performance, accuracy, and utility. This can be particularly important in complex decision-making scenarios or when dealing with nuanced or specialized content. We denote particularly important sources for inference augmentation below:
人工智能代理的推理能力体现在其基于训练数据和输入信息进行解释、预测与响应的能力。尽管这些能力十分先进且持续提升，但必须认识到其局限性及训练数据带来的影响。尤其在大型语言模型语境下，这种能力特指模型根据训练数据和接收的输入得出结论、进行预测并生成响应的能力。 AI 代理的推理增强是指通过附加工具、技术或数据来提升其自然推理能力，从而改善性能、准确性和实用性。这在复杂决策场景或处理微妙专业内容时尤为重要。我们特别标注以下对推理增强至关重要的来源：

Data Enrichment. 数据增强。

Incorporating additional, often external, data sources to provide more context or background can help the AI agent make more informed inferences, especially in areas where its training data may be limited. For example, AI agents can infer meaning from the context of a conversation or text. They analyze the given information and use it to understand the intent and relevant details of user queries. These models are proficient at recognizing patterns in data. They use this ability to make inferences about language, user behavior, or other relevant phenomena based on the patterns they’ve learned during training.
整合额外（通常是外部）数据源来提供更多背景信息，可以帮助 AI 智能体做出更明智的推断——尤其是在其训练数据可能有限的领域。例如，AI 智能体能够从对话或文本的上下文中推断含义。它们会分析给定信息，借此理解用户查询的意图和相关细节。这些模型擅长识别数据中的模式，并利用这种能力根据训练中学到的模式，对语言、用户行为或其他相关现象进行推断。

Algorithm Enhancement. 算法增强

Improving the AI’s underlying algorithms to make better inferences. This could involve using more advanced machine learning models, integrating different types of AI (like combining NLP with image recognition), or updating algorithms to better handle complex tasks. Inference in language models involves understanding and generating human language. This includes grasping nuances like tone, intent, and the subtleties of different linguistic constructions.
改进 AI 的基础算法以优化推断能力。这可能涉及使用更先进的机器学习模型、整合不同类型的 AI 技术（如将自然语言处理与图像识别相结合），或是更新算法以更好地处理复杂任务。语言模型中的推断包括理解和生成人类语言，涵盖对语气、意图等细微差别以及不同语言结构微妙之处的把握。

Human-in-the-Loop (HITL).
人在回路（HITL）

Involving human input to augment the AI’s inferences can be particularly useful in areas where human judgment is crucial, such as ethical considerations, creative tasks, or ambiguous scenarios. Humans can provide guidance, correct errors, or offer insights that the agent would not be able to infer on its own.
在需要人类判断的关键领域（如伦理考量、创造性任务或模糊场景）引入人工输入来增强 AI 推断尤为有效。人类可以提供指导、纠正错误或提供智能体无法自行推断的见解。

Real-Time Feedback Integration.
实时反馈整合

Using real-time feedback from users or the environment to enhance inferences is another promising method for improving performance during inference. For example, an AI might adjust its recommendations based on live user responses or changing conditions in a dynamic system. Or, if the agent is taking actions in a simulated environment that break certain rules, the agent can be dynamically given feedback to help correct itself.
利用来自用户或环境的实时反馈来增强推断能力，是提升推断阶段性能的另一有效方法。例如，AI 可以根据用户的实时反馈或动态系统中的条件变化调整推荐策略；或者当智能体在模拟环境中违反某些规则时，可动态接收反馈以实现自我修正。

Cross-Domain Knowledge Transfer.
跨领域知识迁移。

Leveraging knowledge or models from one domain to improve inferences in another can be particularly helpful when producing outputs within a specialized discipline. For instance, techniques developed for language translation might be applied to code generation, or insights from medical diagnostics could enhance predictive maintenance in machinery.
利用某一领域的知识或模型来提升另一领域的推理能力，在专业学科产出成果时尤为有效。例如，语言翻译技术可应用于代码生成，医学诊断的洞见也能改进机械设备的预测性维护。

Customization for Specific Use Cases.
针对特定用例的定制化

Tailoring the AI’s inference capabilities for particular applications or industries can involve training the AI on specialized datasets or fine-tuning its models to better suit specific tasks, such as legal analysis, medical diagnosis, or financial forecasting. Since the particular language or information within one domain can greatly contrast with the language from other domains, it can be beneficial to finetune the agent on domain-specific information.
为特定应用或行业定制 AI 推理能力，可能涉及使用专业数据集进行训练，或微调模型以更好地适应法律分析、医学诊断、金融预测等具体任务。由于不同领域的专业语言或信息可能存在显著差异，基于领域特定信息对智能体进行微调往往能带来显著效益。

Ethical and Bias Considerations.
伦理与偏见考量

It is important to ensure that the augmentation process does not introduce new biases or ethical issues. This involves careful consideration of the sources of additional data or the impact of the new inference augmentation algorithms on fairness and transparency. When making inferences, especially about sensitive topics, AI agents must sometimes navigate ethical considerations. This involves avoiding harmful stereotypes, respecting privacy, and ensuring fairness.
必须确保能力增强过程不会引入新的偏见或伦理问题。这需要审慎考量附加数据的来源，以及新增推理增强算法对公平性和透明度的影响。在进行推理时（尤其是涉及敏感话题时），AI 智能体有时需要权衡伦理因素，包括避免有害刻板印象、尊重隐私和保障公平性。

Continuous Learning and Adaptation.
持续学习与适应

Regularly updating and refining the AI’s capabilities to keep up with new developments, changing data landscapes, and evolving user needs.
通过定期更新和优化 AI 能力，使其适应新发展、变化的数据格局和演进的用户需求。

In summmary, winference augmentation in AI agents involves methods in which their natural inference abilities can be enhanced through additional data, improved algorithms, human input, and other techniques. Depending on the use-case, this augmentation is often essential for dealing with complex tasks and ensuring accuracy in the agent’s outputs.
总之，AI 智能体的推理增强通过附加数据、改进算法、人工输入等技术手段来扩展其固有推理能力。根据使用场景不同，这种增强对于处理复杂任务、确保输出准确性往往至关重要。

2.2.6 Regulation 2.2.6 监管规范

Recently, Agent AI has made significant advancements, and its integration into embodied systems has opened new possibilities for interacting with agents via more immersive, dynamic, and engaging experiences. To expedite the process and ease the cumbersome work in agent AI developing, we are proposing to develop the next-generation AI-empowered pipeline for agent interaction. Develop a human-machine collaboration system where humans and machines can communicate and interact meaningfully. The system can leverage the LLM’s or VLM dialog capabilities and vast action to talk with human players and identify human needs. Then it will perform proper actions to help human players upon request.
近期，智能体人工智能取得重大进展，其与具身系统的融合为通过更沉浸、动态和吸引人的体验与智能体交互开辟了新可能。为加速进程并减轻智能体 AI 开发中的繁重工作，我们提议开发新一代支持 AI 的智能体交互管道。构建人机协作系统，使人类与机器能够进行有意义的沟通与互动。该系统可利用 LLM 或 VLM 的对话能力与海量行为库，与人类玩家交谈并识别需求，随后根据请求执行恰当动作来协助玩家。

When employing LLM/VLMs for a human-machine collaboration system, it is essential to note that these operate as black boxes, generating unpredictable output. This uncertainty can become crucial in a physical setup, such as operating actual robotics. An approach to address this challenge is constraining the focus of the LLM/VLM through prompt engineering. For instance, in robotic task planning from instructions, providing environmental information within the prompt has been reported to yield more stable outputs than relying solely on text Gramopadhye and Szafir (2022). This report is supported by the Minsky’s frame theory of AI Minsky (1975), suggesting that the problem space to be solved by LLM/VLMs is defined by the given prompts. Another approach is designing prompts to make LLM/VLMs include explanatory text to allow users understand what the model has focused on or recognized. Additionally, implementing a higher layer that allows for pre-execution verification and modification under human guidance can facilitate the operation of systems working under such guidance (Fig. 4).
在将 LLM/VLM 应用于人机协作系统时，必须注意它们作为黑箱运行会产生不可预测的输出。这种不确定性在物理场景（如操作真实机器人）中尤为关键。应对这一挑战的方法是通过提示工程来约束 LLM/VLM 的焦点。例如在基于指令的机器人任务规划中，研究表明在提示中包含环境信息比仅依赖文本能产生更稳定的输出（Gramopadhye 和 Szafir，2022）。该结论与 Minsky（1975）提出的 AI 框架理论相呼应，表明 LLM/VLM 待解决的问题空间由给定提示定义。另一策略是设计提示使 LLM/VLM 生成解释性文本，让用户理解模型的关注点或识别内容。此外，构建允许人工指导下进行执行前验证与修改的高层控制机制，可提升此类指导型系统的可操作性（图 4）。

2.3 Agent AI for Emergent Abilities
2.3 面向涌现能力的智能体 AI

Despite the growing adoption of interactive agent AI systems, the majority of proposed methods still face a challenge in terms of their generalization performance in unseen environments or scenarios. Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new. To address this issue, we build interactive agents that leverage the knowledge-memory of general-purpose foundation models (ChatGPT, Dall-E, GPT-4, etc.) for a novel scenario, specifically for generating a collaboration space between humans and agents. We discover an emergent mechanism–- which we name Mixed Reality with Knowledge Inference Interaction–-that facilitates collaboration with humans to solve challenging tasks in complex real-world environments and enables the exploration of unseen environments for adaptation to virtual reality. For this mechanism, the agent learns i) micro-reactions in cross-modality: collecting relevant individual knowledge for each interaction task (e.g., understanding unseen scenes) from the explicit web source and by implicitly inferring from the output of pretrained models; ii) macro-behavior in reality-agnostic: improving interactive dimensions and patterns in language and multi-modality domains, and make changes based on characterized roles, certain target variable, influenced diversification of collaborative information in mixed-reality and LLMs. We investigate the task of knowledge-guided interactive synergistic effects to collaborated scene generation with combining various OpenAI models, and show promising results of how the interactive agent system can further boost the large foundation models in our setting. It integrates and improves the depth of generalization, conscious and interpretability of a complex adaptive AI systems.
尽管交互式智能体 AI 系统的应用日益广泛，但大多数现有方法在未知环境或场景中的泛化性能仍面临挑战。当前建模实践要求开发者针对每个领域准备大量数据进行模型微调/预训练，这一过程成本高昂，对于新兴领域甚至难以实现。为解决这一问题，我们构建了能够利用通用基础模型（如 ChatGPT、Dall-E、GPT-4 等）知识记忆的交互式智能体，专门用于构建人类与智能体的协同空间。我们发现了一种新兴机制——我们称之为"知识推理混合现实交互"——该机制通过与人类协作解决复杂现实环境中的挑战性任务，并支持探索未知环境以实现虚拟现实适配。该机制使智能体能够：i）在跨模态中学习微观反应：通过显式网络来源和预训练模型输出的隐式推理，为每个交互任务（如理解陌生场景）收集相关个体知识；ii）在现实无关层面实现宏观行为：提升语言和多模态领域的交互维度与模式，并根据角色特征、特定目标变量以及混合现实与 LLMs 中协作信息的多样化影响进行调整。我们研究了知识引导的交互协同效应任务，结合多种 OpenAI 模型进行协作场景生成，结果表明交互式智能体系统能显著增强大型基础模型在特定场景下的表现。该系统整合并提升了复杂自适应 AI 系统的泛化深度、意识性和可解释性。

3 Agent AI Paradigm 3 智能体 AI 范式

In this section, we discuss a new paradigm and framework for training Agent AI. We seek to accomplish several goals with our proposed framework:
本节我们将探讨训练智能体 AI 的新范式与框架。该框架旨在实现以下目标：

•

Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with effective understanding of important modalities, such as text or visual inputs.
利用现有预训练模型及策略，使智能体快速掌握文本、视觉等重要模态的理解能力。
•

Support for sufficient long-term task-planning capabilities.
支持长期任务规划能力的充分实现。
•

Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
构建可编码和检索习得知识的内存框架。
•

Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.
通过环境反馈有效训练智能体的决策能力。

We show a high-level new agent diagram outlining the important submodules of such a system in Fig. 5.
图 5 展示了该系统的顶层架构图，概述了重要子模块。

3.1 LLMs and VLMs 3.1 大语言模型与视觉语言模型

We can use the LLM or VLM model to bootstrap the components of the Agent as showed in Fig. 5. In particular, LLMs have been shown to perform well for task-planning Gong et al. (2023a), contain significant world knowledge Yu et al. (2023b), and display impressive logical reasoning capabilities Creswell et al. (2022). Additionally, VLMs such as CLIP Radford et al. (2021) provide a general visual encoder that is language-aligned, as well as providing zero-shot visual recognition capabilities. For example, state-of-the-art open-source multi-modal models such as LLaVA Liu et al. (2023c) and InstructBLIP Dai et al. (2023) rely upon frozen CLIP models as visual encoders.
如图 5 所示，我们可以采用 LLM 或 VLM 模型来启动智能体组件。特别是 LLMs 已展现出卓越的任务规划能力（Gong 等人，2023a）、丰富的世界知识（Yu 等人，2023b）和出色的逻辑推理能力（Creswell 等人，2022）。此外，诸如 CLIP（Radford 等人，2021）等 VLM 模型不仅提供语言对齐的通用视觉编码器，还具备零样本视觉识别能力。例如 LLaVA（Liu 等人，2023c）和 InstructBLIP（Dai 等人，2023）等顶尖开源多模态模型均采用冻结的 CLIP 模型作为视觉编码器。

3.2 Agent Transformer Definition
3.2 智能体 Transformer 定义

Instead of using frozen LLMs and VLMs for the AI agent, it is also possible to use a single-agent transformer model that takes visual tokens and language tokens as input, similar to Gato Reed et al. (2022). In addition to vision and language, we add a third general type of input, which we denote as agent tokens. Conceptually, agent tokens are used to reserve a specific subspace of the input and output space of the model for agentic behaviors. For robotics or game playing, this may be represented as the input action space of the controller. When training agents to use specific tools, such as image-generation or image-editing models, or for other API calls, agent tokens can also be used. As showed in Fig. 7, we can combine the agent tokens with visual and language tokens to generate a unified interface for training multi-modal agent AI. Compared to using large, proprietary LLMs as agents, there are several advantages to using an agent transformer. Firstly, the model can be easily customized to very specific agentic tasks that may be difficult to represent in natural language (e.g. controller inputs or other specific actions). Thus, the agent can learn from environmental interactions and domain-specific data to improve performance. Secondly, it can be easier to understand why the model does or does not take specific actions by having access to the probabilities of the agent tokens. Thirdly, there are certain domains such as healthcare and law that have strict data privacy requirements. Finally, a relatively smaller agent transformer can potentially be significantly cheaper than a larger proprietary language model.
除了为 AI 智能体使用固定的 LLMs 和 VLMs 外，也可以采用类似 Gato Reed 等人(2022)提出的单智能体 Transformer 模型，该模型同时接收视觉标记和语言标记作为输入。除视觉与语言模态外，我们引入了第三种通用输入类型——智能体标记。从概念上讲，智能体标记用于在模型的输入输出空间中预留特定子空间，以承载智能体行为。对于机器人控制或游戏场景，这可以表现为控制器的输入动作空间。在训练智能体使用特定工具（如图像生成/编辑模型）或进行 API 调用时，亦可运用智能体标记。如图 7 所示，通过将智能体标记与视觉、语言标记相结合，我们能构建用于训练多模态智能体 AI 的统一接口。相较于采用大型专有 LLMs 作为智能体，使用智能体 Transformer 具备多重优势：其一，该模型可轻松定制于那些难以用自然语言描述的特定智能体任务（如控制器输入或其他特定动作），从而通过环境交互和领域数据提升性能；其二，借助智能体标记的概率分布，能更直观理解模型采取或放弃特定动作的决策依据；其三，在医疗、法律等对数据隐私有严格要求的领域更具适用性；最后，较小规模的智能体 Transformer 在成本效益上显著优于大型专有语言模型。

3.3 Agent Transformer Creation
3.3 智能体 Transformer 创建

As shown above in Fig. 5, we can use the new agent paradigm with LLM and VLM-bootstrapped agents, as well as leveraging data generated from large foundation models to train the agent transformer model for learning to execute specific goals. Within this process, the agent model is trained to be specialized and tailored for specific tasks and domains. This approach allows you to leverage a pre-existing, foundation model’s learned features and knowledge. We show a simplified overview of the process in two steps below:
如图 5 所示，我们可以采用基于 LLM 和视觉语言模型（VLM）引导的新型智能体范式，同时利用大型基础模型生成的数据来训练智能体 Transformer 模型，使其学会执行特定目标。在此过程中，智能体模型通过训练可专精于特定任务和领域。这种方法能充分利用预训练基础模型已习得的特征与知识。以下我们通过两个步骤简要说明该流程：

Define Objectives within the Domain.
定义领域目标

In order to train the agent transformer, the objectives and the action-space of the agent within the context of each specific environment needs to be clearly defined. This includes determining which specific tasks or actions the agent needs to perform and assigning unique agent tokens for each. Furthermore, any automatic rules or procedures that can be used to identify successful completion of tasks can significantly improve the amount of data available for training. Otherwise, foundation-model generated or human-annotated data will be required for training the model. After the data is collected and it is possible to evaluate the performance of the agent, the process of continuous improvement can begin.
为训练智能体 Transformer，需要明确定义其在特定环境背景下的目标与行动空间。这包括确定智能体需执行的具体任务或动作，并为每个动作分配专属标记符。此外，建立用于识别任务成功完成的自动化规则或程序，能显著增加可用训练数据量。否则，就需要依赖基础模型生成或人工标注的数据进行训练。当数据收集完毕且能评估智能体性能后，即可启动持续改进流程。

Continuous Improvement. 持续改进

Continuous monitoring of the model’s performance and collection of feedback are essential steps in the process. Feedback should be used for further fine-tuning and updates. It is also crucial to ensure that the model does not perpetuate biases or unethical outcomes. This necessitates a careful examination of the training data, regular checks for biases in outputs, and, if needed, training the model to recognize and avoid biases. Once the model achieves satisfactory performance, it can be deployed for the intended application. Continuous monitoring remains vital to ensure that the model performs as expected and to facilitate necessary adjustments. More details on this process, sources of training data, and details surrounding continous learning for agent AI can be found in Section 8.
持续监控模型表现与收集反馈是该流程的关键环节。反馈数据应用于进一步的微调与模型更新。同时必须确保模型不会延续偏见或产生不道德结果，这要求对训练数据进行严格审查、定期检测输出偏见，必要时还需训练模型识别并规避偏见。当模型达到满意性能后，即可部署至目标应用场景。持续的监控对于保障预期表现和及时调整仍然至关重要。更多关于该流程、训练数据来源及智能体 AI 持续学习的细节详见第 8 章。

4 Agent AI Learning 4 智能体 AI 学习

4.1 Strategy and Mechanism
4.1 策略与机制

The strategy of interactive AI on different domains which extends the paradigm of calling large foundation models with a trained agent that actively seeks to collect user feedback, action information, useful knowledge for generation and interaction. Some times, the LLM/VLM models are not need to trained again, and we improve their performance by providing improved contextual prompts at test time for an agent. On the other hand, it always involves a knowledge/reasoning/commonsense/inference interactive modeling through a combination of triple systems - one performing knowledge retrieval from multi-model query, second performing interactive generation from the relevant agent, and last one the trained a new, informative self-supervised training or pre-training with reinforcement learning or imitation learning with improved way.
跨领域交互式 AI 策略扩展了调用大型基础模型的范式，通过训练主动收集用户反馈、行为信息和实用知识的智能体来实现。某些情况下，LLM/VLM 模型无需重新训练，我们通过为测试阶段的智能体提供优化后的上下文提示来提升其表现。另一方面，该策略始终涉及知识/推理/常识/推断的交互式建模，通过三重系统组合实现：第一个系统执行多模态查询的知识检索，第二个系统由相关智能体进行交互式生成，最后一个系统则通过强化学习或改进的模仿学习方式进行信息量更大的自监督训练或预训练。

4.1.1 Reinforcement Learning (RL)
4.1.1 强化学习（RL）

There is a rich history of leveraging reinforcement learning (RL) to train interactive agents that exhibits intelligent behaviors. RL is a methodology to learn the optimal relationship between states and actions based on rewards (or penalties) received as a result of its actions. RL is a highly scalable framework that has been applied to numerous applications including robotics, however, it generally faces several leader-board and LLM/VLMs have shown their potential to mitigate or overcome some of those difficulties:
利用强化学习（RL）训练具有智能行为的交互式智能体已有丰富研究历史。RL 是通过行为导致的奖励（或惩罚）来学习状态与动作间最优关系的方法论。作为高度可扩展的框架，RL 已应用于机器人等众多领域，但通常面临若干挑战。当前领先的 LLM/VLM 模型已展现出缓解或克服部分困难的潜力：

•

Reward designing The efficiency of policy learning greatly depends on the design of the reward function. Designing the reward function requires not only knowledge of RL algorithms but also a deep understanding of the nature of the task, and thus often necessitates crafting the function based on expert experience. Several studies explored the use of LLM/VLMs for designing reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al. (2023).
奖励设计策略学习效率极大依赖于奖励函数的设计。设计奖励函数不仅需要 RL 算法知识，还需深刻理解任务本质，因此往往需基于专家经验构建。多项研究探索了使用 LLM/VLM 设计奖励函数的方法（Yu 等人，2023a；Katara 等人，2023；Ma 等人，2023）。
•

Data collection and efficiency Given its exploratory nature, RL-based policy learning requires a significant amount of data Padalkar et al. (2023). The necessity for extensive data becomes particularly evident when the policy involves managing long sequences or integrating complex actions. This is because these scenarios demand more nuanced decision-making and learning from a wider range of situations. In recent studies, efforts have been directed towards enhancing data generation to support policy learning Kumar et al. (2023); Du et al. (2023). Additionally, in some studies, these models have been integrated into the reward function to improve policy learning Sontakke et al. (2023). Parallel to these developments, another strand of research has focused on achieving parameter efficiency in learning processes using VLMs Tang et al. (2023); Li et al. (2023d) and LLMs Shi et al. (2023)
数据收集与效率基于强化学习的策略学习因其探索性质需要大量数据（Padalkar 等人，2023）。当策略涉及管理长序列或整合复杂动作时，对海量数据的需求尤为明显，因为这些场景需要更精细的决策能力，并要求从更广泛的情境中学习。近期研究致力于通过增强数据生成来支持策略学习（Kumar 等人，2023；Du 等人，2023）。部分研究还将这些模型整合至奖励函数中以优化策略学习（Sontakke 等人，2023）。与此同时，另一研究方向聚焦于利用视觉语言模型（Tang 等人，2023；Li 等人，2023d）和 LLMs（Shi 等人，2023）实现学习过程的参数高效化。
•

Long-horizon steps In relation to the issue of data efficiency, RL becomes more challenging as the length of action sequences increases. This is due to the ambiguity in the relationship between actions and rewards, known as the credit assignment problem, and the increase in the number of states to be explored, necessitating a significant amount of time and data. One typical approach for long and complex tasks is to break them down into a sequence of subgoals and apply pretrained policies to solve each subgoal (e.g., Takamatsu et al. (2022)). This idea falls within the framework called the task and motion planning (TAMP)Garrett et al. (2021). TAMP is composed of two primary components: task planning, which entails identifying sequences of high-level actions, and motion planning, which involves finding physically consistent, collision-free trajectories to achieve the objectives of the task plan.
长时程步骤关于数据效率问题，随着动作序列长度的增加，强化学习面临更大挑战。这源于动作与奖励关联的模糊性（即信用分配问题），以及需探索状态数量的增长，需要耗费大量时间和数据。针对长周期复杂任务的典型解决方案是将其分解为一系列子目标，并应用预训练策略逐个解决（如 Takamatsu 等人 2022 年研究）。这一思路属于任务与运动规划（TAMP）框架范畴（Garrett 等人 2021 年）。TAMP 由两大核心组件构成：任务规划（确定高层动作序列）和运动规划（寻找物理合理、无碰撞的轨迹以实现任务规划目标）。

LLMs are well-suited to TAMP, and recent research has often adopted an approach where LLMs are used to execute high-level task planning, while low-level controls are addressed with RL-based policies Xu et al. (2023); Sun et al. (2023a); Li et al. (2023b); Parakh et al. (2023). The advanced capabilities of LLMs enable them to effectively decompose even abstract instructions into subgoals Wake et al. (2023c), contributing to the enhancement of language understanding abilities in robotic systems.
LLMs 天然适配 TAMP 框架，近期研究多采用 LLMs 执行高层任务规划，同时用基于强化学习的策略处理底层控制（Xu 等人 2023；Sun 等人 2023a；Li 等人 2023b；Parakh 等人 2023）。LLMs 的先进能力使其能有效将抽象指令分解为子目标（Wake 等人 2023c），从而提升机器人系统的语言理解能力。

4.1.2 Imitation Learning (IL)
4.1.2 模仿学习（IL）

While RL aims to train a policy based on exploratory behavior and maximizing rewards through interactions with the environment, imitation learning (IL) seeks to leverage expert data to mimic the actions of experienced agents or experts. For example, in robotics, one of the major frameworks based on IL is Behavioral Cloning (BC). BC is an approach where a robot is trained to mimic the actions of an expert by directly copying them. In this approach, the expert’s actions in performing specific tasks are recorded, and the robot is trained to replicate these actions in similar situations. Recent BC-based methods often incorporate technologies from LLM/VLMs, enabling more advanced end-to-end models. For example, Brohan et al. proposed RT-1 Brohan et al. (2022) and RT-2 Brohan et al. (2023), transformer-based models that output an action sequence for the base and arm, taking a series of images and language as input. These models are reported to show high generalization performance as the result of training on a large amount of training data.
强化学习（RL）旨在通过与环境交互的探索行为来训练策略并最大化奖励，而模仿学习（IL）则试图利用专家数据来模仿经验丰富的智能体或专家的行为。例如在机器人领域，基于 IL 的主要框架之一是行为克隆（BC）。该方法通过直接复制专家动作来训练机器人进行模仿：专家执行特定任务时的动作被记录下来，机器人被训练在类似情境中复现这些动作。近期基于 BC 的方法常整合 LLM/VLMs 技术，实现了更先进的端到端模型。例如 Brohan 等人提出的 RT-1（2022）和 RT-2（2023）是基于 Transformer 的模型，以系列图像和语言作为输入，输出底座和机械臂的动作序列。据报道，这些模型通过海量训练数据展现出优异的泛化性能。

4.1.3 Traditional RGB 4.1.3 传统 RGB 技术

Learning intelligent agent behavior leveraging image inputs has been of interest for many years Mnih et al. (2015). The inherent challenge of using RGB input is the curse of dimensionality. To solve this problem, researchers either use more data Jang et al. (2022); Ha et al. (2023) or introduce inductive biases into the model design to improve sample efficiency. In particular, authors incorporate 3D structures into the model architecture for manipulations Zeng et al. (2021); Shridhar et al. (2023); Goyal et al. (2023); James and Davison (2022). For robot navigation, authors Chaplot et al. (2020a, b) leverage maps as a representation. Maps can either be learned from a neural network aggregating all previous RGB inputs or through 3D reconstruction methods such as Neural Radiance Fields Rosinol et al. (2022).
利用图像输入学习智能体行为多年来备受关注（Mnih 等人，2015）。使用 RGB 输入的本质挑战在于维度灾难。为解决该问题，研究者要么使用更多数据（Jang 等人，2022；Ha 等人，2023），要么在模型设计中引入归纳偏置以提高样本效率。特别是有些作者将 3D 结构整合到模型架构中以处理操作任务（Zeng 等人，2021；Shridhar 等人，2023；Goyal 等人，2023；James 和 Davison，2022）。在机器人导航领域，Chaplot 等人（2020a，b）利用地图作为表征方式——地图既可通过神经网络聚合所有历史 RGB 输入来学习，也能通过 Neural Radiance Fields 等 3D 重建方法获得（Rosinol 等人，2022）。

To obtain more data, researchers synthesize synthetic data using graphics simulators Mu et al. (2021); Gong et al. (2023b), and try to close the sim2real gap Tobin et al. (2017); Sadeghi and Levine (2016); Peng et al. (2018). Recently, there has been some collective effort to curate large-scale dataset that aims to resolve the data scarcity problem Padalkar et al. (2023); Brohan et al. (2023). On the other hand, to improve sample complexity, data augmentation techniques have been extensively studied as well Zeng et al. (2021); Rao et al. (2020); Haarnoja et al. (2023); Lifshitz et al. (2023).
为获取更多数据，研究者使用图形模拟器合成人工数据（Mu 等人，2021；Gong 等人，2023b），并试图缩小仿真与现实间的差距（Tobin 等人，2017；Sadeghi 和 Levine，2016；Peng 等人，2018）。近期学界开始协力构建旨在解决数据稀缺问题的大规模数据集（Padalkar 等人，2023；Brohan 等人，2023）。另一方面，为提高样本复杂度，数据增强技术也得到广泛研究（Zeng 等人，2021；Rao 等人，2020；Haarnoja 等人，2023；Lifshitz 等人，2023）。

4.1.4 In-context Learning 4.1.4 上下文学习

In-context learning was shown to be an effective method for solving tasks in NLP with the advent of large language models like GPT-3 Brown et al. (2020); Min et al. (2022). Few-shot prompts were seen to be an effective way to contextualize model output’s across a variety of tasks in NLP by providing examples of the task within the context of the LLM prompt. Factors like the diversity of examples and quality of examples shown for the in-context demonstrations may improve the quality of model outputs An et al. (2023); Dong et al. (2022). Within the context of multi-modal foundation models, models like Flamingo and BLIP-2 Alayrac et al. (2022); Li et al. (2023c) have been shown to be effective at a variety of visual understanding tasks when given only given a small number of examples. In context learning can be further improved for agents within environments by incorporating environment-specific feedback when certain actions are taken Gong et al. (2023a).
随着 GPT-3 等大型语言模型的出现，上下文学习已被证明是 NLP 任务的有效解决方法（Brown 等，2020；Min 等，2022）。少量示例提示能通过 LLM 上下文中提供的任务样本，使模型输出适配各类 NLP 任务。研究表明，上下文演示中示例的多样性与质量等因素可提升模型输出质量（An 等，2023；Dong 等，2022）。在多模态基础模型领域，Flamingo 和 BLIP-2 等模型（Alayrac 等，2022；Li 等，2023c）仅需少量示例即可高效完成多种视觉理解任务。通过引入环境特异性行为反馈（Gong 等，2023a），可进一步优化环境中智能体的上下文学习能力。

4.1.5 Optimization in the Agent System
4.1.5 智能体系统优化

The optimization of agent systems can be divided into spatial and temporal aspects. Spatial optimization considers how agents operate within a physical space to execute tasks. This includes inter-robot coordination, resource allocation, and keeping an organized space.
智能体系统优化可分为空间与时间两个维度。空间优化关注智能体在物理环境中执行任务的方式，包括机器人间协同、资源分配及空间组织管理。

In order to effectively optimize agent AI systems, especially systems with large numbers of agents acting in parallel, previous works have focused on using large batch reinforcement learning Shacklett et al. (2023). Since datasets of multi-agent interactions for specific tasks are rare, self-play reinforcement learning enables a team of agents to improve over time. However, this may also lead to very brittle agents that can only work under self-play and not with humans or other independent agents since they over-fit to the self-play training paradigm. To address this issue, we can instead discover a diverse set of conventions Cui et al. (2023); Sarkar et al. (2023), and train an agent that is aware of a wide range of conventions. Foundation models can further help to establish conventions with humans or other independent agents, enabling smooth coordination with new agents.
为有效优化 Agent AI 系统（尤其是多智能体并行系统），前人研究主要采用大批量强化学习（Shacklett 等，2023）。由于特定任务的多智能体交互数据集稀缺，自我对弈强化学习能使智能体团队持续进化，但可能导致智能体过度适应训练范式而无法与人类或其他独立智能体协作。针对此问题，可采用多样化策略发现（Cui 等，2023；Sarkar 等，2023）来训练具备广谱策略认知的智能体。基础模型能进一步帮助建立人机或智能体间协作规范，实现与新智能体的无缝协调。

Temporal optimization, on the other hand, focuses on how agents execute tasks over time. This encompasses task scheduling, sequencing, and timeline efficiency. For instance, optimizing the trajectory of a robot’s arm is an example of efficiently optimizing movement between consecutive tasks Zhou et al. (2023c). At the level of task scheduling, methods like LLM-DP Dagan et al. (2023) and ReAct Yao et al. (2023a) have been proposed to solve efficient task planning by incorporating environmental factors interactively.
时间优化则侧重于智能体如何随时间推移执行任务，这包括任务调度、序列安排和时间线效率。例如，优化机械臂的运动轨迹就是高效协调连续任务间动作的典型案例（Zhou 等人，2023c）。在任务调度层面，LLM-DP（Dagan 等人，2023）和 ReAct（Yao 等人，2023a）等方法通过动态融合环境因素，提出了高效的交互式任务规划解决方案。

4.2 Agent Systems (zero-shot and few-shot level)
4.2 智能体系统（零样本与小样本层级）

4.2.1 Agent Modules 4.2.1 智能体模块

Our foray into the agent paradigm involves the development of Agent AI "Modules" for interactive multi-modal agents using LLMs or VLMs. Our initial Agent Modules facilitate training or in-context learning and adopt a minimalist design for the purposes of demonstrating the agent’s ability to schedule and coordinate effectively. We also explored initial prompt-based memory techniques that facilitate better planning and inform future actions approaches within the domain. To illustrate, our “MindAgent" infrastructure comprises 5 main modules: 1) environment perception with task planning, 2) agent learning, 3) memory, 4) general agent action prediction and 5) cognition, as shown in Figure 5.
我们对智能体范式的探索涉及开发基于 LLMs 或 VLMs 的多模态交互式 Agent AI"模块"。初始的智能体模块采用极简设计，支持训练或上下文学习，旨在展示智能体有效调度与协调的能力。我们还研究了基于提示的初级记忆技术，这些技术能促进领域内更优的规划并为未来行动方法提供依据。以"MindAgent"架构为例，其包含五大核心模块：1) 带任务规划的环境感知，2) 智能体学习，3) 记忆，4) 通用智能体行为预测，以及 5) 认知系统，如图 5 所示。

4.2.2 Agent Infrastructure
4.2.2 智能体基础设施

Agent-based AI is a large and fast-growing community within the domains of entertainment, research, and industry. The development of large foundation models has significantly improved the performance of agent AI systems. However, creating agents in this vein is limited by the increasing effort necessary to create high-quality datasets and overall cost. At Microsoft, building high-quality agent infrastructure has significantly impacted multi-modal agent copilots by using advanced hardware, diverse data sources, and powerful software libraries. As Microsoft continues to push the boundaries of agent technology, AI agent platforms are poised to remain a dominant force in the world of multimodal intelligence for years to come. Nevertheless, agent AI interaction is currently still a complex process that requires a combination of multiple skills. The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large studios, as well as empowering smaller independent content creators to design high quality experiences beyond what they are currently capable of. The current human-machine interaction systems inside multi-modal agents are primarily rule-based. They do have intelligent behaviors in response to human/user actions and possess web knowledge to some extent. However, these interactions are often limited by software development costs to enable specific behaviors in the system. In addition, current models are not designed to help human to achieve a goal in the case of users’ inability to achieve specific tasks. Therefore, there is a need for an agent AI system infrastructure to analyze users behaviors and provide proper support when needed.
基于智能体的 AI 在娱乐、研究和工业领域已形成庞大且快速发展的生态。大型基础模型的发展显著提升了 Agent AI 系统的性能，但这类智能体的创建正面临高质量数据集构建成本攀升的限制。微软通过先进硬件、多元数据源和强大软件库构建的高质量智能体基础设施，对多模态智能体协作者系统产生了深远影响。随着微软持续突破智能体技术边界，AI 智能体平台有望在未来多年保持多模态智能领域的主导地位。然而当前 Agent AI 交互仍是需要多技能结合的复杂过程——大型生成式 AI 模型的最新进展，既能帮助大型工作室大幅降低交互内容的高成本与耗时，也能赋能独立创作者突破现有能力边界设计高质量体验。现有多模态智能体中的人机交互系统主要基于规则设计，虽能响应用户行为并具备一定网络知识，但其交互能力常受限于实现特定系统行为的开发成本。更重要的是，当用户无法完成特定任务时，现有模型缺乏辅助达成目标的机制，因此亟需能分析用户行为并在必要时提供支持的 Agent AI 系统基础设施。

4.3 Agentic Foundation Models (pretraining and finetune level)
4.3 智能体基础模型（预训练与微调层面）

The use of pre-trained foundation models offers a significant advantage in their wide applicability across diverse use cases. The integration of these models enables the development of customized solutions for various applications, circumventing the need for extensive labeled datasets for each specific task.
预训练基础模型的优势在于其广泛的跨场景适用性。这些模型的整合应用能够为多样化场景开发定制化解决方案，避免了为每个特定任务构建大规模标注数据集的必要性。

A notable example in the field of navigation is the LM-Nav system Shah et al. (2023a), which incorporates GPT-3 and CLIP in a novel approach. It effectively uses textual landmarks generated by the language model, anchoring them in images acquired by robots for navigation. This method demonstrates a seamless fusion of textual and visual data, significantly enhancing the capabilities of robotic navigation, while maintaining wide applicability.
导航领域的代表性案例是 LM-Nav 系统（Shah 等人，2023a），其创新性地融合了 GPT-3 与 CLIP 模型。该系统通过语言模型生成文本地标，并将其锚定至机器人采集的环境图像中实现导航。这种方法展现了文本与视觉数据的无缝融合，在保持广泛适用性的同时显著提升了机器人导航能力。

In robot manipulation, several studies have proposed the use of off-the-shelf LLMs (e.g., ChatGPT) while using open vocabulary object detectors. The combination of LLM and advanced object detectors (e.g., Detic Zhou et al. (2022)) facilitates the understanding of human instruction while grounding the textual information in scenery information Parakh et al. (2023). Furthermore, the latest advancements showcase the potential of using prompt engineering with advanced multi-modal models such as GPT-4V(ision) Wake et al. (2023b). This technique opens avenues for multi-modal task planning, underscoring the versatility and adaptability of pre-trained models in a variety of contexts.
在机器人操作领域，多项研究提出结合商用 LLM（如 ChatGPT）与开放词汇目标检测器的方案。LLM 与先进目标检测器（如 Detic，Zhou 等人 2022）的协同工作，既能理解人类指令，又能将文本信息映射至场景要素（Parakh 等人，2023）。最新进展还展示了通过提示工程调用 GPT-4V(ision)等多模态模型的潜力（Wake 等人，2023b），该技术为多模态任务规划开辟了新途径，凸显预训练模型在不同情境下的适应能力。

5 Agent AI Categorization 5 智能体 AI 分类体系

5.1 Generalist Agent Areas
5.1 通用型智能体领域

Computer-based action and generalist agents (GAs) are useful for many tasks. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for GAs. However, for a GA to become truly valuable to its users, it must be natural to interact with, and generalize to a broad range of contexts and modalities. We high-quality extended main Chapters on Agent foundation AI in Sec.6, especially in areas relevant to the themes in general of these topics:
基于计算机的动作智能体与通用型智能体（GA）可胜任多种任务。大型基础模型和交互式 AI 领域的最新进展为通用型智能体赋予了新功能。但要使通用型智能体真正具有实用价值，必须具备自然的交互能力，并能适应广泛的应用场景与模态。我们将在第 6 章重点扩展论述智能体基础 AI，特别是涉及以下主题的相关领域：

Multimodal Agent AI (MMA) is an upcoming forum¹¹1Current URL: https://multimodalagentai.github.io/ for our research and industry communities to engage with each other and with the broader research and technology communities in Agent AI. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for generalist agents (GAs), such as predicting user actions and task planning in constrained settings (e.g., MindAgent Gong et al. (2023a), fine-grained multimodal video understanding Luo et al. (2022), Robotics Ahn et al. (2022b); Brohan et al. (2023)), or providing a chat companion for users that incorporates knowledge feedback (e.g., website customer support for healthcare systems Peng et al. (2023)). More details about the representative works and most recent representative works are shown below. We hope to discuss our vision for the future of MAA and inspire future researchers to work in this space. This article and our forum covers the following main topics, but is not limited exclusively to these:
多模态智能体 AI（MMA）是我们为学界与产业界打造的交流平台 ¹¹1Current URL: https://multimodalagentai.github.io/ ，旨在促进智能体 AI 领域与更广泛科研技术社区的互动。大型基础模型和交互式 AI 的进展已为通用型智能体（GA）带来新功能，例如在受限环境中预测用户行为与任务规划（如 MindAgent Gong 等 2023a，细粒度多模态视频理解 Luo 等 2022，机器人技术 Ahn 等 2022b；Brohan 等 2023），或整合知识反馈的聊天伴侣（如医疗健康系统的网站客户支持 Peng 等 2023）。下文将展示代表性研究成果的最新进展。我们期望通过本文讨论 MMA 的未来愿景，激励更多研究者投身该领域。本文及论坛涵盖但不限于以下主题：

•

Primary Subject Topics: Multimodal Agent AI, General Agent AI
核心主题：多模态智能体 AI，通用型智能体 AI
•

Secondary Subject Topics: Embodied Agents, Action Agents, Language-based Agents, Vision & Language Agents, Knowledge and Inference Agents, Agents for Gaming, Robotics, Healthcare, etc.
次级主题：具身智能体，动作智能体，基于语言的智能体，视觉与语言智能体，知识与推理智能体，游戏/机器人/医疗等领域智能体
•

Extend Subject Topics: Visual Navigation, Simulation Environments, Rearrangement, Agentic Foundation Models, VR/AR/MR, Embodied Vision & Language.
扩展主题：视觉导航，仿真环境，场景重组，智能体基础模型，虚拟/增强/混合现实，具身视觉与语言

Next, we present a specific lists of representative agent categories as follows:
接下来我们将具体列举代表性智能体类别如下：

5.2 Embodied Agents 5.2 具身智能体

Our biological minds live in bodies, and our bodies move through a changing world. The goal of embodied artificial intelligence is to create agents, such as robots, which learn to creatively solve challenging tasks requiring interaction with the environment. While this is a significant challenge, important advances in deep learning and the increasing availability of large datasets like ImageNet have enabled superhuman performance on a variety of AI tasks previously thought intractable. Computer vision, speech recognition and natural language processing have experienced transformative revolutions at passive input-output tasks like language translation and image classification, and reinforcement learning has similarly achieved world-class performance at interactive tasks like game playing. These advances have supercharged embodied AI, enabling a growing collection of users to make rapid progress towards intelligent agents can interactive with machine.
人类智能依托于生物躯体，在动态环境中实现移动与交互。具身人工智能的目标正是创造能通过环境交互创造性解决复杂任务的智能体（如机器人）。尽管面临重大挑战，但深度学习的突破性进展与 ImageNet 等大型数据集的普及，已使 AI 在诸多原认为不可解的领域实现超人类表现。计算机视觉、语音识别和自然语言处理已在翻译、图像分类等被动输入输出任务中完成变革性突破，强化学习同样在游戏等交互任务中达到世界级水平。这些进步为具身 AI 注入强劲动力，使越来越多的研究者得以快速推进能与机器交互的智能体开发。

5.2.1 Action Agents 5.2.1 行动代理

Action agents refer to the agents that need to execute physical actions in the simulated physical environment or real world. In particular, they need to be actively engaging in activities with the environment. We broadly classify action agents into two different categories based on their application domains: gaming AI and robotics.
动作智能体是指在模拟物理环境或现实世界中需要执行物理动作的智能体。这类智能体需主动与环境进行互动。我们根据应用领域将动作智能体大致分为两类：游戏人工智能和机器人技术。

In gaming AI, the agents will interact with the game environment and other independent entities. In these settings, natural language can enable smooth communication between agents and humans. Depending on the game, there may be a specific task to accomplish, providing a true reward signal. For instance, in the competitive Diplomacy game, training a language model using human conversation data along with an action policy with RL enables human-level play Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).
在游戏 AI 中，智能体将与游戏环境及其他独立实体互动。这类场景中，自然语言能实现智能体与人类的流畅沟通。根据游戏类型，可能需要完成特定任务以获取真实奖励信号。例如在竞技游戏《外交》中，通过结合人类对话数据训练语言模型，并采用强化学习策略，可实现人类水平的游戏表现（Meta 基础 AI 研究外交团队等，2022）。

There are also settings where we agents act as normal residents in a town Park et al. (2023a), without trying to optimize a specific goal. Foundation models are useful in these settings because they can model interactions that appear more natural by mimicking human behavior. When augmented with external memory, they produce convincing agents that can have conversations, daily schedules, form relationships, and have a virtual life.
在某些场景中，我们让智能体作为普通居民生活在虚拟城镇里（Park 等人，2023a），而无需优化特定目标。基础模型在此类场景中极具价值，因为它们能通过模拟人类行为来呈现更自然的交互。当辅以外置记忆功能时，这些模型能生成具备对话能力、日常作息、社交关系乃至完整虚拟人生的可信智能体。

5.2.2 Interactive Agents 5.2.2 交互代理

Interactive agents simply refer to agents that can interact with the world, a broader class of agents than action agents. Their forms of interaction do not necessarily require physical actions, but may involve communicating information to users or modifying the environment. For instance, an embodied interactive agent may answer a user’s questions about a topic through dialogue or help users parse through existing information similar to a chatbot. By extending an agent’s capabilities to include information sharing, the core designs and algorithms of Agent AI can be effectively adapted for a range of applications, such as diagnostic Lee et al. (2023) and knowledge-retrieval Peng et al. (2023) agents.
交互式智能体简指能与世界互动的智能体，这是比行动智能体更广泛的类别。它们的互动形式不一定需要物理动作，可能包括向用户传递信息或修改环境。例如，一个具身交互智能体可以通过对话回答用户关于某个主题的疑问，或像聊天机器人那样帮助用户解析现有信息。通过扩展智能体的信息共享能力，Agent AI 的核心设计与算法可有效适配多种应用场景，如诊断类（Lee 等人，2023）和知识检索类（Peng 等人，2023）智能体。

5.3 Simulation and Environments Agents
5.3 仿真与环境智能体

An effective approach for AI agents to learn how to act in an environment is to go through trial-and-error experiences via interactions with the environment. A representative method is RL, which requires extensive experience of failures to train an agent. Although there exist approaches that use physical agents Kalashnikov et al. (2018), using physical agents is time-consuming and costly. Furthermore, training in the physical environment is often feasible when failure in actual environments can be dangerous (e.g., autonomous driving, underwater vehicles). Hence, using simulators to learn policies is a common approach.
AI 智能体学习环境行为策略的有效方法是通过与环境互动的试错过程。代表性方法是强化学习（RL），该方法需要大量失败经验来训练智能体。虽然存在使用实体智能体的方法（Kalashnikov 等人，2018），但实体智能体耗时且成本高昂。此外，当实际环境中的失败可能造成危险时（如自动驾驶、水下航行器），物理环境中的训练往往难以实施。因此，使用仿真器学习策略成为常见方案。

Many simulation platforms have been proposed for research in embodied AI, ranging from navigation Tsoi et al. (2022); Deitke et al. (2020); Kolve et al. (2017) to object manipulation Wang et al. (2023d); Mees et al. (2022); Yang et al. (2023a); Ehsani et al. (2021). One example is Habitat Savva et al. (2019); Szot et al. (2021), which provides a 3D indoor environment where human- and robotic-agents can perform various tasks such as navigation, instruction following, and question answering. Another representative simulation platform is VirtualHome Puig et al. (2018), supporting human avatars for object manipulation in 3D indoor environments. In the field of gaming, Carroll et al. have introduced "Overcooked-AI," a benchmark environment designed to study cooperative tasks between humans and AI Carroll et al. (2019). Along similar lines, several works aim to incorporate real human intervention beyond the focus of interaction between agents and the environment Puig et al. (2023); Li et al. (2021a); Srivastava et al. (2022). These simulators contribute to the learning of policies in practical settings involving agent and robot interactions, and IL-based policy learning utilizing human demonstrative actions.
目前已有许多仿真平台支持具身 AI 研究，涵盖导航（Tsoi 等人，2022；Deitke 等人，2020；Kolve 等人，2017）到物体操控（Wang 等人，2023d；Mees 等人，2022；Yang 等人，2023a；Ehsani 等人，2021）。典型代表是 Habitat（Savva 等人，2019；Szot 等人，2021），该平台提供 3D 室内环境，人类和机器人智能体可执行导航、指令跟随、问答等任务。另一代表性平台 VirtualHome（Puig 等人，2018）支持人类化身在 3D 室内环境中操控物体。游戏领域，Carroll 等人开发的"Overcooked-AI"（Carroll 等人，2019）成为研究人机协作任务的基准环境。类似地，部分研究致力于在智能体与环境互动之外融入真实人类干预（Puig 等人，2023；Li 等人，2021a；Srivastava 等人，2022）。这些仿真器促进了涉及智能体与机器人交互的实际场景中的策略学习，以及基于模仿学习（IL）的人类示范动作策略学习。

In certain scenarios, the process of learning a policy may necessitate the integration of specialized features within simulators. For example, in the case of learning image-based policies, realistic rendering is often required to facilitate adaptability to real environments Mittal et al. (2023); Zhong et al. (2023). Utilizing a realistic rendering engine is effective for generating images that reflect various conditions, such as lighting environments. Moreover, simulators employing physics engines are required to simulate physical interactions with objects Liu and Negrut (2021). The integration of physics engines in simulation has been shown to facilitate the acquisition of skills that are applicable in real-world scenarios Saito et al. (2023).
某些场景中，策略学习过程可能需要集成仿真器的专门功能。例如学习基于图像的策略时，常需真实感渲染以提升现实环境适应能力（Mittal 等人，2023；Zhong 等人，2023）。采用真实感渲染引擎能有效生成反映光照等多样条件的图像。此外，模拟物体物理交互需要搭载物理引擎的仿真器（Liu 和 Negrut，2021）。研究表明，物理引擎的集成有助于获取适用于现实场景的技能（Saito 等人，2023）。

5.4 Generative Agents 5.4 生成型智能体

The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large gaming studios, as well as empower smaller independent studios to create high quality experiences beyond what they are currently capable of. Additionally, embedding large AI models within a sandbox environment will allow users to author their own experiences and express their creativity in ways that are currently out of reach.
大型生成式 AI 模型领域的最新进展，有望大幅降低当前互动内容创作的高成本与耗时问题——既能为大型游戏工作室赋能，也可帮助小型独立工作室突破现有能力边界，打造更高质量的内容体验。此外，将大型 AI 模型嵌入沙盒环境后，用户将能以当前技术无法实现的方式自主创作内容并释放创造力。

The goals of this agent go beyond simply adding interactive 3d content to scenes, but also include:
该智能体的目标不仅限于为场景添加交互式 3D 内容，还包括：

•

Adding arbitrary behavior and rules of interactions to the objects, allowing the user to create their own VR rules with minimal prompting.
为物体赋予任意交互行为与规则，让用户仅需简单指令即可创建个性化 VR 交互规则
•

Generating whole level geometry from a sketch on a piece of paper, by using the multimodal GPT4-v model, as well as other chains of models involving vision AI models
通过多模态 GPT4-v 模型及视觉 AI 模型链，将纸面草图直接生成完整关卡几何结构
•

Retexturing content in scenes using diffusion models
运用扩散模型实现场景内容的实时材质替换
•

Creating custom shaders and visual special effects from simple user prompts
根据用户简易指令生成定制着色器与视觉特效

One potential application in the short term is the VR creation of a storyboarding/prototype tool allowing a single user to create a rough (but functional) sketch of an experience/game an order of magnitude faster than currently feasible. Such a prototype then could be expanded and made more polished using these tools as well.
短期内的潜在应用是开发 VR 故事板/原型工具，使单人用户能以当前十倍速度创建可运行的体验/游戏草稿原型，后续还可利用这些工具进行扩展打磨。

5.4.1 AR/VR/mixed-reality Agents
5.4.1 AR/VR/混合现实智能体

AR/VR/mixed-reality (jointly referred to as XR) settings currently require skilled artists and animators to create characters, environments, and objects to be used to model interactions in virtual worlds. This is a costly process that involves concept art, 3D modeling, texturing, rigging, and animation. XR agents can assist in this process by facilitating interactions between creators and building tools to help build the final virtual environment.
当前 AR/VR/混合现实（统称 XR）领域需要专业美术师和动画师创建角色、场景与交互物体，这一过程涉及概念设计、3D 建模、贴图、骨骼绑定和动画制作等高成本环节。XR 智能体可通过促进创作者间协作、开发虚拟环境构建工具来优化这一流程。

Our early experiments have already demonstrated that GPT models can be used in the few-shot regime inside of the Unity engine (without any additional fine-tuning) to call engine-specific methods, use API calls to download 3d models from the internet and place them into the scene, and assign state trees of behavior and animations to them Huang et al. (2023a). This behavior likely emerges due to the presence of similar code in open source game repositories that use Unity. Therefore, GPT models are capable of building rich visual scenes in terms of loading in many objects into the scene from a simple user prompt.
我们早期的实验已经证明，GPT 模型可以在 Unity 引擎中以少量示例模式（无需任何额外微调）调用引擎特定方法，使用 API 调用从互联网下载 3D 模型并放置到场景中，还能为其分配行为状态树和动画（Huang 等人，2023a）。这种行为可能源于开源游戏仓库中存在使用 Unity 的类似代码。因此，GPT 模型能够根据简单的用户提示，通过向场景加载大量对象来构建丰富的视觉场景。

The aim of this category of agents is to build a platform and a set of tools that provide an efficient interface between large AI models (both GPT-family ones as well as diffusion image models) and a rendering engine. We explore two primary avenues here:
这类智能体的目标是构建一个平台和工具集，为大型 AI 模型（包括 GPT 系列模型和扩散图像模型）与渲染引擎之间提供高效接口。我们在此探索两条主要路径：

•

Integration of large models into the various editor tools in the agent infrastructure, allowing for significant speedups in development.
将大型模型集成到智能体基础设施的各类编辑工具中，从而显著提升开发效率。
•

Controlling the rendering engine from within a user experience, by generating code that follows user instruction and then compiling it at runtime, allowing for users to potentially edit the VR/simulation they are interacting with in arbitrary ways, even by introducing new agent mechanics.
通过在用户体验中生成遵循用户指令的代码并实时编译，实现对渲染引擎的操控。这使得用户能以任意方式（甚至通过引入新的智能体机制）编辑他们正在交互的 VR/模拟环境。

Introducing an AI copilot focused on XR settings would be useful for XR creators, who can use the copilot to complete tedious tasks, like providing simple assets or writing code boilerplate, freeing creators to focus on their creative vision and quickly iterate on ideas.
为 XR 创作者引入专注于扩展现实场景的 AI 协作者将大有裨益。创作者可利用该协作者完成繁琐任务（如提供简单素材或编写代码模板），从而解放精力专注于创意构想并快速迭代想法。

Furthermore, agents can help users interactively modify the environment by adding new assets, changing the dynamics of the environment, or building new settings. This form of dynamic generation during runtime can also be specified by a creator, enabling the user’s experience to feel fresh and continue evolving over time.
此外，智能体还能帮助用户通过添加新素材、改变环境动态或构建新场景来交互式修改环境。这种运行时动态生成模式也可由创作者预先设定，使用户体验保持新鲜感并持续进化。

5.5 Knowledge and Logical Inference Agents
5.5 知识与逻辑推理代理

The capacity to infer and apply knowledge is a defining feature of human cognition, particularly evident in complex tasks such as logical deduction, and understanding theory of mind ²²2https://plato.stanford.edu/entries/cognitive-science. Making inferences on knowledge ensures that the AI’s responses and actions are consistent with known facts and logical principles. This coherence is a crucial mechanism for maintaining trust and reliability in AI systems, especially in critical applications like medical diagnosis or legal analysis. Here, we introduce agents that incorporate the interplay between knowledge and inference that address specific facets of intelligence and reasoning.
推理和应用知识的能力是人类认知的显著特征，这一点在逻辑演绎和心理理论理解等复杂任务中尤为明显 ²²2https://plato.stanford.edu/entries/cognitive-science 。基于知识进行推理能确保人工智能的响应和行为与已知事实及逻辑原则保持一致，这种连贯性是维持 AI 系统可信度与可靠性的关键机制，在医疗诊断或法律分析等关键应用中尤为重要。在此，我们介绍一种融合知识与推理交互的智能体，它们能够处理智能与推理的特定维度。

5.5.1 Knowledge Agent 5.5.1 知识代理

Knowledge Agents reason over their acquired knowledge systems in two directions: implicit and explicit. Implicit knowledge is typically what large-scale language models like the GPT series Brown et al. (2020); OpenAI (2023) encapsulate after being trained on vast amounts of text data. These models can generate responses that give the impression of understanding, as they draw on patterns and information implicitly learned during training. Explicit knowledge, conversely, is structured and can be directly queried, such as the information found in knowledge bases or databases, which was traditionally used to enhance AI reasoning capabilities by referencing verifiable external resources.
知识智能体通过隐性和显性两种方式对其掌握的知识体系进行推理。隐性知识通常是像 GPT 系列这样的大规模语言模型（Brown 等人，2020；OpenAI，2023）通过海量文本数据训练后所内化的内容。这些模型能生成看似理解语义的回应，因为它们利用了训练过程中隐性习得的模式与信息。相反，显性知识具有结构化特征且可直接查询，例如知识库或数据库中的信息——传统上正是通过引用可验证的外部资源来增强 AI 推理能力。

Despite the advancements in language models, their implicit knowledge is static and becomes outdated as the world evolves Lewis et al. (2020); Peng et al. (2023). This limitation necessitates the integration of explicit knowledge sources that are updated continuously, ensuring that AI systems can provide accurate and current responses. The fusion of implicit and explicit knowledge equips AI agents with a more nuanced understanding and the ability to apply knowledge contextually, akin to human intelligence Gao et al. (2022). Such integration is crucial for crafting knowledge-centric AI agents that not only possess information but can also understand, explain, and employ it, thereby narrowing the chasm between extensive learning and profound knowledge (Marcus and Davis, 2019; Gao et al., 2020). These agents are designed to reason with flexibility and dynamic information about the world, enhancing their robustness and adaptability (Marcus, 2020).
尽管语言模型取得了长足进步，但其隐性知识具有静态性，会随着世界发展而过时（Lewis 等人，2020；Peng 等人，2023）。这种局限性要求整合持续更新的显性知识源，以确保 AI 系统能提供准确且与时俱进的响应。隐性知识与显性知识的融合赋予 AI 智能体更细腻的认知能力，使其能像人类智能那样进行情境化知识应用（Gao 等人，2022）。这种集成对于构建以知识为中心的 AI 智能体至关重要——它们不仅掌握信息，更能理解、解释并运用知识，从而弥合广博学习与深刻认知之间的鸿沟（Marcus 和 Davis，2019；Gao 等人，2020）。这类智能体被设计成能灵活推理动态世界信息，从而增强其鲁棒性与适应性（Marcus，2020）。

5.5.2 Logic Agents 5.5.2 逻辑智能体

Generally, a logic agent is a component of a system designed to apply logical reasoning to process data or solve tasks specific to logical inference or logical reasoning. Logic agents within the context of large foundation models like GPT-4 refers to a specialized component or submodules designed to handle logical reasoning tasks. These tasks often involve understanding and manipulating abstract concepts, deducing conclusions from given premises, or solving problems that require a structured, logical approach. Broadly, foundation models like GPT-4 are trained on a vast corpus of text data and learn to perform a wide range of tasks, including those that require some form of logical reasoning. Thus, their capability for logical reasoning is integrated into the overall architecture, and they generally do not possess a distinct, isolated "Logic agent". While GPT-4 and similar models can perform tasks that involve logic, their approach is fundamentally different from how humans or traditional logic-based systems operate. They do not follow formal logical rules or have an explicit understanding of logic; rather, they generate responses based on patterns learned from the training data. As a result, their performance in logical tasks can be impressive, but it can also be inconsistent or limited by the nature of the training data and the inherent limitations of the model’s design. One example of embedding a separate logical submodule into the architecture is (Wang et al., 2023e), which modifies the token embedding process used by LLMs during pre-training by parsing text into logical segments and explicitly modeling logical hierarchies in the token embeddings.
通常而言，逻辑智能体是系统的一个组件，旨在运用逻辑推理来处理数据或解决与逻辑推断、逻辑推理相关的特定任务。在 GPT-4 等大型基础模型的语境下，逻辑智能体指的是专门处理逻辑推理任务的组件或子模块。这类任务通常涉及理解与操作抽象概念、从给定前提推导结论，或解决需要结构化逻辑方法的问题。广义上说，GPT-4 等基础模型通过海量文本数据训练获得多样化任务处理能力，包括需要某种形式逻辑推理的任务。因此，其逻辑推理能力已融入整体架构，通常不存在独立分离的"逻辑智能体"。虽然 GPT-4 及类似模型能执行涉及逻辑的任务，但其运作方式与人类或传统逻辑系统存在本质差异。这些模型不遵循形式逻辑规则，也不具备对逻辑的显式理解，而是基于训练数据中的模式生成响应。因此，它们在逻辑任务中的表现可能令人惊艳，但也可能因训练数据特性和模型设计固有的局限性而出现不一致或受限的情况。Wang 等人(2023e)的研究展示了在架构中嵌入独立逻辑子模块的实例：通过将文本解析为逻辑片段并在词嵌入中显式建模逻辑层级，改进了 LLMs 预训练阶段的词嵌入过程。

5.5.3 Agents for Emotional Reasoning
5.5.3 情感推理代理

Emotional understanding and empathy are important skills for agents in many human-machine interactions. To illustrate, one important goal for creating engaging dialogue agents is to have the agents act with increased emotion and empathy while minimizing socially inappropriate or offensive outputs. To advance towards this goal for dialogue agents, we released the Neural Image Commenting with Empathy (NICE) dataset Chen et al. (2021) consisting of almost two million images and the corresponding human-generated comments and a set of human emotion annotations. We also provided a novel pre-training model - Modeling Affect Gneration for Image Comments (MAGIC) Chen et al. (2021) - which aims to generate comments for images, conditioned on linguistic representations that capture style and affect, and to help generate more empathetic, emotional, engaging and socially appropriate comments. Our experiments show that the approach is effective in training a more human-like and engaging image comment agent. Developing empathy-aware agents is a promising direction for interactive agents, and it is important to create agents with emotional understanding capabilities across a wide range of groups and populations, especially considering that many current language models exhibit bias in their emotional understanding and empathetic reasoning capabilities Mao et al. (2022); Wake et al. (2023d).
情感理解与共情能力是许多人机交互场景中智能体的重要技能。以对话智能体为例，其核心目标之一是在最大限度减少社交失当或冒犯性输出的同时，增强情感表达与共情能力。为此，我们发布了神经共情图像评论数据集 NICE（Chen 等，2021），包含近两百万张图片及对应的人类评论与情感标注。同时提出了创新预训练模型 MAGIC（Chen 等，2021），该模型基于捕捉风格与情感的语言表征生成图像评论，有助于产出更具同理心、情感丰富且社交得体的内容。实验证明该方法能有效训练出更拟人化、更具吸引力的图像评论智能体。发展具备共情意识的智能体是交互代理的重要方向，尤其需注意当前许多语言模型在情感理解与共情推理方面存在偏见（Mao 等，2022；Wake 等，2023d），因此创建具有跨群体情感理解能力的代理至关重要。

5.5.4 Neuro-Symbolic Agents
5.5.4 神经符号智能体

Neuro-Symbolic agents operate on a hybrid system of neurons and symbols d’Avila Garcez and Lamb (2020). To solve problems stated in natural language is a challenging task because it requires explicitly capturing discrete symbolic structural information implicit in the input. However, most general neural sequence models do not explicitly capture such structural information, limiting their performance on these tasks. The work Chen et al. (2020) propose a new encoder-decoder model based on a structured neural representation agent, The encoder of TP-N2F employs TPR ‘binding’ to encode natural-language symbolic structure in vector space and the decoder uses TPR ‘unbinding’ to generate, in symbolic space, a sequential program represented by relational tuples, each consisting of a relation (or operation) and a number of arguments.
神经符号智能体基于神经元与符号的混合系统运作(d’Avila Garcez 与 Lamb，2020)。解决自然语言表述的问题具有挑战性，因为这需要显式捕捉输入中隐含的离散符号结构信息。然而，大多数通用神经序列模型并未显式捕获此类结构信息，限制了其在这些任务上的表现。Chen 等人(2020)提出了一种基于结构化神经表征智能体的新型编码器-解码器模型。TP-N2F 的编码器采用 TPR"绑定"技术在向量空间中编码自然语言的符号结构，解码器则通过 TPR"解绑"在符号空间生成由关系元组构成的序列程序，每个元组包含一个关系(或操作)及若干参数。

Instruction following vision-language (VL) models like GPT-4 offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to” and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. In (Park et al., 2023b), we build Localized Visual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt a LLM to collect common sense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. This pipeline is scalable and fully automatic, as no aligned or human-authored image and text pairs are required. With a separately trained critic model that selects high quality examples, we find that training on the localized commonsense corpus expanded solely from images can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in zero-shot settings demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression.
诸如 GPT-4 等遵循指令的视觉语言(VL)模型提供了灵活接口，支持以零样本方式处理多种多模态任务。但基于完整图像操作的接口无法让用户直接"指向"并访问图像中的特定区域。这种能力不仅对支持基于参照的 VL 基准测试至关重要，也对需要精确图像内推理的实际应用具有重要意义。Park 等人(2023b)构建的局部视觉常识模型允许用户指定(多个)输入区域。我们通过从大型语言模型(LLM)中采样局部常识知识进行训练：具体而言，我们提示 LLM 根据 VL 模型自动生成的全局字面图像描述和局部字面区域描述来收集常识知识。该流程具有可扩展性和全自动性，无需人工对齐或标注的图文配对数据。借助单独训练的选择高质量样本的评判模型，我们发现仅通过图像扩展的局部常识语料库训练，能有效蒸馏现有 VL 模型以支持参照输入接口。零样本设置下的实证结果与人工评估表明，相比传递生成参照表达的基线方法，我们的蒸馏方法能产生更精确的 VL 推理模型。

5.6 LLMs and VLMs Agent 5.6 LLMs 与 VLMs 智能体

A number of works leverage LLMs as agents to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), and leverage the LLMs’ large internet-scale domain knowledge and zero-shot planning abilities to perform agentic tasks like planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code , then using a low-level controller to execute these subtasks. Additionally, Huang et al. (2022b), Liang et al. (2022), and Wang et al. (2023a) also incorporate environmental feedback to improve task performance. There have also been a number of works that demonstrate the ability of general-purpose visually-aligned large language models trained on large-scale text, image, and video data to serve as a foundation for creating multi-modal agents that are embodied and can act in various environments Baker et al. (2022); Driess et al. (2023); Brohan et al. (2023).
多项研究利用 LLMs 作为智能体进行任务规划（Huang 等，2022a；Wang 等，2023b；Yao 等，2023a；Li 等，2023a），充分发挥其互联网级领域知识与零样本规划能力，执行包括规划推理在内的代理任务。近期机器人研究也通过 LLMs 将自然语言指令分解为自然语言或 Python 代码形式的子任务序列（Ahn 等，2022a；Huang 等，2022b；Liang 等，2022），再由底层控制器执行。此外，Huang 等（2022b）、Liang 等（2022）与 Wang 等（2023a）还引入环境反馈机制提升任务性能。另有研究表明，基于大规模文本、图像和视频数据训练的通用视觉对齐大语言模型，可作为构建具身多模态智能体的基础，使其能在多样化环境中行动（Baker 等，2022；Driess 等，2023；Brohan 等，2023）。

6 Agent AI Application Tasks
6 智能体 AI 应用任务

6.1 Agents for Gaming 6.1 游戏代理

Games provide a unique sandbox to test the agentic behavior of LLMs and VLMs, pushing the boundaries of their collaborative and decision-making abilities. We describe three areas in particular that highlight agent’s abilities to interact with human players and other agents, as well as their ability to take meaningful actions within an environment.
游戏为测试 LLMs 和 VLMs 的智能体行为提供了独特的沙盒环境，能够不断拓展其协作与决策能力的边界。我们特别阐述了三个关键领域，它们突显了智能体与人类玩家及其他智能体互动的能力，以及其在环境中采取有意义行动的潜力。

6.1.1 NPC Behavior 6.1.1 非玩家角色行为

In modern gaming systems, the behavior of Non-Player Characters (NPCs) is predominantly dictated by predefined scripts crafted by developers. These scripts encompass a range of reactions and interactions based on various triggers or player actions within the gaming environment. However, this scripted nature often results in predictable or repetitive NPC behavior which fails to evolve in response to player’s actions or the dynamic environment of the game. This rigidity hampers the immersive experience intended in a dynamic gaming environment. Therefore, there is a burgeoning interest in leveraging LLMs to induce autonomy and adaptability in NPC behavior, making interactions more nuanced and engaging. AI-driven NPCs can learn from player behavior, adapt to varying strategies, and provide a more challenging and less predictable gameplay experience. Large Language Models (LLMs) can significantly contribute to evolving NPC behavior in games. By processing vast amounts of text, LLMs can learn patterns and generate responses that are more varied and human-like. They can be utilized to create dynamic dialogue systems, making interactions with NPCs more engaging and less predictable. Furthermore, LLMs can be trained on player feedback and in-game data to continually refine NPC behaviors, making them more attuned to player expectations and game dynamics.
在现代游戏系统中，非玩家角色（NPC）的行为主要受开发者预设脚本的支配。这些脚本包含基于游戏环境中各种触发条件或玩家行为的一系列反应与互动机制。然而，这种脚本化的本质往往导致 NPC 行为呈现可预测性或重复性，无法根据玩家行为或游戏动态环境进行演进。这种僵化特性阻碍了动态游戏环境本该具有的沉浸式体验。因此，业界日益关注利用 LLMs 赋予 NPC 行为自主性与适应性，使互动更具层次感和吸引力。由 AI 驱动的 NPC 能够学习玩家行为模式，适应不同策略，从而提供更具挑战性且难以预测的游戏体验。大型语言模型（LLMs）对游戏 NPC 行为的演进具有重要价值：通过处理海量文本数据，LLMs 能学习行为模式并生成更丰富、更拟人化的反馈；可用于构建动态对话系统，使 NPC 互动更具吸引力且难以预料；此外，LLMs 还能基于玩家反馈和游戏内数据进行持续训练，不断优化 NPC 行为，使其更符合玩家预期和游戏动态。

6.1.2 Human-NPC Interaction
6.1.2 人类与非玩家角色互动

The interaction between human players and NPCs is a crucial aspect of the gaming experience. The conventional interaction paradigm is primarily one-dimensional, with NPCs reacting in a preset manner to player inputs. This limitation stifles the potential for a more organic and enriching interaction, akin to human-human interaction within the virtual realm. The advent of LLM and VLM technologies holds the promise of transforming this paradigm. By employing these technologies, gaming systems can analyze and learn from human behavior to provide more human-like interactions. This not only enhances the realism and engagement of the game but also provides a platform for exploring and understanding human-machine interaction in a controlled yet complex setting.
玩家与 NPC 之间的互动是游戏体验的关键要素。传统互动模式本质上是一维的，NPC 仅以预设方式对玩家输入做出反应。这种局限性抑制了虚拟世界中更自然、更丰富的互动潜力——这种潜力本应接近人与人之间的交互水平。LLM 和 VLM 技术的出现为改变这一范式带来了希望。通过运用这些技术，游戏系统可以分析并学习人类行为，从而提供更具人性化的互动。这不仅提升了游戏的真实感和沉浸感，也为在受控而复杂的环境中探索和理解人机交互提供了平台。

6.1.3 Agent-based Analysis of Gaming
6.1.3 基于代理的游戏分析

Gaming is an integral part of daily life, estimated to engage half of the world’s population³³3https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/. Additionally, it exhibits a positive impact on mental health⁴⁴4https://news.microsoft.com/source/features/work-life/mind-games-how-gaming-can-play-a-positive-role-in-mental-health/. However, contemporary game systems exhibit a deficiency in interactions with human players since their behaviors are primarily hand-crafted by game developers. These pre-programmed behaviors frequently fail to adapt to players’ needs. Consequently, there exists a need for new AI systems in games that can analyze player behaviors and furnish appropriate support when necessary. Intelligent interactive systems bear the potential to revolutionize how gamers interact with gaming systems in general. NPCs’ interactions with gamers are no longer confined by the restricted rule sets designed by game developers. They have the potential to adapt seamlessly to gamers’ experiences, providing timely feedback to enrich the gaming experience and elevate the synergy of human-machine interaction.
电子游戏已成为日常生活的重要组成部分，据估计全球半数人口参与其中 ³³3https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/ 。此外，游戏对心理健康具有积极影响 ⁴⁴4https://news.microsoft.com/source/features/work-life/mind-games-how-gaming-can-play-a-positive-role-in-mental-health/ 。然而，当代游戏系统存在与人类玩家互动的缺陷，因为其行为主要由游戏开发者手工设计。这些预设行为往往难以适应玩家需求。因此，游戏领域需要新型 AI 系统来分析玩家行为，并在必要时提供适当支持。智能交互系统有望彻底改变玩家与游戏系统的互动方式——非玩家角色（NPC）与玩家的互动不再受限于开发者设计的固定规则集，而是能够无缝适应玩家体验，通过实时反馈来丰富游戏乐趣并提升人机协同水平。

LLMs can serve as a robust tool for analyzing in-game text data, including chat logs, player feedback, and narrative content. They can help in identifying patterns of player behavior, preferences, and interactions which can be invaluable for game developers to improve game mechanics and narratives. Additionally, VLMs can parse through large quantities of image and video data from gaming sessions to help analyze user intent and actions within the game world. Moreover, LLMs and VLMs can facilitate the development of intelligent agents within games that can communicate with players and other agents in a sophisticated and human-like manner, enhancing the overall gaming experience. Beyond LLMs and VLMs, user input data, provides a promising avenue for creating game-playing agents that model perception, game playing, and game understanding by imitating human players. By incorporating a combination of player interactions and feedback, pixel inputs, and natural language planning and understanding, agent models can assist in the continuous improvement of game dynamics, driving a more player-centric evolution of the gaming environment.
LLMs 可作为分析游戏内文本数据（包括聊天记录、玩家反馈和叙事内容）的强大工具，帮助识别玩家行为模式、偏好与互动特征，这些对开发者改进游戏机制和叙事极具价值。视觉语言模型（VLMs）则可解析游戏过程中产生的大量图像和视频数据，辅助分析玩家在游戏世界中的意图与行为。此外，LLMs 与 VLMs 能促进游戏内智能体开发，使其能以复杂拟人的方式与玩家及其他智能体交流，从而提升整体游戏体验。除这两种模型外，用户输入数据为创建游戏代理提供了新途径——通过模仿人类玩家来建模感知、游戏操作与游戏理解。结合玩家互动反馈、像素输入以及自然语言规划理解，智能体模型能持续优化游戏动态，推动游戏环境向以玩家为中心的方向演进。

6.1.4 Scene Synthesis for Gaming
6.1.4 游戏场景合成

Scene synthesis is a vital component in the creation and enhancement of immersive gaming environments. It entails the automatic or semi-automatic generation of three-dimensional (3D) scenes and environments within a game. This process includes the generation of terrain, placement of objects, creation of realistic lighting, and sometimes even dynamic weather systems.
场景合成是创建和增强沉浸式游戏环境的核心环节，涉及游戏中三维场景与环境的全自动/半自动生成，包括地形生成、物体布局、真实光照创建，有时甚至包含动态天气系统构建。

Modern games often feature vast, open-world environments. Manually designing these landscapes can be incredibly time-consuming and resource-intensive. Automated terrain generation, often leveraging procedural or AI-driven techniques, can produce complex, realistic landscapes with less manual effort. LLMs and VLMs can utilize the internet scale knowledge to formulate rules to design non-repeating landscapes that are visually impressive and unique. Additionally, LLMs and VLMs can be used to ensure the semantic consistency and variability of generated assets. Placing objects such as buildings, vegetation, and other elements within a scene in a realistic and aesthetically pleasing manner is crucial for immersion.
现代游戏常以广袤的开放世界为特色，而人工设计这些场景耗时耗力。采用程序化或 AI 驱动技术的自动化地形生成，能以较少人工投入创造出复杂逼真的地貌。LLMs 与 VLMs 可运用互联网规模知识制定规则，设计出视觉震撼且永不重复的地形景观。同时，这些模型还能确保生成资源的语义一致性与多样性。在场景中以真实美观的方式布置建筑、植被等元素，对维持沉浸感至关重要。

VLMs and LLMs can assist in object placement by adhering to predefined or learned rules and aesthetics, thus speeding up the level design process. VLMs and LLMs can be further trained to understand the principles of design and aesthetics, aiding in the procedural generation of content. They can help formulate rules or guidelines that procedural algorithms can follow to generate objects, and scenes that are both visually appealing and contextually appropriate.
视觉语言模型(VLMs)和大型语言模型(LLMs)能够通过遵循预设或习得的规则与美学标准来辅助物体摆放，从而加速关卡设计流程。这些模型经过进一步训练后可以理解设计与美学原理，辅助内容的程序化生成。它们能帮助制定规则或指导方针，供程序化算法遵循以生成既具有视觉吸引力又符合情境需求的对象和场景。

Realistic lighting and atmospheric effects are fundamental for creating a believable and engaging gaming environment. Advanced algorithms can simulate natural lighting conditions and dynamic weather effects, enhancing the realism and mood of the scene. LLMs can help develop systems to acheive more realistic lighting and atmospheric effects in several innovative ways. VLMs can analyze vast datasets from real-world lighting and atmospheric conditions to help develop more realistic algorithms for simulating these effects in games. By understanding the patterns and intricacies of natural lighting and weather, these models can contribute to the development of algorithms that mimic reality closely. LLMs and VLMs could also be used to develop systems that adjust lighting and atmospheric effects in real-time based on player actions, game states, or external inputs. They can process natural language commands from players to modify the game environment, providing a more interactive and immersive experience.
逼真的光照与大气效果是构建可信且引人入胜游戏环境的基础。先进算法能模拟自然光照条件与动态天气效果，增强场景的真实感与氛围营造。LLMs 可通过多种创新方式协助开发实现更真实光照与大气效果的系统。VLMs 能分析来自现实世界光照与大气条件的海量数据集，协助开发更真实的游戏效果模拟算法。通过理解自然光照与天气的规律和复杂性，这些模型有助于开发能精准模拟现实的算法。LLMs 和 VLMs 还可用于开发基于玩家操作、游戏状态或外部输入实时调整光照与大气效果的系统，并能处理玩家的自然语言指令来修改游戏环境，提供更具交互性和沉浸感的体验。

6.1.5 Experiments and Results
6.1.5 实验与结果

Zero-shot/Few-shot Learning with LLM or LVM.
零样本/少样本学习与 LLM 或 LVM。

As we showed in the Fig. 8 and Fig. 9, we used GPT-4V for high-level description and action prediction. Fig. 8 showed some qualitative examples of action description generation and editing with GPT-4V. Agent-enhanced text opens up a novel method of generating 3D scenes with game action priors to help improve the naturalness of the scene. Consequently, GPT-4V generates relevant high-level descriptions that are appropriate for the gaming videos.
如图 8 和图 9 所示，我们使用 GPT-4V 进行高层级描述与动作预测。图 8 展示了 GPT-4V 生成和编辑动作描述的定性案例。通过智能体增强文本，我们开创了一种结合游戏动作先验生成 3D 场景的新方法，从而提升场景的自然度。最终 GPT-4V 能为游戏视频生成贴切的高层级描述。

Small Agent Pretraining Model.
小型智能体预训练模型

To showcase our agent vision-language architecture, we first study its application in a widely used domain for gaming agents by pretraining on Minecraft data. As shown in Fig. 7, given an input action agent, key frame of video, and corresponding text, a standard encoder-decoder can be employed to convert the agent action and image into action text token and image patch token and then use the agent-vision-language decoder to convert it into a action prediction sentence. The overall architecture is depicted in Fig. 7. We evaluate our approach with several Minecraft demonstrations. The Minecraft video data consists of 5min clips, and we use for pretraining contains 78K videos, and we used 5K videos (6% of pretraining data) for the first round pretraining. We train a 250M parameter model on 16 NVIDIA v100 GPUs for one day and visualize our model outputs in Fig. 10 and Fig. 11. Fig. 10 shows that our relatively small agent architecture can produce reasonable outputs for Minecraft scenes unseen during training. Fig. 11 showed the model’s predictions compared to the ground truth human player actions indicating potential low-level understanding for our small agent model.
为展示智能体视觉语言架构，我们首先通过在《我的世界》数据上进行预训练，研究其在游戏智能体领域的应用。如图 7 所示，给定输入动作智能体、视频关键帧及对应文本，标准编码器-解码器可将智能体动作和图像转换为动作文本标记与图像分块标记，再经由智能体视觉语言解码器生成动作预测语句。整体架构如图 7 所示。我们在多个《我的世界》演示中评估该方法：预训练使用的 78,000 段视频包含 5 分钟片段，首轮预训练选用 5,000 段视频（占预训练数据 6%）。基于 16 块 NVIDIA v100 显卡训练 2.5 亿参数模型一天后，图 10 与图 11 展示了模型输出。图 10 表明我们的小型智能体架构能为训练中未见的《我的世界》场景生成合理输出。图 11 显示模型预测与真实玩家动作的对比，揭示小型智能体模型具备潜在的低层级理解能力。

Multi-Agent Infrastructure.
多智能体基础设施。

As showed in the agent paradigm in Fig. 5, we designed a novel infrastructure for a new gaming scenario called “CuisineWorld" Gong et al. (2023a). We detail our approach in Fig. 12. Our infrastructure allows for multi-agent collaboration by leveraging GPT-4 as a central planner and works across multiple gaming domains. We investigated our system’s multi-agent planning capabilities, and we deployed the infrastructure into real-world video games to demonstrate its multi-agent and human-AI collaboration effectiveness. Additionally, we presented “Cuisineworld", a text-based multi-agent collaboration benchmark that provides a new auto-metric Collaboration Score (CoS) to quantify collaboration efficiency.
如图 5 所示的智能体范式所示，我们为名为"烹饪世界"的新游戏场景设计了一套创新基础设施（Gong 等人，2023a）。我们在图 12 中详细阐述了该方法。该基础设施利用 GPT-4 作为中央规划器，支持跨多个游戏领域的多智能体协作。我们研究了系统的多智能体规划能力，并将该基础设施部署到现实电子游戏中，验证了其多智能体协作及人机协作效能。此外，我们推出了基于文本的多智能体协作基准测试"烹饪世界"，通过新型自动度量指标协作分数（CoS）来量化协作效率。

Please refer to the Appendix for more examples and details for gaming description, high-level action prediction, and GPT-4V prompting. We show examples for Bleeding Edge in Fig. 32 and Appendix B, Microsoft Flight Simulator in Fig. 33 and Appendix C, ASSASSIN’s CREED ODYSSEY in Fig. 34 and Appendix D, GEARS of WAR 4 in Fig. 35 and Appendix E, and Starfield in Fig. 36 and Appendix F. We also provide a detailed screenshot of the prompting process for GPT4V used to generate Minecraft examples with Fig. 31 in Appendix A.
关于游戏场景描述、高层动作预测及 GPT-4V 提示的更多案例与细节请参阅附录。我们展示了以下游戏的示例：《Bleeding Edge》（图 32 及附录 B）、《微软模拟飞行》（图 33 及附录 C）、《刺客信条：奥德赛》（图 34 及附录 D）、《战争机器 4》（图 35 及附录 E）以及《星空》（图 36 及附录 F）。附录 A 中的图 31 还提供了用于生成《我的世界》示例的 GPT-4V 提示过程的详细截图。

6.2 Robotics 6.2 机器人学

Robots are representative agents that necessitate effective interaction with their environment. In this section, we will introduce key elements essential for efficient robotic operation, review research topics where the latest LLM/VLM technologies have been applied, and share findings from our most recent studies.
机器人是需要与环境进行有效交互的典型智能体。本节将介绍实现高效机器人操作的关键要素，综述 LLM/VLM 前沿技术已应用的研究方向，并分享我们最新研究中的发现。

Visual Motor Control. 视觉运动控制

Visual Motor Control refers to the integration of visual perception and motor action to execute tasks effectively in a robotic system. This integration is paramount as it enables robots to interpret the visual data from their environment and accordingly adjust their motor actions to interact with the environment accurately. For instance, in an assembly line, a robot equipped with visual motor control can perceive the position and orientation of objects and accurately align its manipulator to interact with these objects. This capability is essential for ensuring the precision and effectiveness of robotic operations across a myriad of applications, ranging from industrial automation to assisting the elderly in their daily chores. Moreover, visual motor control facilitates robots in adapting to dynamic environments where the state of the environment may change rapidly, requiring real-time adjustments to motor actions based on visual feedback.
视觉运动控制是指通过整合视觉感知与运动执行，使机器人系统高效完成任务的机制。这种整合至关重要，它使机器人能够解读环境中的视觉数据，并相应调整运动动作以实现精准的环境交互。例如在装配线上，具备视觉运动控制能力的机器人可感知物体的位置与朝向，并精确调整机械臂与物体进行交互。该能力对于确保机器人操作在各类应用中的精准性与有效性具有决定性作用，其应用场景涵盖工业自动化至老年人日常辅助等广泛领域。此外，视觉运动控制使机器人能适应快速变化的动态环境，根据视觉反馈实时调整运动动作。

Additionally, within the context of safe operation, visual information is crucial for detecting execution errors and confirming the pre- and post-conditions of each robot action. In uncontrolled environments, such as unknown domestic settings, robots are more likely to face unexpected outcomes due to unpredictable factors like changing furniture shapes, varied lighting, and slippage. Executing a pre-planned action plan solely in a feedforward manner can pose significant risks in these settings. Therefore, utilizing visual feedback to continually verify outcomes at each step is key to ensuring robust and reliable operation of robotic systems.
在安全操作层面，视觉信息对于检测执行错误及确认每个机器人动作的前置与后置条件同样至关重要。在未知家庭环境等非受控场景中，由于家具形态变化、光线差异及打滑等不可预测因素，机器人更易遭遇意外状况。在此类环境中仅以前馈方式执行预设动作方案可能带来重大风险。因此，利用视觉反馈逐步验证每个动作的执行结果，是确保机器人系统稳定可靠运行的关键所在。

Language Conditioned Manipulation.
语言条件化操控。

Language Conditioned Manipulation entails the ability of a robotic system to interpret and execute tasks based on language instructions. This aspect is particularly crucial for creating intuitive and user-friendly interfaces for human-robot interaction. Through natural language commands, users can specify goals and tasks to robots in a manner similar to human-human communication, thereby lowering the barrier to operating robotic systems. In a practical scenario, for instance, a user could instruct a service robot to “pick up the red apple from the table,” and the robot would parse this instruction, identify the referred object and execute the task of picking it up Wake et al. (2023c). The core challenge lies in developing robust natural language processing and understanding algorithms that can accurately interpret a wide array of instructions, ranging from direct commands to more abstract directives, and enable the robot to convert these instructions into actionable tasks. Furthermore, ensuring that robots can generalize these instructions across diverse tasks and environments is critical for enhancing their versatility and utility in real-world applications. The use of language input to guide robot’s task planning has gained attention in the context of a robot framework called Task and Motion Planning Garrett et al. (2021).
语言条件操控要求机器人系统具备根据语言指令解析并执行任务的能力。这一特性对于创建直观友好的人机交互界面尤为重要。通过自然语言指令，用户可以用类似人类交流的方式向机器人指定目标和任务，从而降低操作机器人系统的门槛。例如在实际场景中，用户可以指示服务机器人"从桌上拿起红苹果"，机器人将解析该指令、识别目标物体并执行抓取动作（Wake 等人，2023c）。核心挑战在于开发鲁棒的自然语言处理与理解算法，使其能准确解析从直接指令到抽象指示的各类语言输入，并将之转化为可执行任务。此外，确保机器人能将指令泛化应用于不同任务和环境，对提升其在实际应用中的多功能性与实用性至关重要。在名为"任务与运动规划"的机器人框架中（Garrett 等人，2021），使用语言输入指导机器人任务规划的方法已受到关注。

Skill Optimization. 技能优化

Recent studies highlight the effectiveness of LLMs in robotic task planning. However the optimal execution of tasks, especially those involving physical interactions like grasping, requires a deeper understanding of the environment that goes beyond simply interpreting human instructions. For example, robot grasping necessitates precise contact points Wake et al. (2023e) and arm posture Sasabuchi et al. (2021) to efficiently execute subsequent actions. While these elements—precise contact points and arm posture—are intuitive for humans, articulating them through language is challenging. Despite advances in internet-scale VLMs, capturing these nuanced indirect cues from scenes and translating them effectively into robotic skills remains a significant challenge. In response, the robotics community is increasingly focusing on collecting enhanced datasets(e.g., Wang et al. (2023d); Padalkar et al. (2023)) or developing methodologies for direct skill acquisition from human demonstrations Wake et al. (2021a). Frameworks including Learning-from-Demonstration and Imitation Learning are leading these developments, playing a crucial role in the optimization of physical skills.
近期研究凸显了 LLMs 在机器人任务规划中的有效性。但任务的最优执行——特别是涉及抓取等物理交互时——需要超越简单理解人类指令的环境深度认知。例如机器人抓取需要精确的接触点（Wake 等人，2023e）和手臂姿态（Sasabuchi 等人，2021）才能高效执行后续动作。虽然这些要素对人类而言是直觉性的，但用语言准确描述却存在困难。尽管互联网级视觉语言模型（VLMs）取得进展，从场景中捕捉这些微妙间接线索并将其有效转化为机器人技能仍面临重大挑战。为此，机器人学界正着力收集增强数据集（如 Wang 等人 2023d；Padalkar 等人 2023）或开发从人类示范直接获取技能的方法（Wake 等人 2021a）。包括示范学习与模仿学习在内的框架正引领这些发展，在物理技能优化中发挥关键作用。

6.2.1 LLM/VLM Agent for Robotics.
6.2.1 机器人应用的 LLM/VLM 智能体

Recent research has demonstrated the potential of LLM/VLMs for robotic agents that involve interactions with humans in an environment. Research topics that aim to leverage latest LLM/VLM technologies include:
最新研究表明，LLM/VLMs 在需与环境及人类交互的机器人智能体领域具有潜力。旨在利用前沿 LLM/VLM 技术的研究方向包括：

Multimodal Systems: Recent research has been actively focusing on developing end-to-end systems that incorporate the latest LLM and VLM technologies as encoders for input information. Particularly, there is a significant trend towards modifying these foundation models to process multimodal information. Jiang et al. (2022); Brohan et al. (2023, 2022); Li et al. (2023d); Ahn et al. (2022b); Shah et al. (2023b); Li et al. (2023e). This adaptation aims to guide robotic actions based on both linguistic instructions and visual cues, thus achieving an effective embodiment.
多模态系统：当前研究正积极开发以内置最新 LLM 和 VLM 技术作为输入信息编码器的端到端系统。尤其值得注意的是改造这些基础模型以处理多模态信息的显著趋势（Jiang 等人 2022；Brohan 等人 2023,2022；Li 等人 2023d；Ahn 等人 2022b；Shah 等人 2023b；Li 等人 2023e）。这种改造旨在通过语言指令和视觉线索共同引导机器人行为，从而实现有效具身化。

Task Planning and Skill Training: In contrast to end-to-end systems, Task And Motion Planning (TAMP) based systems first compute a high-level task plan and then achieve them with low-level robot control, known as skills.
任务规划与技能训练：与端到端系统不同，基于任务与运动规划（TAMP）的系统先计算高层任务计划，再通过被称为技能的底层机器人控制实现目标。

The advanced language processing abilities of LLMs have demonstrated the capability to interpret instructions and decompose them into robot action steps, greatly advancing task planning technologies Ni et al. (2023); Li et al. (2023b); Parakh et al. (2023); Wake et al. (2023c). For skill training, several studies have explored the use of LLMs/VLMs for designing reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al. (2023), generating data to facilitate policy learning Kumar et al. (2023); Du et al. (2023), or serving as part of a reward function Sontakke et al. (2023). Together with training frameworks such as RL and IL, these efforts will contribute to the development of efficient robot controllers.
LLMs 先进的语言处理能力已展现出解析指令并将其分解为机器人行动步骤的潜力，极大推进了任务规划技术的发展（Ni 等人 2023；Li 等人 2023b；Parakh 等人 2023；Wake 等人 2023c）。在技能训练方面，多项研究探索了使用 LLMs/VLMs 设计奖励函数（Yu 等人 2023a；Katara 等人 2023；Ma 等人 2023）、生成促进策略学习的数据（Kumar 等人 2023；Du 等人 2023）或作为奖励函数组成部分（Sontakke 等人 2023）。这些努力与强化学习（RL）、模仿学习（IL）等训练框架相结合，将推动高效机器人控制器的发展。

On-site Optimization: Executing long task steps in robotics can be difficult due to unexpected and unpredictable environmental conditions. Therefore, a significant challenge in the field of robotics involves dynamically adapting and refining robotic skills by integrating task plans with real-time environmental data. For instance, (Ahn et al., 2022b) proposed an approach that calculates the feasibility of actions (i.e., affordance) from visual information and compares it with planned tasks. Additionally, there are approaches that focus on enabling LLMs to output the pre-conditions and post-conditions (e.g., states of objects and their interrelationships) of task steps to optimize their execution Zhou et al. (2023c) and detect pre-condition errors for necessary revisions to the task plan Raman et al. (2023). These strategies seek to achieve environment-grounded robot execution by integrating environmental information and adjusting the robot’s actions at the task plan or controller level.
现场优化：由于环境中存在意外且不可预测的条件，在机器人技术中执行长时任务步骤可能具有挑战性。因此，该领域的一大难题在于如何通过将任务规划与实时环境数据相结合，动态调整并完善机器人技能。例如，(Ahn 等人，2022b)提出了一种方法，通过视觉信息计算动作的可行性（即可供性）并与规划任务进行对比。此外，还有研究致力于让 LLMs 输出任务步骤的前置条件与后置条件（如物体状态及其相互关系）以优化执行过程（Zhou 等人，2023c），以及检测前置条件错误从而对任务规划进行必要修正（Raman 等人，2023）。这些策略旨在通过整合环境信息并在任务规划或控制器层面调整机器人动作，实现基于环境感知的机器人执行。

Conversation Agents: In creating conversational robots, LLMs can contribute to natural, context-sensitive interactions with humans Ye et al. (2023a); Wake et al. (2023f). These models process and generate responses that mimic human conversation, allowing robots to participate in meaningful dialogues. Additionally, LLMs play a significant role in the estimation of conceptual Hensel et al. (2023); Teshima et al. (2022) and emotional attributes Zhao et al. (2023); Yang et al. (2023b); Wake et al. (2023d) of utterances. Those attributes facilitate the understanding of human intent and meaningful gesture generation, thus contributing to the naturalness and efficacy of human-robot communication.
会话代理：在开发对话机器人时，LLMs 能够助力实现与人类自然、情境敏感的交互(Ye 等, 2023a; Wake 等, 2023f)。这些模型通过处理并生成拟人化回应，使机器人能够参与有意义的对话。此外，LLMs 在话语概念属性(Hensel 等, 2023; Teshima 等, 2022)和情感特征(Zhao 等, 2023; Yang 等, 2023b; Wake 等, 2023d)的评估中发挥关键作用。这些特性有助于理解人类意图并生成符合语境的动作，从而提升人机交互的自然度与有效性。

Navigation Agents: Robot navigation has a long history of research, focusing on core aspects such as map-based path planning and Simultaneous Localization and Mapping (SLAM) for creating environmental maps. These functionalities have become standard in widely used robot middleware like the Robot Operating System (ROS) Guimarães et al. (2016).
导航智能体：机器人导航研究历史悠久，其核心聚焦于基于地图的路径规划以及用于构建环境地图的同步定位与建图（SLAM）技术。这些功能已在机器人操作系统（ROS）等广泛使用的机器人中间件中成为标准配置（Guimarães 等人，2016）。

While classic navigation techniques remain prevalent in many robotics applications, they typically rely on static or pre-created maps. Recently, there has been an increased interest in advanced technologies that enable robots to navigate in more challenging environments, leveraging breakthroughs in fields like computer vision and natural language processing. One representative task is object navigation Chaplot et al. (2020a); Batra et al. (2020); Gervet et al. (2023); Ramakrishnan et al. (2022); Zhang et al. (2021), where robots use object names for navigation instead of map coordinates, requiring the visual grounding of object names in the environment. Furthermore, recent attention has been given to technologies that navigate robots in entirely unfamiliar new environments on a zero-shot basis, on top of foundation models, so-called zero-shot object navigation Gadre et al. (2023); Dorbala et al. (2023); Cai et al. (2023). Additionally, Vision-Language Navigation (VLN) Anderson et al. (2018a) is a representative task, where the task involves navigating an agent by natural language instructions in previously unseen, real-world environments Shah et al. (2023a); Zhou et al. (2023a); Dorbala et al. (2022); Liang et al. (2023); Huang et al. (2023b). VLN interprets sentences rather than object names, such as “go to the bathroom on your left.,” thus it requires a higher functionality to parse input text Wang et al. (2019).
虽然经典导航技术仍在多数机器人应用中占主导地位，但它们通常依赖静态或预构建地图。近年来，随着计算机视觉和自然语言处理领域的突破，针对复杂环境导航的先进技术受到越来越多的关注。代表性任务包括以物体名称而非地图坐标进行导航的物体导航(Chaplot 等, 2020a; Batra 等, 2020; Gervet 等, 2023; Ramakrishnan 等, 2022; Zhang 等, 2021)，这需要将物体名称与环境视觉信息进行锚定。更前沿的是基于基础模型的零样本物体导航技术(Gadre 等, 2023; Dorbala 等, 2023; Cai 等, 2023)，使机器人能在完全陌生的新环境中实现零样本导航。视觉语言导航(VLN)(Anderson 等, 2018a)则是另一项代表性任务，要求智能体根据自然语言指令（如"向左转进入浴室"）在未知的真实环境中导航(Shah 等, 2023a; Zhou 等, 2023a; Dorbala 等, 2022; Liang 等, 2023; Huang 等, 2023b)，这需要更高阶的文本解析能力(Wang 等, 2019)。

The advent of foundation models contributes to the development of such adaptive, on-the-fly navigation technologies by enhancing the understanding of human language instructions and the visual interpretation of environmental information. More detailed explanations of representative VLN research are provided in 6.2.2.
基础模型的出现推动了自适应实时导航技术的发展，通过增强对人类语言指令的理解和环境视觉信息的解读能力。关于视觉语言导航(VLN)代表性研究的详细阐述见 6.2.2 章节。

6.2.2 Experiments and Results.
6.2.2 实验与结果

An accumulating body of evidence suggests that recent VLMs and LLMs have promising capabilities for symbolic task planning (e.g., what-to-do). However, each task requires low-level control policy (e.g., how-to-do) to achieve successful interaction between the environment. While reinforcement learning and imitation learning are promising approach to learn policies in a data-driven manner, another promising approach is to obtain the strategy directly from humans through on-site demonstration, an approach called Learning-from-Observation Wake et al. (2021a); Ikeuchi et al. . In this section, we introduce a study where we employ ChatGPT for task planning and enrich the plan by parameterizing it with affordance information to facilitate effective and precise execution (Fig. 13).
现有研究表明，最新视觉语言模型(VLMs)和 LLMs 在符号化任务规划(如"做什么")方面展现潜力。然而每个任务都需要底层控制策略(如"如何做")来实现与环境的有效交互。虽然强化学习和模仿学习是以数据驱动方式获取策略的有效途径，另一种可行方案是通过现场演示直接从人类获取操作策略——即"观察学习"方法(Wake 等, 2021a; Ikeuchi 等)。本节介绍我们利用 ChatGPT 进行任务规划，并通过功能可供性信息参数化来增强计划可执行性的研究(见图 13)。

The pipeline was composed of two modules: task planning and parameterization. In task planning, the system is fed with language instructions and the description of the working environment. These instructions, along with a predefined set of robot actions and output specifications, are compiled into a comprehensive prompt provided to ChatGPT, which then generates a sequence of decomposed tasks with their textual descriptions (Fig. 13; left pane). Notably, we employ a few-shot approach, meaning ChatGPT is not trained on this task, offering an advantage in applicability as it eliminates the need for hardware-dependent data collection and model training. Additionally, the textual descriptions in the output enable the user to check and adjust the results as necessary, which is a crucial feature for a safe and robust operation. Fig. 14 shows the qualitative results conducted for an agentic simulation on top of VirtualHome Puig et al. (2018). The results demonstrate a reasonable task plan and its flexibility in adjusting outputs, indicating the broad applicability of our approach.
该系统流程包含任务规划与参数化两大模块。任务规划模块接收语言指令和工作环境描述，结合预定义的机器人动作集与输出规范，编译成综合提示词输入 ChatGPT，后者生成带文本描述的任务分解序列(图 13 左窗格)。值得注意的是，我们采用小样本学习方式，ChatGPT 无需针对该任务进行训练，这种硬件无关的特性显著提升了适用性。输出中的文本描述允许用户核查和调整结果，这对确保系统安全稳健运行至关重要。图 14 展示了在 VirtualHome 平台(Puig 等, 2018)上进行的智能体仿真定性结果，验证了任务规划的合理性及其输出调整的灵活性，表明该方法具有广泛适用性。

While the task planner guarantees coherency between the task sequences, successful operation in reality requires detailed parameters. For example, grasp type is crucial for carrying a container while spilling out the content, such a parameter is often ignored in a simulators (see Fig. 14 in grasping a pie). In our robot system, therefore, users are asked to demonstrate each action visually (Fig. 13; right pane). The tasks had predefined parameters necessary for execution, which our vision system extracts from the videos Wake et al. (2021b). Notably, our robotic system is not designed for exact replication of human motions (i.e., teleoperation) but rather to handle varying real-world conditions, such as changes in object locations. Hence, the parameters extracted from human demonstrations encompass not precise motion paths but affordance information that dictates effective environmental movement (e.g., waypoints for collision avoidance Wake et al. (2023a), grasp types Wake et al. (2023e), and upper-limbs postures Sasabuchi et al. (2021); Wake et al. (2021a)). The posture of the upper limbs is critical in robots with high degrees of freedom and is designed to assume predictable postures for humans coexisting with the operational robot. The task sequence endowed with affordances is transformed into a sequence of reusable robot skills acquired through reinforcement learning and executed by the robot Takamatsu et al. (2022).
虽然任务规划器能确保任务序列间的连贯性，但在现实中的成功操作需要详细参数。例如，抓取类型对于搬运容器同时避免内容物洒出至关重要，这类参数在仿真环境中常被忽略（见图 14 抓取馅饼的示例）。因此，在我们的机器人系统中，用户需通过视觉演示每个动作（图 13 右侧面板）。任务包含执行所需的预定义参数，这些参数由我们的视觉系统从视频中提取（Wake 等人，2021b）。值得注意的是，我们的机器人系统并非设计用于精确复制人类动作（即遥操作），而是为了应对现实世界中多变的条件，例如物体位置变化。因此，从人类示范中提取的参数并非精确的运动路径，而是决定有效环境移动的功用信息（如避碰航点（Wake 等人，2023a）、抓取类型（Wake 等人，2023e）和上肢姿势（Sasabuchi 等人，2021；Wake 等人，2021a））。上肢姿势对于高自由度机器人至关重要，其设计需确保与操作机器人共处的人类能够预判其姿态。赋予功用信息的任务序列会被转化为通过强化学习获得的可复用机器人技能序列，并由机器人执行（Takamatsu 等人，2022）。

LLM-empowered task planning can be extended to a more versatile robotic system by integrating it with VLMs. Here, we show an example where we use the GPT-4V(ision) to broaden the aforementioned task planner in a multimodal input context (Fig. 15), a human performs actions that are intended to be replicated by the robot. In this paper, only part of the prompt is shown. The whole prompt is available at microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts.
通过将 LLM 驱动的任务规划与视觉语言模型（VLM）结合，可将其扩展至更通用的机器人系统。如图 15 所示，我们在多模态输入场景中使用 GPT-4V(ision)拓展了前述任务规划器——人类演示的动作将被机器人复现。本文仅展示部分提示词，完整提示详见 microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts。

This pipeline takes demonstration videos and text, then outputs a sequence of robot actions. A vision analyzer aims to understand the actions performed by humans in the video. We used GPT-4V and provided a prompt to generate text instructions in a style typical of human-to-human communication.Fig. 16 demonstrates how the usage of text input allows user to give feedback on GPT-4V’s recognition results for correction purposes. Such a feature, aiming at improving the accuracy of the recognition results, also enables more robust operation.
该流程接收演示视频和文本输入，随后输出机器人动作序列。视觉分析模块旨在理解视频中人类的操作行为。我们采用 GPT-4V 模型，通过提示指令生成符合人际交流风格的文本说明。图 16 展示了如何通过文本输入让用户对 GPT-4V 的识别结果提供修正反馈。这种旨在提升识别准确度的功能，同时也能实现更稳定的系统运作。

Next, the scene analyzer compiles the expected work environment into the text information based on the instructions and the first frame of the video data (or an image of the environment). This environmental information includes a list of object names recognized by GPT-4V, the graspable properties of objects, and the spatial relationships between objects. Although these computational processes are a black box within GPT-4V, the information is output based on the knowledge of GPT-4V and the image/text input. Fig. 17 shows the example outputs of our scene analyzer. As shown in the figure, GPT-4V successfully selects the objects that are related to the manipulation. For example, a table is included in the output when the human is relocating a spam container on the table, while the table is ignored for the fridge opening task. These results suggest that the scene analyzer encodes the scene information with respect to the human’s actions. We prompted GPT-4V to explain the results of the object selection process and the reasons behind those choices. In practice, we found this approach resulted in reasonable outputs. Finally, based on the given text instructions and environmental information, the task planner outputs a sequence of tasks Wake et al. (2023c).
接下来，场景分析器会根据指令和视频数据的第一帧（或环境图像）将预期工作环境编译为文本信息。这些环境信息包括 GPT-4V 识别的物体名称列表、物体的可抓取属性以及物体间的空间关系。尽管这些计算过程是 GPT-4V 内部的黑箱操作，但信息输出基于 GPT-4V 的知识库和图像/文本输入。图 17 展示了我们场景分析器的示例输出结果。如图所示，GPT-4V 能成功筛选出与操作相关的物体。例如当人类移动桌面的垃圾箱时，输出结果包含桌子；而在执行冰箱开门任务时，桌子信息则被忽略。这些结果表明场景分析器会根据人类动作来编码场景信息。我们提示 GPT-4V 解释物体选择过程的结果及其决策依据，实践中发现该方法能产生合理输出。最后，基于给定的文本指令和环境信息，任务规划器会输出一系列任务序列（Wake 等人，2023c）。

Embodied Agents for Robotics Navigation.
机器人导航中的具身智能体

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. Navigation in 3D environments Zhu et al. (2017a); Mirowski et al. (2016); Mousavian et al. (2018); Hemachandra et al. (2015) is an essential capability of a mobile intelligent system that functions in the physical world. In the past few years, a plethora of tasks and evaluation protocols Savva et al. (2017); Kolve et al. (2017); Song et al. (2017); Xia et al. (2018); Anderson et al. (2018a) have been proposed as summarized in Anderson et al. (2018b). VLN Anderson et al. (2018a) focuses on language-grounded navigation in the real 3D environment. In order to solve the VLN task, (Anderson et al., 2018a) set up an attention-based sequence-to-sequence baseline model. Then (Wang et al., 2018) introduced a hybrid approach that combines model-free and model-based reinforcement learning (RL) to improve the model’s generalizability. Lastly, (Fried et al., 2018) proposed a speaker-follower model that adopts data augmentation, a panoramic action space and modified beam search for VLN, establishing the current state-of-the-art performance on the Room-to-Room dataset. Extending prior work, we propose a Reinforced Cross-Modal Matching (RCM) for VLN in (Wang et al., 2019). The RCM model is built upon Fried et al. (2018) but differs in many significant aspects: (1) RCM combines a novel multi-reward RL with imitation learning for VLN while Speaker-Follower models Fried et al. (2018) only uses supervised learning as in (Anderson et al., 2018a). (2) The RCM reasoning navigator performs cross-modal grounding rather than the temporal attention mechanism on single-modality input. (3) The RCM matching critic is similar to the Speaker in terms of the architecture design, but the former is used to provide the cycle-reconstruction intrinsic reward for both RL and SIL training while the latter is used to augment training data for supervised learning. In (Wang et al., 2019), we study how to address three critical leader-board for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problem. As shown in Fig. 18, we propose a novel Reinforced Cross-Modal Matching approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieved a new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%). Moreover, in Wang et al. (2019) we introduce a self-supervised imitation learning method for exploration in order to explicitly address the generalization issue, which is a problem not well-studied in prior work. Concurrent to the work, Thomason et al. (2018); Ke et al. (2019); Ma et al. (2019a, b) studies the VLN tasks from various aspects, and Nguyen et al. (2018) introduces a variant of the VLN task to find objects by requesting language assistance when needed. Note that we are the first to propose to explore unseen environments for the VLN task.
视觉语言导航（VLN）是指具身智能体在真实三维环境中执行自然语言指令的导航任务。在三维环境中的导航（Zhu 等人，2017a；Mirowski 等人，2016；Mousavian 等人，2018；Hemachandra 等人，2015）是移动智能系统在物理世界中运作的核心能力。过去几年涌现了大量相关任务与评估方案（Savva 等人，2017；Kolve 等人，2017；Song 等人，2017；Xia 等人，2018；Anderson 等人，2018a），如 Anderson 等人（2018b）所综述。VLN 任务（Anderson 等人，2018a）专注于真实三维环境中基于语言的导航。为解决 VLN 任务，（Anderson 等人，2018a）建立了基于注意力机制的序列到序列基线模型；（Wang 等人，2018）随后提出结合无模型与基于模型强化学习（RL）的混合方法以提升模型泛化能力；（Fried 等人，2018）则提出采用数据增强、全景动作空间和改进束搜索的说话者-跟随者模型，在 Room-to-Room 数据集上取得当时最佳性能。我们在（Wang 等人，2019）中提出强化跨模态匹配（RCM）模型进一步推进该领域：1）RCM 创新性地将多奖励 RL 与模仿学习结合，而说话者-跟随者模型（Fried 等人，2018）仅采用（Anderson 等人，2018a）中的监督学习；2）RCM 推理导航器执行跨模态 grounding 而非单模态输入的时序注意力机制；3）RCM 匹配评判器与说话者架构相似，但前者为 RL 和 SIL 训练提供循环重构内在奖励，后者用于监督学习的数据增强。如图 18 所示，我们通过强化学习实现局部与全局的跨模态 grounding：匹配评判器提供指令与轨迹全局匹配的内在奖励，推理导航器执行局部视觉场景的跨模态 grounding。在 VLN 基准测试中，RCM 模型的 SPL 指标显著超越前人方法 10%，达到新标杆。为提升策略泛化性，我们提出自监督模仿学习（SIL）方法，通过模仿自身历史优质决策探索未知环境，将已知与未知环境的成功率差距从 30.7%大幅缩减至 11.7%。同期研究中，Thomason 等人（2018）、Ke 等人（2019）、Ma 等人（2019a,b）从多角度研究 VLN 任务，Nguyen 等人（2018）提出需要时请求语言协助的 VLN 变体任务。需注意的是，我们首次提出针对 VLN 任务的未知环境探索方法。

6.3 Healthcare 6.3 医疗保健

In healthcare, LLMs and VLMs can act as diagnostic agents, patient care assistants, or even therapy aids, but they come with unique leader-board and responsibilities. With the tremendous potential for AI agents to improve patient care and save lives comes an equally dangerous possibility that their misuse or hasty deployment could endanger thousands or millions of people worldwide. We discuss some of the promising routes for AI agents within the context of healthcare and also discuss some of the key leader-board faced.
在医疗保健领域，LLMs 和 VLMs 可作为诊断代理、患者护理助手甚至治疗辅助工具，但它们也伴随着独特的责任与挑战。人工智能代理在改善患者护理和拯救生命方面具有巨大潜力，但同样危险的是，它们的滥用或仓促部署可能危及全球数千甚至数百万人的生命。我们探讨了人工智能代理在医疗保健领域的一些前景广阔的路径，同时也讨论了面临的关键挑战。

Diagnostic Agents. 诊断代理。

Using LLMs as medical chatbots for patient diagnosis has recently attracted great attention due to the high-demand for medical experts and the potential for LLMs to help triage and diagnose patients Lee et al. (2023). Dialogue agents, especially those that can effectively communicate important medical information to a broad range of people from diverse patient populations, have the potential to provide equitable healthcare access to historically disadvantaged or marginalized groups. Furthermore, doctors and healthcare systems across the world are largely over-burdened and under-resourced, resulting in insufficient access to medical care for hundreds of millions of people worldwide World Health Organization and World Bank (2015). Diagnostic agents provide a particularly advantageous pathway to improve healthcare for millions since they have they can be built with the capability to understand a variety of languages, cultures, and health conditions. Initial results have shown that healthcare-knowledgeable LMMs can be trained by utilizing large-scale web data Li et al. (2023f). Although an exciting direction, the promise of diagnostic agents does not come without risks. We highlight the risks of hallucination within medical contexts, as well as potential pathways for solutions in the following section.
利用 LLMs 作为医疗聊天机器人进行患者诊断，近期因医疗专家资源紧缺以及 LLMs 在患者分诊和诊断方面的潜力而备受关注（Lee 等人，2023 年）。对话代理——尤其是那些能够向来自多元化患者群体的普罗大众有效传达重要医疗信息的系统——有望为历史上处于弱势或被边缘化的群体提供公平的医疗资源。此外，全球医生和医疗系统普遍面临超负荷运转与资源不足的双重压力，导致数亿人无法获得充分的医疗服务（世界卫生组织与世界银行，2015 年）。诊断代理因其可构建多语言、跨文化和全病种理解能力，成为改善数百万人医疗服务的优势路径。初步研究表明，具备医疗知识的 LLMs 可通过大规模网络数据进行训练（Li 等人，2023f）。尽管这一方向令人振奋，但诊断代理的应用仍伴随风险。我们将在下节重点探讨医疗场景下的幻觉风险及其潜在解决方案。

Knowledge Retrieval Agents.
知识检索代理

Within the medical context, model hallucinations are particularly dangerous and may even result in serious patient harm or death, depending on the severity of the error. For instance, if a patient mistakenly receives a diagnosis suggesting they are free of a condition they actually have, it can lead to catastrophic outcomes. These include postponed or inappropriate treatments, or in some cases, a total lack of necessary medical intervention. The gravity of undiagnosed or misdiagnosed conditions can lead to escalated healthcare expenses, extended therapies causing further physical strain, and in extreme scenarios, severe harm or even death. Thus, approaches that can use agents to more reliably retrieve knowledge Peng et al. (2023) or generate text in a retrieval-based manner Guu et al. (2020) are promising directions. Pairing a diagnostic agent with a medical knowledge retrieval agent has the potential to significantly reduce hallucinations while simultaneously improving the quality and preciseness of the responses of the diagnostic dialogue agent.
在医疗场景中，模型幻觉尤为危险，根据错误严重程度甚至可能导致患者重伤或死亡。例如若患者被误诊为未患实际存在的疾病，可能引发灾难性后果——包括延误治疗、错误处置，甚至完全缺失必要医疗干预。漏诊或误诊的严重性会导致医疗费用激增、长期治疗带来的身体损耗，极端情况下将造成重大伤害乃至死亡。因此，采用代理进行更可靠的知识检索（Peng 等人，2023 年）或以检索为基础生成文本（Guu 等人，2020 年）的方法极具前景。将诊断代理与医疗知识检索代理结合使用，既能显著降低幻觉发生，又可同步提升诊断对话代理的响应质量和精确度。

Telemedicine and Remote Monitoring.
远程医疗与远程监护。

Agent-based AI also has great potential within the world of Telemedicine and Remote Monitoring by improving the access to healthcare, improving communications between healthcare providers and patients, as well as improving the efficiency and reducing the costs of frequent doctor-patient interactions Amjad et al. (2023). Primary care clinicians spend significant amounts of time sifting through patient messages, reports, and emails that are often irrelevant or unnecessary for them to view. There is significant potential to allow for support agents to help triage messages from doctors, patients, and other healthcare providers and to help highlight important messages for all parties. By enabling agentic AI systems to coordinate with patients, clinicians, and other AI agents, there is a massive potential to revolutionize the remote healthcare and digital health industry.
基于智能体的 AI 在远程医疗和远程监护领域同样具有巨大潜力，可通过提升医疗可及性、优化医患沟通效率、降低高频医患互动成本等方式革新医疗体系（Amjad 等人，2023）。基层医疗从业者常需耗费大量时间筛选无关或非必要的患者消息、报告和邮件。支持性智能体在帮助分诊医患消息、为各方突出重要信息方面存在显著价值。通过使智能 AI 系统与患者、临床医生及其他 AI 代理协同工作，远程医疗和数字健康产业将迎来革命性变革机遇。

6.3.1 Current Healthcare Capabilities
6.3.1 当前医疗健康领域能力

Image understanding. 图像理解。

We demonstrate the current capabilities and limitations of modern multimodal agents such as GPT-4V within the context of healthcare in Fig. 19. We can see that although GPT-4V possesses significant internal knowledge of the equipment and procedures involved in hospital care, it does not always respond to more prescriptive or diagnostic queries by the user.
我们在图 19 中展示了 GPT-4V 等现代多模态智能体在医疗健康领域的当前能力与局限。虽然 GPT-4V 对医院护理涉及的设备流程具备丰富的内在知识，但面对用户更具处方性或诊断性的查询时并不总是予以响应。

Video understanding. 视频理解。

We investigate the performance of VLM agents for medical video understanding in two contexts. First, we investigate the ability for VLM agents to identify important patient care activities in clinical spaces. Secondly, we explore the usage of of VLMs for more technical videos such as ultrasounds. Specifically, in Figure 20, we demonstrate some of the current capabilities and limitations of GPT-4V for hospital care and medical video analysis.
我们在两种情境下研究了视觉语言模型(VLM)代理在医疗视频理解中的表现。首先，我们考察了 VLM 代理识别临床环境中重要患者护理活动的能力。其次，我们探索了 VLM 在超声等技术类视频中的应用。具体而言，在图 20 中，我们展示了 GPT-4V 模型在医院护理和医疗视频分析方面当前的部分能力与局限性。

6.4 Multimodal Agents 6.4 多模态智能体

The integration of visual and linguistic understanding is crucial for developing sophisticated multimodal AI agents. This includes tasks such as image captioning, visual question answering, video language generation, and video understanding, amongst others. We aim to delve into these visual-language tasks, exploring the leader-board and opportunities they present in the context of AI agents.
视觉与语言理解的融合对开发复杂多模态 AI 智能体至关重要，这包括图像描述生成、视觉问答、视频语言生成及视频理解等任务。我们将深入探讨这些视觉-语言任务，分析其在 AI 智能体领域的技术榜单与发展机遇。

6.4.1 Image-Language Understanding and Generation
6.4.1 图像-语言理解与生成

Image-language understanding is a task that involves the interpretation of visual content in a given image with language and the generation of associated linguistic descriptions. This task is critical to the development of AI agents that can interact with the world in a more human-like manner. Some of most popular ones are image captioning Lin et al. (2014); Sharma et al. (2018); Young et al. (2014); Krishna et al. (2016), referring expression Yu et al. (2016); Karpathy et al. (2014), and visual question answering Antol et al. (2015); Ren et al. (2015); Singh et al. (2019).
图像-语言理解是指通过语言解释给定图像中的视觉内容并生成相关文字描述的任务。该任务对于开发能以更类人方式与世界交互的 AI 智能体至关重要。其中最典型的任务包括：图像描述生成（Lin 等人，2014；Sharma 等人，2018；Young 等人，2014；Krishna 等人，2016）、指代表达（Yu 等人，2016；Karpathy 等人，2014）以及视觉问答（Antol 等人，2015；Ren 等人，2015；Singh 等人，2019）。

More recently, knowledge-intensive Visual Question Answering tasks such as OKVQA Marino et al. (2019), KB-VQA Wang et al. (2015), FVQA Wang et al. (2017), and WebQA Chang et al. (2021) have been introduced. Multimodal agents should capable of identifying objects in an image, comprehending their spatial relationships, generating accurate descriptive sentences about the scene, and utilizing reasoning skills to handle knowledge-intensive visual reasoning. This requires not just object recognition capabilities, but also a deep understanding of spatial relationships, visual semantics, and the ability to map these visual elements to linguistic constructs with integration of the world knowledge.
近年来，研究者提出了知识密集型视觉问答任务，如 OKVQA（Marino 等人，2019）、KB-VQA（Wang 等人，2015）、FVQA（Wang 等人，2017）和 WebQA（Chang 等人，2021）。多模态智能体应当具备识别图像中的物体、理解其空间关系、生成准确场景描述句以及运用推理技能处理知识密集型视觉推理的能力。这不仅需要物体识别能力，还需要深刻理解空间关系、视觉语义，并能结合世界知识将这些视觉元素映射到语言结构中。

6.4.2 Video and Language Understanding and Generation
6.4.2 视频与语言理解及生成

Video-language generation.
视频-语言生成

Video captioning or video storytelling is the task of generating a sequence of coherent sentences for a stream of video frames. Inspired by the successful use of recurrent large foundation models employed in video and language tasks, variants of agent driven enhanced models have shown promising results on the task of video-lanaguage generation. The fundamental challenge is that the strong performance of neural encoder-decoder models does not generalize well for visual storytelling, because the task requires a full understanding of the content of each image as well as the relation among different frames. One important goal for the field is to create an agent-aware text-synthesis model that can efficiently encode the sequence of frames and generate a topically coherent multi-sentence paragraph.
视频描述或视频叙事任务旨在为连续视频帧生成连贯的语句序列。受视频与语言任务中循环大型基础模型成功应用的启发，智能体驱动增强模型的变体在视频-语言生成任务中展现出良好前景。核心挑战在于：尽管神经编码器-解码器模型表现出色，但其性能无法直接迁移到视觉叙事任务，因为该任务需要完整理解每帧图像内容及帧间关系。该领域的重要目标是创建具备智能体意识的文本合成模型，能高效编码帧序列并生成主题连贯的多句段落。

Video Understanding. 视频理解

Video understanding extends the scope of image understanding to dynamic visual content. This involves interpretation and reasoning about the sequence of frames in a video, often in conjunction with accompanying audio or textual information. An agent should be able interact with various modalities from visual, text, and also audio modalities to demonstrate their advanced comprehension of video content. Tasks in this domain include video captioning, video question answering, and activity recognition, amongst others. The leader-board in video understanding are manifold. They include the temporal alignment of visual and linguistic content, the handling of long sequences of frames, and the interpretation of complex activities that unfold over time. Regarding audio, the agent could process spoken words, background noises, music, and tone of voice to comprehend the mood, setting, and subtleties of the video content.
视频理解将图像理解的范畴扩展至动态视觉内容。这涉及对视频帧序列的解释与推理，通常需要结合伴随的音频或文本信息。智能体应能通过视觉、文本和音频等多模态交互，展现其对视频内容的深层理解。该领域任务包括视频描述生成、视频问答和活动识别等。视频理解的难点主要体现在：视觉与语言内容的时间对齐、长帧序列处理以及随时间展开的复杂活动解析。在音频方面，智能体需处理语音、背景噪音、音乐和声调，以理解视频内容的情绪、场景和微妙细节。

Previous works have focused on employing existing video-language training data available online for establishing video foundational models Li et al. (2020, 2021b); Fu et al. (2022); Bain et al. (2021); Zellers et al. (2021, 2022); Fu et al. (2023). Supporting such training pipelines and functionalities is, however, difficult due to the limited and often inconsistent nature of these datasets. Video foundational models are designed with masked and contrastive pretraining objectives and later tuned on their respective tasks. Despite showing remarkable results in multimodal benchmarks, these models encounter difficulties in video-only tasks such as action recognition due to their dependency on limited video-text data built from noisy audio transcriptions. This limitation also leads to the lack of robustness and fine-grained reasoning skills that large language models generally possess.
先前研究主要利用现有在线视频-语言训练数据构建视频基础模型（Li 等人，2020，2021b；Fu 等人，2022；Bain 等人，2021；Zellers 等人，2021，2022；Fu 等人，2023）。但由于数据集有限且质量参差，支持这类训练流程和功能具有挑战性。视频基础模型采用掩码和对比预训练目标设计，后期针对特定任务微调。尽管在多模态基准测试中表现优异，这些模型在纯视频任务（如动作识别）中仍存在困难，因其依赖于从嘈杂音频转录构建的有限视频-文本数据。这种局限性也导致其缺乏大型语言模型通常具备的鲁棒性和细粒度推理能力。

Other methods, similar to those used in image-language understanding, have drawn on the strong reasoning skills and broad knowledge of large language models to improve different facets of video interpretation. The task of video understanding is simplified by language only models like ChatGPT and GPT4 or image-language models like GPT4-V, which treat the audio, video, and language modalities as individual interpretable input data types and position the agents as strong open-source models. For example, Huang et al. (2023c); Li et al. (2023g) transformed video understanding into a natural language processing (NLP) question-answering formulation by textualizing video content with open-source vision classification/detection/caption models. Lin et al. (2023) integrated GPT4-V with specialized tools in vision, audio, and speech, to facilitate complex video understanding tasks, such as scripting character movements and actions in long-form videos.
其他方法借鉴了图像语言理解领域的技术，利用大语言模型强大的推理能力和广博知识来提升视频解读的各个方面。像 ChatGPT、GPT4 这类纯语言模型，或 GPT4-V 这类图像语言模型，通过将音频、视频和语言模态视为独立可解释的输入数据类型，并将智能体定位为强大的开源模型，从而简化了视频理解任务。例如，Huang 等人（2023c）和 Li 等人（2023g）借助开源视觉分类/检测/描述模型将视频内容文本化，将视频理解转化为自然语言处理（NLP）的问答形式。Lin 等人（2023）则整合 GPT4-V 与视觉、音频、语音领域的专用工具，以处理复杂视频理解任务，例如为长视频中的人物动作和行为编写脚本。

Parallel research explores generating scaled datasets from large models, then applying visual instruction tuning Liu et al. (2023c); Li et al. (2023c); Zhu et al. (2023) on the generated data. Considerable audio, speech, and visual expert perception models are subsequently used to verbalize videos. Speech is transcribed with automatic speech recognition tools, and video descriptions and related data are produced with various tagging, grounding, and captioning models Li et al. (2023g); Maaz et al. (2023); Chen et al. (2023); Wang et al. (2023f). These techniques demonstrate how instruction tuning video-language models on generated datasets may lead to enhanced video-reasoning and communication abilities.
并行研究探索从大模型生成规模化数据集，再对生成数据进行视觉指令微调（Liu 等人 2023c；Li 等人 2023c；Zhu 等人 2023）。随后采用大量音频、语音和视觉专家感知模型对视频内容进行语言化处理：通过自动语音识别工具转录语音，并运用各类标注、定位和描述模型（Li 等人 2023g；Maaz 等人 2023；Chen 等人 2023；Wang 等人 2023f）生成视频描述及相关数据。这些技术表明，在生成数据集上对视频语言模型进行指令微调，可有效增强视频推理与交互能力。

6.4.3 Experiments and Results
6.4.3 实验与结果

•

Knowledge-Intensive Models: As introduced in INK Park et al. (2022), and KAT Gui et al. (2022a), an intensive neural knowledge task that incorporates required knowledge annotated by humans to support knowledge-intensive retrieval task.
知识密集型模型：如 INK Park 等人（2022 年）和 KAT Gui 等人（2022a 年）所提出的，这是一种融入人类标注的必要知识以支持知识密集型检索任务的密集型神经知识任务。
•

Multimodal-Agents: There has been a growing interest in multimodal language models like Chameleon Lu et al. (2023) and MM-React Yang et al. (2023c).
多模态智能体：像 Chameleon Lu 等人（2023）和 MM-React Yang 等人（2023c）这样的多模态语言模型正受到越来越多的关注。
•

Visual Instruction Tuning: VCLGui et al. (2022b), Mini-GPT4 Zhu et al. (2023), MPLUG-OWL Ye et al. (2023b), LSKD Park et al. (2023c) generate image-level instruction tuning dataset.
视觉指令调优：VCLGui 等人（2022b）、Mini-GPT4 Zhu 等人（2023）、MPLUG-OWL Ye 等人（2023b）、LSKD Park 等人（2023c）生成了图像级别的指令调优数据集。

Knowledge-Intensive Agent.
知识密集型智能体

As showed in Fig. 22 and Fig. 23, Knowledge-based visual question answering and vision-language retrieval tasks are challenging tasks in multi-modal machine learning that requires outside knowledge beyond image contents. Recent studies on large-scale transformers have primarily focused on maximizing the efficiency of the model’s parameters to store information. This line of research explores a different aspect: whether multimodal transformers can use explicit knowledge in their decision-making process. Pretraining methods based on transformers have shown remarkable success in implicitly learning knowledge representations across multiple modalities. However, traditional methods, mainly unimodal, have investigated knowledge retrieval and subsequent answer prediction, raising questions about the quality and relevance of the knowledge retrieved and the integration of reasoning processes using both implicit and explicit knowledge. To tackle these issues, we introduce the Knowledge Augmented Transformer (KAT), which outperforms others by 6% on the 2022 OK-VQA open-domain multimodal task. KAT combines implicit knowledge from GPT3 with explicit knowledge from websites using an encoder-decoder structure, and allows for concurrent reasoning with both knowledge types during answer generation. Furthermore, incorporating explicit knowledge enhances the interpretability of the model’s predictions. The code and pre-trained models are available at https://github.com/guilk/KAT.
如图 22 和图 23 所示，基于知识的视觉问答与视觉语言检索任务是多模态机器学习中极具挑战性的任务，需要理解图像内容之外的背景知识。近期关于大规模 Transformer 的研究主要聚焦于最大化模型参数存储信息的效率，而本研究则探索了另一个维度：多模态 Transformer 能否在决策过程中利用显性知识。基于 Transformer 的预训练方法已展现出跨模态隐式学习知识表征的卓越能力，但传统方法（主要是单模态方法）研究了知识检索与后续答案预测的关系，这引发了关于检索知识质量、相关性以及如何结合隐性与显性知识进行推理的疑问。为解决这些问题，我们提出了知识增强型 Transformer（KAT），该模型在 2022 年 OK-VQA 开放领域多模态任务中以 6%的优势超越其他模型。KAT 通过编码器-解码器架构，将 GPT3 的隐性知识与网络显性知识相结合，并在答案生成过程中支持两种知识的并行推理。此外，引入显性知识还增强了模型预测的可解释性。代码与预训练模型详见 https://github.com/guilk/KAT。

Vision-language Transformer Agent.
视觉语言 Transformer 智能体。

Next, we introduce the "Training Vision-Language Transformers from Captions" (VLC) model Gui et al. (2022b), a transformer that has been pretrained exclusively with image-caption pairs. Despite using just a simple linear projection layer for image embeddings, VLC attains competitive results across various vision-language tasks, in contrast to other methods that depend on object detectors or supervised CNN/ViT networks. Through extensive analysis, we explore the potential of VLC as a vision-language transformer agent. For instance, we show that VLC’s visual representations are highly effective for ImageNet-1K classification, and our visualizations confirm that VLC can accurately match image patches to corresponding text tokens. The scalability of performance with more training data highlights the promising potential for developing large-scale, weakly-supervised, open-domain vision-language models.
接下来我们介绍"基于字幕训练的视觉语言 Transformer"（VLC）模型（Gui 等人，2022b），这是一种仅通过图像-字幕对进行预训练的 Transformer。尽管仅使用简单的线性投影层来处理图像嵌入，与其他依赖目标检测器或有监督 CNN/ViT 网络的方法相比，VLC 在各种视觉语言任务中仍能取得具有竞争力的结果。通过大量分析，我们探索了 VLC 作为视觉语言 Transformer 智能体的潜力。例如，我们证明 VLC 的视觉表征对 ImageNet-1K 分类非常有效，可视化结果也证实 VLC 能准确将图像块与对应文本标记相匹配。随着训练数据增加而提升的性能扩展性，彰显了开发大规模、弱监督、开放域视觉语言模型的广阔前景。

6.5 Video-language Experiments
6.5 视频语言实验

To understand the practicality of converting pre-trained image-LLMs for video understanding, we temporally expand and fine-tune InstructBLIP Dai et al. (2023) for video captioning. Specifically, we expand the visual encoder of InstructBLIP (EVA-CLIP-G Sun et al. (2023b)) using the same divided space-time attention scheme as Frozen in Time Bain et al. (2021) and keep the Q-former and LLM (Flan-T5-XL Chung et al. (2022)) frozen during training. We freeze all spatial layers of the visual encoder, while keeping the temporal layers unfrozen during captioning training. This allows for our model to take image and videos as input (matching the image-level performance of InstructBLIP). We train on a 5 million video-caption subset of WebVid10M Bain et al. (2021). We visualize two example outputs in Figure 25. However, existing agents fail to fully comprehend precise, fine-grained visual details in the video content. A similar limitation is seen by visual instruction tuning methods, where they lack the general, human-level perception abilities that are remain to be solved by multimodal models and agents.
为探究将预训练图像 LLM 转化为视频理解模型的实用性，我们采用时间维度扩展方法对 InstructBLIP（Dai 等人，2023）进行微调以适用于视频描述任务。具体而言，我们采用与 Frozen in Time（Bain 等人，2021）相同的时空分离注意力机制扩展 InstructBLIP 的视觉编码器（EVA-CLIP-G，Sun 等人，2023b），并在训练过程中保持 Q-former 和 LLM（Flan-T5-XL，Chung 等人，2022）参数冻结。在视频描述训练阶段，我们冻结视觉编码器所有空间层参数，仅解冻时间层参数，使模型能同时处理图像与视频输入（保持与 InstructBLIP 相当的图像级性能）。训练数据采用 WebVid10M（Bain 等人，2021）中 500 万视频-字幕子集，图 25 展示了两个输出示例。然而现有智能体仍无法完全理解视频内容中精确的细粒度视觉细节，这与视觉指令调优方法面临的局限性类似——它们缺乏人类水平的通用感知能力，这仍是多模态模型与智能体亟待解决的问题。

The instruction-tuned models show promise in accurately summarizing visible actions within videos and identifying actions like "person sitting on a bench" effectively in Fig. 25. However, they sometimes add incorrect details, such as "person smiling to the camera," revealing a shortfall in capturing conversation topics or the video’s ambiance, elements that are readily apparent to human observers. This shortfall underscores another key limitation: the omission of audio and speech modalities that would enrich the video understanding with context, aiding in more accurate interpretation and preventing such misrepresentations. Bridging this gap requires a holistic integration of available modalities, allowing multimodal agents to reach a level of comprehension akin to human perception and ensuring a fully multimodal approach to video interpretation.
指令调优模型在准确总结视频中可见动作方面展现出潜力，例如能有效识别图 25 中"人坐在长椅上"这类动作。然而这些模型偶尔会添加错误细节，比如"人对镜头微笑"，暴露出其在捕捉对话主题或视频氛围方面的不足——这些元素对人类观察者而言显而易见。这一缺陷揭示了另一个关键局限：缺失了能够通过上下文丰富视频理解的音频和语音模态，这些模态本可帮助实现更精准的解读并避免此类误判。弥合这一鸿沟需要对现有模态进行整体整合，使多模态智能体能够达到类似人类感知的理解水平，从而确保视频解读真正实现全模态化。

Audio-Video-Language Agents with GPT-4V.
基于 GPT-4V 的视听语言智能体

We then evaluate the capabilities of GPT-4V as a multimodal agent that integrates vision, audio, and speech for a nuanced and precise understanding of videos, following the methodology outlined in Lin et al. (2023). Results depicted in Fig. 26 compare the performance of various video agents on the task of video summarization. The video-instruction tuned model (Li et al., 2023g) provides accurate content but falls short on comprehensiveness and detail, missing specific actions like the methodical use of a broomstick to measure a tree’s height.
我们随后按照 Lin 等人(2023)提出的方法，评估了 GPT-4V 作为多模态代理的能力，该代理整合了视觉、音频和语音功能，以实现对视频细致而精确的理解。图 26 所示结果比较了各种视频代理在视频摘要任务上的表现。经过视频指令调优的模型(Li 等人，2023g)能提供准确内容，但在全面性和细节方面有所欠缺，例如遗漏了系统使用扫帚测量树高等具体动作。

To enhance the accuracy of video descriptions, we employ GPT-4V to caption frames, while audio and its transcriptions are sourced from the OpenAI Whisper model. We then prompt GPT-4V to create video summaries using only frame captions and then using both frame captions and audio transcriptions. Initially, we observe that frame captions alone can lead to fabricated events, such as a person biting down on a stick in the third segment. These inaccuracies persist in the video summary, with descriptions like "in a playful twist, he bites down on it while holding it horizontally." Without audio input, the agent cannot correct these captioning errors, resulting in descriptions that are semantically correct but visually misleading.
为提高视频描述的准确性，我们采用 GPT-4V 生成画面字幕，同时通过 OpenAI Whisper 模型获取音频及其文字转录。随后我们引导 GPT-4V 分别仅使用画面字幕、以及结合画面字幕与音频转录来生成视频摘要。初期观察发现，仅依赖画面字幕会导致虚构事件，例如第三片段中描述"人物水平咬住棍子"这类失实内容。这些错误会延续至视频摘要中，形成"他以俏皮的动作横向咬住棍子"等语义正确但视觉误导的描述。缺乏音频输入时，智能体无法修正这些字幕错误。

However, when we provide the audio transcriptions to the agent, it manages to accurately depict the content, even capturing detailed physical actions like "holding the broomstick perpendicular to the body and rotating it downwards." This level of detail is significantly more informative and gives viewers a clearer understanding of the video’s purpose and key details. These findings highlight the importance of integrating audio, video, and language interactions to develop high-quality multimodal agents. GPT-4V emerges as a promising foundation for such advanced multimodal understanding and interaction.
当向智能体提供音频转录后，其能准确呈现视频内容，甚至捕捉到"将扫帚杆垂直身体持握并向下旋转"等细节动作。这种精细描述显著提升了信息量，使观众能更清晰理解视频主旨与关键细节。这些发现凸显了整合音频、视频与语言交互对开发高质量多模态智能体的重要性。GPT-4V 展现出作为先进多模态理解与交互基础模型的巨大潜力。

Embodied Multi-modal Agents with GPT-4V.
具身多模态智能体与 GPT-4V

As shown in Fig. 27, We mainly used StackOverflow to get the initial Question, then we used the “Bing search" API to retrieve a related video and audio corresponding to the question. Next, we mainly use GPT-4V to get the relevant text information and high-level video description. On the other hand, we transfer the key frame audio to a low-level segment description of the key frames via ASR. Finally, we use GPT-4V to generate convincing "hallucinations" that serve as hard negative queries for video-question and answer tasks. We support interactions and question answering in the current frame of the video, as well as summarization for the overall high-level video description. During inference, we also combine external knowledge information via web search to improve answering capapbilities.
如图 27 所示，我们首先通过 StackOverflow 获取初始问题，继而使用"Bing 搜索"API 检索与问题相关的视频及音频。接着主要利用 GPT-4V 提取相关文本信息和高阶视频描述，同时通过自动语音识别(ASR)将关键帧音频转换为帧级细分描述。最后运用 GPT-4V 生成具有说服力的"幻觉"内容，作为视频问答任务的困难负样本。我们支持视频当前帧的交互问答，以及整体高阶视频描述的总结。在推理过程中，还通过网页搜索结合外部知识以提升回答能力。

The main prompt information for GPT-4V is described as below. The entire prompt is indented for clarity; it is over one page long.
GPT-4V 的主要提示信息描述如下。为清晰起见，整个提示内容进行了缩进处理；其长度超过一页。

GPT-4V are an assistant to provide descriptive, informative, and full comprehensive details in the video for the visually impaired who can hear the video but cannot see. The job is to create high-quality, dense descriptions of the video by synthesizing the given annotations and output them as JSON. Specifically, GPT-4V will be given original query used to search the video, the video title, description, audio transcription, and potentially noisy descriptions for specific time in the video. Different segments of same video is annotated as "[time start - time end (in seconds)] ’text’ ". Utilize the transcriptions and descriptions all together to reason about the exact detail and visual demonstration that might be happening in the video. GPT-4V will to combine or segment the timestamps as necessary to provide the best segmentation of the video.
GPT-4V 是专为视觉障碍人士设计的辅助工具，能够为可听见但无法观看视频的用户提供详尽、信息丰富且全面完整的视频描述。其核心任务是通过整合给定的视频标注信息，生成高质量、密集的视频内容描述，并以 JSON 格式输出。具体而言，GPT-4V 将接收用于搜索视频的原始查询、视频标题、描述、音频转录文本，以及视频中特定时间点可能存在的噪声描述。同一视频的不同片段会标注为"[开始时间-结束时间（秒）] '文本'"格式。系统需综合运用文字转录和描述内容，推理视频中可能发生的具体细节和视觉呈现。GPT-4V 将根据需要合并或分割时间戳，以实现最佳的视频分段效果。

Expectations for GPT-4V Output:
对 GPT-4V 输出的预期要求：

1. Action-Oriented Descriptions: Prioritize plausible actions, motions, and physical demonstrations that the audio implies, enriching your narrative with dynamic visual cues.
1. 动作导向描述：优先呈现音频隐含的合理动作、运动轨迹和物理演示，用动态视觉线索丰富叙述内容。

2. Complete Video Coverage: Provide a continuous and consistent audio-descriptive experience that covers every moment of the video’s duration, ensuring no content is left undescribed.
2. 完整视频覆盖：提供持续连贯的音频描述体验，确保视频时长内的每个时刻都有对应描述，杜绝内容遗漏。

3. Concise Segmentation: Construct your descriptions in focused, succinct segments of 1-2 sentences each to effectively communicate visual actions without overwhelming detail.
3. 精炼分段处理：将描述内容组织成每段 1-2 句的聚焦式简洁段落，既能有效传达视觉动作，又避免细节堆砌。

4. Contextual Audio-Visual Synthesis: Seamlessly blend the spoken audio content with inferred visual elements to form a narrative that reflects potential onscreen activities.
4. 音画语境融合：将口述音频内容与推断的视觉元素无缝结合，构建反映潜在屏幕活动的叙述流。

5. Imaginative and Plausible Speculation: Infuse your descriptions with creative yet believable visual details that correspond with the audio, enhancing scene comprehension.
5. 合理想象推测：在描述中融入与音频对应的创造性但可信的视觉细节，强化场景理解。

6. Accurate Timecode Correspondence: Align your descriptive segments with corresponding timecodes, ensuring that speculative visual details synchronize with the audio narrative’s timeline.
6. 精准时间轴对应：确保描述段落与对应时间码严格对齐，使推测的视觉细节与音频叙述时间线同步。

7. Confident Narrative Delivery: Present the descriptions with assurance, as though the speculated visuals are occurring, to instill confidence in the listener.
7. 自信叙述表达：以确信无疑的语气呈现描述内容，使听众确信推测的视觉画面正在发生。

8. Omit Implausible Details: Exclude descriptions of objects or events that do not reasonably fit within the context established by the audio and visual information provided.
8. 排除不合理细节：剔除音频和现有视觉信息背景下明显不符的对象或事件描述。

The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
最终输出应采用 JSON 格式，包含详细描述视频片段的字典列表。

The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
最终输出应采用 JSON 格式，包含详细描述视频片段的字典列表。

[ ‘start’: <start-time-in-seconds>, ‘end’: <end-time-in-seconds>, ‘text’: “<Your detailed single-sentence, audio-visual description here>" ]
[ ‘开始’: <开始时间（秒）>, ‘结束’: <结束时间（秒）>, ‘文本’: “<此处填写详细的单句视听描述>" ]

For MC Creation: our task is to create multiple-choice questions for video-to-text retrieval tasks that is trivially solved by looking at the title and reading through audio transcriptions. To do so, we will be given original query to get the video, description, audio transcription, and potentially noisy descriptions for specific time in the video.
多选题生成任务：我们的目标是为视频转文本检索任务创建多选题，这类题目仅通过查看标题和阅读音频转录即可轻易解答。为此，我们将获得原始查询以获取视频、描述、音频转录，以及视频中特定时间点可能存在的噪声描述。

•

Format of audio transcription: -[start-end time in seconds] “transcription"
音频转录格式：-[开始-结束时间（秒）] "转录文本"

•

Format of noisy description: - [time in seconds] “description"
噪声描述格式：- [时间点（秒）] "描述内容"

We kindly ask GPT-4V to generate four queries, where the primary query is aligned with the video content, and the other three negatives are subtly different from our primary one. Selecting the primary one should not simply involve listening to audio transcriptions e.g. the text original query is contained in audio transcriptions. The negatives should be closely related but not fully aligned with the video content, requiring visual understanding of the video to differentiate. For example, modify the semantics in nuanced way so that one needs to watch the video than just listening to select the original query. Compile four queries in caption-like statement, with the first one being the rephrased original.
我们要求 GPT-4V 生成四个查询选项，其中首要选项需与视频内容严格匹配，其余三个干扰项则需与首要选项存在细微差异。正确选择首要选项不应仅通过听取音频转录即可完成（例如原始查询文本直接出现在音频转录中）。干扰项应当与视频内容高度相关但未完全吻合，需通过视觉理解才能区分。例如通过微妙调整语义，使得用户必须观看视频而非仅凭听力来选择原始查询。最终以字幕式陈述句呈现四个选项，首项为改写后的原始查询。

Think step by step how you can come up with negative statements using the information from the video. And justify the negative queries are incorrect but still compelling choices that demand nuanced understanding of the video. And how humans would not accidentally choose the negatives over the original query.
请逐步思考如何利用视频信息构建干扰项。需论证这些干扰项虽然不正确，但仍是具有迷惑性的选项，要求对视频内容有细致理解。同时说明人类用户为何不会误选干扰项而非原始查询。

Finally, we present the work in the following format of analyses and 4 queries. No need to generate how you translated the original query.
最后，我们以下列分析框架和 4 项查询的形式呈现研究成果。无需说明原始查询的具体翻译过程。

•

Video Analysis: xxx 视频分析：xxx

•

Queries: [query1, query2, query3, query4]
查询：[查询 1, 查询 2, 查询 3, 查询 4]

•

Justification: xxx 论证依据：xxx

6.6 Agent for NLP 6.6 面向自然语言处理的智能体

6.6.1 LLM agent 6.6.1 LLM 智能体

Recognizing task directives and taking action has been a fundamental challenge in interactive AI and natural language processing for decades. With the recent advances in deep learning, there is a growing interest in studying these areas jointly to improve human-agent collaboration. We identify three specific directions, among others, to improve language-grounded agents:
数十年来，识别任务指令并采取相应行动始终是交互式人工智能和自然语言处理领域的核心挑战。随着深度学习的最新进展，联合研究这些领域以提升人机协作效能正引发日益广泛的关注。我们特别提出三个改进语言基础智能体的发展方向：

•

Tool use and querying from knowledge bases. This direction emphasizes the importance of integrating external knowledge bases, web search, or other helpful tools into the reasoning processes of AI agents. By leveraging structured and unstructured data from various sources, agents can enhance their understanding and provide more accurate and context-aware responses. Furthermore, it fosters the agent’s ability to proactively seek out information when faced with unfamiliar scenarios or queries, ensuring more comprehensive and informed responses. Examples include Toolformer Schick et al. (2023) and Retrieve What You Need Wang et al. (2023g).
工具调用与知识库查询。该方向强调将外部知识库、网络搜索或其他辅助工具整合至 AI 智能体推理流程的重要性。通过利用来自多源的结构化与非结构化数据，智能体可深化理解能力，提供更精准且情境感知的响应。此外，这能增强智能体在遭遇陌生场景或查询时主动获取信息的能力，确保回应的全面性与可靠性。典型案例如 Schick 等人(2023)的 Toolformer 和王等人(2023g)的 Retrieve What You Need。
•

Improved agent reasoning and planning. Enhancing the agent’s ability to reason and plan is pivotal for effective human-agent collaboration. This involves the development of models that can understand complex instructions, infer user intentions, and predict potential future scenarios. This can be accomplished by asking the agent to reflect on past actions and failures as in ReAct Yao et al. (2023a), or by structuring the agent thought process as a form of search Yao et al. (2023b). By simulating different outcomes and assessing the ramifications of various actions, agents can make more informed context-aware decisions.
改进智能体推理与规划能力。提升智能体的推理与规划能力对于实现高效人机协作至关重要，这需要开发能够理解复杂指令、推断用户意图并预测未来潜在场景的模型。实现路径包括让智能体像 ReAct(Yao 等人，2023a)研究那样反思过往行动与失败经验，或是如 Yao 等人(2023b)提出的将智能体思维过程构建为搜索机制。通过模拟不同结果并评估各类行动的潜在影响，智能体能够做出更具环境感知力的明智决策。
•

Incorporating system and human feedback. AI agents can frequently operate in two primary contexts: environments that provide explicit signals about the effectiveness of their actions (system feedback), and settings where they collaborate with humans who can offer verbal critiques (human feedback). This direction underscores the need for adaptive learning mechanisms that allow agents to refine their strategies and rectify mistakes, such as in AutoGen Wu et al. (2023). The ability to continuously learn and adapt from diverse feedback sources ensures that agents remain helpful and aligned for user needs.
融入系统和人类反馈。AI 智能体通常可在两种主要情境下运作：一种是环境提供明确行动效果信号（系统反馈），另一种是与能提供口头评价的人类协作（人类反馈）。这一研究方向强调需要自适应学习机制，使智能体能够优化策略并纠正错误，例如 AutoGen Wu 等人（2023）所示例。通过持续从多元反馈源学习与适应的能力，确保智能体始终贴合用户需求并保持实用性。

6.6.2 General LLM agent 6.6.2 通用 LLM 智能体

Recognizing and understanding agent content and natural language has been a fundamental challenge in interactive AI and natural language processing for decades. With the recent advances in deep learning, there is a growing interest in studying these two areas jointly for deep understanding of both agent planning or human feedback for knowledge-inference and natural language generation. These are the key components of many human-machine-interaction agents, such as “AutoGen"Wu et al. (2023) and “Retrieve What You Need"Wang et al. (2023g).
数十年来，识别和理解智能体内容与自然语言一直是交互式人工智能和自然语言处理领域的核心挑战。随着深度学习技术的最新突破，学界日益关注将这两个领域相结合进行深入研究，以期实现智能体规划/人类知识推理反馈与自然语言生成的深度融合。这些技术构成了诸多"人机交互智能体"（如吴等人 2023 年提出的"AutoGen"与王等人 2023g 提出的"Retrieve What You Need"）的关键组件。

6.6.3 Instruction-following LLM agents
6.6.3 指令遵循型 LLM 智能体

Furthermore, the creation of LLM Agents that can be trained to effectively follow human instructions has become an important area of research. Initial models used human feedback to train a proxy reward model to simulate human preferences, through a process known as Reinforcement Learning with Human Feedback (RLHF) Ouyang et al. (2022). This process produced models such as InstructGPT and ChatGPT. In order to more efficiently train instruction-following LLM agents without needing human labels, researchers developed a more efficient method for instruction-tuning that trains the LLM agent directly on instruction/response pairs, either generated by humans like Dolly 2.0 ⁵⁵5Dolly 2.0 blogpost link or automatically from LLMs like Alpaca Taori et al. (2023). We show the overall Alpaca training pipeline in Figure 28.
此外，训练能够有效遵循人类指令的 LLM 智能体已成为重要研究领域。早期模型通过人类反馈训练代理奖励模型来模拟人类偏好，这一过程被称为基于人类反馈的强化学习（RLHF）（Ouyang 等人，2022）。该方法催生了 InstructGPT 和 ChatGPT 等模型。为更高效地训练指令跟随型 LLM 智能体而无需人工标注，研究者开发了更高效的指令微调方法——直接使用指令/响应对训练 LLM 智能体，这些数据对或由人类生成（如 Dolly 2.0 ⁵⁵5Dolly 2.0 blogpost link ），或通过 LLMs 自动生成（如 Alpaca，Taori 等人，2023）。图 28 展示了 Alpaca 的整体训练流程。

6.6.4 Experiments and Results
6.6.4 实验与结果

Despite the growing adoption of conversational and self-feedback systems, these forms of AI still do not perform well with regard to generating factually correct responses from their own implicit knowledge and therefore often use external tools like web search and knowledge retrieval mechanisms at inference-time to augment their response as a consequence. Addressing this would help create more engaging experiences for users in many real-life applications. In social conversations (such as those on social media platforms like Instagram and Facebook), or with Q+A websites (such as Ask or Quora), people usually engage with others through a series of comments and by web-searching for information and knowledge relevant to the discussion. Thus, the task of generating conversational turns in this context is not to simply bootstrap upon traditional NLP models and tasks, but to use agents to generate dialogue through intelligent behaviors that reflect knowledge search and acquisition Peng et al. (2023). In this way, intelligent agents for NLP tasks extends the task description and improves upon the interpretability of the response by adding an explicit knowledge search and retrieval step during dialogue. Incorporating these web search and retrieval agents as feedback during dialogue will help to engage further and deeper the social interactions between humans and agents Wang et al. (2023e). As the Fig 29 showed, we introduced a new modeling paradigm for transformer language models that detects and extracts important logical structures and information from input texts and then integrates them into the input embeddings through carefully designed multi-layer hierarchical logical projections to infuse logical structures into pre-trained language models as one kind of NLP agent. Wang et al. (2023e) propose a novel approach to construct logic-aware input embeddings for transformer language models through a combination of logic detection, logic mapping and hierarchical logical projections, and then develop a corresponding new modeling paradigm that can upgrade all existing transformer language models into logical transformers to consistently boost their performance. The proposed logical transformer agent consistently achieve superior performance over their baseline transformer models through a deeper understanding of the logical structures of texts. To human users, it is often these aspects that are more important for delivering a meaningful and interesting conversation via a agent-based coordination between dialogue and information retrieval. Delving deep into natural language processing, this topic will discuss the advancements and leader-board in making LLMs more agentic and better suited for various language-centered tasks.
尽管对话式系统和自我反馈系统的应用日益广泛，但这类人工智能在基于自身隐性知识生成事实性正确回应方面仍表现欠佳，因此常在推理阶段依赖网络搜索和知识检索机制等外部工具来增强响应能力。解决这一问题将有助于在诸多现实应用中为用户创造更具吸引力的交互体验。在社交媒体平台（如 Instagram 和 Facebook）的社交对话，或问答网站（如 Ask 和 Quora）的互动中，人们通常通过系列评论和网络信息检索参与讨论。因此，这类场景下的对话轮次生成任务并非简单套用传统 NLP 模型，而是需要智能体通过反映知识搜索与获取的智能行为来产生对话（Peng 等，2023）。这种面向 NLP 任务的智能体通过显式增加对话过程中的知识检索步骤，既扩展了任务描述范畴，又提升了响应结果的可解释性。将网络搜索与检索智能体作为对话反馈机制，能促进人类与智能体之间更深入的社会化互动（Wang 等，2023e）。如图 29 所示，我们提出了一种新型 Transformer 语言模型建模范式：通过检测并提取输入文本中的关键逻辑结构与信息，再经精心设计的层级逻辑投影将其融入输入嵌入，从而将逻辑结构注入预训练语言模型作为 NLP 智能体的一种实现。Wang 等（2023e）提出了通过逻辑检测、逻辑映射和层级逻辑投影构建逻辑感知输入嵌入的新方法，继而开发出可将现有 Transformer 语言模型升级为逻辑 Transformer 的新型建模范式，持续提升模型性能。这种逻辑 Transformer 智能体凭借对文本逻辑结构的深层理解，始终优于基线 Transformer 模型。对人类用户而言，正是这种基于智能体的对话-信息检索协同机制，对实现有意义且有趣的对话至关重要。本专题将深入探讨自然语言处理领域如何通过增强 LLMs 的智能体特性，使其更胜任各类以语言为中心的任务，并分析相关技术进展与性能排行榜。

An open-domain question answering (QA) system usually follows a retrieve-then-read paradigm, in which a retriever is used to retrieve relevant passages from a large corpus, and then a reader generates answers based on the retrieved passages and the original question. In Wang et al. (2023g), we propose a simple and novel mutual learning framework to improve the performance of retrieve-then-read-style models via an intermediate module named the knowledge selector agent, which we train with reinforcement learning. The fine-grained knowledge selector into the retrieve-then-reader paradigm, whose goal is to construct a small subset of passages which retain question-relevant information. As showed in Figure 30, The knowledge selector agent is trained as a component of our novel mutual learning framework, which iteratively trains the knowledge selector and the reader. We adopt a simple and novel approach employing policy gradients to optimize the knowledge selector agnet, using feedback from the reader to train it to select a small and informative set of passages. This approach avoids brute-force search or manually-designed heuristics, without requiring any annotated query-document pairs for supervision. We show that iteratively training the reader and the knowledge selector agent leads to better predictive performance on some public open-domain question answering benchmarks.
开放域问答系统通常采用"检索-阅读"范式：先通过检索器从大规模语料库获取相关段落，再由阅读器基于检索结果和原始问题生成答案。在 Wang 等（2023g）的研究中，我们提出了一种新颖的互学习框架，通过名为"知识选择器智能体"的中间模块（采用强化学习训练）来提升该范式的性能。该细粒度知识选择器旨在构建保留问题相关信息的精简段落子集。如图 30 所示，知识选择器智能体作为互学习框架的组件，与阅读器进行迭代训练。我们采用策略梯度法优化知识选择器，利用阅读器的反馈训练其选择信息量高的小规模段落集合。这种方法避免了暴力搜索或人工设计启发式规则，且无需监督式的查询-文档标注对。实验表明，阅读器与知识选择器智能体的迭代训练能有效提升多个公开开放域问答基准的预测性能。

7 Agent AI Across Modalities, Domains, and Realities
7 跨模态、跨领域与跨现实的智能体 AI

7.1 Agents for Cross-modal Understanding
7.1 跨模态理解智能体

Multi-modal understanding is a significant challenge for creating generalist AI agents due to the lack of large-scale datasets that contain vision, language, and agent behavior. More generally, training data for AI agents is often modality specific. This results in most modern multi-modal systems using a combination of frozen submodules. Some notable examples are Flamingo Alayrac et al. (2022), BLIP-2 Li et al. (2023c), and LLaVA Liu et al. (2023c), all of which utilize a frozen LLM and frozen visual encoder. These submodules are trained individually on separate datasets, and then adaptation layers are trained to encode the visual encoder into the LLM embedding space. In order to make further progress for cross-modal understanding for AI agents, it is likely that the strategy of using frozen LLMs and visual encoders will need to change. Indeed, RT-2, a recent visual-language model that is capable of taking actions within the domain of robotics showed significantly improved performance when jointly tuning the visual encoder and LLM for robotics and visual-language tasks Brohan et al. (2023).
由于缺乏包含视觉、语言和智能体行为的大规模数据集，多模态理解成为构建通用人工智能智能体的重大挑战。更普遍的情况是，AI 智能体的训练数据往往具有模态特异性。这导致大多数现代多模态系统采用冻结子模块组合的方案，例如 Flamingo Alayrac 等人(2022)、BLIP-2 Li 等人(2023c)和 LLaVA Liu 等人(2023c)等典型案例，这些系统都使用了冻结的 LLM 和冻结的视觉编码器。这些子模块分别在独立数据集上训练，然后通过适配层训练将视觉编码器映射到 LLM 的嵌入空间。为了推进 AI 智能体的跨模态理解研究，当前这种使用冻结 LLM 和视觉编码器的策略可能需要改变。事实上，最新研究表明，在机器人视觉语言模型 RT-2 中，当对视觉编码器和 LLM 进行机器人任务与视觉语言任务的联合调优时，其性能获得了显著提升 Brohan 等人(2023)。

7.2 Agents for Cross-domain Understanding
7.2 跨领域理解智能体

A key challenge for creating generalist agents is the distinctive visual appearance and disparate action spaces across different domains. Humans possess the capability to interpret images and videos from various sources, including the real world, video games, and specialized domains such as robotics and healthcare, once they become familiar with the specific details of these areas. However, existing LLMs and VLMs often demonstrate significant differences between the data they were trained on and the varied domains in which they are applied. And notably, training agent models to predict specific actions presents a considerable challenge when trying to develop a single policy that can effectively learn multiple control systems across domains. Generally, the approach most modern works take when applying systems within specific domains is to start from a pretrained foundation model and then finetune a separate model for each specific domain. This fails to capture any commonalities between domains and results in a smaller total set of data used for training instead of leveraging each domain’s data.
构建通用智能体的核心挑战在于不同领域间显著的视觉表现差异和动作空间区隔。人类一旦熟悉特定领域细节，就能解读来自现实世界、电子游戏及机器人学/医疗等专业领域的图像视频。但现有 LLMs 和 VLMs 的训练数据与其应用领域常存在明显差异。值得注意的是，要开发能有效学习跨领域多控制系统的单一策略时，训练智能体模型预测特定动作面临巨大挑战。当前主流做法是从预训练基础模型出发，为每个特定领域微调独立模型，这既无法捕捉领域间共性，也导致训练数据总量受限而无法充分利用各领域数据。

7.3 Interactive agent for cross-modality and cross-reality
7.3 跨模态与跨现实的交互式智能体

Developing AI agents that can successfully understand and perform tasks across different realities is an on-going challenge that has seen some recent success for image and scene generation Huang et al. (2023a). In particular, it is challenging for agents to simultaneously understand real-world and virtual reality environments due to their visual dissimilarities and separate environment physics. Within the context of cross-reality, Sim to Real transfer is a particularly important problem when using simulation-trained policies for real-world data, which we discuss in the next section.
开发能够成功理解并执行跨现实任务的 AI 智能体是一个持续存在的挑战，目前在图像和场景生成领域已取得一些成果（Huang 等，2023a）。由于视觉差异和独立的环境物理特性，智能体要同时理解真实世界与虚拟现实环境尤为困难。在跨现实语境下，当使用仿真训练策略处理真实世界数据时，"仿真到现实"迁移成为关键问题，我们将在下节详细讨论。

7.4 Sim to Real Transfer 7.4 仿真到现实迁移

Techniques which enable models trained in simulation to be deployed in the real world. Embodied agents, especially one based on RL policies, are typically trained in simulated environments. These simulations do not fully replicate the characteristics of the real world (e.g., disturbances, light, gravity, and other physical properties). Due to this discrepancy between simulation and reality, models trained in simulation often struggle to perform well when applied in the real world. This issue is known as the “sim-to-real” problem. To solve this problem, several approaches can be taken:
这类技术使得在仿真环境中训练的模型能够部署到现实世界。具身智能体（尤其是基于强化学习策略的）通常先在仿真环境中训练，但这些仿真无法完全复现真实世界的特性（如干扰、光照、重力等物理属性）。由于仿真与现实间的这种差异，仿真训练的模型在现实应用中往往表现不佳，此现象被称为"仿真到现实"问题。现有解决方案包括：

•

Domain randomization: domain randomization is a technique that trains a model while randomly varying parameters within a simulation environment (e.g., object appearance, sensor noise, and optical properties) in anticipation of the uncertainties and variations of the real world Tobin et al. (2017). For instance, in the context of training a RL-based grasping skills, introducing randomness in the shapes of objects can lead to a policy capable of adapting to objects with somewhat different shapes Saito et al. (2022).
领域随机化：该技术通过随机改变仿真环境参数（如物体外观、传感器噪声和光学特性）来训练模型，以应对现实世界的不确定性和变化（Tobin 等，2017）。例如在训练基于强化学习的抓取技能时，引入物体形状的随机性可使策略适应不同形状的物体（Saito 等，2022）。
•

Domain adaptation: Domain adaptation, or domain transfer is a technique that bridges the gap between simulated and real-world domains by training models with a large number of simulated images and a smaller set of real-world images. In practical settings, unpaired image-to-image translation methods such as CycleGAN Zhu et al. (2017b) are employed due to the difficulty in preparing paired images across domains. Several enhanced versions exist for reinforcement learning, including RL-CycleGAN Rao et al. (2020), and for imitation learning, such as RetinaGAN Ho et al. (2021).
领域适应：该技术通过使用大量仿真图像和少量真实图像训练模型，弥合仿真与真实领域间的差距。实践中由于难以准备跨领域配对图像，常采用无配对图像转换方法如 CycleGAN（Zhu 等，2017b）。强化学习领域有 RL-CycleGAN（Rao 等，2020）等改进版本，模仿学习领域则有 RetinaGAN（Ho 等，2021）。
•

Improvement of simulation: Realistic simulation is a key for sim-to-real transfer. Part of this effort is achieved by a system identification techniques Zhu et al. (2017c); Allevato et al. (2020), which aims to identify simulation parameters to mimic the real-world environments. Additionally, use of photorealistic simulators would be effective in image-based reinforcement learning Martinez-Gonzalez et al. (2020); Müller et al. (2018); Shah et al. (2018); Sasabuchi et al. (2023).
仿真改进：高拟真度仿真是实现仿真到现实迁移的关键。系统辨识技术（Zhu 等，2017c；Allevato 等，2020）通过识别仿真参数来模拟真实环境。此外，基于图像的强化学习可受益于照片级仿真器（Martinez-Gonzalez 等，2020；Müller 等，2018；Shah 等，2018；Sasabuchi 等，2023）。

The sim-to-real transfer remains a central challenge in the study of Embodied Agents, as approaches keep evolving. Both theoretical and empirical research are essential to advance these technologies further.
随着方法论的持续演进，仿真到现实迁移仍是具身智能体研究的核心挑战。要推动技术进步，理论探索与实践验证都不可或缺。

8 Continuous and Self-improvement for Agent AI
8 智能体 AI 的持续与自我改进

Currently, foundation model based AI agents have the capacity to learn from multiple different data sources, which allow for more flexible sources for data for training. Two key consequences of this are (1) user and human-based interaction data can be used to further refine and improve the agent and (2) existing foundation models and model artifacts can be used to generate training data. We discuss each of these in more detail in the following sections, but we note that since current AI Agents are largely tied to existing pretrained foundation models, they generally do not learn from continuous interaction with their environments. We think this is an exciting future direction, and initial work by Bousmalis et al. has shown that self-improving agents for robotic control are able to continuous learn and improve through environmental interactions without supervision Bousmalis et al. (2023).
目前，基于基础模型的 AI 智能体能够从多种不同数据源中学习，这为训练数据提供了更灵活的来源。由此产生的两个关键影响是：(1) 用户与人机交互数据可用于进一步优化和改进智能体；(2) 现有基础模型及模型产物可用于生成训练数据。我们将在后续章节详细讨论这两点，但需要指出的是，由于当前 AI 智能体主要依赖预训练基础模型，它们通常无法通过与环境的持续交互进行学习。我们认为这是一个极具前景的未来研究方向——Bousmalis 等人（2023）的初步工作已证明，用于机器人控制的自改进型智能体能够在无监督条件下通过环境交互持续学习与提升。

8.1 Human-based Interaction Data
8.1 人机交互数据

The core idea behind using human-based interaction data is to leverage a large number of of agent-human interactions to train and improve future iterations of the agent. There are several strategies used to improve agents from human-agent interactions.
利用人机交互数据的核心理念是通过大量智能体-人类交互来训练和改进后续迭代版本。现有多种策略可用于基于人机交互优化智能体：

•

Additional training data Perhaps the simplest usage of human-agent interactions is to use the interaction examples themselves as training data for a future iteration of the agent. This generally requires filtering strategies to differentiate successful agent examples from unsuccessful interaction examples. Filtering can be rules-based (e.g., reaching some desired end goal state), model-based (e.g., classifying successful vs unsuccessful interactions), or manually selected after a posthoc inspection and/or modification of the interaction examples.
附加训练数据最直接的应用是将人机交互案例本身作为下一代智能体的训练数据。这通常需要筛选策略来区分成功与失败的交互案例，筛选方式可基于规则（如达成特定目标状态）、模型（如对交互成功与否进行分类）或人工检查/修改后的手动选择。
•

Human preference learning During interaction with the user, the agent system can prompt the user with several different model outputs and allow for the user to select the best output. This is commonly used by LLMs like ChatGPT and GPT-4, whereby users can select one output (out of several) that aligns best with their preferences.
人类偏好学习在用户交互过程中，智能体系统可呈现多种不同模型输出供用户选择最佳方案。ChatGPT 和 GPT-4 等 LLMs 常采用这种方法，用户可从多个输出中选择最符合偏好的结果。
•

Safety training (red-teaming) Red-teaming within the context of Agent AI refers to having a dedicated team of adversaries (either human or computer) that seek to exploit and expose weaknesses and vulnerabilities within the Agent AI system. Although adversarial in nature, red-teaming is commonly used as a means for understanding how to improve AI safety measures and reduce the occurrence of harmful outputs. The core principle is to discover consistent methods for inducing unwanted agent outputs so that the model can be trained on data that explicitly corrects this behavior.
安全训练（红队测试）在 Agent AI 领域中，红队测试指由专门对抗团队（人类或计算机）主动探测并暴露智能体系统的弱点与漏洞。虽然本质上是对抗行为，但该方法常被用于改进 AI 安全措施并减少有害输出。其核心原理是通过系统性诱发不良输出来获取修正数据，从而针对性训练模型。

8.2 Foundation Model Generated Data
8.2 基础模型生成数据

With the advent of powerful foundation model artifacts produced by academia and industry, there have been a variety of methods developed to extract and generate meaningful training data from these artifacts using a variety of prompting and data-pairing techniques.
随着学术界和产业界开发出强大的基础模型，人们已发展出多种方法，通过提示技术和数据配对手段从这些模型中提取并生成有意义的训练数据。

•

LLM Instruction-tuning Methods for generating instruction-following training data from LLMs have allowed for the finetuning of smaller, open-source models based on the outputs of larger proprietary LLMs Wang et al. (2022b). For example, Alpaca Taori et al. (2023) and Vicuna Zheng et al. (2023) are LLMs based on the open-source LLaMA family Touvron et al. (2023) that have been tuned on various outputs from ChatGPT and human participants. This method of instruction tuning can be viewed as a form of knowledge distillation, where the larger LLM serves as a teacher model to a smaller student model. Importantly, although LLM instruction-tuning has been shown to transfer the writing style and some instruction-following capabilities of the teacher model to the student model, significant gaps still exist between the factuality and capabilities of the teacher and student models Gudibande et al. (2023).
基于 LLM 的指令微调方法能够根据专有大模型输出生成指令跟随训练数据，从而实现对小型开源模型的精调（Wang 等人，2022b）。例如 Alpaca（Taori 等人，2023）和 Vicuna（Zheng 等人，2023）这类基于开源 LLaMA 系列（Touvron 等人，2023）的 LLMs，通过 ChatGPT 输出与人类参与者数据进行指令微调。这种指令微调可视为知识蒸馏的一种形式，即以大模型作为教师模型指导小规模学生模型。值得注意的是，虽然 LLM 指令微调已被证实能转移教师模型的文风及部分指令跟随能力，但师生模型在事实准确性及综合能力上仍存在显著差距（Gudibande 等人，2023）。
•

Vision-language pairs A number of recent works have sought to increase the number of diversity of pretraining data available to visual-language models by automatically generating captions and other text for visual content. For example, LLaVA Liu et al. (2023c) uses 150,000 examples of instruction-following behavior from textual and visual inputs that are mainly LLM-generated. Other work has shown that using VLMs to re-caption images can improve the training data and subsequent quality of image generation models Segalis et al. (2023). Within the realm of video understanding, using VLMs and LLMs to recaption videos has been shown to improve the performance and quality of subsequent VLMs trained on the recaptioned videos Wang et al. (2023f); Zhao et al. (2022).
视觉-语言配对数据近期多项研究致力于通过自动生成视觉内容描述文本，提升视觉语言模型预训练数据的多样性规模。例如 LLaVA（Liu 等人，2023c）采用了 15 万例主要由 LLM 生成的图文指令跟随样本。另有研究表明，使用视觉语言模型（VLM）重标注图像能优化训练数据，进而提升图像生成模型质量（Segalis 等人，2023）。在视频理解领域，利用 VLM 和 LLM 重标注视频已被证实能提升后续训练模型的性能表现（Wang 等人，2023f；Zhao 等人，2022）。

9 Agent Dataset and Leaderboard
9 代理数据集与排行榜

To accelerate research in this domain, we propose two benchmarks respectively for multi-agent gaming and agentic visual language tasks. We will release two new datasets - “CuisineWorld” and “VideoAnalytica” - and a set of baseline models, encouraging participants to explore new models, systems, and submit their results on the test set of our leaderboard.
为加速该领域研究，我们分别针对多智能体游戏与具身视觉语言任务提出两项基准测试。我们将发布"CuisineWorld"和"VideoAnalytica"两个新数据集及一组基线模型，鼓励研究者探索新模型与系统，并在我们排行榜的测试集上提交结果。

9.1 “CuisineWorld” Dataset for Multi-agent Gaming
9.1 多智能体游戏数据集"CuisineWorld"

CuisineWorld is a text-based game reminiscent of Overcooked! It offers a platform for AI-powered agents to cooperate and play in tandem. This dataset will test the collaboration efficiency of multi-agent systems, offering insights into how well LLMs and other systems can work together in dynamic scenarios. In particular, the dataset will focus on how well the agents understand goals, and how well the agents can coordinate among themselves. Two types of modes are supported in this dataset: a centralized dispatcher mode and a decentralized mode. Participants can choose a play mode and make a submission to our leaderboard.
CuisineWorld 是一款受《胡闹厨房》启发的文本游戏，为 AI 智能体提供了协同操作平台。该数据集将测试多智能体系统的协作效率，揭示 LLMs 等系统在动态场景中的协同表现，尤其关注智能体对目标的理解能力与相互协调能力。数据集支持两种模式：集中调度模式与去中心化模式，参与者可选择玩法并向排行榜提交结果。

9.1.1 Benchmark 9.1.1 基准测试

For our competition, we will release a benchmark, the CuisineWorld benchmark, which includes a text interface that includes extendable task definition files, and an interface for multi-agent interaction, and human-machine interactions. We introduce the gaming interaction task in which the goal is to generate relevant, appropriate, multi-agent collaboration strategies that can maximize collaboration efficiency. We evaluate the collaboration efficiency with the proposed evaluation metric: CoS.
竞赛将发布 CuisineWorld 基准测试，包含可扩展任务定义的文本接口、多智能体交互接口及人机交互接口。我们设计了游戏交互任务，目标是生成能最大化协作效率的相关多智能体策略，并使用创新评估指标 CoS 衡量协作效率。

The “CuisineWorld" dataset was collected by Microsoft, UCLA, and Stanford University. The goal of the competition is to explore how different, existing and novel, grounded-LLM and interactive techniques perform with this benchmark and establish strong baselines for the task of multi-agent gaming infrastructure.
该数据集由微软、加州大学洛杉矶分校和斯坦福大学联合采集，旨在探索现有及新型具身 LLM 技术在此基准测试中的表现，为多智能体游戏基础设施建立强基线。

The dataset of CuisineWorld includes:
数据集包含：

- A selection of well-defined multi-agent collaboration tasks.
- 精选的多智能体协作任务
- An API system to facilitate agent interactions.
- 支持智能体交互的 API 系统
- An automatic evaluation system.
- 自动评估系统

(The link for downloading the dataset will soon be made available and this article will be updated to include it here.)
（数据集下载链接即将公布，本文档将同步更新）

9.1.2 Task 9.1.2 任务说明

•

We provide a dataset and related the benchmark, called Microsoft MindAgent and and correspondingly release a dataset “CuisineWorld” to the to the research community.
我们发布名为 Microsoft MindAgent 的基准测试及配套"CuisineWorld"数据集，向研究社区开放。
•

We will provide benchmarks to evaluate and rank the submitted “MindAgent" algorithms. We will also provide baseline results generated using popular infrastructures.
将提供基准来评估提交的"MindAgent"算法并排名，同时公布基于主流基础设施生成的基线结果。

9.1.3 Metrics and Judging 9.1.3 指标与评判

The quality of multi-agent collaboration efficiency is determined by the new “cos" auto-metric (from MindAgent Gong et al. (2023a)). The final rating of out metric is calculated as an average over the evaluated collaboration efficiency metrics of the multi-agent system on all tasks. Human evaluators will be asked to rate individual responses as well as provide subjective judgement of the engagement, breadth and an overall quality of the users’ interactions with the agents.
多智能体协作效率的质量由新型"cos"自动度量标准决定（源自 MindAgent Gong 等人(2023a)的研究）。最终评分将计算为多智能体系统在所有任务上协作效率指标的平均值。人类评估者将被要求对单个响应进行评分，并对用户与智能体交互的参与度、广度及整体质量提供主观判断。

9.1.4 Evaluation 9.1.4 评估方案

•

Automated Evaluation. We plan to release a leaderboard, starting on the release date (TBA), registered participants will be asked to submit their results on the task associated with the dataset “CuisineWorld" (our publicly released dataset for the leaderboard). Submission of results will be closed on the end date (TBA). Each team will be required to submit their generated results on the testing set for automated evaluation of the “cos" metric.
自动评估。我们计划在发布日期（待定）启动排行榜，要求注册参与者提交其在"CuisineWorld"数据集（我们为排行榜公开发布的专用数据集）相关任务上的结果。结果提交将在截止日期（待定）关闭。每个团队需提交测试集上的生成结果，用于"cos"指标的自动化评估。
•

Human Evaluation on our leaderboard. The leaderboard participants will need to provide a submission file generated by evaluation scripts locally. We will use the evalAI system to check the submission file and optionally rerun the code for top challenge contenders. Therefore, teams must also submit their code with a Readme file on how to run their code. Human evaluation will be performed by the organization team.
排行榜人工评估。参与者需提供本地通过评估脚本生成的提交文件。我们将使用 evalAI 系统核验提交文件，并对顶尖参赛者选择性重运行代码。因此各团队必须同时提交代码及说明运行方式的 Readme 文件。人工评估将由组委会执行。
•

Winner Announcement. We will make an announcement of the winners and post the final ratings of the submissions on our leaderboard.
优胜者公告。我们将公布获胜团队名单，并在排行榜上发布所有提交作品的最终评分。

9.2 Audio-Video-Language Pre-training Dataset.
9.2 音视频语言预训练数据集

We introduce VideoAnalytica: a new benchmark for analytical video demonstration comprehension. VideoAnalytica focuses on leveraging video demonstrations as aids to better understand complex, high-level reasoning embedded within long-formed instructional videos. The objective is to evaluate the cognitive reasoning abilities of video language models, pushing them beyond mere recognition tasks and basic comprehension, towards a more sophisticated and nuanced understanding of videos. Crucially, VideoAnalytica emphasizes the integration of multiple modalities, such as audio, video, and language, as well as the ability of models to apply domain-specific knowledge, to contextualize and interpret the information presented in the videos. Specifically, VideoAnalytica involves two primary tasks:
我们推出 VideoAnalytica：一个用于分析性视频演示理解的新基准。该基准专注于利用视频演示作为辅助工具，以更好地理解长篇幅教学视频中蕴含的复杂高阶推理。其核心目标是评估视频语言模型的认知推理能力，推动它们超越单纯的识别任务和基础理解，实现对视频更精细、更深入的解读。VideoAnalytica 特别强调多模态（如音频、视频和语言）的融合能力，以及模型运用领域专业知识对视频信息进行情境化解读的能力。具体包含两项核心任务：

1.

Video Text Retrieval: This task involves accurately retrieving relevant text from the instructional videos. The challenge lies in distinguishing between relevant and irrelevant information, thus requiring a deep understanding of the video content, and analysis of the demonstration to retrieve the correct query. To further increase the complexity of these tasks, we introduce hard negatives into our datasets generated by large language models. We run human validation on the generated negatives and remove instances that make the task invalid and unfair (e.g. negatives being valid).
视频文本检索：该任务要求从教学视频中准确检索相关文本。难点在于区分相关信息与无关信息，这需要深入理解视频内容并分析演示过程以检索正确查询。为增加任务复杂度，我们在数据集中引入了由大语言模型生成的困难负样本，并通过人工验证剔除会导致任务失效或不公平的样本（例如负样本实际有效的情况）。
2.

Video Assisted Informative Question Answering: This task requires the model to answer questions based on the information extracted from the videos. The focus is on complex questions that require analytical reasoning and a thorough comprehension of the video demonstration.
视频辅助信息问答：该任务要求模型基于从视频提取的信息回答问题，重点关注需要分析推理能力和全面理解视频演示的复杂问题。

To facilitate the development of an audio-video-language agent for analytical video understanding, we introduce a benchmark leaderboard for the two tasks from VideoAnalytica.
为促进开发面向分析性视频理解的音视频语言智能体，我们针对 VideoAnalytica 两项任务设立了基准排行榜。

•

The leaderboard participants will need to submit their solutions for evaluation. The evaluation will be based on the model’s performance on the two tasks, and the results will be displayed on the leaderboard. Participants are required to submit their code, along with a detailed explanation of their approach and methodology.
排行榜参与者需提交解决方案接受评估。评估将基于模型在两项任务上的表现，结果将在排行榜公示。参与者需同时提交代码及详细的方法论说明。
•

Ethical considerations: The leaderboard focuses on understanding and interpreting video content, which could potentially be used in surveillance or other privacy-invasive applications. Therefore, it’s crucial to consider the ethical implications and potential misuse of the technology. We encourage participants to consider these aspects in their submissions and promote the ethical use of AI.
伦理考量：本排行榜聚焦视频内容的理解与解读技术，该技术可能被用于监控等侵犯隐私的应用。因此必须充分考虑技术伦理影响和潜在滥用风险。我们鼓励参与者在提交方案时考量这些因素，倡导 AI 技术的伦理使用。

10 Broader Impact Statement
10 更广泛的影响声明

This article and our associated forum ⁶⁶6https://multimodalagentai.github.io aim to be a catalyst for innovative research, fostering collaborations that will drive the next wave of AI applications. By focusing on multimodal agents, we emphasize the future direction of human-AI interactions, leader-board, and solutions. We detail three ways in which we make significant contributions to the broader community.
本文及相关论坛 ⁶⁶6https://multimodalagentai.github.io 旨在成为创新研究的催化剂，通过促进协作来推动下一代人工智能应用的发展。我们聚焦多模态智能体，着重探讨人机交互的未来方向、技术指标与解决方案，并具体阐述我们为更广泛社区做出重要贡献的三大途径。

Firstly, we hope our forum grounds AI researchers to develop solutions motivated by real-world problems in gaming, robotics, healthcare, and long-video understanding. Specifically, the development of multimodal agents in gaming could lead to more immersive and personalized gaming experiences, thereby transforming the gaming industry. In robotics, the development of adaptive robotic systems could revolutionize industries ranging from manufacturing to agriculture, potentially addressing labor shortages and improving efficiency. In healthcare, the use of LLMs and VLMs as diagnostic agents or patient care assistants could lead to more accurate diagnoses, improved patient care, and increased accessibility to medical services, particularly in underserved areas. Furthermore, the ability of these models to interpret long-form videos could have far-reaching applications, from enhancing online learning to improving technical support services. In general, the topics covered in our forum will have significant downstream effects on a wide range of industries and humans across the world.
首先，我们希望论坛能助力 AI 研究者开发针对现实问题的解决方案，涵盖游戏、机器人、医疗和长视频理解等领域。具体而言，游戏领域多模态智能体的开发将带来更具沉浸感和个性化的游戏体验，从而变革游戏产业；机器人领域自适应系统的突破或将彻底改变从制造业到农业的众多行业，缓解劳动力短缺并提升效率；医疗领域利用 LLMs 和 VLMs 作为诊断助手或护理代理，可提高诊断准确性、改善患者护理，并增强医疗欠发达地区的服务可及性；而长视频解析能力的突破将产生深远影响，从提升在线教育到优化技术支持服务。总体而言，论坛议题将对全球各行业和人类社会产生重大影响。

Secondly, we hope our forum stands as a valuable resource for AI practitioners and researchers alike, serving as a platform to explore and deeply comprehend the diverse and complex leader-board that come with implementing AI agents across a wide variety of environments and situations. This exploration includes, for instance, understanding the specific limitations and potential hazards linked to Agentic AI systems when they are developed for specialized sectors such as healthcare diagnostics. In this domain, issues like dangerous hallucinations in AI behavior can pose significant risks, highlighting the critical need for meticulous design and testing. However, these specific leader-board may not be equally relevant or noticeable when considering AI agents crafted for the gaming industry. In such recreational fields, developers might instead prioritize tackling different hurdles, such as the need for AI to perform more open-ended generation and exhibit creativity, adapting dynamically to unpredictable gameplay scenarios and player interactions. By attending the forum, participants will gain insights into how these varied environments dictate the focus and direction of AI development, and how best to tailor AI solutions to meet these distinct needs and overcome the pertinent leader-board.
其次，我们希望论坛成为 AI 从业者与研究人员的宝贵资源平台，助力深入探索和理解在不同环境与情境下部署 AI 智能体所带来的多样化技术挑战。例如在医疗诊断等专业领域，需重点关注智能体系统可能存在的局限性和潜在风险——诸如 AI 危险幻觉行为引发的重大隐患，这要求极其严谨的设计与测试。然而这些挑战在游戏行业可能并不显著，开发者反而需要应对开放式内容生成、创造性表现等不同难题，使 AI 能动态适应不可预测的游戏场景与玩家互动。通过参与论坛，与会者将深入理解不同应用场景如何塑造 AI 研发重点，以及如何定制化解决方案来应对相应挑战。

Thirdly, the various elements of our event, including the expert presentations, informative posters, and notably the winners of our two leader-board, are set to offer a substantive yet succinct overview of the latest and significant trends, research directions, and innovative concepts in the realm of multimodal agents. These presentations will encapsulate pivotal findings and developments, shining a light on new systems, ideas, and technologies in the field of mulitmodal agent AI. This assortment of knowledge is not only beneficial for the attendees of our forum, who are looking to deepen their understanding and expertise in this domain, but it also serves as a dynamic and rich resource board. Those visiting our forum’s website can tap into this reservoir of information to discover and understand the cutting-edge advancements and creative ideas steering the future of multimodal agent AI. We strive to serve as a useful knowledge base for both newcomers and veterans in the field. By engaging with these resources, we hope participants and online visitors alike can remain informed of the transformative changes and novel approaches that are shaping the exciting landscape surrounding multimodal agent AI.
第三，本次活动包含多个环节——专家演讲、信息丰富的海报展示，尤其是两个排行榜的获奖者——将精要而全面地概述多模态智能体领域的最新重大趋势、研究方向和创新理念。这些展示将涵盖关键研究成果与发展动态，揭示多模态智能体 AI 领域的新系统、新思想与新技术。这些知识宝库不仅有助于论坛参会者深化对该领域的理解与专长，更构成一个活跃而丰富的资源平台。访问我们论坛网站的观众可从中获取信息，探索并理解引领多模态智能体 AI 未来的尖端进展与创意构想。我们致力于为领域内新老研究者提供实用知识库，通过接触这些资源，希望现场参会者与线上访客都能及时了解正在塑造多模态智能体 AI 精彩前景的变革性突破与创新方法。

11 Ethical Considerations 11 伦理考量

Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.
多模态智能体 AI 系统具有广泛应用前景。除交互式 AI 外，基于多模态模型的智能体可助力机器人和 AI 代理的内容生成，赋能生产力应用，协助场景回放、文本复述、行为预测或 2D/3D 场景合成。智能体 AI 的基础性突破将推动这些目标的实现，而如何更好地在模拟或现实世界中构建具身化、共情化的智能体模型，将使众多领域受益。可以说，这些应用大多能产生积极影响。

However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.
然而该技术也可能被恶意利用。生成内容的智能体 AI 系统可能用于操纵或欺骗人类。因此依据负责任 AI 准则开发这项技术至关重要，例如明确告知用户内容由 AI 系统生成，并提供系统定制化控制选项。值得注意的是，智能体 AI 或可发展出检测操纵性内容的新方法——这部分得益于大基座模型丰富的幻觉表现——从而帮助解决现实世界的另一难题。

For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.
具体而言：1）在医疗健康领域，LLM 和 VLM 智能体的伦理部署尤为关键，基于偏见数据训练的 AI 代理可能因对弱势群体做出错误诊断而加剧健康不平等，且 AI 处理敏感患者数据会引发重大隐私问题；2）游戏产业中，AI 代理或将改变开发者角色，使其从编写 NPC 脚本转向优化智能体学习流程；同理，自适应机器人系统可能重塑制造业岗位结构，重点在于培养新技能而非取代人力。负责任地应对这些转型对减少社会经济震荡至关重要。

Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly applying the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.
此外，智能体 AI 主要学习模拟环境中的协作策略，若直接应用于现实世界可能因数据分布偏移产生风险。需建立鲁棒测试和持续安全监测机制以降低实际场景中的意外行为风险。我们的"VideoAnalytica"数据集虽经微软和华盛顿大学伦理审查及法律流程，但因采集自互联网而存在代表性局限，仍需警惕数据偏差。单一数据集或研讨会难以涵盖所有分布特征，我们主张通过开放数据或系统，开展更多与实际任务/主题相关的方法探讨。

We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.
本项目将设立专题讨论这些伦理问题，探索缓解策略，并部署负责任的多模态 AI 代理。我们希望通过本文助力更多研究者共同解答这些命题。

12 Diversity Statement 12 多样性声明

By examining the adaptability of AI agent models in various domains, we inherently embrace a diversity of leader-board, perspectives, and solutions. In this vein, our project aims to build a diverse community by exploring the wide array of subjects in multimodal and agentic AI.
通过考察 AI 代理模型在各领域的适应性，我们自然融合了多元的评估标准、视角和解决方案。基于此，本项目旨在通过探索多模态与智能体 AI 的广阔主题，构建多样化的学术共同体。

With these principles in mind, this project focuses on advanced multimodal systems that interact effectively within both physical and virtual environments and facilitate effective interaction with humans. As such, we intend to engage a broad range of experts and practitioners across a wide-range of technical specialities, cultures, countries, and scholarly fields to discuss important topics, including but not limited to:
秉持这些原则，本项目聚焦于能在物理/虚拟环境中高效交互、并促进人机有效沟通的先进多模态系统。为此我们将汇聚跨技术领域、文化背景、国家和学科的专业人士，共同探讨包括但不限于以下议题：

•

Application of foundation models: the development of agents with integrated modalities (audio, image, text, sensor inputs), aiming to enhance their recognition and response capabilities for a wide variety of applications.
基座模型应用：开发整合多模态（音频/图像/文本/传感器输入）的智能体，提升其在多样化场景中的识别与响应能力。
•

General-purpose end-to-end systems: the development of end-to-end models that are trained with large-scale data, seeking to create versatile and adaptable AI solutions.
通用端到端系统：利用大规模数据训练端到端模型，创造通用性强、适应性佳的 AI 解决方案。
•

Methodologies for grounding modalities: integrating information across various modalities, enhancing the coherence and efficacy of data processing.
多模态 grounding 方法论：整合跨模态信息，增强数据处理的连贯性与有效性。
•

Intuitive human interface: the development of effective and meaningful interaction between humans and agents.
直觉化人机界面：打造高效且富有意义的人与智能体交互模式。
•

Taming LLM/VLMs: exploring new approaches to address common issues in large-scale models, such as hallucinations and biases in their outputs.
驯服 LLM/VLM：探索新方法以解决大规模模型中的常见问题，如输出中的幻觉和偏见。

We aspire to broaden our collective understanding of the potential and limitations of agentic AI by leveraging our unique and diverse perspectives. We strongly believe that this approach will not only enrich individual perspectives, but will also enhance the community’s collective knowledge and promote a holistic view that is more inclusive of the wide-ranging leader-board faced by multimodal AI agents.
我们渴望通过利用独特多元的视角，拓宽对具身化 AI 潜力与局限的集体认知。我们坚信这种方法不仅能丰富个人见解，更能提升社群的集体智慧，促进形成更具包容性的整体视角，以应对多模态 AI 智能体面临的广泛挑战榜单。

References

Ahn et al. (2022a) M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can and not as i say: Grounding language in robotic affordances,” in arXiv preprint arXiv:2204.01691, 2022.
Ahn et al. (2022b) M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
Alayrac et al. (2022) J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
Allevato et al. (2020) A. Allevato, E. S. Short, M. Pryor, and A. Thomaz, “Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer,” in Conference on Robot Learning. PMLR, 2020, pp. 445–455.
Amjad et al. (2023) A. Amjad, P. Kordel, and G. Fernandes, “A review on innovation in healthcare sector (telehealth) through artificial intelligence,” Sustainability, vol. 15, no. 8, p. 6655, 2023.
An et al. (2023) S. An, Z. Lin, Q. Fu, B. Chen, N. Zheng, J.-G. Lou, and D. Zhang, “How do in-context examples affect compositional generalization?” arXiv preprint arXiv:2305.04835, 2023.
Anderson et al. (2018b) P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
Anderson et al. (2018a) P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
Antol et al. (2015) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
Bain et al. (2021) M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.
Baker et al. (2022) B. Baker, I. Akkaya, P. Zhokov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (vpt): Learning to act by watching unlabeled online videos,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 639–24 654, 2022.
Batra et al. (2020) D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” arXiv preprint arXiv:2006.13171, 2020.
Black et al. (2023) K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639, 2023.
Blair-Stanek et al. (2023) A. Blair-Stanek, N. Holzenberger, and B. Van Durme, “Can gpt-3 perform statutory reasoning?” arXiv preprint arXiv:2302.06100, 2023.
Bousmalis et al. (2023) K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju et al., “Robocat: A self-improving foundation agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.
Brohan et al. (2022) A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
Brohan et al. (2023) A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
Bubeck et al. (2023) S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
Cai et al. (2023) W. Cai, S. Huang, G. Cheng, Y. Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” arXiv preprint arXiv:2309.10309, 2023.
Carroll et al. (2019) M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan, “On the utility of learning about humans for human-ai coordination,” Advances in neural information processing systems, vol. 32, 2019.
Chang et al. (2021) Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, and Y. Bisk, “WebQA: Multihop and Multimodal QA,” arXiv preprint arXiv:2109.00590, 2021.
Chaplot et al. (2020a) D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020.
Chaplot et al. (2020b) D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 875–12 884.
Chen et al. (2023) G. Chen, Y.-D. Zheng, J. Wang, J. Xu, Y. Huang, J. Pan, Y. Wang, Y. Wang, Y. Qiao, T. Lu, and L. Wang, “Videollm: Modeling video sequence with large language models,” 2023.
Chen et al. (2020) K. Chen, Q. Huang, H. Palangi, P. Smolensky, K. D. Forbus, and J. Gao, “Mapping natural-language problems to formal-language solutions using structured neural representations,” in ICML 2020, July 2020.
Chen et al. (2021) K. Chen, Q. Huang, D. McDuff, X. Gao, H. Palangi, J. Wang, K. Forbus, and J. Gao, “Nice: Neural image commenting with empathy,” in EMNLP 2021, October 2021. [Online]. Available: https://www.microsoft.com/en-us/research/publication/nice-neural-image-commenting-with-empathy/
Choi et al. (2023) J. H. Choi, K. E. Hickman, A. Monahan, and D. Schwarcz, “Chatgpt goes to law school,” Available at SSRN, 2023.
Chung et al. (2022) H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
Codella et al. (2018) N. C. F. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 168–172.
Creswell et al. (2022) A. Creswell, M. Shanahan, and I. Higgins, “Selection-inference: Exploiting large language models for interpretable logical reasoning,” arXiv preprint arXiv:2205.09712, 2022.
Cui et al. (2023) B. Cui, A. Lupu, S. Sokota, H. Hu, D. J. Wu, and J. N. Foerster, “Adversarial diversity in hanabi,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=uLE3WF3-H_5
Dagan et al. (2023) G. Dagan, F. Keller, and A. Lascarides, “Dynamic planning with a llm,” arXiv preprint arXiv:2308.06391, 2023.
Dai et al. (2023) W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
d’Avila Garcez and Lamb (2020) A. d’Avila Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,” 2020.
Deitke et al. (2020) M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford et al., “Robothor: An open simulation-to-real embodied ai platform,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3164–3174.
Dong et al. (2022) Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
Dorbala et al. (2022) V. S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,” arXiv preprint arXiv:2211.16649, 2022.
Dorbala et al. (2023) V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation,” arXiv preprint arXiv:2303.03480, 2023.
Dosovitskiy et al. (2021) A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
Driess et al. (2023) D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
Du et al. (2023) Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum et al., “Video language planning,” arXiv preprint arXiv:2310.10625, 2023.
Dziri et al. (2021) N. Dziri, A. Madotto, O. Zaiane, and A. J. Bose, “Neural path hunter: Reducing hallucination in dialogue systems via path grounding,” arXiv preprint arXiv:2104.08455, 2021.
Ehsani et al. (2021) K. Ehsani, W. Han, A. Herrasti, E. VanderBilt, L. Weihs, E. Kolve, A. Kembhavi, and R. Mottaghi, “Manipulathor: A framework for visual object manipulation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4497–4506.
Fried et al. (2018) D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in Advances in Neural Information Processing Systems (NIPS), 2018.
Fu et al. (2022) T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “Violet : End-to-end video-language transformers with masked visual-token modeling,” 2022.
Fu et al. (2023) ——, “An empirical study of end-to-end video-language transformers with masked visual modeling,” 2023.
Gadre et al. (2023) S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181.
Gao et al. (2020) J. Gao, B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and H.-Y. Shum, “Robust conversational ai with grounded text generation,” arXiv preprint arXiv:2009.03457, 2020.
Gao et al. (2022) J. Gao, C. Xiong, P. Bennett, and N. Craswell, “Neural approaches to conversational information retrieval,” arXiv preprint arXiv:2201.05176, 2022.
Garrett et al. (2021) C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated task and motion planning,” Annual review of control, robotics, and autonomous systems, vol. 4, pp. 265–293, 2021.
Gervet et al. (2023) T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot, “Navigating to objects in the real world,” Science Robotics, vol. 8, no. 79, p. eadf6991, 2023.
Gong et al. (2023b) R. Gong, J. Huang, Y. Zhao, H. Geng, X. Gao, Q. Wu, W. Ai, Z. Zhou, D. Terzopoulos, S.-C. Zhu et al., “Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
Gong et al. (2023a) R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng, S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
Goyal et al. (2023) A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” arXiv preprint arXiv:2306.14896, 2023.
Gramopadhye and Szafir (2022) M. Gramopadhye and D. Szafir, “Generating executable action plans with environmentally-aware language models,” arXiv preprint arXiv:2210.04964, 2022.
Gudibande et al. (2023) A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P. Abbeel, S. Levine, and D. Song, “The false promise of imitating proprietary llms,” arXiv preprint arXiv:2305.15717, 2023.
Gui et al. (2022b) L. Gui, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, “Vlc: Training vision-language transformers from captions,” May 2022.
Gui et al. (2022a) L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, “Kat: A knowledge augmented transformer for vision-and-language,” in NAACL 2022. Long paper, Oral., January 2022.
Guimarães et al. (2016) R. L. Guimarães, A. S. de Oliveira, J. A. Fabro, T. Becker, and V. A. Brenner, “Ros navigation: Concepts and tutorial,” Robot Operating System (ROS) The Complete Reference (Volume 1), pp. 121–160, 2016.
Guu et al. (2020) K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in International conference on machine learning. PMLR, 2020, pp. 3929–3938.
Ha et al. (2023) H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” arXiv preprint arXiv:2307.14535, 2023.
Haarnoja et al. (2023) T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, M. Wulfmeier, J. Humplik, S. Tunyasuvunakool, N. Y. Siegel, R. Hafner et al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,” arXiv preprint arXiv:2304.13653, 2023.
He et al. (2022) K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” CVPR, 2022.
Hemachandra et al. (2015) S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter, “Learning models for following natural language directions in unknown environments,” arXiv preprint arXiv:1503.05079, 2015.
Henrich et al. (2010) J. Henrich, S. J. Heine, and A. Norenzayan, “The weirdest people in the world?” Behavioral and Brain Sciences, vol. 33, no. 2-3, p. 61–83, 2010.
Hensel et al. (2023) L. B. Hensel, N. Yongsatianchot, P. Torshizi, E. Minucci, and S. Marsella, “Large language models in textual analysis for gesture selection,” in INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2023, pp. 378–387.
Ho et al. (2021) D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y. Bai, “Retinagan: An object-aware approach to sim-to-real transfer,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 10 920–10 926.
Huang et al. (2023b) C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615.
Huang et al. (2023a) Q. Huang, J. S. Park, A. Gupta, P. Bennett, R. Gong, S. Som, B. Peng, O. K. Mohammed, C. Pal, Y. Choi et al., “Ark: Augmented reality with knowledge interactive emergent ability,” arXiv preprint arXiv:2305.00970, 2023.
Huang et al. (2022a) W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 9118–9147. [Online]. Available: https://proceedings.mlr.press/v162/huang22a.html
Huang et al. (2022b) W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” in arXiv preprint arXiv:2207.05608, 2022.
Huang et al. (2023c) Z. Huang, H. Feng, Z. Chongzhi, L. Sheng, L. Ziwei, and J. Shao, “Dolphin: General video interaction platform based on llms,” 2023, https://github.com/kaleido-lab/dolphin.
Ikeuchi et al. (0) K. Ikeuchi, N. Wake, K. Sasabuchi, and J. Takamatsu, “Semantic constraints to represent common sense required in household actions for multimodal learning-from-observation robot,” The International Journal of Robotics Research, vol. 0, no. 0, p. 02783649231212929, 0.
Ikeuchi et al. (2023) K. Ikeuchi, J. Takamatsu, K. Sasabuchi, N. Wake, and A. Kanehiro, “Applying learning-from-observation to household service robots: three common-sense formulation,” arXiv preprint arXiv:2304.09966, 2023.
Imani et al. (2023) S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” arXiv preprint arXiv:2303.05398, 2023.
James and Davison (2022) S. James and A. J. Davison, “Q-attention: Enabling efficient learning for vision-based robotic manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1612–1619, 2022.
Jang et al. (2022) E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning. PMLR, 2022, pp. 991–1002.
Ji et al. (2023) Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
Jiang et al. (2022) Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” arXiv, 2022.
Kalashnikov et al. (2018) D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning. PMLR, 2018, pp. 651–673.
Karpathy et al. (2014) A. Karpathy, A. Joulin, and L. F. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, vol. 27, 2014.
Katara et al. (2023) P. Katara, Z. Xian, and K. Fragkiadaki, “Gen2sim: Scaling up robot learning in simulation with generative models,” arXiv preprint arXiv:2310.18308, 2023.
Ke et al. (2019) L. Ke, X. Li, B. Yonatan, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Kim et al. (2023) J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7, 2023, pp. 8255–8263.
Ko et al. (2023) P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum, “Learning to act from actionless videos through dense correspondences,” arXiv preprint arXiv:2310.08576, 2023.
Kolve et al. (2017) E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu et al., “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
Krishna et al. (2016) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” in arXiv:1602.07332, 2016.
Kumar et al. (2023) K. N. Kumar, I. Essa, and S. Ha, “Words into action: Learning diverse humanoid robot behaviors using language guided iterative motion refinement,” arXiv preprint arXiv:2310.06226, 2023.
Lee et al. (2023) P. Lee, S. Bubeck, and J. Petro, “Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine,” New England Journal of Medicine, vol. 388, no. 13, pp. 1233–1239, 2023.
Lewis et al. (2020) P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
Li et al. (2023b) B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” arXiv preprint arXiv:2310.10645, 2023.
Li et al. (2021a) C. Li, F. Xia, R. Martín-Martín, M. Lingelbach, S. Srivastava, B. Shen, K. Vainio, C. Gokmen, G. Dharan, T. Jain et al., “igibson 2.0: Object-centric simulation for robot learning of everyday household tasks,” arXiv preprint arXiv:2108.03272, 2021.
Li et al. (2023f) C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
Li et al. (2023a) G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for" mind" exploration of large scale language model society,” arXiv preprint arXiv:2303.17760, 2023.
Li et al. (2023e) J. Li, Q. Gao, M. Johnston, X. Gao, X. He, S. Shakiah, H. Shi, R. Ghanadan, and W. Y. Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” arXiv preprint arXiv:2310.09676, 2023.
Li et al. (2023c) J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
Li et al. (2023g) K. Li, Y. He, W. Yi, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
Li et al. (2020) L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+language omni-representation pre-training,” 2020.
Li et al. (2021b) L. Li, J. Lei, Z. Gan, L. Yu, Y.-C. Chen, R. Pillai, Y. Cheng, L. Zhou, X. E. Wang, W. Y. Wang, T. L. Berg, M. Bansal, J. Liu, L. Wang, and Z. Liu, “Value: A multi-task benchmark for video-and-language understanding evaluation,” 2021.
Li et al. (2023d) X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu et al., “Vision-language foundation models as effective robot imitators,” arXiv preprint arXiv:2311.01378, 2023.
Liang et al. (2022) J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in arXiv preprint arXiv:2209.07753, 2022.
Liang et al. (2023) X. Liang, L. Ma, S. Guo, J. Han, H. Xu, S. Ma, and X. Liang, “Mo-vln: A multi-task benchmark for open-set zero-shot vision-and-language navigation,” arXiv preprint arXiv:2306.10322, 2023.
Lifshitz et al. (2023) S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith, “Steve-1: A generative model for text-to-behavior in minecraft,” arXiv preprint arXiv:2306.00937, 2023.
Lin et al. (2023) K. Lin, F. Ahmed, L. Li, C.-C. Lin, E. Azarnasab, Z. Yang, J. Wang, L. Liang, Z. Liu, Y. Lu, C. Liu, and L. Wang, “Mm-vid: Advancing video understanding with gpt-4v(ision),” 2023.
Lin et al. (2014) T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” Proceedings of ECCV, 2014.
Liu and Negrut (2021) C. K. Liu and D. Negrut, “The role of physics-based simulators in robotics,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 35–58, 2021.
Liu et al. (2023c) H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
Liu et al. (2023a) H. Liu, A. Chen, Y. Zhu, A. Swaminathan, A. Kolobov, and C.-A. Cheng, “Interactive robot learning from verbal correction,” arXiv preprint arXiv:2310.17555, 2023.
Liu et al. (2023b) Y. Liu, W. Held, and D. Yang, “Dada: Dialect adaptation via dynamic aggregation of linguistic rules,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Lu et al. (2023) P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” 2023.
Luo et al. (2022) Z. Luo, Z. Durante, L. Li, W. Xie, R. Liu, E. Jin, Z. Huang, L. Y. Li, J. Wu, J. C. Niebles et al., “Moma-lrg: Language-refined graphs for multi-object multi-actor activity parsing,” Advances in Neural Information Processing Systems, vol. 35, pp. 5282–5298, 2022.
Ma et al. (2019a) C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” arXiv preprint arXiv:1901.03035, 2019.
Ma et al. (2019b) C.-Y. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The regretful agent: Heuristic-aided navigation through progress estimation,” arXiv preprint arXiv:1903.01602, 2019.
Ma et al. (2023) Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” arXiv preprint arXiv:2310.12931, 2023.
Maaz et al. (2023) M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” 2023.
Mao et al. (2022) R. Mao, Q. Liu, K. He, W. Li, and E. Cambria, “The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection,” IEEE Transactions on Affective Computing, 2022.
Marcus (2020) G. Marcus, “The next decade in ai: four steps towards robust artificial intelligence,” arXiv preprint arXiv:2002.06177, 2020.
Marcus and Davis (2019) G. Marcus and E. Davis, Rebooting AI: Building artificial intelligence we can trust. Pantheon, 2019.
Marino et al. (2019) K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in CVPR, 2019.
Martinez-Gonzalez et al. (2020) P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. Jover-Alvarez, S. Orts-Escolano, and J. Garcia-Rodriguez, “Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation,” Virtual Reality, vol. 24, pp. 271–288, 2020.
Maynez et al. (2020) J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 1906–1919. [Online]. Available: https://aclanthology.org/2020.acl-main.173
Mees et al. (2022) O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022.
Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR) Meta Fundamental AI Research (FAIR) Diplomacy Team, A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu et al., “Human-level play in the game of Diplomacy by combining language models with strategic reasoning,” Science, vol. 378, no. 6624, pp. 1067–1074, 2022.
Min et al. (2022) S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” arXiv preprint arXiv:2202.12837, 2022.
Minsky (1975) M. L. Minsky, “Minsky’s frame system theory,” in Proceedings of the 1975 Workshop on Theoretical Issues in Natural Language Processing, ser. TINLAP ’75. USA: Association for Computational Linguistics, 1975, p. 104–116. [Online]. Available: https://doi.org/10.3115/980190.980222
Mirchandani et al. (2023) S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G. Arenas, K. Rao, D. Sadigh, and A. Zeng, “Large language models as general pattern machines,” arXiv preprint arXiv:2307.04721, 2023.
Mirowski et al. (2016) P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to navigate in complex environments,” arXiv preprint arXiv:1611.03673, 2016.
Mittal et al. (2023) M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y. Guo, H. Mazhar et al., “Orbit: A unified simulation framework for interactive robot learning environments,” IEEE Robotics and Automation Letters, 2023.
Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
Mousavian et al. (2018) A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson, “Visual representations for semantic target driven navigation,” arXiv preprint arXiv:1805.06066, 2018.
Mu et al. (2021) T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” arXiv preprint arXiv:2107.14483, 2021.
Müller et al. (2018) M. Müller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem, “Sim4cv: A photo-realistic simulator for computer vision applications,” International Journal of Computer Vision, vol. 126, pp. 902–919, 2018.
Nay (2022) J. J. Nay, “Law informs code: A legal informatics approach to aligning artificial intelligence with humans,” Nw. J. Tech. & Intell. Prop., vol. 20, p. 309, 2022.
Nguyen et al. (2018) K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based navigation with language-based assistance via imitation learning with indirect intervention,” arXiv preprint arXiv:1812.04155, 2018.
Ni et al. (2023) Z. Ni, X.-X. Deng, C. Tai, X.-Y. Zhu, X. Wu, Y.-J. Liu, and L. Zeng, “Grid: Scene-graph-based instruction-driven robotic task planning,” arXiv preprint arXiv:2309.07726, 2023.
OpenAI (2023) OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023.
Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
Padalkar et al. (2023) A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864, 2023.
Parakh et al. (2023) M. Parakh, A. Fong, A. Simeonov, A. Gupta, T. Chen, and P. Agrawal, “Human-assisted continual robot learning with foundation models,” arXiv preprint arXiv:2309.14321, 2023.
Park et al. (2023b) J. S. Park, J. Hessel, K. Chandu, P. P. Liang, X. Lu, P. West, Q. Huang, J. Gao, A. Farhadi, and Y. Choi, “Multimodal agent – localized symbolic knowledge distillation for visual commonsense models,” in NeurIPS 2023, October 2023.
Park et al. (2023c) J. S. Park, J. Hessel, K. Chandu, P. P. Liang, X. Lu, P. West, Y. Yu, Q. Huang, J. Gao, A. Farhadi, and Y. Choi, “Localized symbolic knowledge distillation for visual commonsense models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=V5eG47pyVl
Park et al. (2022) J. Park, Q. Huang, Y. Bisk, J. Yang, S. Som, A. Farhadi, Y. Choi, and J. Gao, “Ink: Intensive neural knowledge,” July 2022.
Park et al. (2023a) J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” arXiv preprint arXiv:2304.03442, 2023.
Peng et al. (2023) B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen et al., “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813, 2023.
Peng et al. (2018) X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 3803–3810.
Puig et al. (2018) X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba, “Virtualhome: Simulating household activities via programs,” in 2018 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8494–8502.
Puig et al. (2023) X. Puig, E. Undersander, A. Szot, M. D. Cote, T.-Y. Yang, R. Partsey, R. Desai, A. W. Clegg, M. Hlavac, S. Y. Min et al., “Habitat 3.0: A co-habitat for humans, avatars and robots,” arXiv preprint arXiv:2310.13724, 2023.
Radford et al. (2021) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
Ramakrishnan et al. (2022) S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “Poni: Potential functions for objectgoal navigation with interaction-free learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 890–18 900.
Raman et al. (2023) S. S. Raman, V. Cohen, D. Paulius, I. Idrees, E. Rosen, R. Mooney, and S. Tellex, “Cape: Corrective actions from precondition errors using large language models,” in 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
Rao et al. (2020) K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, “Rl-cyclegan: Reinforcement learning aware simulation-to-real,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 157–11 166.
Raunak et al. (2021) V. Raunak, A. Menezes, and M. Junczys-Dowmunt, “The curious case of hallucinations in neural machine translation,” arXiv preprint arXiv:2104.06683, 2021.
Reed et al. (2022) S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.
Ren et al. (2015) M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” Advances in neural information processing systems, vol. 28, 2015.
Rohrbach et al. (2018) A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,” arXiv preprint arXiv:1809.02156, 2018.
Rosinol et al. (2022) A. Rosinol, J. J. Leonard, and L. Carlone, “Nerf-slam: Real-time dense monocular slam with neural radiance fields,” arXiv preprint arXiv:2210.13641, 2022.
Sadeghi and Levine (2016) F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
Saito et al. (2022) D. Saito, K. Sasabuchi, N. Wake, J. Takamatsu, H. Koike, and K. Ikeuchi, “Task-grasping from a demonstrated human strategy,” in 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids), 2022, pp. 880–887.
Saito et al. (2023) D. Saito, K. Sasabuchi, N. Wake, A. Kanehira, J. Takamatsu, H. Koike, and K. Ikeuchi, “Constraint-aware policy for compliant manipulation,” 2023.
Sarkar et al. (2023) B. Sarkar, A. Shih, and D. Sadigh, “Diverse conventions for human-AI collaboration,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Sasabuchi et al. (2021) K. Sasabuchi, N. Wake, and K. Ikeuchi, “Task-oriented motion mapping on robots of various configuration using body role division,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 413–420, 2021.
Sasabuchi et al. (2023) K. Sasabuchi, D. Saito, A. Kanehira, N. Wake, J. Takamatsu, and K. Ikeuchi, “Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation,” arXiv preprint arXiv:2301.01382, 2023.
Savva et al. (2017) M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “Minos: Multimodal indoor simulator for navigation in complex environments,” arXiv preprint arXiv:1712.03931, 2017.
Savva et al. (2019) M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347.
Schick et al. (2023) T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023.
Segalis et al. (2023) E. Segalis, D. Valevski, D. Lumen, Y. Matias, and Y. Leviathan, “A picture is worth a thousand words: Principled recaptioning improves image generation,” arXiv preprint arXiv:2310.16656, 2023.
Shacklett et al. (2023) B. Shacklett, L. G. Rosenzweig, Z. Xie, B. Sarkar, A. Szot, E. Wijmans, V. Koltun, D. Batra, and K. Fatahalian, “An extensible, data-oriented architecture for high-performance, many-world simulation,” ACM Trans. Graph., vol. 42, no. 4, 2023.
Shah et al. (2023a) D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning. PMLR, 2023, pp. 492–504.
Shah et al. (2023b) R. Shah, R. Martín-Martín, and Y. Zhu, “Mutex: Learning unified policies from multimodal task specifications,” arXiv preprint arXiv:2309.14320, 2023.
Shah et al. (2018) S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference. Springer, 2018, pp. 621–635.
Sharma et al. (2018) P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
Shi et al. (2023) R. Shi, Y. Liu, Y. Ze, S. S. Du, and H. Xu, “Unleashing the power of pre-trained language models for offline reinforcement learning,” arXiv preprint arXiv:2310.20587, 2023.
Shridhar et al. (2023) M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
Shuster et al. (2021) K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” arXiv preprint arXiv:2104.07567, 2021.
Singh et al. (2019) A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8317–8326.
Song et al. (2017) S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Sontakke et al. (2023) S. A. Sontakke, J. Zhang, S. M. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” arXiv preprint arXiv:2310.07899, 2023.
Srivastava et al. (2022) S. Srivastava, C. Li, M. Lingelbach, R. Martín-Martín, F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, K. Liu et al., “Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments,” in Conference on Robot Learning. PMLR, 2022, pp. 477–490.
Sun et al. (2023a) J. Sun, Q. Zhang, Y. Duan, X. Jiang, C. Cheng, and R. Xu, “Prompt, plan, perform: Llm-based humanoid control via quantized imitation learning,” arXiv preprint arXiv:2309.11359, 2023.
Sun et al. (2023b) Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv preprint arXiv:2303.15389, 2023.
Szot et al. (2021) A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
Takamatsu et al. (2022) J. Takamatsu, K. Sasabuchi, N. Wake, A. Kanehira, and K. Ikeuchi, “Learning-from-observation system considering hardware-level reusability,” arXiv preprint arXiv:2212.09242, 2022.
Tang et al. (2023) C. Tang, D. Huang, W. Ge, W. Liu, and H. Zhang, “Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,” IEEE Robotics and Automation Letters, 2023.
Taori et al. (2023) R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
Teshima et al. (2022) H. Teshima, N. Wake, D. Thomas, Y. Nakashima, H. Kawasaki, and K. Ikeuchi, “Deep gesture generation for social robots using type-specific libraries,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 8286–8291.
Tevet et al. (2022) G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” arXiv preprint arXiv:2209.14916, 2022.
Thomason et al. (2018) J. Thomason, D. Gordan, and Y. Bisk, “Shifting the baseline: Single modality performance on visual navigation & qa,” arXiv preprint arXiv:1811.00613, 2018.
Tobin et al. (2017) J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2017, pp. 23–30.
Touvron et al. (2023) H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
Tsoi et al. (2022) N. Tsoi, A. Xiang, P. Yu, S. S. Sohn, G. Schwartz, S. Ramesh, M. Hussein, A. W. Gupta, M. Kapadia, and M. Vázquez, “Sean 2.0: Formalizing and generating social situations for robot navigation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 047–11 054, 2022.
Wake et al. (2021a) N. Wake, R. Arakawa, I. Yanokura, T. Kiyokawa, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “A learning-from-observation framework: One-shot robot teaching for grasp-manipulation-release household operations,” in 2021 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2021.
Wake et al. (2021b) N. Wake, I. Yanokura, K. Sasabuchi, and K. Ikeuchi, “Verbal focus-of-attention system for learning-from-observation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 377–10 384.
Wake et al. (2023a) N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Interactive task encoding system for learning-from-observation,” in 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2023, pp. 1061–1066.
Wake et al. (2023d) ——, “Bias in emotion recognition with chatgpt,” arXiv preprint arXiv:2310.11753, 2023.
Wake et al. (2023f) ——, “Gpt models meet robotic applications: Co-speech gesturing chat system,” arXiv preprint arXiv:2306.01741, 2023.
Wake et al. (2023b) ——, “Gpt-4v(ision) for robotics: Multimodal task planning from human demonstration,” arXiv preprint arXiv:2311.12015, 2023.
Wake et al. (2023c) ——, “Chatgpt empowered long-step robot control in various environments: A case application,” IEEE Access, vol. 11, pp. 95 060–95 078, 2023.
Wake et al. (2023e) N. Wake, D. Saito, K. Sasabuchi, H. Koike, and K. Ikeuchi, “Text-driven object affordance for guiding grasp-type recognition in multimodal robot teaching,” Machine Vision and Applications, vol. 34, no. 4, p. 58, 2023.
Wang et al. (2023e) B. Wang, Q. Huang, B. Deb, A. L. Halfaker, L. Shao, D. McDuff, A. Awadallah, D. Radev, and J. Gao, “Logical transformers: Infusing logical structures into pre-trained language models,” in Proceedings of ACL 2023, May 2023.
Wang et al. (2023g) D. Wang, Q. Huang, M. Jackson, and J. Gao, “Retrieve what you need: A mutual learning framework for open-domain question answering,” March 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/retrieve-what-you-need-a-mutual-learning-framework-for-open-domain-question-answering/
Wang et al. (2023b) G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
Wang et al. (2023c) L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on large language model based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
Wang et al. (2015) P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick, “Explicit knowledge-based reasoning for visual question answering,” arXiv preprint arXiv:1511.02570, 2015.
Wang et al. (2017) P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, “Fvqa: Fact-based visual question answering,” TPAMI, vol. 40, no. 10, pp. 2413–2427, 2017.
Wang et al. (2018) X. Wang, W. Xiong, H. Wang, and W. Y. Wang, “Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation,” in The European Conference on Computer Vision (ECCV), September 2018.
Wang et al. (2019) X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Weng, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR 2019, June 2019.
Wang et al. (2023f) Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Chen, Y. Wang, P. Luo, Z. Liu, Y. Wang, L. Wang, and Y. Qiao, “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” 2023.
Wang et al. (2022b) Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
Wang et al. (2023d) Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023.
Wang et al. (2022a) Z. Wang, Y. Chen, T. Liu, Y. Zhu, W. Liang, and S. Huang, “Humanise: Language-conditioned human motion generation in 3d scenes,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 14 959–14 971. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/6030db5195150ac86d942186f4abdad8-Paper-Conference.pdf
Wang et al. (2023a) Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” arXiv preprint arXiv:2302.01560, 2023.
Wei et al. (2022) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
World Health Organization and World Bank (2015) World Health Organization and World Bank, “Tracking universal health coverage: First global monitoring report,” www.who.int/healthinfo/universal_health_coverage/report/2015/en, Jun 2015.
Wu et al. (2023) Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation,” Microsoft, Tech. Rep. MSR-TR-2023-33, August 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/autogen-enabling-next-gen-llm-applications-via-multi-agent-conversation-framework/
Xi et al. (2023) Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
Xia et al. (2018) F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: real-world perception for embodied agents,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
Xu et al. (2023) M. Xu, P. Huang, W. Yu, S. Liu, X. Zhang, Y. Niu, T. Zhang, F. Xia, J. Tan, and D. Zhao, “Creative robot tool use with large language models,” arXiv preprint arXiv:2310.13065, 2023.
Yang et al. (2023a) J. Yang, Y. Dong, S. Liu, B. Li, Z. Wang, C. Jiang, H. Tan, J. Kang, Y. Zhang, K. Zhou et al., “Octopus: Embodied vision-language programmer from environmental feedback,” arXiv preprint arXiv:2310.08588, 2023.
Yang et al. (2023b) K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Ananiadou, “On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis,” arXiv preprint arXiv:2304.03347, 2023.
Yang et al. (2023c) Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” 2023.
Yao et al. (2023b) S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” 2023.
Yao et al. (2023a) S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” 2023.
Ye et al. (2023b) Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Modularization empowers large language models with multimodality,” 2023.
Ye et al. (2023a) Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration with chatgpt,” IEEE Access, 2023.
Young et al. (2014) P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2014.
Yu et al. (2023b) J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-Li, X. Lv, H. Peng, Z. Yao, X. Zhang, H. Li et al., “Kola: Carefully benchmarking world knowledge of large language models,” arXiv preprint arXiv:2306.09296, 2023.
Yu et al. (2016) L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 69–85.
Yu et al. (2023a) W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik et al., “Language to rewards for robotic skill synthesis,” arXiv preprint arXiv:2306.08647, 2023.
Zellers et al. (2021) R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” 2021.
Zellers et al. (2022) R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi, “Merlot reserve: Neural script knowledge through vision and language and sound,” 2022.
Zeng et al. (2021) A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V. Sindhwani et al., “Transporter networks: Rearranging the visual world for robotic manipulation,” in Conference on Robot Learning. PMLR, 2021, pp. 726–747.
Zeng et al. (2023) A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang, “Agenttuning: Enabling generalized agent abilities for llms,” 2023.
Zha et al. (2023) L. Zha, Y. Cui, L.-H. Lin, M. Kwon, M. G. Arenas, A. Zeng, F. Xia, and D. Sadigh, “Distilling and retrieving generalizable knowledge for robot manipulation via language corrections,” arXiv preprint arXiv:2311.10678, 2023.
Zhang et al. (2022) M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
Zhang et al. (2021) S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, “Hierarchical object-to-zone graph for object navigation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 130–15 140.
Zhao et al. (2023) W. Zhao, Y. Zhao, X. Lu, S. Wang, Y. Tong, and B. Qin, “Is chatgpt equipped with emotional dialogue capabilities?” arXiv preprint arXiv:2304.09582, 2023.
Zhao et al. (2022) Y. Zhao, I. Misra, P. Krähenbühl, and R. Girdhar, “Learning video representations from large language models,” in arXiv preprint arXiv:2212.04501, 2022.
Zheng et al. (2023) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.
Zhong et al. (2023) Z. Zhong, J. Cao, S. Gu, S. Xie, W. Gao, L. Luo, Z. Yan, H. Zhao, and G. Zhou, “Assist: Interactive scene nodes for scalable and realistic indoor simulation,” arXiv preprint arXiv:2311.06211, 2023.
Zhou et al. (2023a) G. Zhou, Y. Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision-and-language navigation with large language models,” arXiv preprint arXiv:2305.16986, 2023.
Zhou et al. (2023c) H. Zhou, M. Ding, W. Peng, M. Tomizuka, L. Shao, and C. Gan, “Generalizable long-horizon manipulations with large language models,” arXiv preprint arXiv:2310.02264, 2023.
Zhou et al. (2022) X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
Zhou et al. (2023b) Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao, “Analyzing and mitigating object hallucination in large vision-language models,” arXiv preprint arXiv:2310.00754, 2023.
Zhu et al. (2023) D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
Zhu et al. (2017b) J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
Zhu et al. (2017c) S. Zhu, A. Kimmel, K. E. Bekris, and A. Boularias, “Fast model identification via physics engines for data-efficient policy search,” arXiv preprint arXiv:1710.08893, 2017.
Zhu et al. (2022) X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan, J. Zhang, and Y. Yang, “Solving math word problem via cooperative reasoning induced language models,” arXiv preprint arXiv:2210.16257, 2022.
Zhu et al. (2017a) Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3357–3364.

Appendix for 附录
Agent AI 智能体 AI

Appendix A GPT-4V Agent Prompt Details
附录 A GPT-4V 智能体提示细节

Unless specified, we use the default system prompt for GPT-4V. We show a detailed description of the process to prompt GPT-4V for Minecraft and how we generate GPT-4V’s responses in Fig. 31.
除非特别说明，我们使用 GPT-4V 的默认系统提示。图 31 展示了如何为《我的世界》游戏构建 GPT-4V 提示流程的详细说明，以及我们生成 GPT-4V 响应的具体方法。

Appendix B GPT-4V for Bleeding Edge
附录 B GPT-4V 在《嗜血边缘》中的应用

Bleeding Edge is a third person team-based combat game where players attempt to capture objective points or collect more resources than the enemy team. We show an example input and output when prompting GPT-4V for the game Bleeding Edge in Fig. 32. Compared to Minecraft, we qualitatively find that GPT-4V has a less thorough understanding of the visual content and game rules. This is likely due to the (1) the amount of minecraft data present in GPT-4V’s training data and (2) the visual complexity of Bleeding Edge compared to Minecraft.
《嗜血边缘》是一款第三人称团队竞技游戏，玩家需要抢占目标点或比敌方收集更多资源。图 32 展示了为该游戏构建 GPT-4V 提示时的一个输入输出案例。与《我的世界》相比，我们定性发现 GPT-4V 对该游戏视觉内容和规则的理解相对薄弱，可能由于：(1)GPT-4V 训练数据中《我的世界》相关数据量更大；(2)《嗜血边缘》相比《我的世界》具有更高的视觉复杂度。

Appendix C GPT-4V for Microsoft Fight Simulator
附录 C GPT-4V 在《微软模拟飞行》中的应用

As shown in Fig. 33, a GPT-4V based agent can provide the high-level action description for the player in Microsoft Flight Simulator. It describes how the player is in the process of flying an aircraft, shown by the cockpit perspective and external views of the plane, managing various flight controls and instruments to maintain appropriate airspeed and altitude while navigating through the virtual airspace.
如图 33 所示，基于 GPT-4V 的智能体可为《微软模拟飞行》中的玩家提供高层次动作描述。它描述了玩家如何通过驾驶舱视角和飞机外部视角来操控飞行器，管理各类飞行控制装置与仪表，在虚拟空域导航时保持适当空速与高度。

Appendix D GPT-4V for Assassin’s Creed Odyssey
附录 D GPT-4V 在《刺客信条：奥德赛》中的应用

As shown in Fig. 34, the GPT-4V agent provides the high-level action description of the character in the images being engaged in nighttime combat during a historical battle, wielding a spear and utilizing special abilities, such as the Bull Rush, to fight against multiple enemies marked with level indicators, while managing their health and ability meters in a third-person action role-playing setting. GPT-4V also describes that the player is engaged in a large-scale melee combat scenario, controlling a character with red accents on their armor, and is prompted to use a "Bull Rush Ability" as indicated by the on-screen instructions. The environment is dark and fiery, suggesting a battlefield, and the player’s character is facing multiple enemies marked with level indicators and health bars overhead. The action is part of a tutorial or combat sequence where the player learns and executes combat abilities.
如图 34 所示，GPT-4V 智能体描述了图像中角色正在历史战役的夜间战斗中，手持长矛并使用"蛮牛冲撞"等特殊能力对抗多名带有等级标识的敌人，同时在第三人称动作角色扮演场景中管理生命值与技能槽。GPT-4V 还指出玩家正处于大规模近战场景，操控着带有红色装甲装饰的角色，并根据屏幕提示使用"蛮牛冲撞能力"。黑暗且火光冲天的环境暗示战场设定，玩家角色正面对多名头顶带有等级标识和血条的敌人。该动作属于教程或战斗序列环节，玩家在此过程中学习并执行战斗技能。

Appendix E GPT-4V for GEARS of WAR 4
附录 E GPT-4V 在《战争机器 4》中的应用

As shown in Fig. 35, Based on the sequence of 10 frames of video provided, the GPT-4V agent can provide a general description of the player’s actions in the video game:
如图 35 所示，基于提供的 10 帧视频序列，GPT-4V 智能体能够对电子游戏中玩家的行动进行总体描述：

“The player is navigating a character through an industrial environment at night, taking strategic cover behind various objects like crates and walls. The character is armed and engages in combat with enemy entities. The player uses firearms and possibly melee attacks, as indicated by close-quarters combat. They appear to manage their ammunition and health, indicated by visual cues such as a flashing red indicator on the screen, which often signifies damage to the character.
"玩家正在夜间工业环境中操控角色行进，策略性地利用板条箱、墙壁等各类物体作为掩体。该角色持有武器并与敌方单位交战。根据近战画面显示，玩家使用了枪械并可能发动了徒手攻击。他们似乎在进行弹药和生命值管理，这通过屏幕上闪烁的红色指示器等视觉线索体现——此类提示通常表示角色受伤。

Additionally, the player seems to be moving tactically, shifting between different cover positions to advance or defend against the enemy. There is evidence of a protective shield being activated at some point, which suggests the player is responding to incoming fire by using defensive abilities. The character also seems to be reloading their weapon, suggesting an element of resource management during combat. Blood and impact effects indicate successful hits on the enemies.
此外，玩家展现出战术移动行为，在不同掩体位置间切换以推进或防御敌袭。有证据表明玩家曾激活防护盾，这显示其通过使用防御能力来应对敌方火力。角色还出现了武器装弹动作，暗示战斗过程中存在资源管理要素。血液和击中特效表明玩家成功对敌人造成了伤害。"

The player’s character collaborates with allies, as other characters are seen in the vicinity, engaging in similar combat actions. The environment has interactive elements, such as exploding barrels, which the player might use to their advantage.
玩家角色与盟友协同作战，可见其他角色也在附近区域进行类似的战斗行动。环境中存在可互动元素，比如爆炸桶，玩家可巧妙利用这些元素获取战术优势。

The scenes depict fast-paced action with a focus on strategic movement, shooting, and situational awareness to overcome adversaries and possibly achieve a specific objective within the game."
画面描绘了快节奏的动作场景，着重展现战略移动、射击技巧和态势感知能力，玩家需通过这些手段击败对手，并可能在游戏中达成特定目标。

Appendix F GPT-4V for Starfield
附录 F GPT-4V 在《星空》中的应用

As showed in Fig. 36, based on the sequence of 10 frames of video provided for Starfield, the GPT-4V agent provides a general description of the player’s actions in a video game. According to GPT-4V: "The images depict a series of events within the game Starfield, showcasing the player’s interactions within a mining environment. Here’s a description of the sequence:
如图 36 所示，基于为《星空》游戏提供的 10 帧视频序列，GPT-4V 智能体对玩家在视频游戏中的行为进行了概括性描述。根据 GPT-4V 的分析："这些图像展现了游戏《星空》中的一系列事件，呈现了玩家在采矿环境中的交互过程。以下是该序列的描述：

1) The player stands in a dimly lit underground cavern with stalactites, accompanied by another character in similar space gear. They face a highlighted waypoint, indicating an objective or location of interest.
1) 玩家站在一个光线昏暗的地下洞窟中，周围悬挂着钟乳石，身旁站着另一位穿着相似太空装备的角色。他们面朝一个高亮显示的路径点，暗示着目标或兴趣地点。

2) Moving through an industrial interior, likely a mining station, the player approaches another character. The HUD (Heads-Up Display) shows a dialogue line from "Supervisor Lin," which suggests NPC (non-player character) interaction and possibly a mission briefing or update.
2) 玩家穿过可能是采矿站的工业设施内部，接近另一名角色。平视显示器（HUD）上显示着来自'主管林'的对话台词，这表明玩家正在与 NPC（非玩家角色）互动，可能是任务简报或进度更新。"

3) The player seems to be following the NPC, Supervisor Lin, through the cavern, hinting at a guided segment of gameplay where the NPC leads the player to a specific location or task.
3) 玩家正跟随 NPC 林主管穿过洞穴，这暗示着游戏中的引导环节——NPC 将带领玩家前往特定地点或任务。

4) A dialogue from Supervisor Lin suggests a decision point or mission update, possibly directing the player to a new objective related to a "big vein" of resources.
4) 林主管的对话内容提示了一个决策点或任务更新，可能指引玩家前往与"大型矿脉"资源相关的新目标。

5) The player encounters Supervisor Lin again, who provides instruction or information about metal deposits, implying the start of a mining task or exploration of a new area.
5) 玩家再次遇见林主管，他提供了关于金属矿藏的指导或信息，暗示着采矿任务或新区域探索的开始。

6) In this frame, the player appears to be observing or interacting with a piece of machinery, with another character, Isabelle, indicating an action or dialogue related to equipment operation or technical work.
6) 在此画面中，玩家似乎正在观察或操作某台机械设备，另一名角色伊莎贝尔正在进行与设备操作或技术工作相关的动作或对话。

7) The player looks down at a rough terrain with two glowing objects, which could represent resources to be collected or points of interest for the mission.
7) 玩家俯视崎岖地形，地面两个发光物体可能代表待收集资源或任务兴趣点。

8) The player examines a specific resource, indicated by a HUD marker as "Beryllium Deposit," suggesting a resource gathering or mining activity.
8) 玩家正在检查标记为"铍矿床"的特定资源，HUD 标识表明这是资源收集或采矿活动。

9) The player traverses through the caverns, potentially searching for additional resources or navigating to the next objective, as indicated by another character in space gear ahead.
9) 玩家在洞穴中穿行，可能正在搜寻额外资源或根据前方身着太空服的另一个角色指引前往下一个目标。

10) The player is positioned in a darker area of the cavern, with two glowing markers indicating a path or objective points to be reached.
10) 玩家身处洞穴较暗区域，两个发光标记指示着需要抵达的路径或目标点。

Throughout these images, the player seems to be engaged in a mission involving exploration, NPC interaction, resource collection, and possibly machinery operation within a mining environment. The HUD elements like waypoints, dialogue text, and mission updates provide context to the player’s actions and the game’s objectives."
整个画面序列显示玩家正参与一项包含勘探、NPC 互动、资源采集及可能涉及采矿环境机械操作的任务。诸如路径点、对话文本和任务更新等 HUD 元素为玩家行动和游戏目标提供了上下文。

Author Biographies 作者简介

Zane Durante 赞恩·杜兰特

PhD Student in Computer Science at Stanford University
斯坦福大学计算机科学博士研究生
Email: durante@stanford.edu
邮箱：durante@stanford.edu
Web: https://zanedurante.github.io/
个人网站：https://zanedurante.github.io/
Zane Durante is a PhD student at Stanford Vision Lab advised by Dr. Fei-Fei Li. H His research interests include video understanding, connecting computer vision with natural language, and AI applications in hospital care. He is leading an ongoing collaboration between Stanford Medical School and Stanford Vision Lab to develop a dataset of natural language descriptions of video clips of hospital scenes. His PhD work is supported by NSF’s Graduate Research Fellowship.
赞恩·杜兰特是斯坦福视觉实验室（Stanford Vision Lab）的博士研究生，导师为李飞飞博士。他的研究兴趣包括视频理解、计算机视觉与自然语言的结合，以及人工智能在医疗护理中的应用。他正在主导斯坦福医学院与斯坦福视觉实验室之间的持续合作项目，致力于开发医院场景视频片段的自然语言描述数据集。其博士研究受到美国国家科学基金会研究生科研奖学金资助。

Qiuyuan Huang 黄秋远

Principal Researcher at Microsoft Research, Redmond, WA, USA.
微软研究院（美国华盛顿州雷德蒙德）首席研究员
Email: qihua@microsoft.com
邮箱：qihua@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/qihua/
个人网页：https://www.microsoft.com/en-us/research/people/qihua/
Qiuyuan Huang is a principal researcher in the deep learning group at Microsoft Research (MSR), Redmond, WA. Her current research interests are mainly in the deep learning, multi-modality, and natural language processing areas, specifically on Agent AI for Gaming, Robotics and Healthcare; Knowledge-reasoning Intelligence for Interactive AI; Neuro-symbolic Computation for Inference Reasoning; and Large Foundation models for NLP and Multi-modality.
黄秋远现任微软研究院雷德蒙德分部深度学习组首席研究员，主要研究领域包括深度学习、多模态与自然语言处理，具体方向涉及游戏/机器人/医疗领域的智能体 AI、交互式 AI 的知识推理智能、神经符号推理计算，以及面向 NLP 与多模态的大型基础模型。

Naoki Wake 和気直樹

Researcher at Microsoft, Redmond, WA, USA.
微软公司（美国华盛顿州雷德蒙德）研究员
Email: naoki.wake@microsoft.com
邮箱：naoki.wake@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/nawake/
个人网页：https://www.microsoft.com/en-us/research/people/nawake/
Naoki Wake is a researcher in the Applied Robotics Research group at Microsoft. His current research involves the development of multimodal perception systems for robots and co-speech gesturing systems. His past research has spanned auditory neuroscience, neuro-rehabilitation, and speech processing. Naoki received his B.S. degree in Engineering in 2014, and his Ph.D. in Information Science and Technology in 2019, both from the University of Tokyo.
和気直樹现任微软应用机器人研究组研究员，主要从事机器人多模态感知系统与伴随语音的肢体动作生成系统研发。其过往研究涵盖听觉神经科学、神经康复及语音处理领域。他于 2014 年获东京大学工学学士学位，2019 年获东京大学信息科学与技术博士学位。

Ran Gong 龚然

PhD Student in Computer Science at University of California, Los Angeles.
加州大学洛杉矶分校计算机科学博士研究生
Email: nikepupu@ucla.edu
邮箱：nikepupu@ucla.edu
Web: https://nikepupu.github.io
网站：https://nikepupu.github.io
Ran Gong, is a PhD student at the UCLA VCLA Lab. His research lies in the intersection of Robotics, Computer Vision, Computer Graphics, and Machine Learning. His research focuses on embodied simulation and interaction with a goal of creating intelligent behaviors that can solve diverse tasks in diverse environments as well as well as the capability of collaborating with humans. He received his B.S. degree in Computer Science and Engineering at the University of California, Los Angeles.
龚然，加州大学洛杉矶分校 VCLA 实验室博士生。他的研究横跨机器人学、计算机视觉、计算机图形学与机器学习领域，聚焦具身模拟与交互技术，致力于开发能在多样化环境中执行多样化任务的智能体行为，并实现与人类的高效协作。他本科毕业于加州大学洛杉矶分校计算机科学与工程系。

Jae Sung Park 朴宰成

PhD Student at University of Washington
华盛顿大学博士生
Email: jspark96@cs.washington.edu
邮箱：jspark96@cs.washington.edu
Web: https://homes.cs.washington.edu/~jspark96/
个人网站：https://homes.cs.washington.edu/~jspark96/
Jae Sung is a PhD student advised by Yejin Choi and Ali Farhadi. His research focuses on developing models with multimodal commonsense reasoning. He is interested in equipping models with grounding linguistic concepts to visual modalities, and having them understand multimedia content in a way that humans process the visual information. Jae Sung received his B.S. degree in Computer Science at University of California, Berkeley.
朴宰成在 Yejin Choi 与 Ali Farhadi 教授指导下攻读博士学位，研究方向为多模态常识推理模型的开发。他致力于将语言概念锚定到视觉模态，使模型能像人类处理视觉信息那样理解多媒体内容。本科毕业于加州大学伯克利分校计算机科学系。

Bidipta Sarkar 比迪普塔·萨卡尔

Undergraduate Student at Stanford University
斯坦福大学本科生
Email: bidiptas@stanford.edu
邮箱：bidiptas@stanford.edu
Web: https://bsarkar321.github.io/
个人网站：https://bsarkar321.github.io/
Bidipta Sarkar is a senior undergraduate student at Stanford University and a member of Stanford’s ILIAD lab. His research focuses on creating AI agents that can interact with their environment and safely work alongside humans and other autonomous agents.
比迪普塔·萨卡尔是斯坦福大学大四学生，ILIAD 实验室成员。他的研究专注于开发能与环境交互、并安全地与人类及其他自主智能体协同工作的人工智能体。

Rohan Taori 罗翰·陶里

PhD Student in Computer Science at Stanford University
斯坦福大学计算机科学博士生
Email: rtaori@cs.stanford.edu
邮箱：rtaori@cs.stanford.edu
Web: https://www.rohantaori.com/
个人网站：https://www.rohantaori.com/
Rohan Taori is a PhD student at the Stanford AI Lab. His research focuses on studying the foundations of machine learning in the context of real-world systems. Most recently, Rohan has pushed forward the frontier of open-source large language models, finetuning them to be helpful, general-purpose assistants. Rohan is also very interested in augmenting language models with multi-modality capability to allow them to reason over images and videos. Rohan received his Bachelor’s degree in Computer Science at UC Berkeley.
罗翰·陶里是斯坦福人工智能实验室的博士生。他的研究专注于探索现实世界系统中机器学习的基础理论。近期，罗翰在开源大语言模型领域取得突破性进展，通过微调使模型成为实用的通用助手。他还致力于增强语言模型的多模态能力，使其具备图像与视频的推理能力。罗翰本科毕业于加州大学伯克利分校计算机科学专业。

Yusuke Noda 野田祐介

Principal Software Engineer. Microsoft Gaming, Redmond, WA.
首席软件工程师，微软游戏部门（华盛顿州雷德蒙德）
Email: yusuke.noda@microsoft.com
邮箱：yusuke.noda@microsoft.com
Web: https://www.linkedin.com/in/yusuke-noda-908797/
领英主页：https://www.linkedin.com/in/yusuke-noda-908797/
Yusuke Noda is a principal software engineer at Microsoft Gaming Platform. He has led the development of gaming infrastructure for Xbox One and Xbox Cloud Gaming and has over 16 years of experience developing efficient infrastructure for cloud and gaming technologies.
野田祐介现任微软游戏平台首席软件工程师，曾主导 Xbox One 和 Xbox 云游戏的基础架构开发，在云计算与游戏技术的高效基建领域拥有 16 年以上经验。

Demetri Terzopoulos 迪米特里·特佐普洛斯

UCLA Distinguished Professor, Academy Award winning computer scientist and entrepreneur.
加州大学洛杉矶分校特聘教授，奥斯卡奖得主，计算机科学家与企业家。
Email: dt@cs.ucla.edu 电子邮箱：dt@cs.ucla.edu
Web: https://web.cs.ucla.edu/~dt/
个人网站：https://web.cs.ucla.edu/~dt/
Dr. Demetri Terzopoulos is a Distinguished Professor and Chancellor’s Professor of Computer Science at the University of California, Los Angeles, where he directs the UCLA Computer Graphics & Vision Laboratory. He is also Co-Founder and Chief Scientist of VoxelCloud, Inc., a multinational healthcare AI company. He is or was a Guggenheim Fellow, a Fellow of the Royal Society (FRS) of London and the Royal Society of Canada (FRSC), a Fellow of the Association for Computing Machinery (ACM), a Life Fellow of the Institute of Electrical and Electronics Engineers (IEEE), a Fellow of the Asia-Pacific Artificial Intelligence Association (AAIA), a Distinguished Fellow of the International Engineering and Technology Institute (IETI), a Member of the European Academy of Sciences (EAS) and the New York Academy of Sciences (NYAS), and a Life Member of Sigma Xi.
Demetri Terzopoulos 博士是加州大学洛杉矶分校计算机科学特聘教授兼校长讲席教授，同时担任 UCLA 计算机图形学与视觉实验室主任。他亦是跨国医疗人工智能公司 VoxelCloud 联合创始人兼首席科学家。现任/曾任：古根海姆基金会学者、伦敦皇家学会会士(FRS)、加拿大皇家学会会士(FRSC)、美国计算机协会会士(ACM)、电气电子工程师学会终身会士(IEEE)、亚太人工智能协会会士(AAIA)、国际工程技术研究院杰出会士(IETI)、欧洲科学院(EAS)及纽约科学院(NYAS)院士、Sigma Xi 学会终身会员。

Yejin Choi

Wisnner-Slivaka Chair and Brett Helsel Professor at University of Washington, Senior Research Manager at Allen Institute for Artificial Intelligence, and MacArthur Fellow.
华盛顿大学 Wisnner-Slivaka 讲席教授、Brett Helsel 教授，艾伦人工智能研究所高级研究经理，麦克阿瑟基金会"天才奖"得主。
Email: yejin@cs.washington.edu
电子邮箱：yejin@cs.washington.edu
Web: https://homes.cs.washington.edu/~yejin/
个人主页：https://homes.cs.washington.edu/~yejin/
Yejin is a Wisnner-Slivaka Chair and Brett Helsel Professor at University of Washington and Senior Research Manager at Allen Institute of Artifical Intelligence. She has won the Anita Borg Early Career Award in 2018. She was the recipient of MacArthur Fellow foundation fellowship in 2020. She has received outstanding paper award in AAAI 2020, Neurips 2021, ICML 2022, and ACL 2023, and the best paper award in NAACL 2022 and ACL 2023. She is one of the main organizers of COLM 2024, an academic venue focused on the study of language modeling. Her main research interests are commonsense reasoning in the fields of Natural Language Processing, Machine Learning, Artificial Intelligence, with broader interests in Computer Vision and Digital Humanities.
Yejin 现任华盛顿大学 Wisnner-Slivaka 讲席教授、Brett Helsel 教授，并担任艾伦人工智能研究所高级研究经理。她于 2018 年获得 Anita Borg 早期职业成就奖，2020 年获颁麦克阿瑟基金会"天才奖"。其研究成果曾获 AAAI 2020、NeurIPS 2021、ICML 2022 和 ACL 2023 杰出论文奖，以及 NAACL 2022 与 ACL 2023 最佳论文奖。作为 COLM 2024（语言建模研究学术会议）的主要组织者之一，她的核心研究方向涵盖自然语言处理、机器学习和人工智能领域的常识推理，并延伸至计算机视觉与数字人文领域。

Fei-Fei Li 李飞飞

Professor of Computer Science at Stanford University
斯坦福大学计算机科学系教授
Email: feifeili@stanford.edu
电子邮箱：feifeili@stanford.edu
Web: https://profiles.stanford.edu/fei-fei-li/
个人主页：https://profiles.stanford.edu/fei-fei-li/
Fei-Fei is the inaugural Sequoia Professor in the Computer Science Department at Stanford University, and Co-Director of Stanford’s Human-Centered AI Institute. She served as the Director of Stanford’s AI Lab from 2013 to 2018. And during her sabbatical from Stanford from January 2017 to September 2018, Dr. Li was Vice President at Google and served as Chief Scientist of AI/ML at Google Cloud. Since then she has served as a Board member or advisor in various public or private companies.
李飞飞是斯坦福大学计算机科学系首位红杉冠名教授，兼任斯坦福以人为本人工智能研究院联席主任。2013 至 2018 年担任斯坦福人工智能实验室主任，2017 年 1 月至 2018 年 9 月学术休假期间出任谷歌副总裁兼谷歌云 AI/ML 首席科学家。此后在多家上市公司和私营企业担任董事或顾问职务。

Katsushi Ikeuchi 池内克史

Senior Principal Research Manager at Microsoft and an IEEE Life Fellow. Microsoft Research, Redmond, WA. IEEE Life Fellow.
微软首席研究经理，IEEE 终身会士。工作于华盛顿州雷德蒙德微软研究院。
Email: katsuike@microsoft.com
电子邮箱：katsuike@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/katsuike/
个人主页：https://www.microsoft.com/en-us/research/people/katsuike/
Dr. Katsushi Ikeuchi received the BE degree in Mechanical Engineering from Kyoto University in 1973 and the PhD degree in Information Engineering from the University of Tokyo in 1978. After working at the Artificial Intelligence Laboratory, Massachusetts Institute of Technology for three years, Electrotechnical Laboratory, Ministry of International Trade and Industry, Japanese government for five years, and Robotics Institute, Carnegie Mellon University for 10 years, Institute of Industrial Science, the University of Tokyo for 19 years, he joined Microsoft as a Principal Researcher in 2015. During this tenure of CMU and UTokyo, he supervised more than 50 PhD students.
池内克史博士 1973 年获京都大学机械工程学士学位，1978 年获东京大学信息工程博士学位。曾先后任职于麻省理工学院人工智能实验室（3 年）、日本通商产业省电子技术综合研究所（5 年）、卡内基梅隆大学机器人研究所（10 年）及东京大学生产技术研究所（19 年），2015 年加入微软担任首席研究员。在 CMU 和东京大学任职期间指导过 50 余名博士生。

His research interest spans computer vision, robotics, and computer graphics. In these research fields, he has received several best paper awards, including the David Marr Prize in computational vision and IEEE Robotics and Automation Society K. S. Fu memorial best transaction paper.
其研究领域横跨计算机视觉、机器人学与计算机图形学，曾获计算视觉领域 David Marr 奖、IEEE 机器人与自动化协会傅京孙纪念最佳期刊论文奖等多项荣誉。

His community service includes: general chair of IROS95, ITSC99, IV01, ICCV05, ACCV07, ICCV17; program chair of CVPR96, ICCV03, ICRA09, ICPR12, ICCV15; EIC of IJCV (2000-2017), EIC of IJ ITS (2012-2014), associate editor of IEEE Trans. RA, IEEE Trans. PAMI; and a distinguished lecturer of IEEE Signal Processing Society in 2000-2002, IEEE Robotics and Automation Society in 2004-2006 and IEEE Computer Society in 2008-2010.
学术服务包括：担任 IROS95、ITSC99、IV01、ICCV05、ACCV07、ICCV17 大会主席；CVPR96、ICCV03、ICRA09、ICPR12、ICCV15 程序主席；IJCV 主编（2000-2017）、IJ ITS 主编（2012-2014）；IEEE Trans. RA 与 IEEE Trans. PAMI 副编辑；以及 IEEE 信号处理学会（2000-2002）、IEEE 机器人与自动化协会（2004-2006）和 IEEE 计算机学会（2008-2010）杰出讲师。

Through these research and society service, he was awarded a (life) fellow from IEEE, IEICE, IPSJ and RSJ. He received the Distinguished Researcher Award from IEEE-PAMI, the Medal of Honor with Purple Ribbon from Japanese Emperor, the Okawa prize from Okawa foundation as well as a couple of research achievement awards from Japanese professional societies.
凭借学术贡献与社会服务，获选 IEEE、IEICE、IPSJ 和 RSJ 终身会士，荣获 IEEE-PAMI 杰出研究者奖、日本天皇紫绶褒章、大川基金会大川奖及多个日本专业协会研究成就奖。

Hoi Vo

Technical Fellow. Microsoft Gaming and X-Box Emerging Technologies, Redmond, WA.
微软技术院士。工作于华盛顿州雷德蒙德微软游戏与 Xbox 新兴技术部门。
Email: hoiv@microsoft.com
电子邮箱：hoiv@microsoft.com
Web: https://www.linkedin.com/in/hoi-vo-193420/
网址：https://www.linkedin.com/in/hoi-vo-193420/
Hoi Vo is a Technical Fellow from MS Gaming Division focusing on bridging AGI with new Gaming IP. He has played a pivotal role in establishing the Xbox Cloud Gaming service has led cloud and gaming efforts across teams in Microsoft, including Microsoft’s Windows Azure platform. His focus is on the optimization of AI models to run efficiently at the edge while leveraging the cloud for scenarios that exceed the hardware capabilities, including fine-tuning models to be more engaging with various playing experiences.
Hoi Vo 是微软游戏部门技术院士，专注于将通用人工智能（AGI）与新型游戏 IP 相结合。他在建立 Xbox 云游戏服务中发挥了关键作用，并领导微软跨团队的云与游戏技术研发，涵盖微软 Windows Azure 平台。其研究重点是通过云端协同优化 AI 模型运行效率——当场景需求超出硬件性能时利用云端算力，包括微调模型以适配多样化游戏体验。

Jianfeng Gao 高剑峰

Distinguished Scientist and Vice President at Microsoft and an IEEE Fellow. Microsoft Research, Redmond, WA. IEEE Fellow.
微软杰出科学家兼副总裁，IEEE 会士。任职于微软研究院（华盛顿州雷德蒙市）。IEEE 会士。
Email: jfgao@microsoft.com
邮箱：jfgao@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/jfgao/
网址：https://www.microsoft.com/en-us/research/people/jfgao/
Dr. Jianfeng Gao is Distinguished Scientist and Vice President at Microsoft Research, IEEE Fellow, ACM Distinguished Member, and the current head of the Deep Learning Group at Microsoft Research. From 2014 to 2017, he was Partner Research Manager in Business AI at Microsoft AI Research and at Deep Learning Technology Center (DLTC) at Microsoft Research, Redmond. He lead the development of AI solutions to Predictive Sales and Marketing. He also works on deep learning for text and image processing (see ACL/SIGIR 2018 Tutorial, Deep Learning 2017 Tutorial and IJCAI 2016 Tutorial or MS internal site) and lead the development of AI systems for dialogue, machine reading comprehension (MRC), and question answering (QA). From 2022, he leads the research of self-improving AI where LLMs (e.g., ChatGPT/GPT4) are augmented and adapted for the development of commercial AI systems.
高剑峰博士是微软研究院杰出科学家兼副总裁，IEEE 会士、ACM 杰出会员，现任微软研究院深度学习组负责人。2014 至 2017 年间，他作为合伙研究经理先后任职于微软 AI 研究院商业人工智能部门及微软雷德蒙研究院深度学习技术中心，主导开发了销售与营销预测 AI 解决方案。他深耕文本与图像处理的深度学习技术（参见 ACL/SIGIR 2018 教程、2017 深度学习教程及 IJCAI 2016 教程或微软内网），并领导开发对话系统、机器阅读理解（MRC）及问答（QA）等 AI 系统。自 2022 年起，他主导自进化 AI 研究，通过增强适配 LLMs（如 ChatGPT/GPT4）来开发商业 AI 系统。

Acknowledgements 致谢

We are especially grateful to Peter Lee, Doug Burger, Desney Tan, Johannes Gehrke, Ryen White, Ece Kamar, Subhojit Som, and Kareem Choudhry for their advices, enormous support, and encouragement. We thank to Haiyan Zhang, Spencer Perreault, Dave Bignell, Katja Hofmann, Sam Devlin, Shanzheng Tan, Raluca Georgescu, Bill Dolan, Nebojsa Jojic, Sudha Rao, Adrian Brown, Andrzej Banburski-Fahey, Jianwei Yang for the early insightful discussions and helps of Gaming. We appreciate Kiran Muthabatulla, Antonio Criminisi, Tom Cashman, Nguyen Bach, Jennifer Marsman, Jaron Lanier from Mesh team, OCTO of Microsoft, and Microsoft office team for their mix-reality work, dataset work, and their generous helps and feedback for the project. We special thanks to Paul Bennett, Corby Rosset, Michel Galley, Chenglong Wang, Baolin Peng, Hao Chen, Silviu Cucerzan, Ahmed Awadallah, Saleema Amershi for their suggestion and comments for the NLP part. The authors gratefully acknowledge Paul Smolensky, Yonatan Bisk, Kezhen Chen, Borui Wang, Liangke Gui, Dingmin Wang, Xin (Eric) Wang, Zhe Gan, Xiaojian Ma, Zilong Zheng, Song-chun Zhu, Dragomir R. Radev, Daniel McDuff, Harry Shum for the related previous works, comments, suggestions, painstaking multiple reviews of this paper, and their pointers to the literature. Finally, we would like to really appreciate Microsoft Holulens team, Microsoft X-box team, and Meta Quest team for their generous provision of the equipment; MSR Central Engineering (CE) team, Microsoft 343 team for the data collection and sharing; Microsoft AOAI and GCR team for their Azure-OpenAI endpoint supporting.
我们特别感谢 Peter Lee、Doug Burger、Desney Tan、Johannes Gehrke、Ryen White、Ece Kamar、Subhojit Som 和 Kareem Choudhry 的建议、大力支持与鼓励。感谢 Haiyan Zhang、Spencer Perreault、Dave Bignell、Katja Hofmann、Sam Devlin、Shanzheng Tan、Raluca Georgescu、Bill Dolan、Nebojsa Jojic、Sudha Rao、Adrian Brown、Andrzej Banburski-Fahey、Jianwei Yang 在早期富有洞见的讨论以及对游戏领域的帮助。感谢 Kiran Muthabatulla、Antonio Criminisi、Tom Cashman、Nguyen Bach、Jennifer Marsman、Jaron Lanier 来自 Mesh 团队、微软 OCTO 及微软办公团队的混合现实工作、数据集构建以及对本项目的慷慨帮助与反馈。特别鸣谢 Paul Bennett、Corby Rosset、Michel Galley、Chenglong Wang、Baolin Peng、Hao Chen、Silviu Cucerzan、Ahmed Awadallah、Saleema Amershi 对自然语言处理部分的建议与指正。作者衷心感谢 Paul Smolensky、Yonatan Bisk、Kezhen Chen、Borui Wang、Liangke Gui、Dingmin Wang、Xin (Eric) Wang、Zhe Gan、Xiaojian Ma、Zilong Zheng、Song-chun Zhu、Dragomir R. Radev、Daniel McDuff、Harry Shum 的相关前期工作、评论建议、对本文的细致审阅以及文献指引。最后，我们要特别感谢微软 Hololens 团队、微软 Xbox 团队和 Meta Quest 团队慷慨提供的设备支持；MSR Central Engineering (CE)团队和微软 343 团队的数据采集与共享；微软 AOAI 和 GCR 团队提供的 Azure-OpenAI 终端支持。

We would like to thank our colleagues from Stanford’s Partnership in AI-assisted Care, who helped inform the medical applications explored in this work. In particular, we would like to thank Ehsan Adeli, Paul Tang, Amit Kaushal, Roger Bohn, Kevin Schulman, and Arnold Milstein for their clinical expertise and guidance.
感谢斯坦福大学 AI 辅助医疗合作伙伴项目的同事们为本研究医疗应用方向提供的专业支持。特别感谢 Ehsan Adeli、Paul Tang、Amit Kaushal、Roger Bohn、Kevin Schulman 和 Arnold Milstein 的临床专业知识指导。

This research was supported by Microsoft Research project Fair 2023, Microsoft HackBox 2023, OCTO team.
本研究由微软研究院 Fair 2023 项目、Microsoft HackBox 2023 及 OCTO 团队提供支持。

Agent AI: Surveying the Horizons of Multimodal Interaction智能体 AI：探索多模态交互的疆界