这是用户在 2025-7-22 12:40 为 https://app.immersivetranslate.com/pdf-pro/d531535f-e59b-4a85-8ad8-ef542b4e3483/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

LISA: Reasoning Segmentation via Large Language Model
LISA: 通过大型语言模型进行推理分割

Xin Lai 1 1 ^(1**)quad{ }^{1 *} \quad Zhuotao Tian 2 2 ^(2**†)quad{ }^{2 * \dagger} \quad Yukang Chen 1 1 ^(1)quad{ }^{1} \quad Yanwei Li 1 1 ^(1)quad{ }^{1} \quad Yuhui Yuan 4 4 ^(4)quad{ }^{4} \quad Shu Liu 3 3 ^(3)quad{ }^{3} \quad Jiaya Jia 1 , 3 1 , 3 ^(1,3){ }^{1,3} 1 1 ^(1){ }^{1} CUHK 2 2 ^(2){ }^{2} HIT (Shenzhen) 3 3 ^(3){ }^{3} SmartMore 4 4 ^(4){ }^{4} MSRA
1 1 ^(1){ }^{1} CUHK 2 2 ^(2){ }^{2} HIT (深圳) 3 3 ^(3){ }^{3} SmartMore 4 4 ^(4){ }^{4} MSRA

Abstract  摘要

Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task - reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex reasoning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoningfree datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at github.com/dvlabresearch/LISA.
尽管感知系统近年来取得了显著进展,但它们仍然依赖于明确的人类指令或预定义的类别来识别目标对象,然后执行视觉识别任务。这些系统无法主动推理和理解隐含的用户意图。在本研究中,我们提出了一种新的分割任务——推理分割。该任务旨在根据复杂和隐含的查询文本输出分割掩码。此外,我们建立了一个基准,包含超过一千个图像-指令-掩码数据样本,结合复杂的推理和世界知识用于评估。最后,我们提出了 LISA:大型语言指令分割助手,它继承了多模态大型语言模型(LLMs)的语言生成能力,同时具备生成分割掩码的能力。我们通过引入<SEG>标记扩展了原始词汇,并提出了嵌入作为掩码的范式,以解锁分割能力。值得注意的是,LISA 能够处理涉及复杂推理和世界知识的案例。此外,当仅在无推理数据集上训练时,它表现出强大的零-shot 能力。此外,仅用 239 个推理分割数据样本对模型进行微调,进一步提升了性能。定量和定性实验表明,我们的方法有效解锁了多模态 LLMs 的新推理分割能力。代码、模型和数据可在 github.com/dvlabresearch/LISA 获取。

1. Introduction  1. 引言

In daily life, users tend to issue direct commands like “Change the TV channel” to instruct a robot, rather than providing explicit step-by-step instructions such as “Go to the table first, find the TV remote, and then press the button to change the channel.” However, existing perception systems consistently rely on humans to explicitly indicate target objects or pre-define categories before executing visual recog-
在日常生活中,用户倾向于发出直接的命令,如“换电视频道”,来指示机器人,而不是提供明确的逐步指令,例如“先去桌子那,找到电视遥控器,然后按按钮换频道。”然而,现有的感知系统始终依赖于人类明确指示目标对象或预定义类别,然后执行视觉识别任务。这些系统无法根据隐含指令主动推理和理解用户意图。这种推理能力对于开发下一代智能感知系统至关重要,并在工业应用中具有巨大的潜力,特别是在机器人技术领域。
nition tasks. These systems cannot actively reason and comprehend user intention based on implicit instruction. This reasoning ability is crucial in developing next-generation intelligent perception systems and holds substantial potential for industrial applications, particularly in robotics.
In this work, we introduce a new segmentation task reasoning segmentation, which requires generating a binary segmentation mask based on an implicit query text involving complex reasoning. Notably, the query text is not limited to a straightforward reference (e.g., “the orange”), but a more complicated description involving complex reasoning or world knowledge (e.g., “the food with high Vitamin C”). To accomplish this task, the model must possess two key abilities: 1) reasoning complex and implicit text queries jointly with the image; 2) producing segmentation masks.
在这项工作中,我们引入了一种新的分割任务——推理分割,该任务要求基于涉及复杂推理的隐式查询文本生成二进制分割掩码。值得注意的是,查询文本并不限于简单的引用(例如,“橙子”),而是涉及复杂推理或世界知识的更复杂描述(例如,“富含维生素 C 的食物”)。为了完成这一任务,模型必须具备两个关键能力:1)与图像共同推理复杂和隐式文本查询;2)生成分割掩码。
Inspired by the exceptional capacity of LLMs to reason and comprehend user intentions, we aim to leverage this capability of LLMs to address the aforementioned first challenge. However, while several studies [1,23,24,28,29,55, 63] have integrated robust reasoning capabilities into multimodal LLMs to accommodate visual input, the majority of these models primarily concentrate on text generation tasks and still fall short in performing vision tasks that require fine-grained output formats, such as segmentation masks. This leads us to ask: can we enable multimodal LLMs with the capability to output segmentation masks?
受到 LLMs 在推理和理解用户意图方面卓越能力的启发,我们旨在利用 LLMs 的这一能力来解决上述第一个挑战。然而,尽管已有多项研究[1,23,24,28,29,55,63]将强大的推理能力整合到多模态 LLMs 中以适应视觉输入,但这些模型大多数主要集中于文本生成任务,仍然在执行需要细粒度输出格式的视觉任务(如分割掩码)方面存在不足。这使我们提出疑问:我们能否使多模态 LLMs 具备输出分割掩码的能力?
To this end, we introduce LISA: a large Language Instructed Segmentation Assistant, a multimodal LLM capable of producing segmentation masks. Specifically, we incorporate an additional token, i.e., <SEG>, into the existing vocabulary. Upon generating the <SEG> token, its hidden embedding is further decoded into the corresponding segmentation mask. By representing the segmentation mask as an embedding, LISA acquires segmentation capabilities and benefits from end-to-end training. Remarkably, LISA demonstrates robust zero-shot abilities. Training the model solely on standard semantic segmentation and referring segmentation datasets yields surprisingly effective performance on the reasoning segmentation task. Furthermore, we find that LISA’s performance can be significantly enhanced by fine-tuning on just 239 reasoning segmentation data samples. As illustrated in Fig. 1, LISA can handle various scenarios
为此,我们引入了 LISA:一个大型语言指令分割助手,它是一种能够生成分割掩码的多模态 LLM。具体而言,我们在现有词汇中加入了一个额外的标记,即<SEG>。在生成<SEG>标记后,其隐藏嵌入进一步解码为相应的分割掩码。通过将分割掩码表示为嵌入,LISA 获得了分割能力,并受益于端到端训练。值得注意的是,LISA 展现出强大的零-shot 能力。仅在标准语义分割和引用分割数据集上训练模型,便在推理分割任务上取得了令人惊讶的有效表现。此外,我们发现,通过对仅 239 个推理分割数据样本进行微调,可以显著提升 LISA 的性能。如图 1 所示,LISA 能够处理各种场景。

Figure 1. We unlock new segmentation capabilities for existing multimodal LLMs. Our model (i.e., LISA) can deal with cases involving complex reasoning and world knowledge. Also, we demonstrate the cases of explanatory answers in the 3rd row. Additionally, in the 4th row, our model can output multiple segmentation masks in a single answer. More illustrations can be found in the supplementary material.
图 1. 我们为现有的多模态 LLMs 解锁了新的分割能力。我们的模型(即 LISA)能够处理涉及复杂推理和世界知识的案例。此外,我们在第三行展示了解释性答案的案例。此外,在第四行,我们的模型可以在单个答案中输出多个分割掩码。更多插图可以在补充材料中找到。

involving complex reasoning and world knowledge.
涉及复杂推理和世界知识。

In addition, to validate the effectiveness, we establish a benchmark for reasoning segmentation evaluation, called ReasonSeg. Comprising over one thousand imageinstruction pairs, this benchmark offers persuasive evaluation metrics for the task. To align more closely with practical applications, we annotate the images from OpenImages [21] and ScanNetv2 [10] with implicit text queries that involve complex reasoning.
此外,为了验证有效性,我们建立了一个推理分割评估基准,称为 ReasonSeg。该基准包含超过一千对图像-指令对,为该任务提供了有说服力的评估指标。为了更紧密地与实际应用对齐,我们对来自 OpenImages [21]和 ScanNetv2 [10]的图像进行了注释,使用涉及复杂推理的隐式文本查询。
In summary, our contributions are as follows:
总之,我们的贡献如下:
  • We introduce the reasoning segmentation task, which necessitates reasoning based on implicit human instructions. Such reasoning capability is crucial for building a genuinely intelligent perception system.
    我们引入了推理分割任务,该任务需要基于隐含的人类指令进行推理。这种推理能力对于构建真正智能的感知系统至关重要。
  • We present our model - LISA, which incorporates new segmentation capabilities. It demonstrates robust zeroshot ability on the reasoning segmentation task when trained solely on reasoning-free datasets, and achieves further performance boost by fine-tuning on just 239 data samples that involve reasoning.
    我们提出了我们的模型 - LISA,它结合了新的分割能力。在仅使用无推理数据集进行训练时,它在推理分割任务上展示了强大的零样本能力,并通过在仅 239 个涉及推理的数据样本上进行微调,进一步提升了性能。
  • We establish a reasoning segmentation benchmark, ReasonSeg, containing over one thousand image-instruction-mask data samples. This benchmark is essential for evaluation and encourages the community to further explore the reasoning ability for vision tasks.
    我们建立了一个推理分割基准,ReasonSeg,包含超过一千个图像-指令-掩码数据样本。这个基准对于评估至关重要,并鼓励社区进一步探索视觉任务的推理能力。

2.1. Image Segmentation  2.1. 图像分割

Semantic segmentation aims to assign a class label to every pixel in an image. Numerous studies [ 2 , 5 , 8 , 12 , 16 [ 2 , 5 , 8 , 12 , 16 [2,5,8,12,16[2,5,8,12,16, 22 , 31 , 37 , 42 , 43 , 45 , 46 , 51 , 56 , 59 61 , 64 ] 22 , 31 , 37 , 42 , 43 , 45 , 46 , 51 , 56 , 59 61 , 64 ] 22,31,37,42,43,45,46,51,56,59-61,64]22,31,37,42,43,45,46,51,56,59-61,64] have proposed diverse designs (such as encoder-decoder, dilated convolution, pyramid pooling module, non-local operator, and more) to effectively encode semantic information. Research on instance segmentation [9, 14, 58] and panoptic segmentation [7, 18, 25, 50] has introduced various architectural innovations for instance-level segmentation, including DETR [4]-based structures, mask attention, and dynamic convolution. In recent years, typical segmentation tasks have made significant progress and become increasingly mature. Consequently, it is imperative to develop more intelligent interaction ways for image segmentation.
语义分割旨在为图像中的每个像素分配一个类别标签。许多研究 [ 2 , 5 , 8 , 12 , 16 [ 2 , 5 , 8 , 12 , 16 [2,5,8,12,16[2,5,8,12,16 , 22 , 31 , 37 , 42 , 43 , 45 , 46 , 51 , 56 , 59 61 , 64 ] 22 , 31 , 37 , 42 , 43 , 45 , 46 , 51 , 56 , 59 61 , 64 ] 22,31,37,42,43,45,46,51,56,59-61,64]22,31,37,42,43,45,46,51,56,59-61,64] 提出了多种设计(如编码器-解码器、膨胀卷积、金字塔池化模块、非局部操作符等)以有效编码语义信息。关于实例分割 [9, 14, 58] 和全景分割 [7, 18, 25, 50] 的研究引入了多种架构创新,包括基于 DETR [4] 的结构、掩码注意力和动态卷积。近年来,典型的分割任务取得了显著进展,变得越来越成熟。因此,开发更智能的图像分割交互方式是势在必行的。
The referring segmentation task [17, 36] enables interaction with human language, aiming to segment the target object based on a given explicit text description. Recently, Kirillov et al. [19] introduced SAM, trained with billions of high-quality masks, supporting bounding boxes and points as prompts while demonstrating exceptional segmentation quality. X-Decoder [65] bridges vision and language, unifying multiple tasks within a single model. SEEM [66] further supports various human interaction methods, including text, audio, and scribble. However, these studies primarily focus
参考分割任务 [17, 36] 使得与人类语言的交互成为可能,旨在根据给定的明确文本描述对目标对象进行分割。最近,Kirillov 等人 [19] 引入了 SAM,使用数十亿个高质量掩码进行训练,支持边界框和点作为提示,同时展示了卓越的分割质量。X-Decoder [65] 连接了视觉和语言,将多个任务统一在一个模型中。SEEM [66] 进一步支持多种人类交互方式,包括文本、音频和涂鸦。然而,这些研究主要集中在解决多任务兼容性和统一性上,忽视了新能力的注入。在这项工作中,我们提出了 LISA,它具备现有分割器尚未探索的推理能力。

Figure 2. Examples of the annotated image-instruction-mask data samples. Left: short phrase query. Right: long sentence query. More examples are given in the supplementary material.
on addressing multi-task compatibility and unification, neglecting the injection of new capabilities. In this work, we present LISA and it possesses reasoning ability that has not been explored yet in existing segmentors.

2.2. Multimodal Large Language Model
2.2. 多模态大型语言模型

Motivated by the remarkable reasoning abilities of LLMs, researchers are exploring ways to transfer these capabilities into the vision domain, developing multimodal LLMs. Flamingo [1] employs a cross-attention structure to attend to visual contexts, enabling visual in-context learning. Models such as BLIP-2 [24] and mPLUG-OWL [55] propose encoding image features with a visual encoder, which are then fed into the LLM alongside text embeddings. Otter [23] further incorporates robust few-shot capabilities through in-context instruction tuning on the proposed MIMIC-IT dataset. LLaVA [29] and MiniGPT-4 [63] first conduct image-text feature alignment followed by instruction tuning. Koh et al. [20] also investigates image retrieval for LLMs. Moreover, numerous works [32, 44, 49, 52, 54] utilize prompt engineering, connecting independent modules via API calls, but without the benefits of end-to-end training. Recently, there have been studies examining the intersection between multimodal LLMs and vision tasks. VisionLLM [47] offers a flexible interaction interface for multiple vision-centric tasks through instruction tuning but fails to fully exploit LLMs for complex reasoning. Kosmos-2 [38] constructs large-scale data of grounded image-text pairs, infusing grounding capabilities into LLMs. DetGPT [39] bridges the fixed multimodal LLM and open-vocabulary detector, enabling detection to be performed based on user instruction. GPT4RoI [57] introduces spatial boxes as input and trains the model on region-text pairs. In contrast, our work aims to efficiently inject segmentation capabilities into multimodal LLMs in the manner of end-to-end training.
受到 LLMs 卓越推理能力的启发,研究人员正在探索将这些能力转移到视觉领域的方法,开发多模态 LLMs。Flamingo [1]采用交叉注意力结构来关注视觉上下文,从而实现视觉上下文学习。BLIP-2 [24]和 mPLUG-OWL [55]等模型提出使用视觉编码器对图像特征进行编码,然后将其与文本嵌入一起输入 LLM。Otter [23]进一步通过在提出的 MIMIC-IT 数据集上进行上下文指令调优,结合了强大的少样本能力。LLaVA [29]和 MiniGPT-4 [63]首先进行图像-文本特征对齐,然后进行指令调优。Koh 等人 [20]也研究了 LLMs 的图像检索。此外,许多工作 [32, 44, 49, 52, 54]利用提示工程,通过 API 调用连接独立模块,但没有实现端到端训练的好处。最近,有研究考察了多模态 LLMs 与视觉任务之间的交集。VisionLLM [47]通过指令调优为多个以视觉为中心的任务提供灵活的交互接口,但未能充分利用 LLMs 进行复杂推理。Kosmos-2 [38]构建了大规模的基础图像-文本对数据,将基础能力注入 LLMs。DetGPT [39]连接了固定的多模态 LLM 和开放词汇检测器,使得可以根据用户指令进行检测。GPT4RoI [57]引入空间框作为输入,并在区域-文本对上训练模型。相比之下,我们的工作旨在以端到端训练的方式高效地将分割能力注入多模态 LLMs。

3. Reasoning Segmentation
3. 推理分割

3.1. Problem Definition  3.1. 问题定义

The reasoning segmentation task is to output a binary segmentation mask M M M\mathbf{M}, given an input image x i m g x i m g x_(img)\mathbf{x}_{i m g} and an im-
推理分割任务是根据输入图像 x i m g x i m g x_(img)\mathbf{x}_{i m g} 和隐式查询文本指令 M M M\mathbf{M} 输出二进制分割掩码 M M M\mathbf{M}。该任务与参考分割任务 [17] 具有相似的公式,但要复杂得多。关键区别在于推理分割中查询文本的复杂性。查询文本不是简单的短语(例如,“垃圾桶”),而是包含更复杂表达(例如,“应该放入垃圾的东西”)或更长句子(例如,“在烹饪、消费食物和准备食物之后,我们可以在哪里扔掉剩下的食物和残渣?”)的文本,这涉及复杂的推理或世界知识。

plicit query text instruction x t x t x t x t x_(txt)\mathbf{x}_{t x t}. The task shares a similar formulation with the referring segmentation task [17], but is far more challenging. The key distinction lies in the complexity of the query text in reasoning segmentation. Instead of a straightforward phrase (e.g., “the trash can”), the query text includes more intricate expressions (e.g., “something that the garbage should be put into”) or longer sentences (e.g., “After cooking, consuming food, and preparing for food, where can we throw away the rest of the food and scraps?”) that involve complex reasoning or world knowledge.
3.2. 基准测试

3.2. Benchmark

Given the lack of quantitative evaluation, it is imperative to establish a benchmark for the reasoning segmentation task. To ensure reliable assessment, we have collected a diverse set of images from OpenImages [21] and ScanNetv2 [10], annotating them with implicit text instructions and highquality target masks. To cover different scenarios, our text instructions consist of two types: 1) short phrases; 2) long sentences; as illustrated in Figure 2. The resulting ReasonSeg benchmark comprises a total of 1218 image-instruction-mask data samples. This dataset is further partitioned into three splits: train, val, and test, containing 239, 200, and 779 data samples, respectively. As the primary purpose of the benchmark is evaluation, the validation and testing sets include a larger number of data samples. The details of data annotation are given in the supplementary material.
鉴于缺乏定量评估,建立推理分割任务的基准是至关重要的。为了确保可靠的评估,我们从 OpenImages [21] 和 ScanNetv2 [10] 收集了一组多样化的图像,并用隐式文本指令和高质量目标掩码进行了标注。为了覆盖不同的场景,我们的文本指令由两种类型组成:1)短语;2)长句;如图 2 所示。最终的 ReasonSeg 基准包含总共 1218 个图像-指令-掩码数据样本。该数据集进一步分为三个部分:训练集、验证集和测试集,分别包含 239、200 和 779 个数据样本。由于基准的主要目的是评估,验证集和测试集包含更多的数据样本。数据标注的详细信息在补充材料中给出。

4. Our Method  4. 我们的方法

In this section, we first introduce the model architecture in Sec. 4.1. After that, we elaborate on the training data preparation and training parameters in Sec. 4.2.
在本节中,我们首先在第 4.1 节介绍模型架构。之后,我们在第 4.2 节详细说明训练数据准备和训练参数。

4.1. Architecture  4.1. 架构

Embedding as Mask. Most current multimodal LLMs (such as LLaVA [29], Flamingo [1], BLIP-2 [24], Otter [23], etc.) support image and text as input, but they can only output text and cannot directly output fine-grained segmentation masks. VisionLLM [47] offers a solution by parsing
将嵌入作为掩码。目前大多数多模态 LLM(如 LLaVA [29]、Flamingo [1]、BLIP-2 [24]、Otter [23] 等)支持图像和文本作为输入,但它们只能输出文本,无法直接输出细粒度的分割掩码。VisionLLM [47] 提供了一种解决方案,通过解析

Figure 3. The pipeline of LISA. Given the input image and text query, the multimodal LLM (e.g., LLaVA [29]) generates text output. The last-layer embedding for the <SEG> token is then decoded into the segmentation mask via the decoder. We use LoRA [15] for efficient fine-tuning. The choice of vision backbone can be flexible (e.g., SAM [66], Mask2Former [9]).
图 3. LISA 的流程图。给定输入图像和文本查询,多模态 LLM(例如 LLaVA [29])生成文本输出。然后,<SEG> 令牌的最后一层嵌入通过解码器解码为分割掩码。我们使用 LoRA [15] 进行高效微调。视觉主干的选择可以是灵活的(例如,SAM [66]、Mask2Former [9])。

segmentation masks as sequences of polygons, enabling the representation of segmentation masks as plain text and allowing end-to-end training within the framework of existing multimodal LLMs. However, end-to-end training with the polygon sequences introduces optimization challenges and may compromise generalization ability unless a massive amount of data and computational resources are employed. For instance, training a 7B model, VisionLLM requires 4 × 8 4 × 8 4xx84 \times 8 NVIDIA 80G A100 GPUs and 50 epochs, which is computationally prohibitive. In contrast, it takes less than 3 days to train LISA-7B on 8 NVIDIA 24G 3090 GPUs.
分割掩码作为多边形序列,使得分割掩码可以作为纯文本表示,并允许在现有多模态 LLM 的框架内进行端到端训练。然而,使用多边形序列进行端到端训练引入了优化挑战,并可能妨碍泛化能力,除非使用大量数据和计算资源。例如,训练一个 7B 模型,VisionLLM 需要 4 × 8 4 × 8 4xx84 \times 8 个 NVIDIA 80G A100 GPU 和 50 个训练周期,这在计算上是不可行的。相比之下,在 8 个 NVIDIA 24G 3090 GPU 上训练 LISA-7B 不到 3 天。
To this end, we propose the embedding-as-mask paradigm to infuse new segmentation capabilities into the multimodal LLM. The pipeline of our method is illustrated in Fig. 3. Specifically, we first expand the original LLM vocabulary with a new token, i.e., <SEG>, which signifies the request for the segmentation output. Given a text instruction x t x t x t x t x_(txt)\mathbf{x}_{t x t} along with the input image x i m g x i m g x_(img)\mathbf{x}_{i m g}, we feed them into the multimodal LLM F F F\mathcal{F}, which in turn outputs a text response y ^ t x t y ^ t x t hat(y)_(txt)\hat{\mathbf{y}}_{t x t}. It can be formulated as
为此,我们提出了嵌入作为掩码的范式,以将新的分割能力注入多模态 LLM。我们方法的流程图如图 3 所示。具体而言,我们首先用一个新令牌 扩展原始 LLM 词汇,该令牌表示请求分割输出。给定文本指令 x t x t x t x t x_(txt)\mathbf{x}_{t x t} 以及输入图像 x i m g x i m g x_(img)\mathbf{x}_{i m g} ,我们将它们输入多模态 LLM F F F\mathcal{F} ,后者输出文本响应 y ^ t x t y ^ t x t hat(y)_(txt)\hat{\mathbf{y}}_{t x t} 。可以将其表述为
y ^ t x t = F ( x i m g , x t x t ) . y ^ t x t = F x i m g , x t x t . hat(y)_(txt)=F(x_(img),x_(txt)).\hat{\mathbf{y}}_{t x t}=\mathcal{F}\left(\mathbf{x}_{i m g}, \mathbf{x}_{t x t}\right) .
When the LLM intends to generate a binary segmentation mask, the output y ^ t x t y ^ t x t hat(y)_(txt)\hat{\mathbf{y}}_{t x t} would include a <SEG> token. We then extract the LLM last-layer embedding h ~ seg h ~ seg  tilde(h)_("seg ")\tilde{\mathbf{h}}_{\text {seg }} corresponding to the <SEG> token and apply an MLP projection layer γ γ gamma\gamma to obtain h seg h seg  h_("seg ")\mathbf{h}_{\text {seg }}. Simultaneously, the vision backbone F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }} extracts the dense visual features f f f\mathbf{f} from the visual input x i m g x i m g x_(img)\mathbf{x}_{i m g}. Finally, h s e g h s e g h_(seg)\mathbf{h}_{s e g} and f f f\mathbf{f} are fed to the decoder F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} to produce the final segmentation mask M ^ M ^ hat(M)\hat{\mathbf{M}}. The detailed structure of the decoder F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} follows [19]. The process can be formulated as
当 LLM 打算生成二进制分割掩码时,输出 y ^ t x t y ^ t x t hat(y)_(txt)\hat{\mathbf{y}}_{t x t} 将包含一个标记。然后,我们提取与标记对应的 LLM 最后一层嵌入 h ~ seg h ~ seg  tilde(h)_("seg ")\tilde{\mathbf{h}}_{\text {seg }} ,并应用一个 MLP 投影层 γ γ gamma\gamma 以获得 h seg h seg  h_("seg ")\mathbf{h}_{\text {seg }} 。同时,视觉主干 F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }} 从视觉输入 x i m g x i m g x_(img)\mathbf{x}_{i m g} 中提取密集的视觉特征 f f f\mathbf{f} 。最后,将 h s e g h s e g h_(seg)\mathbf{h}_{s e g} f f f\mathbf{f} 输入解码器 F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} 以生成最终的分割掩码 M ^ M ^ hat(M)\hat{\mathbf{M}} 。解码器 F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} 的详细结构参见[19]。该过程可以表述为
h s e g = γ ( h ~ s e g ) , f = F e n c ( x i m g ) , M ^ = F d e c ( h s e g , f ) . h s e g = γ h ~ s e g , f = F e n c x i m g , M ^ = F d e c h s e g , f . {:[h_(seg)=gamma( tilde(h)_(seg))","quadf=F_(enc)(x_(img))","],[ hat(M)=F_(dec)(h_(seg),f).]:}\begin{gathered} \mathbf{h}_{s e g}=\gamma\left(\tilde{\mathbf{h}}_{s e g}\right), \quad \mathbf{f}=\mathcal{F}_{e n c}\left(\mathbf{x}_{i m g}\right), \\ \hat{\mathbf{M}}=\mathcal{F}_{d e c}\left(\mathbf{h}_{s e g}, \mathbf{f}\right) . \end{gathered}
Training Objectives. The model is trained end-to-end using the text generation loss L t x t L t x t L_(txt)\mathcal{L}_{t x t} and the segmentation
训练目标。模型通过文本生成损失 L t x t L t x t L_(txt)\mathcal{L}_{t x t} 和分割掩码损失 L t x t L t x t L_(txt)\mathcal{L}_{t x t} 进行端到端训练。

mask loss L mask L mask  L_("mask ")\mathcal{L}_{\text {mask }}. The overall objective L L L\mathcal{L} is the weighted sum of these losses, determined by λ t x t λ t x t lambda_(txt)\lambda_{t x t} and λ mask λ mask  lambda_("mask ")\lambda_{\text {mask }} :
整体目标 L L L\mathcal{L} 是这些损失的加权和,由 λ t x t λ t x t lambda_(txt)\lambda_{t x t} λ mask λ mask  lambda_("mask ")\lambda_{\text {mask }} 决定:
L = λ t x t L t x t + λ mask L mask . L = λ t x t L t x t + λ mask  L mask  . L=lambda_(txt)L_(txt)+lambda_("mask ")L_("mask ").\mathcal{L}=\lambda_{t x t} \mathcal{L}_{t x t}+\lambda_{\text {mask }} \mathcal{L}_{\text {mask }} .
Specifically, L t x t L t x t L_(txt)\mathcal{L}_{t x t} is the auto-regressive cross-entropy loss for text generation, and L mask L mask  L_("mask ")\mathcal{L}_{\text {mask }} is the mask loss, which encourages the model to produce high-quality segmentation results. To compute L mask L mask  L_("mask ")\mathcal{L}_{\text {mask }}, we employ a combination of per-pixel binary cross-entropy (BCE) loss and DICE loss, with corresponding loss weights λ b c e λ b c e lambda_(bce)\lambda_{b c e} and λ d i c e λ d i c e lambda_(dice)\lambda_{d i c e}. Given the ground-truth targets y t x t y t x t y_(txt)\mathbf{y}_{t x t} and M M M\mathbf{M}, these losses can be formulated as
具体而言, L t x t L t x t L_(txt)\mathcal{L}_{t x t} 是文本生成的自回归交叉熵损失,而 L mask L mask  L_("mask ")\mathcal{L}_{\text {mask }} 是掩码损失,鼓励模型生成高质量的分割结果。为了计算 L mask L mask  L_("mask ")\mathcal{L}_{\text {mask }} ,我们采用每像素二进制交叉熵(BCE)损失和 DICE 损失的组合,具有相应的损失权重 λ b c e λ b c e lambda_(bce)\lambda_{b c e} λ d i c e λ d i c e lambda_(dice)\lambda_{d i c e} 。给定真实目标 y t x t y t x t y_(txt)\mathbf{y}_{t x t} M M M\mathbf{M} ,这些损失可以表述为
L t x t = C E ( y ^ t x t , y t x t ) L m a s k = λ b c e B C E ( M ^ , M ) + λ dice D I C E ( M ^ , M ) . L t x t = C E y ^ t x t , y t x t L m a s k = λ b c e B C E ( M ^ , M ) + λ dice  D I C E ( M ^ , M ) . {:[L_(txt)=CE( hat(y)_(txt),y_(txt))],[L_(mask)=lambda_(bce)BCE( hat(M)","M)+lambda_("dice ")DICE( hat(M)","M).]:}\begin{gathered} \mathcal{L}_{t x t}=\mathbf{C E}\left(\hat{\mathbf{y}}_{t x t}, \mathbf{y}_{t x t}\right) \\ \mathcal{L}_{m a s k}=\lambda_{b c e} \mathbf{B C E}(\hat{\mathbf{M}}, \mathbf{M})+\lambda_{\text {dice }} \mathbf{D I C E}(\hat{\mathbf{M}}, \mathbf{M}) . \end{gathered}
It is noteworthy that the proposed method endows existing multimodal LLMs with new segmentation capabilities, such that they can generate not only text but also fine-grained output formats. Also, our method is based on an end-to-end training pipeline and connects the LLM and vision modules with hidden embedding representation, which proves significantly more effective than the decoupled two-stage method as discussed in Sec. 5.2.
值得注意的是,所提出的方法赋予现有的多模态 LLM 新的分割能力,使其不仅能够生成文本,还能生成细粒度的输出格式。此外,我们的方法基于端到端的训练流程,通过隐藏嵌入表示连接 LLM 和视觉模块,这比在第 5.2 节中讨论的解耦两阶段方法显著更有效。

4.2. Training  4.2 训练

Training Data Formulation. As illustrated in Fig. 4, our training data comprises mainly three parts, all of which are derived from widely-used public datasets. The details are as follows:
训练数据构建。如图 4 所示,我们的训练数据主要由三部分组成,均来源于广泛使用的公共数据集。具体细节如下:
  • Semantic Segmentation Dataset. Semantic segmentation datasets typically consist of images and the corresponding multi-class labels. During training, we randomly choose several categories for each image. To generate data that matches the format of visual question answering, we employ a question-answer template like "USER: <IMAGE> Can you segment the
    语义分割数据集。语义分割数据集通常由图像及其对应的多类标签组成。在训练过程中,我们随机选择每张图像的几个类别。为了生成与视觉问答格式匹配的数据,我们采用类似“用户:<图像>你能分割

    Raw Data  原始数据

    the lady with the blue shirt
    穿蓝色衬衫的女士吗”这样的问答模板。

Processed Data
  • USER: <IMAGE> Can you segment the table in this image?
    用户:<IMAGE> 你能分割这张图片中的表格吗?
  • ASSISTANT: It is <SEG>.  助手:它是 <SEG>。
Semantic Segmentation Data
语义分割数据
  • USER: <IMAGE> Can you segment the lady with the blue shirt in this image?
    用户:<IMAGE> 你能分割这张图片中穿蓝色衬衫的女士吗?
  • ASSISTANT: Sure, it is <SEG>.
    助手:当然,它是 <SEG>。

Referring Segmentation Data
参考分割数据
  • USER: What type of sign and traffic device can be seen in the image? <IMAGE>
    用户:图像中可以看到什么类型的标志和交通设备? <IMAGE>
  • ASSISTANT: In the image, there is a street sign and a traffic light above a city road.
    助手:在图像中,有一个街道标志和一个交通信号灯悬挂在城市道路上。
  • USER: How many traffic lights are visible in the image?
    用户:图像中可见多少个交通信号灯?

No Binary Segmentation Mask
无二进制分割掩码

VQA Data  VQA 数据
Figure 4. The illustration of training data formulation from different types of data, including semantic segmentation data, referring segmentation data, and visual question answering (VQA) data.
图 4. 不同类型数据的训练数据构建示意图,包括语义分割数据、引用分割数据和视觉问答(VQA)数据。

{class_name} in this image? ASSISTANT: It is <SEG>.", where {class_name} is the chosen category, and <IMAGE> denotes the placeholder for tokens of image patches. The corresponding binary segmentation mask is used as the ground truth to provide mask loss supervision. During training, we also use other templates to generate the QA data to ensure data diversity, as shown in the supplementary material. We adopt ADE20K, COCO-Stuff, and LVIS-PACO part segmentation datasets.
{class_name} 在这张图像中?助手:它是 <SEG>。”,其中 {class_name} 是所选类别,<IMAGE> 表示图像补丁的令牌占位符。相应的二进制分割掩码用作真实值,以提供掩码损失监督。在训练过程中,我们还使用其他模板生成 QA 数据,以确保数据多样性,如补充材料所示。我们采用 ADE20K、COCO-Stuff 和 LVIS-PACO 部分分割数据集。
  • Vanilla Referring Segmentation Dataset. Referring segmentation datasets provide an input image and an explicit short description of the target object. Thus, it is easy to convert them into question-answer pairs using a template like “USER: <IMAGE> Can you segment {description} in this image? ASSISTANT: Sure, it is <SEG>.”, where {description} is the given explicit description. For this part, we adopt refCOCO , refCOCO+, refCOCOg, and refCLEF datasets.
    普通引用分割数据集。引用分割数据集提供输入图像和目标对象的明确简短描述。因此,使用类似“用户:<IMAGE> 你能在这张图像中分割 {description} 吗?助手:当然,它是 <SEG>。”的模板将其转换为问答对非常简单,其中 {description} 是给定的明确描述。对于这一部分,我们采用 refCOCO、refCOCO+、refCOCOg 和 refCLEF 数据集。
  • Visual Question Answering Dataset. To preserve the original Visual Question Answering (VQA) ability of the multimodal LLM, we also include the VQA dataset during training. We use LLaVA-Instruct-150k [29] for LLaVA v1
    视觉问答数据集。为了保留多模态 LLM 的原始视觉问答(VQA)能力,我们在训练期间也包括 VQA 数据集。我们使用 LLaVA-Instruct-150k [29] 用于 LLaVA v1

    and LLaVA-v1.5-mix665k for LLaVA v1.5 [28].
    和 LLaVA-v1.5-mix665k 用于 LLaVA v1.5 [28]。

    Notably, the above datasets do not include any reasoning segmentation data sample. Instead, it only contains samples where the target objects are explicitly indicated in the query texts. Surprisingly, even without complex reasoning training data, LISA demonstrates impressive zero-shot ability on the ReasonSeg benchmark, as shown in Table 1. Moreover, we find that further performance boost could be yielded by finetuning the model on only 239 data samples that involve complex reasoning.
    值得注意的是,上述数据集不包含任何推理分割数据样本。相反,它仅包含在查询文本中明确指示目标对象的样本。令人惊讶的是,即使没有复杂推理训练数据,LISA 在 ReasonSeg 基准测试中也展现了令人印象深刻的零-shot 能力,如表 1 所示。此外,我们发现通过对仅包含 239 个涉及复杂推理的数据样本进行微调,模型的性能可以进一步提升。
Trainable Parameters. To preserve the learned knowledge of the pre-trained multimodal LLM F F F\mathcal{F} (i.e., LLaVA in our experiments), we leverage LoRA [15] to perform efficient fine-tuning, and completely freeze the vision backbone F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }}. The decoder F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} is fully fine-tuned. Additionally, the LLM token embeddings (embed_tokens), the LLM head (lm_head), and the projection layer γ γ gamma\gamma are also trainable.
可训练参数。为了保留预训练的多模态 LLM F F F\mathcal{F} (即我们实验中的 LLaVA)所学到的知识,我们利用 LoRA [15] 进行高效的微调,并完全冻结视觉主干 F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }} 。解码器 F dec F dec  F_("dec ")\mathcal{F}_{\text {dec }} 完全微调。此外,LLM 令牌嵌入(embed_tokens)、LLM 头(lm_head)和投影层 γ γ gamma\gamma 也是可训练的。
It is notable that the resulting model avoids the catastrophic forgetting of the original text generation capability and preserves the conversation ability, as verified in the supplementary material. The potential reasons are: we 1) employ LoRA fine-tuning to reduce the trainable parameters and 2) incorporate the VQA dataset during fine-tuning.
值得注意的是,最终模型避免了原始文本生成能力的灾难性遗忘,并保留了对话能力,这在补充材料中得到了验证。潜在原因是:我们 1) 采用 LoRA 微调以减少可训练参数,2) 在微调过程中结合了 VQA 数据集。
Table 1. Reasoning segmentation results among LISA (ours) and previous related works. ‘ft’ denotes using 239 reasoning segmentation data samples to fine-tune the model. Unless otherwise specified, we use LLaVA v1 [29] as the base model. LLaVA1.5 denotes LLaVA v1.5 [28].
表 1. LISA(我们的)与之前相关工作的推理分割结果。‘ft’ 表示使用 239 个推理分割数据样本来微调模型。除非另有说明,我们使用 LLaVA v1 [29] 作为基础模型。LLaVA1.5 表示 LLaVA v1.5 [28]。
Method  方法 val  验证 test  测试
overall  总体 short query  短查询 long query  长查询 overall  总体
gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU
OVSeg [26] 28.5 18.6 18.0 15.5 28.7 22.5 26.1 20.8
GRES [27] 22.4 19.9 17.6 15.0 22.6 23.8 21.3 22.0
X-Decoder [65] 22.6 17.9 20.4 11.6 22.2 17.5 21.7 16.3
SEEM [66] 25.5 21.2 20.1 11.5 25.6 20.8 24.3 18.7
Grounded-SAM [30] 26.0 14.5 17.8 10.8 22.4 18.6 21.3 16.4
LISA-7B 44.4 46.0 37.6 34.4 36.6 34.7 36.8 34.1
LISA-7B (ft) 52.9 54.0 40.6 40.6 49.4 51.0 47.3 48.4
LISA-13B 48.9 46.9 39.9 43.3 46.4 46.5 44.8 45.8
LISA-13B (ft) 56.2 62.9 44.3 42.0 54.0 54.3 51.7 51.1
LLaVA1.5-7B + OVSeg 38.2 23.5 24.2 18.7 44.6 37.1 39.7 31.8
LISA-7B-LLaVA1.5 53.6 52.3 47.1 48.5 49.2 48.9 48.7 48.8
LISA-7B-LLaVA1.5 (ft) 61.3 62.9 48.3 46.3 57.9 59.7 55.6 56.9
LLaVA1.5-13B + OVSeg 37.9 26.4 27.1 19.4 46.1 40.6 41.5 34.1
LISA-13B-LLaVA1.5 57.7 60.3 50.8 50.0 54.7 50.9 53.8 50.8
LISA-13B-LLaVA1.5 (ft) 65.0 72.9 55.4 50.6 63.2 65.3 61.3 62.2
Method val test overall short query long query overall gIoU cIoU gIoU cIoU gIoU cIoU gIoU cIoU OVSeg [26] 28.5 18.6 18.0 15.5 28.7 22.5 26.1 20.8 GRES [27] 22.4 19.9 17.6 15.0 22.6 23.8 21.3 22.0 X-Decoder [65] 22.6 17.9 20.4 11.6 22.2 17.5 21.7 16.3 SEEM [66] 25.5 21.2 20.1 11.5 25.6 20.8 24.3 18.7 Grounded-SAM [30] 26.0 14.5 17.8 10.8 22.4 18.6 21.3 16.4 LISA-7B 44.4 46.0 37.6 34.4 36.6 34.7 36.8 34.1 LISA-7B (ft) 52.9 54.0 40.6 40.6 49.4 51.0 47.3 48.4 LISA-13B 48.9 46.9 39.9 43.3 46.4 46.5 44.8 45.8 LISA-13B (ft) 56.2 62.9 44.3 42.0 54.0 54.3 51.7 51.1 LLaVA1.5-7B + OVSeg 38.2 23.5 24.2 18.7 44.6 37.1 39.7 31.8 LISA-7B-LLaVA1.5 53.6 52.3 47.1 48.5 49.2 48.9 48.7 48.8 LISA-7B-LLaVA1.5 (ft) 61.3 62.9 48.3 46.3 57.9 59.7 55.6 56.9 LLaVA1.5-13B + OVSeg 37.9 26.4 27.1 19.4 46.1 40.6 41.5 34.1 LISA-13B-LLaVA1.5 57.7 60.3 50.8 50.0 54.7 50.9 53.8 50.8 LISA-13B-LLaVA1.5 (ft) 65.0 72.9 55.4 50.6 63.2 65.3 61.3 62.2| Method | val | | test | | | | | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | overall | | short query | | long query | | overall | | | | gIoU | cIoU | gIoU | cIoU | gIoU | cIoU | gIoU | cIoU | | OVSeg [26] | 28.5 | 18.6 | 18.0 | 15.5 | 28.7 | 22.5 | 26.1 | 20.8 | | GRES [27] | 22.4 | 19.9 | 17.6 | 15.0 | 22.6 | 23.8 | 21.3 | 22.0 | | X-Decoder [65] | 22.6 | 17.9 | 20.4 | 11.6 | 22.2 | 17.5 | 21.7 | 16.3 | | SEEM [66] | 25.5 | 21.2 | 20.1 | 11.5 | 25.6 | 20.8 | 24.3 | 18.7 | | Grounded-SAM [30] | 26.0 | 14.5 | 17.8 | 10.8 | 22.4 | 18.6 | 21.3 | 16.4 | | LISA-7B | 44.4 | 46.0 | 37.6 | 34.4 | 36.6 | 34.7 | 36.8 | 34.1 | | LISA-7B (ft) | 52.9 | 54.0 | 40.6 | 40.6 | 49.4 | 51.0 | 47.3 | 48.4 | | LISA-13B | 48.9 | 46.9 | 39.9 | 43.3 | 46.4 | 46.5 | 44.8 | 45.8 | | LISA-13B (ft) | 56.2 | 62.9 | 44.3 | 42.0 | 54.0 | 54.3 | 51.7 | 51.1 | | LLaVA1.5-7B + OVSeg | 38.2 | 23.5 | 24.2 | 18.7 | 44.6 | 37.1 | 39.7 | 31.8 | | LISA-7B-LLaVA1.5 | 53.6 | 52.3 | 47.1 | 48.5 | 49.2 | 48.9 | 48.7 | 48.8 | | LISA-7B-LLaVA1.5 (ft) | 61.3 | 62.9 | 48.3 | 46.3 | 57.9 | 59.7 | 55.6 | 56.9 | | LLaVA1.5-13B + OVSeg | 37.9 | 26.4 | 27.1 | 19.4 | 46.1 | 40.6 | 41.5 | 34.1 | | LISA-13B-LLaVA1.5 | 57.7 | 60.3 | 50.8 | 50.0 | 54.7 | 50.9 | 53.8 | 50.8 | | LISA-13B-LLaVA1.5 (ft) | 65.0 | 72.9 | 55.4 | 50.6 | 63.2 | 65.3 | 61.3 | 62.2 |

5. Experiment  5. 实验

5.1. Experimental Setting
5.1. 实验设置

Network Architecture. Unless otherwise specified, we use LLaVA-7B-v1-1 or LLaVA-13B-v1-1 [29] as the base multimodal LLM F F F\mathcal{F}, and adopt the ViT-H SAM [19] backbone as the vision backbone F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }}. The projection layer of γ γ gamma\gamma is an MLP with channels of [256, 4096, 4096].
网络架构。除非另有说明,我们使用 LLaVA-7B-v1-1 或 LLaVA-13B-v1-1 [29] 作为基础的多模态 LLM F F F\mathcal{F} ,并采用 ViT-H SAM [19] 作为视觉主干 F enc F enc  F_("enc ")\mathcal{F}_{\text {enc }} γ γ gamma\gamma 的投影层是一个具有 [256, 4096, 4096] 通道的 MLP。
Implementation Details. We adopt 8 NVIDIA 24G 3090 GPUs for training. The training scripts are based on deepspeed [41] engine. We use AdamW [33] optimizer with the learning rate and weight decay set to 0.0003 and 0 , respectively. We also adopt WarmupDecayLR as the learning rate scheduler, where the warmup iterations are set to 100 . The weights of the text generation loss λ t x t λ t x t lambda_(txt)\lambda_{t x t} and the mask loss λ mask λ mask  lambda_("mask ")\lambda_{\text {mask }} are set to 1.0 and 1.0, respectively, and those of the bce loss λ b c e λ b c e lambda_(bce)\lambda_{b c e} and the dice loss λ dice λ dice  lambda_("dice ")\lambda_{\text {dice }} are set to 2.0 and 0.5 , respectively. Besides, the batch size per device is set to 2 , and the gradient accumulation step is set to 10 . During training, we select at most 3 categories for each image in semantic segmentation datasets.
实现细节。我们采用 8 个 NVIDIA 24G 3090 GPU 进行训练。训练脚本基于 deepspeed [41] 引擎。我们使用 AdamW [33] 优化器,学习率和权重衰减分别设置为 0.0003 和 0。我们还采用 WarmupDecayLR 作为学习率调度器,其中预热迭代设置为 100。文本生成损失 λ t x t λ t x t lambda_(txt)\lambda_{t x t} 和掩码损失 λ mask λ mask  lambda_("mask ")\lambda_{\text {mask }} 的权重分别设置为 1.0 和 1.0,而 bce 损失 λ b c e λ b c e lambda_(bce)\lambda_{b c e} 和 dice 损失 λ dice λ dice  lambda_("dice ")\lambda_{\text {dice }} 的权重分别设置为 2.0 和 0.5。此外,每个设备的批量大小设置为 2,梯度累积步数设置为 10。在训练期间,我们为语义分割数据集中的每张图像最多选择 3 个类别。
Datasets. As mentioned in Sec. 4.2, our training data is mainly composed of three types of datasets: (1) For the semantic segmentation dataset, we use ADE20K [62] and COCO-Stuff [3]. Besides, to enhance the segmentation result for some part of an object, we also use part semantic segmentation datasets, including PACO-LVIS [40], PartImageNet [13], and PASCAL-Part [6]; (2) For the referring segmentation dataset, we use refCLEF, refCOCO, ref COCO + [ 17 ] COCO + [ 17 ] COCO+[17]\mathrm{COCO}+[17], and refCOCOg [35]; (3) For the visual ques-
数据集。如第 4.2 节所述,我们的训练数据主要由三种类型的数据集组成:(1) 对于语义分割数据集,我们使用 ADE20K [62] 和 COCO-Stuff [3]。此外,为了增强某些物体部分的分割结果,我们还使用部分语义分割数据集,包括 PACO-LVIS [40]、PartImageNet [13] 和 PASCAL-Part [6];(2) 对于引用分割数据集,我们使用 refCLEF、refCOCO、ref COCO + [ 17 ] COCO + [ 17 ] COCO+[17]\mathrm{COCO}+[17] 和 refCOCOg [35];(3) 对于视觉问答 (VQA) 数据集,我们使用 LLaVA v1 的 LLaVA-Instruct-150k 数据集 [29] 和 LLaVA v1.5 的 LLaVA-v1.5-mix665k 数据集 [28]。为了避免数据泄漏,我们在训练期间排除了在 refCOCO COCO + [ 17 ] COCO + [ 17 ] COCO+[17]\mathrm{COCO}+[17] 验证集中出现的 COCO 样本。此外,我们惊讶地发现,仅通过对 239 个 ReasonSeg 数据样本进行微调,模型的性能可以进一步提升。

tion answering (VQA) dataset, we use the datasets of LLaVA-Instruct-150k for LLaVA v1 [29] and LLaVA-v1.5-mix665k for LLaVA v1.5 [28]. In order to avoid data leakage, we exclude the COCO samples whose images are present in the refCOCO ( + / g ) ( + / g ) (+//g)(+/ \mathrm{g}) validation sets during training. Furthermore, we surprisingly find that by fine-tuning the model on only 239 ReasonSeg data samples, the model’s performance can be further boosted.
评估指标。我们遵循大多数之前关于引用分割的工作 [17, 35],采用两个指标:gIoU 和 cIoU。gIoU 定义为所有每张图像的交并比 (IoUs) 的平均值,而 cIoU 定义为累积交集与累积并集的比值。由于 cIoU 对大面积物体高度偏倚且波动过大,因此更倾向于使用 gIoU。
Evaluation Metrics. We follow most previous works on referring segmentation [17, 35] to adopt two metrics: gIoU and cIoU. gIoU is defined by the average of all per-image Intersection-over-Unions (IoUs), while cIoU is defined by the cumulative intersection over the cumulative union. Since cIoU is highly biased toward large-area objects and it fluctuates too much, gIoU is preferred.
5.2. 推理分割结果

5.2. Reasoning Segmentation Results

The reasoning segmentation results are shown in Table 1. It is worth noting that existing works fail to handle the task, but our model can accomplish the task involving complex reasoning with more than 20 % 20 % 20%20 \% gIoU performance boost. As mentioned before, the reasoning segmentation task is essentially different from the referring segmentation task in that it requires the model to possess reasoning ability or access world knowledge. Only by truly understanding the query, can the model do well in the task. The existing works have no proper way to understand an implicit query, but our model exploits multimodal LLMs to reach the goal.
推理分割结果如表 1 所示。值得注意的是,现有的工作未能处理该任务,但我们的模型能够完成涉及复杂推理的任务,性能提升超过 20 % 20 % 20%20 \% gIoU。如前所述,推理分割任务与引用分割任务本质上不同,因为它要求模型具备推理能力或获取世界知识。只有真正理解查询,模型才能在任务中表现良好。现有的工作没有合适的方法来理解隐含查询,但我们的模型利用多模态 LLMs 来实现目标。
Notably, we also make a comparison with the vanilla two-stage method (LLaVA1.5 + OVSeg). Specifically, the
值得一提的是,我们还与传统的两阶段方法(LLaVA1.5 + OVSeg)进行了比较。具体来说,
Table 2. Referring segmentation results (cIoU) among LISA (ours) and existing methods.
表 2. LISA(我们的)与现有方法之间的引用分割结果(cIoU)。
Method  方法 refCOCO refCOCO+ refCOCOg
val testA testB val testA testB val(U) test(U)
MCN [34] 62.4 64.2 59.7 50.6 55.0 44.7 49.2 49.4
VLT [11] 67.5 70.5 65.2 56.3 61.0 50.1 55.0 57.7
CRIS [48] 70.5 73.2 66.1 62.3 68.1 53.7 59.9 60.4
LAVT [53] 72.7 75.8 68.8 62.1 68.4 55.1 61.2 62.1
ReLA [27] 73.8 76.5 70.2 66.0 71.0 57.7 65.0 66.0
X-Decoder [65] - - - - - - 64.6 -
SEEM [66] - - - - - - 65.7 -
LISA-7B 74.1 76.5 71.1 62.4 67.4 56.5 66.4 68.5
LISA-7B (fine-tuned on ReferSeg)
LISA-7B(在 ReferSeg 上微调)
74.9 79.1 72.3 65.1 70.8 58.1 67.9 70.6
Method refCOCO refCOCO+ refCOCOg val testA testB val testA testB val(U) test(U) MCN [34] 62.4 64.2 59.7 50.6 55.0 44.7 49.2 49.4 VLT [11] 67.5 70.5 65.2 56.3 61.0 50.1 55.0 57.7 CRIS [48] 70.5 73.2 66.1 62.3 68.1 53.7 59.9 60.4 LAVT [53] 72.7 75.8 68.8 62.1 68.4 55.1 61.2 62.1 ReLA [27] 73.8 76.5 70.2 66.0 71.0 57.7 65.0 66.0 X-Decoder [65] - - - - - - 64.6 - SEEM [66] - - - - - - 65.7 - LISA-7B 74.1 76.5 71.1 62.4 67.4 56.5 66.4 68.5 LISA-7B (fine-tuned on ReferSeg) 74.9 79.1 72.3 65.1 70.8 58.1 67.9 70.6| Method | refCOCO | | | refCOCO+ | | | refCOCOg | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | val | testA | testB | val | testA | testB | val(U) | test(U) | | MCN [34] | 62.4 | 64.2 | 59.7 | 50.6 | 55.0 | 44.7 | 49.2 | 49.4 | | VLT [11] | 67.5 | 70.5 | 65.2 | 56.3 | 61.0 | 50.1 | 55.0 | 57.7 | | CRIS [48] | 70.5 | 73.2 | 66.1 | 62.3 | 68.1 | 53.7 | 59.9 | 60.4 | | LAVT [53] | 72.7 | 75.8 | 68.8 | 62.1 | 68.4 | 55.1 | 61.2 | 62.1 | | ReLA [27] | 73.8 | 76.5 | 70.2 | 66.0 | 71.0 | 57.7 | 65.0 | 66.0 | | X-Decoder [65] | - | - | - | - | - | - | 64.6 | - | | SEEM [66] | - | - | - | - | - | - | 65.7 | - | | LISA-7B | 74.1 | 76.5 | 71.1 | 62.4 | 67.4 | 56.5 | 66.4 | 68.5 | | LISA-7B (fine-tuned on ReferSeg) | 74.9 | 79.1 | 72.3 | 65.1 | 70.8 | 58.1 | 67.9 | 70.6 |
two-stage method refers to first using a multimodal LLM (e.g., LLaVA v1.5) to generate a text output for the input query, and then adopting a referring or open-vocabulary segmentation model (e.g., OVSeg) to generate the segmentation mask. If the intermediate text output remains too long and exceeds the input token length limit of OVSeg, we use GPT3.5 to further summarize. More details can be found in the supplementary material. The results in Table 1 show that our model outperforms the two-stage method significantly. We explain that the potential reasons are: 1) Our model is trained end-to-end, while the two-stage method is completely decoupled; 2) The two-stage method relies on text as an intermediary to transmit information, while our model utilizes the hidden embedding that is more expressive.
两阶段方法是指首先使用多模态 LLM(例如,LLaVA v1.5)为输入查询生成文本输出,然后采用引用或开放词汇分割模型(例如,OVSeg)生成分割掩码。如果中间文本输出仍然过长,超过 OVSeg 的输入令牌长度限制,我们使用 GPT3.5 进一步进行摘要。更多细节可以在补充材料中找到。表 1 中的结果显示,我们的模型显著优于两阶段方法。我们解释潜在原因如下:1)我们的模型是端到端训练的,而两阶段方法是完全解耦的;2)两阶段方法依赖文本作为中介传递信息,而我们的模型利用更具表现力的隐藏嵌入。
Another finding is that LISA-13B outperforms the 7B counterpart substantially, especially on the long-query scenarios, which indicates that the current performance bottleneck may still lie in understanding the query text, and a stronger multimodal LLM (e.g., LLaVA v1.5 [28]) leads to even better results.
另一个发现是,LISA-13B 在长查询场景中显著优于 7B 版本,这表明当前的性能瓶颈可能仍在于理解查询文本,而更强大的多模态 LLM(例如,LLaVA v1.5 [28])会带来更好的结果。

5.3. Vanilla Referring Segmentation Results
5.3. 原始引用分割结果

To show that our model is also competent in the vanilla referring segmentation task, we make a comparison with existing state-of-the-art methods in Table 2. We evaluate the methods on refCOCO, refCOCO+, refCOCOg validation and testing sets. Our model achieves state-of-the-art results across various referring segmentation benchmarks.
为了表明我们的模型在原始引用分割任务中也具备能力,我们在表 2 中与现有的最先进方法进行了比较。我们在 refCOCO、refCOCO+、refCOCOg 验证和测试集上评估这些方法。我们的模型在各种引用分割基准上达到了最先进的结果。

5.4. Ablation Study  5.4. 消融研究

In this section, we conduct an extensive ablation study to reveal the contribution of each component. Unless otherwise specified, we report the metrics of gIoU and cIoU of LISA7 B on the validation set.
在本节中,我们进行了一项广泛的消融研究,以揭示每个组件的贡献。除非另有说明,我们报告 LISA7 B 在验证集上的 gIoU 和 cIoU 指标。
Design Choices of Vision Backbone. We emphasize that vision backbones other than SAM are also applicable in our framework. In Table 3, we notice that SAM performs the best, potentially because of the massive high-quality
视觉主干的设计选择。我们强调,除了 SAM 之外的视觉主干在我们的框架中也适用。在表 3 中,我们注意到 SAM 的表现最佳,这可能是由于其大量高质量的训练数据。
Table 3. Ablation study on the design choice of vision backbone. ‘ft’ denotes finetuning on ReasonSeg training set.
表 3. 关于视觉骨干设计选择的消融研究。‘ft’表示在 ReasonSeg 训练集上进行微调。
Vision Backbone  视觉骨干 gIoU cIoU
Mask2Former-Swin-L 42.4 38.8
SAM (w/ LoRA)  SAM(带 LoRA) 41.5 37.3
SAM 44.4 46.0
Mask2Former-Swin-L (ft)  Mask2Former-Swin-L(微调) 50.7 52.3
SAM w/ LORA (ft)
SAM 带 LORA(微调)
51.8 51.9
SAM (ft)  SAM(微调) 52.9 54.0
Vision Backbone gIoU cIoU Mask2Former-Swin-L 42.4 38.8 SAM (w/ LoRA) 41.5 37.3 SAM 44.4 46.0 Mask2Former-Swin-L (ft) 50.7 52.3 SAM w/ LORA (ft) 51.8 51.9 SAM (ft) 52.9 54.0| Vision Backbone | gIoU | cIoU | | :--- | :--- | :--- | | Mask2Former-Swin-L | 42.4 | 38.8 | | SAM (w/ LoRA) | 41.5 | 37.3 | | SAM | 44.4 | 46.0 | | Mask2Former-Swin-L (ft) | 50.7 | 52.3 | | SAM w/ LORA (ft) | 51.8 | 51.9 | | SAM (ft) | 52.9 | 54.0 |
Table 4. Ablation study on SAM pre-trained weight and rephrasing.
表 4. 关于 SAM 预训练权重和重述的消融研究。
Exp. ID  实验 ID Pre-train SAM γ SAM  γ _("SAM ")gamma_{\text {SAM }} \gamma  预训练 SAM γ SAM  γ _("SAM ")gamma_{\text {SAM }} \gamma rephrasing  改述 gIoU cIoU
1 \checkmark 35.9 44.6
2 \checkmark 50.7 51.1
3 \checkmark \checkmark 5 2 . 9 5 2 . 9 52.9\mathbf{5 2 . 9} 5 4 . 0 5 4 . 0 54.0\mathbf{5 4 . 0}
Exp. ID Pre-train _("SAM ")gamma rephrasing gIoU cIoU 1 ✓ 35.9 44.6 2 ✓ 50.7 51.1 3 ✓ ✓ 52.9 54.0| Exp. ID | Pre-train $_{\text {SAM }} \gamma$ | rephrasing | gIoU | cIoU | | :---: | :---: | :---: | :---: | :---: | | 1 | | $\checkmark$ | 35.9 | 44.6 | | 2 | $\checkmark$ | | 50.7 | 51.1 | | 3 | $\checkmark$ | $\checkmark$ | $\mathbf{5 2 . 9}$ | $\mathbf{5 4 . 0}$ |
data used in its pre-training phase. Further, we also find that with the Mask2Former backbone, our framework still achieves a decent performance on the reasoning segmentation task, significantly outperforming previous works such as X-Decoder [65]. This reveals the fact that the design choice of vision backbone is flexible and not limited to SAM.
在其预训练阶段使用的数据。此外,我们还发现,使用 Mask2Former 主干网络,我们的框架在推理分割任务上仍然取得了不错的性能,显著超越了之前的工作,如 X-Decoder [65]。这揭示了视觉主干的设计选择是灵活的,并不局限于 SAM。
SAM LoRA Fintuning. We also investigate the effectiveness of applying LoRA on the SAM backbone. In Table 3, we note that the performance of LoRA fine-tuned SAM backbone is inferior to that of the frozen one. A potential reason is that fine-tuning impairs the generalization ability of the original SAM model.
SAM LoRA 微调。我们还研究了在 SAM 主干上应用 LoRA 的有效性。在表 3 中,我们注意到 LoRA 微调的 SAM 主干性能低于冻结的主干。一个潜在的原因是微调损害了原始 SAM 模型的泛化能力。
SAM Pre-trained Weight. To demonstrate the contribution of SAM pre-trained weight, we make a comparison between Experiments 1 and 3 in Table 4. Without being initialized with SAM pre-trained weight, the vision backbone is trained from scratch. This causes the performance to fall substantially behind that of the baseline model.
SAM 预训练权重。为了证明 SAM 预训练权重的贡献,我们在表 4 中比较了实验 1 和实验 3。没有使用 SAM 预训练权重进行初始化,视觉主干是从头开始训练的。这导致性能大幅落后于基线模型。

Figure 5. Visual comparison among LISA (ours) and existing related methods. More illustrations are given in the supplementary material.
图 5. LISA(我们的)与现有相关方法的视觉比较。更多插图见补充材料。
Table 5. Ablation study on training data.
表 5. 关于训练数据的消融研究。
ID SemanticSeg  语义分割 ReferSeg  参考分割 VQA ReasonSeg gIoU cIoU  推理分割 gIoU cIoU
1
\checkmark \checkmark
\checkmark
✓ ✓ ✓| $\checkmark$ $\checkmark$ | | :--- | | $\checkmark$ |
\checkmark \checkmark \checkmark \checkmark \checkmark 48.9 53.5
2 \checkmark \checkmark \checkmark \checkmark 48.5 50.8
3 \checkmark \checkmark \checkmark \checkmark 46.7 50.9
4 \checkmark \checkmark \checkmark \checkmark 46.6 46.7
5 \checkmark \checkmark \checkmark 30.4 20.4
6 \checkmark \checkmark \checkmark \checkmark \checkmark 47.7 51.1
7 \checkmark \checkmark \checkmark \checkmark \checkmark 44.4 46.0
8 \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark 52.9 54.0
ID SemanticSeg ReferSeg VQA ReasonSeg gIoU cIoU 1 "✓ ✓ ✓" ✓ ✓ ✓ ✓ ✓ 48.9 53.5 2 ✓ ✓ ✓ ✓ 48.5 50.8 3 ✓ ✓ ✓ ✓ 46.7 50.9 4 ✓ ✓ ✓ ✓ 46.6 46.7 5 ✓ ✓ ✓ 30.4 20.4 6 ✓ ✓ ✓ ✓ ✓ 47.7 51.1 7 ✓ ✓ ✓ ✓ ✓ 44.4 46.0 8 ✓ ✓ ✓ ✓ ✓ ✓ 52.9 54.0| ID | SemanticSeg | | | ReferSeg | VQA | ReasonSeg gIoU cIoU | | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 1 | $\checkmark$ $\checkmark$ <br> $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 48.9 | 53.5 | | 2 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 48.5 | 50.8 | | 3 | | $\checkmark$ | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 46.7 | 50.9 | | 4 | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 46.6 | 46.7 | | 5 | | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 30.4 | 20.4 | | 6 | $\checkmark$ | $\checkmark$ | $\checkmark$ | | $\checkmark$ | $\checkmark$ | 47.7 | 51.1 | | 7 | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | | 44.4 | 46.0 | | 8 | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 52.9 | 54.0 |
Table 6. Results on the ReasonSeg test set.
表 6. ReasonSeg 测试集的结果。
Training splits  训练分割 # data samples  # 数据样本 gIoU cIoU
train  训练 239 51.7 51.1
train + val  训练 + 验证 439 5 4 . 0 5 4 . 0 54.0\mathbf{5 4 . 0} 5 4 . 9 5 4 . 9 54.9\mathbf{5 4 . 9}
Training splits # data samples gIoU cIoU train 239 51.7 51.1 train + val 439 54.0 54.9| Training splits | # data samples | gIoU | cIoU | | :--- | :---: | :---: | :---: | | train | 239 | 51.7 | 51.1 | | train + val | 439 | $\mathbf{5 4 . 0}$ | $\mathbf{5 4 . 9}$ |
Instruction Rephrasing by GPT-3.5. When fine-tuning the model on the reasoning segmentation data samples, we rephrase the text instruction by GPT-3.5 (the details are shown in the supplementary material), and randomly choose one. The comparison between Experiments 2 and 3 in Table 4 shows that the performance is increased by 2.2 % gIoU 2.2 % gIoU 2.2%gIoU2.2 \% \mathrm{gIoU} and 2.9 % 2.9 % 2.9%2.9 \% cIoU. This result verifies the effectiveness of such data augmentation.
通过 GPT-3.5 进行指令重述。在对推理分割数据样本进行微调时,我们通过 GPT-3.5 重述文本指令(详细信息见补充材料),并随机选择一个。表 4 中实验 2 和实验 3 的比较显示,性能提高了 2.2 % gIoU 2.2 % gIoU 2.2%gIoU2.2 \% \mathrm{gIoU} 2.9 % 2.9 % 2.9%2.9 \% cIoU。这个结果验证了这种数据增强的有效性。
Contribution of All Types of Training Data. In Table 5, we show the contribution of each type of data to the performance. We find that in Exp. 5, we do not use any semantic segmentation dataset, and the performance drops a lot. We conjecture that semantic segmentation datasets provide a large amount of ground-truth binary masks for training, since a multi-class label can induce multiple binary masks.
所有类型训练数据的贡献。在表 5 中,我们展示了每种数据对性能的贡献。我们发现,在实验 5 中,我们没有使用任何语义分割数据集,性能大幅下降。我们推测,语义分割数据集为训练提供了大量的真实二进制掩码,因为多类标签可以引发多个二进制掩码。
We also notice that adding more reasoning segmentation data samples during training leads to better results. In Table 6, we also add the ReasonSeg val set (200 data samples) during fine-tuning, and it yields better performance in both gIoU and cIoU metrics. This indicates that more reasoning segmentation training samples are beneficial at this moment.
我们还注意到,在训练过程中添加更多推理分割数据样本会导致更好的结果。在表 6 中,我们在微调期间还添加了 ReasonSeg 验证集(200 个数据样本),在 gIoU 和 cIoU 指标上都取得了更好的性能。这表明,更多的推理分割训练样本在此时是有益的。

5.5. Qualitative Results
5.5. 定性结果

As depicted in Fig. 5, we provide a visual comparison with existing related works, including the model for openvocabulary semantic segmentation (OVSeg), referring segmentation (GRES), and the generalist models for segmentation (X-Decoder and SEEM). These models fail to handle the displayed cases with various errors, while our approach produces accurate and high-quality segmentation results. More illustrations are given in the supplementary material.
如图 5 所示,我们提供了与现有相关工作的视觉比较,包括开放词汇语义分割模型(OVSeg)、参考分割(GRES)和通用分割模型(X-Decoder 和 SEEM)。这些模型未能处理所展示的案例,出现了各种错误,而我们的方法则产生了准确且高质量的分割结果。更多插图见补充材料。

6. Conclusion  6. 结论

In this work, we have proposed a new segmentation task-reasoning segmentation. Also, we have introduced an evaluation benchmark ReasonSeg, which comprises over one thousand data samples. Finally, we have presented our model - LISA. It injects segmentation capabilities into current multimodal LLMs and performs surprisingly effectively on the reasoning segmentation task. We hope our work can shed new light on the direction of combining LLMs and vision tasks in the future.
在这项工作中,我们提出了一种新的分割任务——推理分割。同时,我们引入了一个评估基准 ReasonSeg,其中包含超过一千个数据样本。最后,我们展示了我们的模型——LISA。它将分割能力注入当前的多模态 LLMs,并在推理分割任务上表现出惊人的效果。我们希望我们的工作能为未来结合 LLMs 和视觉任务的方向提供新的启示。

Acknowledgements  致谢

This work is supported in part by the Research Grants Council under the Areas of Excellence scheme grant AoE/E-601/22-R and the Shenzhen Science and Technology Program under No. KQTD20210811090149095.
本工作部分由研究资助委员会在卓越领域计划下的资助 AoE/E-601/22-R 和深圳科技计划资助编号 KQTD20210811090149095 支持。

References  参考文献

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022. 1, 3
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds 等。Flamingo:一种用于少量学习的视觉语言模型。NeurIPS, 2022. 1, 3

[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017. 2
[2] Vijay Badrinarayanan, Alex Kendall 和 Roberto Cipolla。Segnet:一种用于图像分割的深度卷积编码器-解码器架构。TPAMI, 2017. 2

[3] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Cocostuff: Thing and stuff classes in context. In CVPR, 2018. 6
[3] Holger Caesar, Jasper Uijlings 和 Vittorio Ferrari。Cocostuff:上下文中的物体和物质类别。在 CVPR, 2018. 6

[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. 2
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov 和 Sergey Zagoruyko。使用变换器的端到端目标检测。在 ECCV, 2020. 2

[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018. 2
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy 和 Alan L. Yuille。Deeplab:使用深度卷积网络、空洞卷积和全连接条件随机场的语义图像分割。TPAMI, 2018. 2

[6] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014. 6
[6] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun 和 Alan Yuille。检测你能检测的:使用整体模型和身体部位检测和表示对象。在 CVPR, 2014. 6

[7] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020. 2
[7] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam 和 Liang-Chieh Chen。Panoptic-deeplab:一种简单、强大且快速的自下而上的全景分割基线。在 CVPR, 2020. 2

[8] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel classification is not all you need for semantic segmentation. NeurIPS, 2021. 2
[8] Bowen Cheng, Alex Schwing 和 Alexander Kirillov。逐像素分类并不是语义分割所需的一切。NeurIPS, 2021. 2

[9] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 2, 4
[9] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov 和 Rohit Girdhar。用于通用图像分割的掩码注意力掩码变换器。在 CVPR, 2022. 2, 4

[10] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2, 3
[10] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser 和 Matthias Nießner。Scannet:丰富注释的室内场景 3D 重建。在 CVPR, 2017. 2, 3

[11] Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. In ICCV, 2021. 7
[11] 丁恒辉、刘畅、王思辰和姜旭东。用于指代分割的视觉-语言变换器和查询生成。在 ICCV,2021。7

[12] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In CVPR, 2019. 2
[12] 傅俊、刘静、田海杰、李勇、包永军、方志伟和陆汉青。用于场景分割的双重注意力网络。在 CVPR,2019。2

[13] Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. Partimagenet: A large, high-quality dataset of parts. In ECCV, 2022. 6
[13] 何炬、杨硕、杨少康、亚当·科尔特莱夫斯基、袁晓东、陈杰能、刘帅、杨成、余启航和阿兰·尤伊尔。Partimagenet:一个大型高质量的部件数据集。在 ECCV,2022。6

[14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 2
[14] 何凯明、乔治亚·基奥克萨里、皮奥特·多拉尔和罗斯·吉尔希克。Mask R-CNN。在 ICCV,2017。2

[15] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021. 4, 5
[15] 爱德华·J·胡、沈业龙、菲利普·沃利斯、泽元·艾伦朱、李元志、王显、王璐和陈伟柱。Lora:大型语言模型的低秩适应。arXiv:2106.09685,2021。4,5

[16] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019. 2
[16] 黄子龙、王兴刚、黄立超、黄畅、魏云超和刘文宇。Ccnet:用于语义分割的交叉注意力。在 ICCV,2019。2

[17] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014. 2, 3, 6
[17] 萨哈尔·卡泽姆扎德、文森特·奥尔多内斯、马克·马滕和塔玛拉·伯格。Referitgame:在自然场景照片中指代物体。在 EMNLP,2014。2,3,6

[18] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019. 2
[18] 亚历山大·基里洛夫、何凯明、罗斯·吉尔希克、卡斯滕·罗瑟和皮奥特·多拉尔。全景分割。在 CVPR,2019。2

[19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv:2304.02643, 2023. 2, 4, 6
[19] 亚历山大·基里洛夫、埃里克·敏顿、尼基拉·拉维、毛汉子、克洛伊·罗兰、劳拉·古斯塔夫森、肖特特、斯宾塞·怀特海德、亚历山大·C·伯格、罗万·燕和其他人。Segment Anything。arXiv:2304.02643,2023。2,4,6

[20] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs. 2023. 3
[20] 柯静宇、鲁斯兰·萨拉库丁诺夫和丹尼尔·弗里德。将语言模型与图像结合以实现多模态输入和输出。2023。3

[21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 2, 3
[21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, 和 Vittorio Ferrari. 开放图像数据集 v4:统一的图像分类、物体检测和大规模视觉关系检测. IJCV, 2020. 2, 3

[22] Xin Lai, Zhuotao Tian, Li Jiang, Shu Liu, Hengshuang Zhao, Liwei Wang, and Jiaya Jia. Semi-supervised semantic segmentation with directional context-aware consistency. In CVPR, 2021. 2
[22] Xin Lai, Zhuotao Tian, Li Jiang, Shu Liu, Hengshuang Zhao, Liwei Wang, 和 Jiaya Jia. 具有方向上下文感知一致性的半监督语义分割. 在 CVPR, 2021. 2

[23] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023. 1, 3
[23] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, 和 Ziwei Liu. Otter:一种具有上下文指令调优的多模态模型. arXiv:2305.03726, 2023. 1, 3

[24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023. 1, 3
[24] Junnan Li, Dongxu Li, Silvio Savarese, 和 Steven Hoi. Blip2:通过冻结图像编码器和大型语言模型引导语言-图像预训练. arXiv:2301.12597, 2023. 1, 3

[25] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional networks for panoptic segmentation. In CVPR, 2021. 2
[25] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, 和 Jiaya Jia. 用于全景分割的全卷积网络. 在 CVPR, 2021. 2

[26] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023. 6
[26] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, 和 Diana Marculescu. 具有掩码适应的开放词汇语义分割. 在 CVPR, 2023. 6

[27] Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In CVPR, 2023. 6, 7
[27] Chang Liu, Henghui Ding, 和 Xudong Jiang. Gres:广义指代表达分割. 在 CVPR, 2023. 6, 7

[28] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint, 2023. 1, 5, 6, 7
[28] Haotian Liu, Chunyuan Li, Yuheng Li, 和 Yong Jae Lee. 通过视觉指令调优改进基线. arXiv 预印本, 2023. 1, 5, 6, 7

[29] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023. 1, 3, 4, 5, 6
[29] Haotian Liu, Chunyuan Li, Qingyang Wu, 和 Yong Jae Lee. 视觉指令调优. arXiv:2304.08485, 2023. 1, 3, 4, 5, 6

[30] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pretraining for open-set object detection. arXiv preprint, 2023. 6
[30] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, 等. Grounding dino:将 dino 与基础预训练结合用于开放集物体检测. arXiv 预印本, 2023. 6

[31] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better. arXiv preprint, 2015. 2
[31] Wei Liu, Andrew Rabinovich, 和 Alexander C. Berg. Parsenet: 更广泛的视野以获得更好的观察。arXiv 预印本,2015。2

[32] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving visioncentric tasks by interacting with chatbots beyond language. arXiv:2305.05662, 2023. 3
[32] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, 等. Internchat: 通过与超越语言的聊天机器人互动解决以视觉为中心的任务。arXiv:2305.05662,2023。3

[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017. 6
[33] Ilya Loshchilov 和 Frank Hutter. 解耦权重衰减正则化。arXiv:1711.05101,2017。6

[34] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020. 7
[34] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, 和 Rongrong Ji. 多任务协作网络用于联合指代表达理解和分割。在 CVPR,2020。7

[35] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016. 6
[35] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, 和 Kevin Murphy. 明确对象描述的生成与理解。在 CVPR,2016。6

[36] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016. 2
[36] Varun K Nagaraja, Vlad I Morariu, 和 Larry S Davis. 建模对象之间的上下文以理解指代表达。在 ECCV,2016。2

[37] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015. 2
[37] Hyeonwoo Noh, Seunghoon Hong, 和 Bohyung Han. 学习反卷积网络进行语义分割。在 ICCV,2015。2

[38] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023. 3
[38] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, 和 Furu Wei. Kosmos-2: 将多模态大型语言模型与世界对接。arXiv:2306.14824,2023。3

[39] Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, and Tong Zhang. Detgpt: Detect what you need via reasoning. arXiv:2305.14167, 2023. 3
[39] Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, 和 Tong Zhang. Detgpt: 通过推理检测你所需的内容。arXiv:2305.14167,2023。3

[40] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. In CVPR, 2023. 6
[40] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, 等. Paco: 常见物体的部分和属性。在 CVPR,2023。6

[41] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD, 2020. 6
[41] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, 和 Yuxiong He. Deepspeed: 系统优化使得训练超过 1000 亿参数的深度学习模型成为可能。在 SIGKDD, 2020. 6

[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 2
[42] Olaf Ronneberger, Philipp Fischer, 和 Thomas Brox. U-net: 用于生物医学图像分割的卷积网络。在 MICCAI, 2015. 2

[43] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. TPAMI, 2017. 2
[43] Evan Shelhamer, Jonathan Long, 和 Trevor Darrell. 用于语义分割的全卷积网络。TPAMI, 2017. 2

[44] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv:2303.17580, 2023. 3
[44] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, 和 Yueting Zhuang. Hugginggpt: 使用 chatgpt 和其在 huggingface 的朋友解决 AI 任务。arXiv:2303.17580, 2023. 3

[45] Zhuotao Tian, Pengguang Chen, Xin Lai, Li Jiang, Shu Liu, Hengshuang Zhao, Bei Yu, Ming-Chang Yang, and Jiaya Jia. Adaptive perspective distillation for semantic segmentation. TPAMI, 2022. 2
[45] Zhuotao Tian, Pengguang Chen, Xin Lai, Li Jiang, Shu Liu, Hengshuang Zhao, Bei Yu, Ming-Chang Yang, 和 Jiaya Jia. 用于语义分割的自适应透视蒸馏。TPAMI, 2022. 2

[46] Zhuotao Tian, Jiequan Cui, Li Jiang, Xiaojuan Qi, Xin Lai, Yixin Chen, Shu Liu, and Jiaya Jia. Learning context-aware classifier for semantic segmentation. AAAI, 2023. 2
[46] Zhuotao Tian, Jiequan Cui, Li Jiang, Xiaojuan Qi, Xin Lai, Yixin Chen, Shu Liu, 和 Jiaya Jia. 学习上下文感知分类器用于语义分割。AAAI, 2023. 2

[47] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu
[47] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao 等. Visionllm: 大语言模型也是视觉中心任务的开放解码器。arXiv:2305.11175, 2023. 3
Qiao, et al. Visionllm: Large language model is also an openended decoder for vision-centric tasks. arXiv:2305.11175, 2023. 3
[48] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, 和 Tongliang Liu. Cris: 基于 Clip 的引用图像分割。在 CVPR, 2022. 7

[48] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. In CVPR, 2022. 7
[49] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, 和 Nan Duan. Visual chatgpt: 与视觉基础模型进行对话、绘图和编辑。arXiv:2303.04671, 2023. 3

[49] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023. 3
[50] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In CVPR, 2019. 2
[50] 熊宇文, 廖仁杰, 赵恒霜, 胡锐, 白敏, 尤尔辛·尤梅尔, 和 拉奎尔·乌尔塔孙. Upsnet: 一个统一的全景分割网络. 在 CVPR, 2019. 2

[51] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018. 2
[51] 杨茂克, 于琨, 张驰, 李志伟, 和 杨奎元. Denseaspp 在街景语义分割中的应用. 在 CVPR, 2018. 2

[52] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li , and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023. 3
[52] 杨瑞, 宋林, 李艳伟, 赵思杰, 葛逸霄, 李秀, 和 山莺. gpt4tools: 通过自我指导教大型语言模型使用工具. arXiv:2305.18752, 2023. 3

[53] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022. 7
[53] 杨赵, 王佳琦, 唐彦松, 陈凯, 赵恒霜, 和 菲利普·HS·托尔. lavt: 语言感知视觉变换器用于引用图像分割. 在 CVPR, 2022. 7

[54] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv:2303.11381, 2023. 3
[54] 杨正源, 李林杰, 王剑锋, 林凯文, 艾哈迈德·埃赫桑, 法伊萨尔·艾哈迈德, 刘子诚, 刘策, 曾迈克尔, 和 王丽娟. mm-react: 促使 chatgpt 进行多模态推理和行动. arXiv:2303.11381, 2023. 3

[55] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023. 1, 3
[55] 叶青浩, 许海洋, 许国海, 叶家博, 闫明, 周怡扬, 王俊阳, 胡安文, 石鹏程, 施雅雅, 等. mplug-owl: 模块化赋能大型语言模型与多模态. arXiv:2304.14178, 2023. 1, 3

[56] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 2
[56] 鱼菲舍和 弗拉德伦·科尔图. 通过扩张卷积进行多尺度上下文聚合. 在 ICLR, 2016. 2

[57] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023. 3
[57] 张世龙, 孙佩泽, 陈守发, 肖敏, 邵文琦, 张文伟, 陈凯, 和 罗平. gpt4roi: 在兴趣区域上调整大型语言模型的指令. arXiv:2307.03601, 2023. 3

[58] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-net: Towards unified image segmentation. NeurIPS, 2021. 2
[58] 张文伟, 庞江淼, 陈凯, 和 陈常乐. k-net: 朝着统一的图像分割迈进. NeurIPS, 2021. 2

[59] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017. 2
[59] 赵恒霜, 施建平, 齐晓娟, 王晓刚, 和 贾佳雅. 金字塔场景解析网络. 在 CVPR, 2017. 2

[60] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, 2018.
[60] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, 和 Jiaya Jia. Icnet 用于高分辨率图像的实时语义分割. 在 ECCV, 2018.

[61] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Pointwise spatial attention network for scene parsing. In ECCV, 2018. 2
[61] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, 和 Jiaya Jia. Psanet: 用于场景解析的逐点空间注意力网络. 在 ECCV, 2018. 2

[62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017. 6
[62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, 和 Antonio Torralba. 通过 ade20k 数据集进行场景解析. 在 CVPR, 2017. 6

[63] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-
[63] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, 和 Mohamed Elhoseiny. Minigpt-4: 通过先进的大型语言模型增强视觉-语言理解. arXiv:2304.10592, 2023. 1, 3

language understanding with advanced large language models. arXiv:2304.10592, 2023. 1, 3
[64] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, 和 Xiang Bai. 用于语义分割的非对称非局部神经网络. 在 ICCV, 2019. 2

[64] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In ICCV, 2019. 2
[65] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, 等. 像素、图像和语言的广义解码. 在 CVPR, 2023. 2, 6, 7

[65] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. In CVPR, 2023. 2, 6, 7
[66] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, 和 Yong Jae Lee. 一次性分割所有事物. arXiv:2304.06718, 2023. 2, 4, 6, 7

[66] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. arXiv:2304.06718, 2023. 2, 4, 6, 7
*平等贡献

  1. *Equal Contribution  {0} 通讯作者.
    ^(†){ }^{\dagger} Corresponding Author.