这是用户在 2025-7-22 18:36 为 https://app.immersivetranslate.com/pdf-pro/2eae28b3-9086-4514-b8a0-89168681bd65/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
SAM-6D:Segment Anything 模型与零样本 6D 物体姿态估计的结合

Figure 1. We present SAM-6D for zero-shot 6D object pose estimation. SAM-6D takes an RGB image (a) and a depth map (b) of a cluttered scene as inputs, and performs instance segmentation (d) and pose estimation (e) for novel objects ©. We present the qualitative results of SAM-6D on the seven core datasets of the BOP benchmark [54], including YCB-V, LM-O, HB, T-LESS, IC-BIN, ITODD and TUD-L, arranged from left to right. Best view in the electronic version.
图 1. 我们提出了用于零样本 6D 物体姿态估计的 SAM-6D。SAM-6D 以杂乱场景的 RGB 图像(a)和深度图(b)作为输入,执行新颖物体©的实例分割(d)和姿态估计(e)。我们展示了 SAM-6D 在 BOP 基准测试[54]的七个核心数据集上的定性结果,包括 YCB-V、LM-O、HB、T-LESS、IC-BIN、ITODD 和 TUD-L,图中从左到右排列。电子版中效果最佳。

Abstract  摘要

Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including in-
零样本 6D 物体姿态估计涉及在杂乱场景中检测新颖物体及其 6D 姿态,这对模型的泛化能力提出了重大挑战。幸运的是,近期的 Segment Anything 模型(SAM)展示了卓越的零样本迁移性能,为解决该任务提供了有希望的方案。基于此,我们提出了 SAM-6D,一个通过两步实现该任务的新框架,包括在—

stance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-topartial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence,
姿态分割与姿态估计。针对目标物体,SAM-6D 采用两个专用子网络,即实例分割模型(ISM)和姿态估计模型(PEM),在杂乱的 RGB-D 图像上执行这些步骤。ISM 以 SAM 作为先进的起点,生成所有可能的物体提议,并通过精心设计的语义、外观和几何对象匹配分数,有选择地保留有效提议。PEM 将姿态估计视为部分到部分的点匹配问题,执行两阶段点匹配过程,采用新颖的背景标记设计来构建密集的 3D-3D 对应关系,

ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
最终得出姿态估计结果。无需复杂技巧,SAM-6D 在 BOP 基准的七个核心数据集上,在新颖物体的实例分割和姿态估计方面均优于现有方法。

1. Introduction  1. 引言

Object pose estimation is fundamental in many real-world applications, such as robotic manipulation and augmented reality. Its evolution has been significantly influenced by the emergence of deep learning models. The most studied task in this field is Instance-level 6D Pose Estimation [18, 19, 51, 58, 60, 63], which demands annotated training images of the target objects, thereby making the deep models object-specific. Recently, the research emphasis gradually shifts towards the task of Category-level 6D Pose Estimation [7, 29-32, 56, 61] for handling unseen objects, yet provided they belong to certain categories of interest. In this paper, we thus delve into a broader task setting of Zero-shot 6D Object Pose Estimation [5, 28], which aspires to detect all instances of novel objects, unseen during training, and estimate their 6D poses. Despite its significance, this zeroshot setting presents considerable challenges in both object detection and pose estimation.
物体姿态估计在许多现实应用中具有基础性作用,如机器人操作和增强现实。深度学习模型的出现极大地推动了该领域的发展。该领域研究最多的任务是实例级 6D 姿态估计[18, 19, 51, 58, 60, 63],该任务需要目标物体的带注释训练图像,因此使得深度模型具有针对特定物体的特性。近年来,研究重点逐渐转向类别级 6D 姿态估计[7, 29-32, 56, 61],以处理未见过的物体,但这些物体属于某些感兴趣的类别。在本文中,我们进一步探讨了更广泛的零样本 6D 物体姿态估计任务[5, 28],该任务旨在检测训练期间未见过的全新物体实例,并估计它们的 6D 姿态。尽管该零样本设置意义重大,但在物体检测和姿态估计方面都面临着相当大的挑战。
Recently, Segment Anything Model (SAM) [26] has garnered attention due to its remarkable zero-shot segmentation performance, which enables prompt segmentation with a variety of prompts, e.g., points, boxes, texts or masks. By prompting SAM with evenly sampled 2D grid points, one can generate potential class-agnostic object proposals, which may be highly beneficial for zero-shot 6D object pose estimation. To this end, we propose a novel framework, named SAM-6D, which employs SAM as an advanced starting point for the focused zero-shot task. Fig. 2 gives an overview illustration of SAM-6D. Specifically, SAM-6D employs an Instance Segmentation Model (ISM) to realize instance segmentation of novel objects by enhancing SAM with a carefully crafted object matching score, and a Pose Estimation Model (PEM) to solve object poses through a two-stage process of partial-to-partial point matching.
最近,Segment Anything Model(SAM)[26] 因其出色的零样本分割性能而受到关注,该性能使得通过多种提示(如点、框、文本或掩码)实现快速分割成为可能。通过使用均匀采样的二维网格点对 SAM 进行提示,可以生成潜在的类别无关的物体提议,这对于零样本 6D 物体姿态估计可能非常有益。为此,我们提出了一个新颖的框架,命名为 SAM-6D,该框架将 SAM 作为专注于零样本任务的先进起点。图 2 展示了 SAM-6D 的整体示意图。具体而言,SAM-6D 采用实例分割模型(ISM)通过引入精心设计的物体匹配分数来增强 SAM,实现对新颖物体的实例分割;同时采用姿态估计模型(PEM)通过部分到部分点匹配的两阶段过程来解决物体姿态问题。
The Instance Segmentation Model (ISM) is developed using SAM to take advantage of its zero-shot abilities for generating all possible class-agnostic proposals, and then assigns a meticulously calculated object matching score to each proposal for ascertaining whether it aligns with a given novel object. In contrast to methods that solely focus on object semantics [5, 40], we design the object matching scores considering three terms, including semantics, appearance and geometry. For each proposal, the first term assesses its semantic matching degree to the rendered templates of the object, while the second one further evaluates its appearance similarities to the best-matched template. The final term considers the matching degree based on ge-
实例分割模型(ISM)是利用 SAM 开发的,旨在利用其零样本能力生成所有可能的类别无关提议,然后为每个提议分配一个精心计算的对象匹配分数,以确定其是否与给定的新颖对象相符。与仅关注对象语义的方法[5, 40]不同,我们设计的对象匹配分数考虑了三个方面,包括语义、外观和几何。对于每个提议,第一个方面评估其与对象渲染模板的语义匹配程度,第二个方面进一步评估其与最佳匹配模板的外观相似度。最后一个方面则考虑基于几何的匹配程度。

Figure 2. An overview of our proposed SAM-6D, which consists of an Instance Segmentation Model (ISM) and a Pose Estimation Model (PEM) for joint instance segmentation and pose estimation of novel objects in RGB-D images. ISM leverages the Segment Anything Model (SAM) [26] to generate all possible proposals and selectively retains valid ones based on object matching scores. PEM involves two stages of point matching, from coarse to fine, to establish 3D-3D correspondence and calculate object poses for all valid proposals. Best view in the electronic version.
图 2. 我们提出的 SAM-6D 概述,该方法由实例分割模型(ISM)和姿态估计模型(PEM)组成,用于 RGB-D 图像中新颖物体的联合实例分割和姿态估计。ISM 利用 Segment Anything Model(SAM)[26]生成所有可能的候选区域,并根据物体匹配分数有选择地保留有效候选。PEM 包括两个阶段的点匹配,从粗到细,建立 3D-3D 对应关系并计算所有有效候选的物体姿态。电子版中效果最佳。

ometry, such as object shape and size, by calculating the Intersection-over-Union (IoU) value between the bounding boxes of the proposal and the 2D projection of the object transformed by a rough pose estimate.
通过计算候选框与由粗略姿态估计变换后的物体二维投影之间的交并比(IoU)值,来估计几何信息,如物体形状和大小。
The Pose Estimation Model (PEM) is designed to calculate a 6D object pose for each identified proposal that matches the novel object. Initially, we formulate this pose estimation challenge as a partial-to-partial point matching problem between the sampled point sets of the proposal and the target object, considering the factors such as occlusions, segmentation inaccuracies, and sensor noises. To solve this problem, we propose a simple yet effective solution that involves the use of background tokens; specifically, for the two point sets, we learn to align their non-overlapped points with the background tokens in the feature space, and thus effectively establish an assignment matrix to build the necessary correspondence for predicting the object pose. Based on the design of background tokens, we further develop PEM with two point matching stages, i.e., Coarse Point Matching and Fine Point Matching. The first stage realizes sparse correspondence to derive an initial object pose, which is subsequently used to transform the point set of the proposal, enabling the learning of positional encodings. The second stage incorporates the positional encodings of the two point sets to inject the initial correspondence, and
姿态估计模型(PEM)旨在为每个与新颖物体匹配的识别提议计算 6D 物体姿态。最初,我们将此姿态估计问题表述为提议点集与目标物体采样点集之间的部分到部分点匹配问题,考虑了遮挡、分割不准确和传感器噪声等因素。为了解决该问题,我们提出了一种简单而有效的解决方案,涉及使用背景标记;具体来说,对于这两个点集,我们学习在特征空间中将它们未重叠的点与背景标记对齐,从而有效建立分配矩阵,构建预测物体姿态所需的对应关系。基于背景标记的设计,我们进一步开发了包含两个点匹配阶段的 PEM,即粗匹配和细匹配。第一阶段实现稀疏对应以推导初始物体姿态,随后利用该姿态变换提议点集,从而实现位置编码的学习。 第二阶段结合了两组点集的位置编码,以注入初始对应关系,并

builds dense correspondence for estimating a more precise object pose. To effectively model dense interactions in the second stage, we propose an innovative design of Sparse-toDense Point Transformers, which realize interactions on the sparse versions of the dense features, and subsequently, distribute the enhanced sparse features back to the dense ones using Linear Transformers [12, 24].
构建密集对应关系以估计更精确的物体姿态。为了在第二阶段有效建模密集交互,我们提出了一种创新设计的稀疏到密集点变换器(Sparse-to-Dense Point Transformers),该设计在密集特征的稀疏版本上实现交互,随后利用线性变换器(Linear Transformers)[12, 24]将增强的稀疏特征分布回密集特征。
For the two models of SAM-6D, ISM, built on SAM, does not require any network re-training or fine-tuning, while PEM is trained on the large-scale synthetic images of ShapeNet-Objects [4] and Google-Scanned-Objects [9] datasets provided by [28]. We evaluate SAM-6D on the seven core datasets of the BOP benchmark [54], including LM-O, T-LESS, TUD-L, IC-BIN, ITODD, HB, and YCBV. The qualitative results are visualized in Fig. 1. SAM-6D outperforms the existing methods on both tasks of instance segmentation and pose estimation of novel objects, thereby showcasing its robust generalization capabilities.
对于 SAM-6D 的两个模型,基于 SAM 构建的 ISM 不需要任何网络重新训练或微调,而 PEM 则在[28]提供的 ShapeNet-Objects [4]和 Google-Scanned-Objects [9]数据集的大规模合成图像上进行训练。我们在 BOP 基准测试[54]的七个核心数据集上评估了 SAM-6D,包括 LM-O、T-LESS、TUD-L、IC-BIN、ITODD、HB 和 YCBV。定性结果如图 1 所示。SAM-6D 在新颖物体的实例分割和姿态估计两个任务上均优于现有方法,展示了其强大的泛化能力。
Our main contributions could be summarized as follows:
我们的主要贡献可总结如下:
  • We propose a novel framework of SAM-6D, which realizes joint instance segmentation and pose estimation of novel objects from RGB-D images, and outperforms the existing methods on seven datasets of BOP benchmark.
    我们提出了一个新颖的框架 SAM-6D,实现了从 RGB-D 图像中对新颖物体的联合实例分割和姿态估计,并在 BOP 基准的七个数据集上优于现有方法。
  • We leverage the zero-shot capacities of Segmentation Anything Model (SAM) to generate all possible proposals, and devise a novel object matching score to identify the proposals corresponding to novel objects.
    我们利用 Segmentation Anything Model(SAM)的零样本能力生成所有可能的候选区域,并设计了一种新颖的物体匹配分数来识别对应新颖物体的候选区域。
  • We approach pose estimation as a partial-to-partial point matching problem with a simple yet effective design of background tokens, and propose a two-stage point matching model for novel objects. The first stage realizes coarse point matching to derive initial object poses, which are then refined in the second stage of fine point matching using newly proposed Sparse-to-Dense Point Transformers.
    我们将姿态估计视为一个部分到部分的点匹配问题,采用简单而有效的背景标记设计,并提出了一个针对新颖物体的两阶段点匹配模型。第一阶段实现粗略点匹配以推导初始物体姿态,随后在第二阶段通过新提出的稀疏到密集点变换器进行细致点匹配以优化姿态。

2.1. Segment Anything  2.1. 任意分割

Segment Anything (SA) [26], is a promptable segmentation task that focuses on predicting valid masks for various types of prompts, e.g., points, boxes, text, and masks. To tackle this task, the authors propose a powerful segmentation model called Segment Anything Model (SAM), which comprises three components, including an image encoder, a prompt encoder and a mask decoder. SAM has demonstrated remarkable zero-shot transfer segmentation performance in real-world scenarios, including challenging situations such as medical images [36, 37, 71], camouflaged objects [22, 55], and transparent objects [13, 23]. Moreover, SAM has exhibited high versatility across numerous vision applications [69], such as image inpainting [35, 62, 64, 67], object tracking [15, 65, 73], 3D detection and segmentation [2, 66, 70], and 3D reconstruction [3, 49, 59].
任意分割(Segment Anything,SA)[26] 是一项可提示的分割任务,重点在于预测针对各种类型提示(如点、框、文本和掩码)的有效掩码。为了解决该任务,作者提出了一种强大的分割模型,称为任意分割模型(Segment Anything Model,SAM),该模型包含三个部分:图像编码器、提示编码器和掩码解码器。SAM 在真实场景中展现了卓越的零样本迁移分割性能,包括医学图像[36, 37, 71]、伪装物体[22, 55]和透明物体[13, 23]等挑战性情况。此外,SAM 在众多视觉应用中表现出高度的通用性[69],例如图像修复[35, 62, 64, 67]、目标跟踪[15, 65, 73]、三维检测与分割[2, 66, 70]以及三维重建[3, 49, 59]。
Recent studies have also investigated semantically segmenting anything due to the critical role of semantics in vision tasks. Semantic Segment Anything (SSA) [6] is proposed on top of SAM, aiming to assign semantic categories to the masks generated by SAM. Both PerSAM [72] and Matcher [34] employ SAM to segment the object belonging to a specific category in a query image by searching for point prompts with the aid of a reference image containing an object of the same category. CNOS [40] is proposed to segment all instances of a given object model, which firstly generates mask proposals via SAM and subsequently filters out proposals with low feature similarities against object templates rendered from the object model.
最近的研究也探讨了语义分割任意物体,原因在于语义在视觉任务中的关键作用。语义分割任意物体(Semantic Segment Anything,SSA)[6]是在 SAM 基础上提出的,旨在为 SAM 生成的掩码分配语义类别。PerSAM [72]和 Matcher [34]都利用 SAM,通过参考图像中包含同类别物体的点提示搜索,在查询图像中分割属于特定类别的物体。CNOS [40]被提出用于分割给定物体模型的所有实例,其首先通过 SAM 生成掩码提议,随后过滤掉与从物体模型渲染的物体模板特征相似度较低的提议。
For efficiency, FastSAM [74] is proposed by utilizing instance segmentation networks with regular convolutional networks instead of visual transformers used in SAM. Additionally, MobileSAM [68] replaces the heavy encoder of SAM with a lightweight one through decoupled distillation.
为了提高效率,FastSAM [74]通过使用常规卷积网络的实例分割网络替代 SAM 中使用的视觉变换器而被提出。此外,MobileSAM [68]通过解耦蒸馏将 SAM 的重型编码器替换为轻量级编码器。

2.2. Pose Estimation of Novel Objects
2.2. 新颖物体的姿态估计

Methods Based on Image Matching Methods within this group [1, 28, 33, 38, 39, 41, 42, 46, 50] often involve comparing object proposals to templates of the given novel objects, which are rendered with a series of object poses, to retrieve the best-matched object poses. For example, Gen6D [33], OVE6D [1], and GigaPose [41] are designed to select the viewpoint rotations via image matching and then estimate the in-plane rotations to obtain the final estimates. MegaPose [28] employs a coarse estimator to treat image matching as a classification problem, of which the recognized object poses are further updated by a refiner.
基于图像匹配的方法 该组方法[1, 28, 33, 38, 39, 41, 42, 46, 50]通常涉及将目标提议与给定新颖物体的模板进行比较,这些模板通过一系列物体姿态渲染,以检索最佳匹配的物体姿态。例如,Gen6D [33]、OVE6D [1]和 GigaPose [41]设计用于通过图像匹配选择视角旋转,然后估计平面内旋转以获得最终估计。MegaPose [28]采用粗略估计器,将图像匹配视为分类问题,识别出的物体姿态随后由细化器进一步更新。

Methods Based on Feature Matching Methods within this group [5, 10, 11, 17, 20, 53] align the 2D pixels or 3D points of the proposals with the object surface in the feature space [21, 52], thereby building correspondence to compute object poses. OnePose [53] matches the pixel descriptors of proposals with the aggregated point descriptors of the point sets constructed by Structure from Motion (SfM) for 2D-3D correspondence, while OnePose++ [17] further improves it with a keypoint-free SfM and a sparse-to-dense 2D-3D matching model. ZeroPose [5] realizes 3D-3D matching via geometric structures, and GigaPose [41] establishes 2D-2D correspondence to regress in-plane rotation and 2D scale. Moreover, [11] introduces a zero-shot category-level 6D pose estimation task, along with a self-supervised semantic correspondence learning method. Unlike the above one-stage point matching work, the unique contributions in our Pose Estimation Model are: (a) a two-stage pipeline that boosts performance by incorporating coarse correspondence for finer matching, (b) an efficient design of background tokens to eliminate the need of optimal transport with iterative optimization [48], and © a Sparse-to-Dense Point Transformer to effectively model dense relationship.
基于特征匹配的方法 该组方法[5, 10, 11, 17, 20, 53]通过在特征空间[21, 52]中将提议的 2D 像素或 3D 点与物体表面对齐,从而建立对应关系以计算物体姿态。OnePose[53]通过将提议的像素描述符与由结构光束法(SfM)构建的点集的聚合点描述符进行匹配,实现 2D-3D 对应;而 OnePose++[17]则通过无关键点的 SfM 和稀疏到稠密的 2D-3D 匹配模型进一步提升了性能。ZeroPose[5]通过几何结构实现 3D-3D 匹配,GigaPose[41]建立 2D-2D 对应以回归平面内旋转和 2D 尺度。此外,[11]引入了零样本类别级 6D 姿态估计任务,并提出了一种自监督的语义对应学习方法。 与上述单阶段点匹配工作不同,我们的姿态估计模型的独特贡献在于:(a)一个两阶段流程,通过引入粗略对应来提升更精细匹配的性能,(b)一种高效的背景令牌设计,消除了使用迭代优化的最优传输[48]的需求,以及(c)一个稀疏到密集点变换器,有效建模密集关系。

3. Methodology of SAM-6D
3. SAM-6D 的方法论

We present SAM-6D for zero-shot 6D object pose estimation, which aims to detect all instances of a specific novel object, unseen during training, along with their 6D object poses in the RGB-D images. To realize the challenging task, SAM-6D breaks it down into two steps via two dedicated sub-networks, i.e., an Instance Segmentation Model (ISM) and a Pose Estimation Model (PEM), to first segment all instances and then individually predict their 6D poses, as shown in Fig. 2. We detail the architectures of ISM and PEM in Sec. 3.1 and Sec. 3.2, respectively.
我们提出了 SAM-6D 用于零样本 6D 物体姿态估计,旨在检测训练时未见过的特定新颖物体的所有实例及其在 RGB-D 图像中的 6D 物体姿态。为实现这一具有挑战性的任务,SAM-6D 通过两个专用子网络将其分解为两个步骤,即实例分割模型(ISM)和姿态估计模型(PEM),先分割所有实例,再单独预测它们的 6D 姿态,如图 2 所示。我们分别在第 3.1 节和第 3.2 节详细介绍 ISM 和 PEM 的架构。

3.1. Instance Segmentation Model
3.1. 实例分割模型

SAM-6D uses an Instance Segmentation Model (ISM) to segment the instances of a novel object O O O\mathcal{O}. Given a cluttered scene, represented by an RGB image I I I\mathcal{I}, ISM leverages the zero-shot transfer capabilities of Segment Anything Model (SAM) [26] to generate all possible proposals M M M\mathcal{M}. For each proposal m M m M m inMm \in \mathcal{M}, ISM calculates an object matching score s m s m s_(m)s_{m} to assess the matching degree between m m mm and O O O\mathcal{O} in terms of semantics, appearance, and geometry. The matched instances with O O O\mathcal{O} can then be identified by simply setting a matching threshold δ m δ m delta_(m)\delta_{m}.
SAM-6D 使用实例分割模型(ISM)对新颖物体 O O O\mathcal{O} 进行实例分割。给定一个由 RGB 图像 I I I\mathcal{I} 表示的杂乱场景,ISM 利用 Segment Anything Model(SAM)[26] 的零样本迁移能力生成所有可能的提议 M M M\mathcal{M} 。对于每个提议 m M m M m inMm \in \mathcal{M} ,ISM 计算一个物体匹配分数 s m s m s_(m)s_{m} ,以评估 m m mm O O O\mathcal{O} 在语义、外观和几何方面的匹配程度。然后,通过简单设置匹配阈值 δ m δ m delta_(m)\delta_{m} ,即可识别与 O O O\mathcal{O} 匹配的实例。
In this subsection, we initially provide a brief review of SAM in Sec. 3.1.1 and then explain the computation of the object matching score s m s m s_(m)s_{m} in Sec. 3.1.2.
在本小节中,我们首先在第 3.1.1 节简要回顾 SAM,然后在第 3.1.2 节解释物体匹配分数 s m s m s_(m)s_{m} 的计算方法。

3.1.1 Preliminaries of Segment Anything Model
3.1.1 Segment Anything Model 的基础知识

Given an RGB image I I I\mathcal{I}, Segment Anything Model (SAM) [26] realizes promptable segmentation with various types of prompts P r , e . g P r , e . g P_(r),e.g\mathcal{P}_{r}, e . g., points, boxes, texts, or masks. Specifically, SAM consists of three modules, including an image encoder Φ Image Φ Image  Phi_("Image ")\Phi_{\text {Image }}, a prompt encoder Φ Prompt Φ Prompt  Phi_("Prompt ")\Phi_{\text {Prompt }}, and a mask decoder Ψ Mask Ψ Mask  Psi_("Mask ")\Psi_{\text {Mask }}, which could be formulated as follows:
给定一张 RGB 图像 I I I\mathcal{I} ,Segment Anything Model(SAM)[26] 实现了基于多种提示类型 P r , e . g P r , e . g P_(r),e.g\mathcal{P}_{r}, e . g (点、框、文本或掩码)的可提示分割。具体来说,SAM 包含三个模块:图像编码器 Φ Image Φ Image  Phi_("Image ")\Phi_{\text {Image }} 、提示编码器 Φ Prompt Φ Prompt  Phi_("Prompt ")\Phi_{\text {Prompt }} 和掩码解码器 Ψ Mask Ψ Mask  Psi_("Mask ")\Psi_{\text {Mask }} ,其可表示为:
M , C = Ψ Mask ( Φ Image ( I ) , Φ Prompt ( P r ) ) M , C = Ψ Mask  Φ Image  ( I ) , Φ Prompt  P r M,C=Psi_("Mask ")(Phi_("Image ")(I),Phi_("Prompt ")(P_(r)))\mathcal{M}, \mathcal{C}=\Psi_{\text {Mask }}\left(\Phi_{\text {Image }}(I), \Phi_{\text {Prompt }}\left(\mathcal{P}_{r}\right)\right)
where M M M\mathcal{M} and C C C\mathcal{C} denote the predicted proposals and the corresponding confidence scores, respectively.
其中 M M M\mathcal{M} C C C\mathcal{C} 分别表示预测的提议和相应的置信度分数。
To realize zero-shot transfer, one can prompt SAM with evenly sampled 2D grids to yield all possible proposals, which can then be filtered based on confidence scores, retaining only those with higher scores, and applied to NonMaximum Suppression to eliminate redundant detections.
为了实现零样本迁移,可以用均匀采样的二维网格对 SAM 进行提示,以生成所有可能的提议,然后根据置信度分数进行筛选,仅保留分数较高的提议,并应用非极大值抑制以消除冗余检测。

3.1.2 Object Matching Score
3.1.2 物体匹配分数

Given the proposals M M M\mathcal{M}, the next step is to identify the ones that are matched with a specified object O O O\mathcal{O} by assigning each proposal m M m M m inMm \in \mathcal{M} with an object matching score s m s m s_(m)s_{m}, which comprises three terms, each evaluating the matches in terms of semantics, appearance, and geometry, respectively.
给定提议 M M M\mathcal{M} ,下一步是通过为每个提议 m M m M m inMm \in \mathcal{M} 分配一个物体匹配分数 s m s m s_(m)s_{m} 来识别与指定物体 O O O\mathcal{O} 匹配的提议,该分数包含三个部分,分别从语义、外观和几何三个方面评估匹配情况。
Following [40], we sample N T N T N_(T)N_{\mathcal{T}} object poses in SE ( 3 ) SE ( 3 ) SE(3)\operatorname{SE}(3) space to render the templates { T k } k = 1 N T T k k = 1 N T {T_(k)}_(k=1)^(N_(T))\left\{\mathcal{T}_{k}\right\}_{k=1}^{N_{\mathcal{T}}} of O O O\mathcal{O}, which are fed
根据[40],我们在 SE ( 3 ) SE ( 3 ) SE(3)\operatorname{SE}(3) 空间中采样 N T N T N_(T)N_{\mathcal{T}} 个物体姿态,以渲染 O O O\mathcal{O} 的模板 { T k } k = 1 N T T k k = 1 N T {T_(k)}_(k=1)^(N_(T))\left\{\mathcal{T}_{k}\right\}_{k=1}^{N_{\mathcal{T}}} ,这些模板被输入

into a pre-trained visual transformer (ViT) backbone [8] of DINOv2 [45], resulting in the class embedding f T k c l s f T k c l s f_(T_(k))^(cls)\boldsymbol{f}_{\mathcal{T}_{k}}^{c l s} and N T k patch N T k patch  N_(T_(k))^("patch ")N_{\mathcal{T}_{k}}^{\text {patch }} patch embeddings { f T k , i patch } i = 1 N T k patch f T k , i patch  i = 1 N T k patch  {f_(T_(k),i)^("patch ")}_(i=1)^(N_(T_(k))^("patch "))\left\{\boldsymbol{f}_{\mathcal{T}_{k}, i}^{\text {patch }}\right\}_{i=1}^{N_{\mathcal{T}_{k}}^{\text {patch }}} of each template T k T k T_(k)\mathcal{T}_{k}. For each proposal m m mm, we crop the detected region out from I I I\mathcal{I}, and resize it to a fixed resolution. The image crop is denoted as I m I m I_(m)\mathcal{I}_{m} and also processed through the same ViT to obtain the class embedding f I m c l s f I m c l s f_(I_(m))^(cls)\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s} and the patch embeddings { f I m , j patch } j = 1 N I m patch f I m , j patch  j = 1 N I m patch  {f_(I_(m),j)^("patch ")}_(j=1)^(N_(I_(m))^("patch "))\left\{\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right\}_{j=1}^{N_{\mathcal{I}_{m}}^{\text {patch }}}, with N I m patch N I m patch  N_(I_(m))^("patch ")N_{\mathcal{I}_{m}}^{\text {patch }} denoting the number of patches within the object mask. Subsequently, we calculate the values of the individual score terms.
到预训练的视觉变换器(ViT)主干网络[8],即 DINOv2[45],从而得到每个模板 T k T k T_(k)\mathcal{T}_{k} 的类别嵌入 f T k c l s f T k c l s f_(T_(k))^(cls)\boldsymbol{f}_{\mathcal{T}_{k}}^{c l s} N T k patch N T k patch  N_(T_(k))^("patch ")N_{\mathcal{T}_{k}}^{\text {patch }} 个补丁嵌入 { f T k , i patch } i = 1 N T k patch f T k , i patch  i = 1 N T k patch  {f_(T_(k),i)^("patch ")}_(i=1)^(N_(T_(k))^("patch "))\left\{\boldsymbol{f}_{\mathcal{T}_{k}, i}^{\text {patch }}\right\}_{i=1}^{N_{\mathcal{T}_{k}}^{\text {patch }}} 。对于每个提议 m m mm ,我们从 I I I\mathcal{I} 中裁剪出检测到的区域,并调整到固定分辨率。该图像裁剪记为 I m I m I_(m)\mathcal{I}_{m} ,同样通过相同的 ViT 处理以获得类别嵌入 f I m c l s f I m c l s f_(I_(m))^(cls)\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s} 和补丁嵌入 { f I m , j patch } j = 1 N I m patch f I m , j patch  j = 1 N I m patch  {f_(I_(m),j)^("patch ")}_(j=1)^(N_(I_(m))^("patch "))\left\{\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right\}_{j=1}^{N_{\mathcal{I}_{m}}^{\text {patch }}} ,其中 N I m patch N I m patch  N_(I_(m))^("patch ")N_{\mathcal{I}_{m}}^{\text {patch }} 表示物体掩码内的补丁数量。随后,我们计算各个得分项的数值。
Semantic Matching Score We compute a semantic score s sem s sem  s_("sem ")s_{\text {sem }} through the class embeddings by averaging the top K values from { f I m c l s , f T k c l s | f I m c l s | | f T k c l s | } k = 1 N T f I m c l s , f T k c l s f I m c l s f T k c l s k = 1 N T {((:f_(I_(m))^(cls),f_(T_(k))^(cls):))/(|f_(I_(m))^(cls)|*|f_(T_(k))^(cls)|)}_(k=1)^(N_(T))\left\{\frac{\left\langle\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s}, \boldsymbol{f}_{\mathcal{T}_{k}}^{c l s}\right\rangle}{\left|\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s}\right| \cdot\left|\boldsymbol{f}_{\mathcal{T}_{k}}^{c l s}\right|}\right\}_{k=1}^{N_{\mathcal{T}}} to establish a robust measure of semantic matching, with < , > < , > < , ><,> denoting an inner product. The template that yields the highest semantic value can be seen as the best-matched template, denoted as T best T best  T_("best ")\mathcal{T}_{\text {best }}, and is used in the computation of the subsequent two scores.
语义匹配得分 我们通过类别嵌入计算语义得分 s sem s sem  s_("sem ")s_{\text {sem }} ,方法是从 { f I m c l s , f T k c l s | f I m c l s | | f T k c l s | } k = 1 N T f I m c l s , f T k c l s f I m c l s f T k c l s k = 1 N T {((:f_(I_(m))^(cls),f_(T_(k))^(cls):))/(|f_(I_(m))^(cls)|*|f_(T_(k))^(cls)|)}_(k=1)^(N_(T))\left\{\frac{\left\langle\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s}, \boldsymbol{f}_{\mathcal{T}_{k}}^{c l s}\right\rangle}{\left|\boldsymbol{f}_{\mathcal{I}_{m}}^{c l s}\right| \cdot\left|\boldsymbol{f}_{\mathcal{T}_{k}}^{c l s}\right|}\right\}_{k=1}^{N_{\mathcal{T}}} 中取前 K 个值求平均,以建立稳健的语义匹配度量,其中 < , > < , > < , ><,> 表示内积。产生最高语义值的模板被视为最佳匹配模板,记为 T best T best  T_("best ")\mathcal{T}_{\text {best }} ,并用于后续两个得分的计算。

Appearance Matching Score Given T best T best  T_("best ")\mathcal{T}_{\text {best }}, we compare I m I m I_(m)\mathcal{I}_{m} and T best T best  T_("best ")\mathcal{T}_{\text {best }} in terms of appearance using an appearance score s appe s appe  s_("appe ")s_{\text {appe }}, based on the patch embeddings, as follows:
外观匹配得分 给定 T best T best  T_("best ")\mathcal{T}_{\text {best }} ,我们基于补丁嵌入,通过外观得分 s appe s appe  s_("appe ")s_{\text {appe }} 比较 I m I m I_(m)\mathcal{I}_{m} T best T best  T_("best ")\mathcal{T}_{\text {best }} 的外观,具体如下:

s appe = 1 N I m patch j = 1 N I m patch max i = 1 , , N T best patch f I m , j patch , f T best , i patch | f I m , j patch | | f T best , i patch | s appe  = 1 N I m patch  j = 1 N I m patch  max i = 1 , , N T best  patch  f I m , j patch  , f T best  , i patch  f I m , j patch  f T best  , i patch  s_("appe ")=(1)/(N_(I_(m))^("patch "))sum_(j=1)^(N_(I_(m))^("patch "))max_(i=1,dots,N_(T_("best "))^("patch "))((:f_(I_(m),j)^("patch "),f_(T_("best "),i)^("patch "):))/(|f_(I_(m),j)^("patch ")|*|f_(T_("best "),i)^("patch ")|)s_{\text {appe }}=\frac{1}{N_{\mathcal{I}_{m}}^{\text {patch }}} \sum_{j=1}^{N_{\mathcal{I}_{m}}^{\text {patch }}} \max _{i=1, \ldots, N_{\mathcal{T}_{\text {best }}}^{\text {patch }}} \frac{\left\langle\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}, \boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right\rangle}{\left|\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right| \cdot\left|\boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right|}.
s appe s appe  s_("appe ")s_{\text {appe }} is utilized to distinguish objects that are semantically similar but differ in appearance.
s appe s appe  s_("appe ")s_{\text {appe }} 用于区分语义相似但外观不同的物体。

Geometric Matching Score In terms of geometry, we score the proposal m m mm by considering factors like object shapes and sizes. Utilizing the object rotation from T best T best  T_("best ")\mathcal{T}_{\text {best }} and the mean location of the cropped points of m m mm, we have a coarse pose to transform the object O O O\mathcal{O}, which is then projected onto the image to obtain a compact bounding box B o B o B_(o)\mathcal{B}_{o}. Afterwards, the Intersection-over-Union (IoU) value between B o B o B_(o)\mathcal{B}_{o} and the bounding box B m B m B_(m)\mathcal{B}_{m} of m m mm is used as the geometric score s geo s geo  s_("geo ")s_{\text {geo }} :
几何匹配得分 在几何方面,我们通过考虑物体的形状和大小等因素来对提议 m m mm 进行评分。利用来自 T best T best  T_("best ")\mathcal{T}_{\text {best }} 的物体旋转和 m m mm 裁剪点的平均位置,我们得到了一个粗略的姿态,用于变换物体 O O O\mathcal{O} ,然后将其投影到图像上以获得一个紧凑的边界框 B o B o B_(o)\mathcal{B}_{o} 。随后,使用 B o B o B_(o)\mathcal{B}_{o} m m mm 的边界框 B m B m B_(m)\mathcal{B}_{m} 之间的交并比(IoU)值作为几何得分 s geo s geo  s_("geo ")s_{\text {geo }}
s g e o = B m B o B m B o . s g e o = B m B o B m B o . s_(geo)=(B_(m)nnnB_(o))/(B_(m)uuuB_(o)).s_{g e o}=\frac{\mathcal{B}_{m} \bigcap \mathcal{B}_{o}}{\mathcal{B}_{m} \bigcup \mathcal{B}_{o}} .
The reliability of s geo s geo  s_("geo ")s_{\text {geo }} is easily impacted by occlusions. We thus compute a visible ratio r v i s r v i s r_(vis)r_{v i s} to evaluate the confidence of s geo s geo  s_("geo ")s_{\text {geo }}, which is detailed in the supplementary materials.
s geo s geo  s_("geo ")s_{\text {geo }} 的可靠性容易受到遮挡的影响。因此,我们计算了一个可见比例 r v i s r v i s r_(vis)r_{v i s} 来评估 s geo s geo  s_("geo ")s_{\text {geo }} 的置信度,具体内容详见补充材料。
By combining the above three score terms, the object matching score s m s m s_(m)s_{m} could be formulated as follows:
通过结合上述三个得分项,物体匹配得分 s m s m s_(m)s_{m} 可表示为:
s m = s s e m + s a p p e + r v i s s g e o 1 + 1 + r v i s s m = s s e m + s a p p e + r v i s s g e o 1 + 1 + r v i s s_(m)=(s_(sem)+s_(appe)+r_(vis)*s_(geo))/(1+1+r_(vis))s_{m}=\frac{s_{s e m}+s_{a p p e}+r_{v i s} \cdot s_{g e o}}{1+1+r_{v i s}}

3.2. Pose Estimation Model
3.2. 姿态估计模型

SAM-6D uses a Pose Estimation Model (PEM) to predict the 6D poses of the proposals matched with the object O O O\mathcal{O}.
SAM-6D 使用姿态估计模型(PEM)来预测与物体 O O O\mathcal{O} 匹配的提议的 6D 姿态。
For each object proposal m m mm, PEM uses a strategy of point registration to predict the 6 D pose w.r.t. O O O\mathcal{O}. Denoting the sampled point set of m m mm as P m R N m × 3 P m R N m × 3 P_(m)inR^(N_(m)xx3)\mathcal{P}_{m} \in \mathbb{R}^{N_{m} \times 3} with N m N m N_(m)N_{m} points
对于每个物体提议 m m mm ,PEM 使用点注册策略来预测相对于 O O O\mathcal{O} 的 6D 姿态。将 m m mm 的采样点集表示为 P m R N m × 3 P m R N m × 3 P_(m)inR^(N_(m)xx3)\mathcal{P}_{m} \in \mathbb{R}^{N_{m} \times 3} ,包含 N m N m N_(m)N_{m} 个点

Figure 3. An illustration of Pose Estimation Model (PEM) of SAM-6D.
图 3. SAM-6D 姿态估计模型(PEM)示意图。

and that of O O O\mathcal{O} as P o R N o × 3 P o R N o × 3 P_(o)inR^(N_(o)xx3)\mathcal{P}_{o} \in \mathbb{R}^{N_{o} \times 3} with N o N o N_(o)N_{o} points, the goal is to solve an assignment matrix to present the partial-to-partial correspondence between P m P m P_(m)\mathcal{P}_{m} and P o P o P_(o)\mathcal{P}_{o}. Partial-to-partial correspondence arises as P o P o P_(o)\mathcal{P}_{o} only partially matches P m P m P_(m)\mathcal{P}_{m} due to occlusions, and P m P m P_(m)\mathcal{P}_{m} may partially align with P o P o P_(o)\mathcal{P}_{o} due to segmentation inaccuracies and sensor noises. We propose to equip their respective point features F m R N m × C F m R N m × C F_(m)inR^(N_(m)xx C)\boldsymbol{F}_{m} \in \mathbb{R}^{N_{m} \times C} and F o R N o × C F o R N o × C F_(o)inR^(N_(o)xx C)\boldsymbol{F}_{o} \in \mathbb{R}^{N_{o} \times C} with learnable Background Tokens, denoted as f m b g R C f m b g R C f_(m)^(bg)inR^(C)\boldsymbol{f}_{m}^{b g} \in \mathbb{R}^{C} and f o b g R C f o b g R C f_(o)^(bg)inR^(C)\boldsymbol{f}_{o}^{b g} \in \mathbb{R}^{C}, where C C CC is the number of feature channels. This simple design resolves the assignment problem of non-overlapped points in two point sets, and the partial-to-partial correspondence thus could be effectively built based on feature similarities. Specifically, we can first compute the attention matrix A A A\mathcal{A} as follows:
以及 O O O\mathcal{O} 作为 P o R N o × 3 P o R N o × 3 P_(o)inR^(N_(o)xx3)\mathcal{P}_{o} \in \mathbb{R}^{N_{o} \times 3} ,具有 N o N o N_(o)N_{o} 个点,目标是求解一个分配矩阵,以表示 P m P m P_(m)\mathcal{P}_{m} P o P o P_(o)\mathcal{P}_{o} 之间的部分到部分对应关系。部分到部分对应关系的产生是因为 P o P o P_(o)\mathcal{P}_{o} 仅部分匹配 P m P m P_(m)\mathcal{P}_{m} ,这是由于遮挡造成的,而 P m P m P_(m)\mathcal{P}_{m} 可能由于分割不准确和传感器噪声而部分与 P o P o P_(o)\mathcal{P}_{o} 对齐。我们提出为它们各自的点特征 F m R N m × C F m R N m × C F_(m)inR^(N_(m)xx C)\boldsymbol{F}_{m} \in \mathbb{R}^{N_{m} \times C} F o R N o × C F o R N o × C F_(o)inR^(N_(o)xx C)\boldsymbol{F}_{o} \in \mathbb{R}^{N_{o} \times C} 配备可学习的背景标记,分别记为 f m b g R C f m b g R C f_(m)^(bg)inR^(C)\boldsymbol{f}_{m}^{b g} \in \mathbb{R}^{C} f o b g R C f o b g R C f_(o)^(bg)inR^(C)\boldsymbol{f}_{o}^{b g} \in \mathbb{R}^{C} ,其中 C C CC 是特征通道的数量。这一简单设计解决了两个点集之间非重叠点的分配问题,因此部分到部分对应关系可以基于特征相似性有效建立。具体来说,我们可以首先计算注意力矩阵 A A A\mathcal{A} ,计算方式如下:
A = [ f m b g , F m ] × [ f o b g , F o ] T R ( N m + 1 ) × ( N o + 1 ) , A = f m b g , F m × f o b g , F o T R N m + 1 × N o + 1 , A=[f_(m)^(bg),F_(m)]xx[f_(o)^(bg),F_(o)]^(T)inR^((N_(m)+1)xx(N_(o)+1)),\mathcal{A}=\left[\boldsymbol{f}_{m}^{b g}, \boldsymbol{F}_{m}\right] \times\left[\boldsymbol{f}_{o}^{b g}, \boldsymbol{F}_{o}\right]^{T} \in \mathbb{R}^{\left(N_{m}+1\right) \times\left(N_{o}+1\right)},
and then obtain the soft assignment matrix A ~ A ~ tilde(A)\tilde{\mathcal{A}}
然后得到软分配矩阵 A ~ A ~ tilde(A)\tilde{\mathcal{A}}
A ~ = Softmax row ( A / τ ) Softmax col ( A / τ ) , A ~ = Softmax row ( A / τ ) Softmax col ( A / τ ) , tilde(A)=Softmax_(row)(A//tau)*Softmax_(col)(A//tau),\tilde{\mathcal{A}}=\operatorname{Softmax}_{\mathrm{row}}(\mathcal{A} / \tau) \cdot \operatorname{Softmax}_{\mathrm{col}}(\mathcal{A} / \tau),
where Softmax row ( ) Softmax row  ( ) Softmax_("row ")()\operatorname{Softmax}_{\text {row }}() and Softmax col ( ) Softmax col  ( ) Softmax_("col ")()\operatorname{Softmax}_{\text {col }}() denote Softmax operations executed along the row and column of the matrix, respectively. τ τ tau\tau is a constant temperature. The values in each row of A ~ A ~ tilde(A)\tilde{\mathcal{A}}, excluding the first row associated with the background, indicate the matching probabilities of the point p m P m p m P m p_(m)inP_(m)\boldsymbol{p}_{m} \in \mathcal{P}_{m} aligning with background and the points in P o P o P_(o)\mathcal{P}_{o}. Specifically, for p m p m p_(m)\boldsymbol{p}_{m}, its corresponding point p o P o p o P o p_(o)inP_(o)\boldsymbol{p}_{o} \in \mathcal{P}_{o} can be identified by locating the index of the maximum score a ~ A ~ a ~ A ~ tilde(a)in tilde(A)\tilde{a} \in \tilde{\mathcal{A}} along the row; if this index equals zero, the embedding of p m p m p_(m)\boldsymbol{p}_{m} aligns with the background token, indicating it
其中 Softmax row ( ) Softmax row  ( ) Softmax_("row ")()\operatorname{Softmax}_{\text {row }}() Softmax col ( ) Softmax col  ( ) Softmax_("col ")()\operatorname{Softmax}_{\text {col }}() 分别表示沿矩阵的行和列执行的 Softmax 操作。 τ τ tau\tau 是一个常数温度。 A ~ A ~ tilde(A)\tilde{\mathcal{A}} 中每一行的值,除与背景相关的第一行外,表示点 p m P m p m P m p_(m)inP_(m)\boldsymbol{p}_{m} \in \mathcal{P}_{m} 与背景及 P o P o P_(o)\mathcal{P}_{o} 中点的匹配概率。具体来说,对于 p m p m p_(m)\boldsymbol{p}_{m} ,其对应的点 p o P o p o P o p_(o)inP_(o)\boldsymbol{p}_{o} \in \mathcal{P}_{o} 可以通过沿行定位最大分数 a ~ A ~ a ~ A ~ tilde(a)in tilde(A)\tilde{a} \in \tilde{\mathcal{A}} 的索引来确定;如果该索引等于零,则表示 p m p m p_(m)\boldsymbol{p}_{m} 的嵌入与背景标记对齐,表明它

has no valid correspondence in P o P o P_(o)\mathcal{P}_{o}. Once A ~ A ~ tilde(A)\tilde{\mathcal{A}} is obtained, we can gather all the matched pairs { ( p m , p o ) } p m , p o {(p_(m),p_(o))}\left\{\left(\boldsymbol{p}_{m}, \boldsymbol{p}_{o}\right)\right\}, along with their scores { a ~ } { a ~ } { tilde(a)}\{\tilde{a}\}, to compute the pose using weighted SVD.
P o P o P_(o)\mathcal{P}_{o} 中没有有效对应。一旦获得 A ~ A ~ tilde(A)\tilde{\mathcal{A}} ,我们可以收集所有匹配对 { ( p m , p o ) } p m , p o {(p_(m),p_(o))}\left\{\left(\boldsymbol{p}_{m}, \boldsymbol{p}_{o}\right)\right\} 及其分数 { a ~ } { a ~ } { tilde(a)}\{\tilde{a}\} ,利用加权 SVD 计算姿态。
Building on the above strategy with background tokens, PEM is designed in two point matching stages. For the proposal m m mm and the target object O O O\mathcal{O}, the first stage involves Coarse Point Matching between their sparse point sets P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} and P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c}, while the second stage involves Fine Point Matching between their dense sets P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} and P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f}; we use the upper scripts ‘c’ and ’ f ’ to indicate respective variables of these two stages. The aim of the first stage is to derive a coarse pose R init R init  R_("init ")\boldsymbol{R}_{\text {init }} and t init t init  t_("init ")\boldsymbol{t}_{\text {init }} from sparse correspondence. Then in the second stage, we use the initial pose to transform P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} for learning the positional encodings, and employ stacked Sparse-to-Dense Point Transformers to learn dense correspondence for a final pose R R R\boldsymbol{R} and t t t\boldsymbol{t}. Prior to two point matching modules, we incorporate a Feature Extraction module to learn individual point features of m m mm and O O O\mathcal{O}. Fig. 3 gives a detailed illustration of PEM.
基于上述带有背景标记的策略,PEM 设计为两个点匹配阶段。对于提议 m m mm 和目标物体 O O O\mathcal{O} ,第一阶段涉及它们稀疏点集 P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} 之间的粗匹配,而第二阶段涉及它们稠密点集 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} 之间的精细匹配;我们使用上标“c”和“f”来表示这两个阶段的相应变量。第一阶段的目标是从稀疏对应关系中推导出粗略位姿 R init R init  R_("init ")\boldsymbol{R}_{\text {init }} t init t init  t_("init ")\boldsymbol{t}_{\text {init }} 。然后在第二阶段,我们使用初始位姿变换 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} 以学习位置编码,并采用堆叠的稀疏到稠密点变换器来学习稠密对应关系,得到最终位姿 R R R\boldsymbol{R} t t t\boldsymbol{t} 。在两个点匹配模块之前,我们引入了特征提取模块来学习 m m mm O O O\mathcal{O} 的单点特征。图 3 详细说明了 PEM。

3.2.1 Feature Extraction
3.2.1 特征提取

The Feature Extraction module is employed to extract point-wise features F m F m F_(m)\boldsymbol{F}_{m} and F o F o F_(o)\boldsymbol{F}_{o} for the point sets P m P m P_(m)\mathcal{P}_{m} and P o P o P_(o)\mathcal{P}_{o} of the proposal m m mm and the given object O O O\mathcal{O}, respectively.
特征提取模块用于提取提议 m m mm 和给定物体 O O O\mathcal{O} 的点集 P m P m P_(m)\mathcal{P}_{m} P o P o P_(o)\mathcal{P}_{o} 的点级特征 F m F m F_(m)\boldsymbol{F}_{m} F o F o F_(o)\boldsymbol{F}_{o}
Rather than directly extracting features from the discretized points P m P m P_(m)\mathcal{P}_{m}, we utilize the visual transformer (ViT) backbone [8] on its masked image crop I m crop I m cropI_(m)\operatorname{crop} \mathcal{I}_{m} to capture patchwise embeddings, which are then reshaped and interpolated to match the size of I m I m I_(m)\mathcal{I}_{m}. Each point in P m P m P_(m)\mathcal{P}_{m} is assigned with the corresponding pixel embedding, yielding F m F m F_(m)\boldsymbol{F}_{m}.
我们不是直接从离散点集 P m P m P_(m)\mathcal{P}_{m} 中提取特征,而是在其遮挡图像 crop I m crop I m cropI_(m)\operatorname{crop} \mathcal{I}_{m} 上利用视觉变换器(ViT)主干网络 [8] 来捕获按块划分的嵌入,然后将其重塑和插值以匹配 I m I m I_(m)\mathcal{I}_{m} 的大小。 P m P m P_(m)\mathcal{P}_{m} 中的每个点都被分配了对应的像素嵌入,生成了 F m F m F_(m)\boldsymbol{F}_{m}
We represent the object O O O\mathcal{O} with templates rendered from different camera views. All visible object pixels (points) are aggregated across views and sampled to create the point set P o P o P_(o)\mathcal{P}_{o}. Corresponding pixel embeddings, extracted using the ViT backbone, are then used to form F o F o F_(o)\boldsymbol{F}_{o}.
我们用从不同相机视角渲染的模板来表示物体 O O O\mathcal{O} 。所有可见的物体像素(点)在各视角间汇总并采样,形成点集 P o P o P_(o)\mathcal{P}_{o} 。随后使用 ViT 主干网络提取的对应像素嵌入被用来构建 F o F o F_(o)\boldsymbol{F}_{o}

3.2.2 Coarse Point Matching
3.2.2 粗略点匹配

The Coarse Point Matching module is used to initialize a coarse object pose R init R init  R_("init ")\boldsymbol{R}_{\text {init }} and t init t init  t_("init ")\boldsymbol{t}_{\text {init }} by estimating a soft assignment matrix S ~ c S ~ c tilde(S)^(c)\tilde{\mathcal{S}}^{c} between sparse versions of P m P m P_(m)\mathcal{P}_{m} and P o P o P_(o)\mathcal{P}_{o}.
粗略点匹配模块用于通过估计稀疏版本的 P m P m P_(m)\mathcal{P}_{m} P o P o P_(o)\mathcal{P}_{o} 之间的软分配矩阵 S ~ c S ~ c tilde(S)^(c)\tilde{\mathcal{S}}^{c} ,初始化粗略的物体姿态 R init R init  R_("init ")\boldsymbol{R}_{\text {init }} t init t init  t_("init ")\boldsymbol{t}_{\text {init }}
As shown in Fig. 3, we first sample a sparse point set P m c R N m c × 3 P m c R N m c × 3 P_(m)^(c)inR^(N_(m)^(c)xx3)\mathcal{P}_{m}^{c} \in \mathbb{R}^{N_{m}^{c} \times 3} with N m c N m c N_(m)^(c)N_{m}^{c} points from P m P m P_(m)\mathcal{P}_{m}, and P o c R N o c × 3 P o c R N o c × 3 P_(o)^(c)inR^(N_(o)^(c)xx3)\mathcal{P}_{o}^{c} \in \mathbb{R}^{N_{o}^{c} \times 3} with N o c N o c N_(o)^(c)N_{o}^{c} points from P o P o P_(o)\mathcal{P}_{o}, along with their respective sampled features F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} and F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c}. Then we concatenate F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} and F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c} with learnable background tokens, and process them through T c T c T^(c)T^{c} stacked Geometric Transformers [48], each of which consists of a geometric self-attention for intra-pointset feature learning and a cross-attention for inter-point-set correspondence modeling. The processed features, denoted as F ~ m c F ~ m c tilde(F)_(m)^(c)\tilde{\boldsymbol{F}}_{m}^{c} and F ~ o c F ~ o c tilde(F)_(o)^(c)\tilde{\boldsymbol{F}}_{o}^{c}, are subsequently used to compute the soft assignment matrix A ~ c A ~ c tilde(A)^(c)\tilde{\mathcal{A}}^{c} based on (5) and (6).
如图 3 所示,我们首先从 P m P m P_(m)\mathcal{P}_{m} 中采样一个稀疏点集 P m c R N m c × 3 P m c R N m c × 3 P_(m)^(c)inR^(N_(m)^(c)xx3)\mathcal{P}_{m}^{c} \in \mathbb{R}^{N_{m}^{c} \times 3} ,包含 N m c N m c N_(m)^(c)N_{m}^{c} 个点,从 P o P o P_(o)\mathcal{P}_{o} 中采样 P o c R N o c × 3 P o c R N o c × 3 P_(o)^(c)inR^(N_(o)^(c)xx3)\mathcal{P}_{o}^{c} \in \mathbb{R}^{N_{o}^{c} \times 3} ,包含 N o c N o c N_(o)^(c)N_{o}^{c} 个点,以及它们各自采样的特征 F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c} 。然后我们将 F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c} 与可学习的背景标记拼接,并通过 T c T c T^(c)T^{c} 层堆叠的几何变换器[48]进行处理,每层包括用于点集内特征学习的几何自注意力和用于点集间对应建模的交叉注意力。处理后的特征记为 F ~ m c F ~ m c tilde(F)_(m)^(c)\tilde{\boldsymbol{F}}_{m}^{c} F ~ o c F ~ o c tilde(F)_(o)^(c)\tilde{\boldsymbol{F}}_{o}^{c} ,随后用于基于(5)和(6)计算软分配矩阵 A ~ c A ~ c tilde(A)^(c)\tilde{\mathcal{A}}^{c}
With A ~ c A ~ c tilde(A)^(c)\tilde{\mathcal{A}}^{c}, we obtain the matching probabilities between the overlapped points of P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} and P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c}, which can serve as the distribution to sample multiple triplets of point pairs and compute pose hypotheses [14, 25]. We assign each pose hypothesis R hyp R hyp  R_("hyp ")\boldsymbol{R}_{\text {hyp }} and t hyp t hyp  t_("hyp ")\boldsymbol{t}_{\text {hyp }} a pose matching score s hyp s hyp  s_("hyp ")s_{\text {hyp }} as:
利用 A ~ c A ~ c tilde(A)^(c)\tilde{\mathcal{A}}^{c} ,我们获得 P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} 重叠点之间的匹配概率,该概率可作为分布,用于采样多个点对三元组并计算位姿假设[14, 25]。我们为每个位姿假设 R hyp R hyp  R_("hyp ")\boldsymbol{R}_{\text {hyp }} t hyp t hyp  t_("hyp ")\boldsymbol{t}_{\text {hyp }} 分配一个位姿匹配分数 s hyp s hyp  s_("hyp ")s_{\text {hyp }} ,计算公式为:
s hyp = N m c / p m c P m c min p o c P o c R hyp T ( p o c t hyp ) p m c 2 . s hyp  = N m c / p m c P m c min p o c P o c R hyp  T p o c t hyp  p m c 2 . s_("hyp ")=N_(m)^(c)//sum_(p_(m)^(c)inP_(m)^(c))min_(p_(o)^(c)inP_(o)^(c))||R_("hyp ")^(T)(p_(o)^(c)-t_("hyp "))-p_(m)^(c)||_(2).s_{\text {hyp }}=N_{m}^{c} / \sum_{\boldsymbol{p}_{m}^{c} \in \mathcal{P}_{m}^{c}} \min _{\boldsymbol{p}_{o}^{c} \in \mathcal{P}_{o}^{c}}\left\|\boldsymbol{R}_{\text {hyp }}^{T}\left(\boldsymbol{p}_{o}^{c}-\boldsymbol{t}_{\text {hyp }}\right)-\boldsymbol{p}_{m}^{c}\right\|_{2} .
Among the pose hypotheses, the one with the highest pose matching score is chosen as the initial pose R init R init  R_("init ")\boldsymbol{R}_{\text {init }} and t init t init  t_("init ")\boldsymbol{t}_{\text {init }} inputted into the next Fine Point Matching module.
在所有位姿假设中,选择具有最高位姿匹配分数的假设作为初始位姿 R init R init  R_("init ")\boldsymbol{R}_{\text {init }} t init t init  t_("init ")\boldsymbol{t}_{\text {init }} ,输入到下一个精细点匹配模块。

3.2.3 Fine Point Matching
3.2.3 精细点匹配

The Fine Point Matching module is utilized to build dense correspondence and estimate a more precise pose R R R\boldsymbol{R} and t t t\boldsymbol{t}.
细点匹配模块用于构建稠密对应关系并估计更精确的姿态 R R R\boldsymbol{R} t t t\boldsymbol{t}
To build finer correspondence, we sample a dense point set P m f R N m f × 3 P m f R N m f × 3 P_(m)^(f)inR^(N_(m)^(f)xx3)\mathcal{P}_{m}^{f} \in \mathbb{R}^{N_{m}^{f} \times 3} with N m f N m f N_(m)^(f)N_{m}^{f} points from P m P m P_(m)\mathcal{P}_{m}, and P o f P o f P_(o)^(f)in\mathcal{P}_{o}^{f} \in R N o f × 3 R N o f × 3 R^(N_(o)^(f)xx3)\mathbb{R}^{N_{o}^{f} \times 3} with N o f N o f N_(o)^(f)N_{o}^{f} points from P o P o P_(o)\mathcal{P}_{o}, along with their respective sampled features F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f}. We then inject the initial correspondence, learned by the coarse point matching, through the inclusion of positional encodings. Specifically, we transform P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} with the coarse pose R init R init  R_("init ")\boldsymbol{R}_{\text {init }} and t init t init  t_("init ")\boldsymbol{t}_{\text {init }} and apply it to a multi-scale Set Abstract Level [47] to learn the positional encodings F m p F m p F_(m)^(p)\boldsymbol{F}_{m}^{p}; similarly, positional encodings F o p F o p F_(o)^(p)\boldsymbol{F}_{o}^{p} are also learned for P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f}. We then add F m p F m p F_(m)^(p)\boldsymbol{F}_{m}^{p} and F o p F o p F_(o)^(p)\boldsymbol{F}_{o}^{p} to F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f}, concatenate each with a background token, and process them to yield F ~ m f F ~ m f tilde(F)_(m)^(f)\tilde{\boldsymbol{F}}_{m}^{f} and F ~ o f F ~ o f tilde(F)_(o)^(f)\tilde{\boldsymbol{F}}_{o}^{f}, resulting in the soft assignment matrix A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f} based on (5) and (6).
为了构建更细致的对应关系,我们从 P m P m P_(m)\mathcal{P}_{m} 中采样了一个包含 N m f N m f N_(m)^(f)N_{m}^{f} 个点的稠密点集 P m f R N m f × 3 P m f R N m f × 3 P_(m)^(f)inR^(N_(m)^(f)xx3)\mathcal{P}_{m}^{f} \in \mathbb{R}^{N_{m}^{f} \times 3} ,并从 P o P o P_(o)\mathcal{P}_{o} 中采样了一个包含 N o f N o f N_(o)^(f)N_{o}^{f} 个点的 P o f P o f P_(o)^(f)in\mathcal{P}_{o}^{f} \in R N o f × 3 R N o f × 3 R^(N_(o)^(f)xx3)\mathbb{R}^{N_{o}^{f} \times 3} ,以及它们各自采样的特征 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} 。然后,我们通过引入位置编码注入由粗匹配点学习到的初始对应关系。具体来说,我们使用粗略姿态 R init R init  R_("init ")\boldsymbol{R}_{\text {init }} t init t init  t_("init ")\boldsymbol{t}_{\text {init }} 变换 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} ,并将其应用于多尺度集合抽象层 [47] 来学习位置编码 F m p F m p F_(m)^(p)\boldsymbol{F}_{m}^{p} ;同样地,位置编码 F o p F o p F_(o)^(p)\boldsymbol{F}_{o}^{p} 也为 P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} 学习得到。接着,我们将 F m p F m p F_(m)^(p)\boldsymbol{F}_{m}^{p} F o p F o p F_(o)^(p)\boldsymbol{F}_{o}^{p} 分别加到 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} 上,将每个结果与一个背景标记拼接,并进行处理,得到 F ~ m f F ~ m f tilde(F)_(m)^(f)\tilde{\boldsymbol{F}}_{m}^{f} F ~ o f F ~ o f tilde(F)_(o)^(f)\tilde{\boldsymbol{F}}_{o}^{f} ,从而基于公式 (5) 和 (6) 生成软分配矩阵 A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f}
However, the commonly used transformers [48, 57] incur a significant computational cost when learning dense point features. The recent Linear Transformers [12, 24], while being more efficient, exhibit less effective modeling of point
然而,常用的变换器 [48, 57] 在学习稠密点特征时计算成本较高。最近的线性变换器 [12, 24] 虽然更高效,但在点的建模效果上表现较弱。

interactions, since they implement attentions along the feature dimension. To address this, we propose a novel design of Sparse-to-Dense Point Transformer (SDPT), as shown in Fig. 3. Specifically, given two dense point features F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f}, SDPT first samples two sparse features from them and applies a Geometric Transformer [48] to enhance their interactions, resulting in two improved sparse features, denoted as F m f F m f F_(m)^(f')\boldsymbol{F}_{m}^{f \prime} and F o f F o f F_(o)^(f')\boldsymbol{F}_{o}^{f \prime}. SDPT then employs a Linear Crossattention [12] to spread the information from F m f F m f F_(m)^(f')\boldsymbol{F}_{m}^{f \prime} to F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f}, treating the former as the key and value of the transformer, and the latter as the query. The same operations are applied to F o f F o f F_(o)^(f')\boldsymbol{F}_{o}^{f \prime} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} to update F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f}.
交互,因为它们沿特征维度实现注意力机制。为了解决这个问题,我们提出了一种新颖的稀疏到稠密点变换器(Sparse-to-Dense Point Transformer,SDPT)设计,如图 3 所示。具体来说,给定两个稠密点特征 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} ,SDPT 首先从中采样两个稀疏特征,并应用几何变换器(Geometric Transformer)[48]来增强它们的交互,得到两个改进的稀疏特征,记为 F m f F m f F_(m)^(f')\boldsymbol{F}_{m}^{f \prime} F o f F o f F_(o)^(f')\boldsymbol{F}_{o}^{f \prime} 。然后,SDPT 采用线性交叉注意力(Linear Crossattention)[12]将信息从 F m f F m f F_(m)^(f')\boldsymbol{F}_{m}^{f \prime} 传播到 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} ,将前者视为变换器的键和值,后者视为查询。同样的操作也应用于 F o f F o f F_(o)^(f')\boldsymbol{F}_{o}^{f \prime} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} 以更新 F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f}
In Fine Point Matching, we stack T f T f T^(f)T^{f} SDPTs to model the dense correspondence and learn the soft assignment matrix A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f}. We note that, in each SDPT, the background tokens are consistently maintained in both sparse and dense point features. After obtaining A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f}, we search within P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} for the corresponding points to all foreground points in P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} along with the probabilities, building dense correspondence, and compute the final object pose R R R\boldsymbol{R} and t t t\boldsymbol{t} via weighted SVD.
在精细点匹配中,我们堆叠 T f T f T^(f)T^{f} 个 SDPT 来建模稠密对应关系并学习软分配矩阵 A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f} 。我们注意到,在每个 SDPT 中,背景标记在稀疏和稠密点特征中始终保持一致。获得 A ~ f A ~ f tilde(A)^(f)\tilde{\mathcal{A}}^{f} 后,我们在 P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} 中搜索所有前景点 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} 的对应点及其概率,构建稠密对应关系,并通过加权奇异值分解(weighted SVD)计算最终的物体姿态 R R R\boldsymbol{R} t t t\boldsymbol{t}

4. Experiments  4. 实验

In this section, we conduct experiments to evaluate our proposed SAM-6D, which consists of an Instance Segmentation Model (ISM) and a Pose Estimation Model (PEM).
在本节中,我们进行实验以评估我们提出的 SAM-6D,该方法由实例分割模型(ISM)和姿态估计模型(PEM)组成。

Datasets We evaluate our proposed SAM-6D on the seven core datasets of the BOP benchmark [54], including LM-O, T-LESS, TUD-L, IC-BIN, ITODD, HB, and YCB-V. PEM is trained on the large-scale synthetic ShapeNet-Objects [4] and Google-Scanned-Objects [9] datasets provided by [28], with a total of 2 , 000 , 000 2 , 000 , 000 2,000,0002,000,000 images across 50 , 000 50 , 000 ∼50,000\sim 50,000 objects.
数据集 我们在 BOP 基准测试[54]的七个核心数据集上评估我们提出的 SAM-6D,包括 LM-O、T-LESS、TUD-L、IC-BIN、ITODD、HB 和 YCB-V。PEM 在由[28]提供的大规模合成 ShapeNet-Objects [4]和 Google-Scanned-Objects [9]数据集上进行训练,共包含 50 , 000 50 , 000 ∼50,000\sim 50,000 个对象的 2 , 000 , 000 2 , 000 , 000 2,000,0002,000,000 张图像。

Implementation Details For ISM, we follow [40] to utilize the default ViT-H SAM [26] or FastSAM [74] for proposal generation, and the default ViT-L model of DINOv2 [44] to extract class and patch embeddings. For PEM, we set N m c = N o c = 196 N m c = N o c = 196 N_(m)^(c)=N_(o)^(c)=196N_{m}^{c}=N_{o}^{c}=196 and N m f = N o f = 2048 N m f = N o f = 2048 N_(m)^(f)=N_(o)^(f)=2048N_{m}^{f}=N_{o}^{f}=2048, and use InfoNCE loss [43] to supervise the learning of attention matrices (5) for both matching stages. We use ADAM to train PEM with a total of 600,000 iterations; the learning rate is initialized as 0.0001 , with a cosine annealing schedule used, and the batch size is set as 28 . For each object, we use two rendered templates for training PEM. During evaluation, we follow [40] and use 42 templates for both ISM and PEM.
实现细节 对于 ISM,我们遵循[40],使用默认的 ViT-H SAM [26]或 FastSAM [74]进行提议生成,并使用 DINOv2 [44]的默认 ViT-L 模型提取类别和补丁嵌入。对于 PEM,我们设置 N m c = N o c = 196 N m c = N o c = 196 N_(m)^(c)=N_(o)^(c)=196N_{m}^{c}=N_{o}^{c}=196 N m f = N o f = 2048 N m f = N o f = 2048 N_(m)^(f)=N_(o)^(f)=2048N_{m}^{f}=N_{o}^{f}=2048 ,并使用 InfoNCE 损失[43]监督两个匹配阶段的注意力矩阵(5)的学习。我们使用 ADAM 训练 PEM,总共进行 600,000 次迭代;学习率初始化为 0.0001,采用余弦退火调度,批量大小设置为 28。对于每个对象,我们使用两个渲染模板训练 PEM。评估时,我们遵循[40],对 ISM 和 PEM 均使用 42 个模板。

Evaluation Metrics For instance segmentation, we report the mean Average Precision (mAP) scores at different Intersection-over-Union (IoU) thresholds ranging from 0.50 to 0.95 with a step size of 0.05 . For pose estimation, we report the mean Average Recall (AR) w.r.t three error functions, i.e., Visible Surface Discrepancy (VSD), Maximum Symmetry-Aware Surface Distance (MSSD) and Maximum Symmetry-Aware Projection Distance (MSPD). For further details about these evaluation metrics, please refer to [54].
评估指标 对于实例分割,我们报告在不同交并比(IoU)阈值下的平均精度均值(mAP)分数,阈值范围从 0.50 到 0.95,步长为 0.05。对于姿态估计,我们报告相对于三种误差函数的平均召回率(AR),即可见表面差异(VSD)、最大对称感知表面距离(MSSD)和最大对称感知投影距离(MSPD)。有关这些评估指标的更多详细信息,请参见[54]。
Method  方法 Segmentation Model  分割模型 Object Matching Score  物体匹配分数 BOP Dataset  BOP 数据集 Mean  平均值
s sem s sem  s_("sem ")s_{\text {sem }} s appe s appe  s_("appe ")s_{\text {appe }} s geo s geo  s_("geo ")s_{\text {geo }} LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V
ZeroPose [5] SAM [26] - - - 34.4 32.7 41.4 25.1 22.4 47.8 51.9 36.5
CNOS [40] FastSAM [74] - - - 39.7 37.4 48.0 27.0 25.4 51.1 59.9 41.2
CNOS [40] SAM [26] - - - 39.6 39.7 39.1 28.4 28.2 48.0 59.5 40.4
SAM-6D (Ours)  SAM-6D(我们的) FastSAM [74] \checkmark × × xx\times × × xx\times 39.5 37.6 48.7 25.7 25.3 51.2 60.2 41.2
\checkmark \checkmark × × xx\times 40.6 39.3 50.1 27.7 29.0 52.2 60.6 42.8
\checkmark × × xx\times \checkmark 40.4 41.4 49.7 28.2 30.1 54.0 61.1 43.6
\checkmark \checkmark \checkmark 42.2 42.0 51.7 29.3 31.9 54.8 62.1 44.9
SAM [26] \checkmark × × xx\times × × xx\times 43.4 39.1 48.2 33.3 28.8 55.1 60.3 44.0
\checkmark \checkmark × × xx\times 44.4 40.8 49.8 34.5 30.0 55.7 59.5 45.0
\checkmark × × xx\times \checkmark 44.0 44.7 54.8 33.8 31.5 58.3 59.9 46.7
\checkmark \checkmark \checkmark 46.0 45.1 56.9 35.7 33.2 59.3 60.5 48.1
Method Segmentation Model Object Matching Score BOP Dataset Mean s_("sem ") s_("appe ") s_("geo ") LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V ZeroPose [5] SAM [26] - - - 34.4 32.7 41.4 25.1 22.4 47.8 51.9 36.5 CNOS [40] FastSAM [74] - - - 39.7 37.4 48.0 27.0 25.4 51.1 59.9 41.2 CNOS [40] SAM [26] - - - 39.6 39.7 39.1 28.4 28.2 48.0 59.5 40.4 SAM-6D (Ours) FastSAM [74] ✓ xx xx 39.5 37.6 48.7 25.7 25.3 51.2 60.2 41.2 ✓ ✓ xx 40.6 39.3 50.1 27.7 29.0 52.2 60.6 42.8 ✓ xx ✓ 40.4 41.4 49.7 28.2 30.1 54.0 61.1 43.6 ✓ ✓ ✓ 42.2 42.0 51.7 29.3 31.9 54.8 62.1 44.9 SAM [26] ✓ xx xx 43.4 39.1 48.2 33.3 28.8 55.1 60.3 44.0 ✓ ✓ xx 44.4 40.8 49.8 34.5 30.0 55.7 59.5 45.0 ✓ xx ✓ 44.0 44.7 54.8 33.8 31.5 58.3 59.9 46.7 ✓ ✓ ✓ 46.0 45.1 56.9 35.7 33.2 59.3 60.5 48.1| Method | Segmentation Model | Object Matching Score | | | BOP Dataset | | | | | | | Mean | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | | $s_{\text {sem }}$ | $s_{\text {appe }}$ | $s_{\text {geo }}$ | LM-O | T-LESS | TUD-L | IC-BIN | ITODD | HB | YCB-V | | | ZeroPose [5] | SAM [26] | - | - | - | 34.4 | 32.7 | 41.4 | 25.1 | 22.4 | 47.8 | 51.9 | 36.5 | | CNOS [40] | FastSAM [74] | - | - | - | 39.7 | 37.4 | 48.0 | 27.0 | 25.4 | 51.1 | 59.9 | 41.2 | | CNOS [40] | SAM [26] | - | - | - | 39.6 | 39.7 | 39.1 | 28.4 | 28.2 | 48.0 | 59.5 | 40.4 | | SAM-6D (Ours) | FastSAM [74] | $\checkmark$ | $\times$ | $\times$ | 39.5 | 37.6 | 48.7 | 25.7 | 25.3 | 51.2 | 60.2 | 41.2 | | | | $\checkmark$ | $\checkmark$ | $\times$ | 40.6 | 39.3 | 50.1 | 27.7 | 29.0 | 52.2 | 60.6 | 42.8 | | | | $\checkmark$ | $\times$ | $\checkmark$ | 40.4 | 41.4 | 49.7 | 28.2 | 30.1 | 54.0 | 61.1 | 43.6 | | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 42.2 | 42.0 | 51.7 | 29.3 | 31.9 | 54.8 | 62.1 | 44.9 | | | SAM [26] | $\checkmark$ | $\times$ | $\times$ | 43.4 | 39.1 | 48.2 | 33.3 | 28.8 | 55.1 | 60.3 | 44.0 | | | | $\checkmark$ | $\checkmark$ | $\times$ | 44.4 | 40.8 | 49.8 | 34.5 | 30.0 | 55.7 | 59.5 | 45.0 | | | | $\checkmark$ | $\times$ | $\checkmark$ | 44.0 | 44.7 | 54.8 | 33.8 | 31.5 | 58.3 | 59.9 | 46.7 | | | | $\checkmark$ | $\checkmark$ | $\checkmark$ | 46.0 | 45.1 | 56.9 | 35.7 | 33.2 | 59.3 | 60.5 | 48.1 |
Table 1. Instance segmentation results of different methods on the seven core datasets of the BOP benchmark [54]. We report the mean Average Precision (mAP) scores at different Intersection-over-Union (IoU) values ranging from 0.50 to 0.95 with a step size of 0.05 .
表 1. 不同方法在 BOP 基准[54]七个核心数据集上的实例分割结果。我们报告了在不同交并比(IoU)值(从 0.50 到 0.95,步长为 0.05)下的平均精度均值(mAP)得分。
Method  方法 Input Type  输入类型 Detection / Segmentation
检测 / 分割
BOP Dataset  BOP 数据集 Mean  均值
LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V
With Supervised Detection / Segmentation
使用监督检测/分割
MegaPose [28] RGB 18.7 19.7 20.5 15.3 8.00 18.6 13.9 16.2
MegaPose ^(†){ }^{\dagger} [28] RGB 53.7 62.2 58.4 43.6 30.1 72.9 60.4 54.5
MegaPose ^(†){ }^{\dagger} [28] RGB-D 58.3 54.3 71.2 37.1 40.4 75.7 63.3 57.2
ZeroPose [5] RGB-D MaskRCNN [16] 26.1 24.3 61.1 24.7 26.4 38.2 29.5 32.6
ZeroPose ^(†){ }^{\dagger} [5] RGB-D 56.2 53.3 87.2 41.8 43.6 68.2 58.4 58.4
SAM-6D (Ours)  SAM-6D(我们的) RGB-D 66.5 66.0 80.9 61.9 31.9 81.8 79.6 66.9
With Zero-Shot Detection / Segmentation
零样本检测/分割
ZeroPose [5] RGB-D 26.0 17.8 41.2 17.7 38.0 43.9 25.7 25.7
ZeroPose ^(†){ }^{\dagger} [5] RGB-D ZeroPose [5] 49.1 34.0 74.5 39.0 42.9 61.0 57.7 51.2
SAM-6D (Ours)  SAM-6D(我们的) RGB-D 63.5 43.0 80.2 51.8 48.4 69.1 79.2 62.2
MegaPose* [28] RGB 22.9 17.7 25.8 15.2 10.8 25.1 28.1 20.8
MegaPose ^(†**){ }^{\dagger *} [28] RGB 49.9 47.7 65.3 36.7 31.5 65.4 60.1 50.9
MegaPose ^(†**){ }^{\dagger *} [28] RGB-D 62.6 48.7 85.1 46.7 46.8 73.0 76.4 62.8
ZeroPose ^(†**){ }^{\dagger *} [5] RGB-D CNOS (FastSAM) [40] 53.8 40.0 83.5 39.2 52.1 65.3 65.3 57.0
GigaPose [41] RGB 29.9 27.3 30.2 23.1 18.8 34.8 29.0 27.6
GigaPose ^(†){ }^{\dagger} [41] RGB 59.9 57.0 63.5 46.7 39.7 72.2 66.3 57.9
SAM-6D (Ours)  SAM-6D(我们的) RGB-D 65.1 47.9 82.5 49.7 56.2 73.8 81.5 65.3
SAM-6D (Ours)  SAM-6D(我们的) RGB-D SAM-6D (FastSAM)  SAM-6D(FastSAM) 66.7 48.5 82.9 51.0 57.2 73.6 83.4 66.2
SAM-6D (Ours)  SAM-6D(我们的) RGB-D SAM-6D (SAM)  SAM-6D(SAM) 69.9 51.5 90.4 58.8 60.2 77.6 84.5 70.4
Method Input Type Detection / Segmentation BOP Dataset Mean LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V With Supervised Detection / Segmentation MegaPose [28] RGB 18.7 19.7 20.5 15.3 8.00 18.6 13.9 16.2 MegaPose ^(†) [28] RGB 53.7 62.2 58.4 43.6 30.1 72.9 60.4 54.5 MegaPose ^(†) [28] RGB-D 58.3 54.3 71.2 37.1 40.4 75.7 63.3 57.2 ZeroPose [5] RGB-D MaskRCNN [16] 26.1 24.3 61.1 24.7 26.4 38.2 29.5 32.6 ZeroPose ^(†) [5] RGB-D 56.2 53.3 87.2 41.8 43.6 68.2 58.4 58.4 SAM-6D (Ours) RGB-D 66.5 66.0 80.9 61.9 31.9 81.8 79.6 66.9 With Zero-Shot Detection / Segmentation ZeroPose [5] RGB-D 26.0 17.8 41.2 17.7 38.0 43.9 25.7 25.7 ZeroPose ^(†) [5] RGB-D ZeroPose [5] 49.1 34.0 74.5 39.0 42.9 61.0 57.7 51.2 SAM-6D (Ours) RGB-D 63.5 43.0 80.2 51.8 48.4 69.1 79.2 62.2 MegaPose* [28] RGB 22.9 17.7 25.8 15.2 10.8 25.1 28.1 20.8 MegaPose ^(†**) [28] RGB 49.9 47.7 65.3 36.7 31.5 65.4 60.1 50.9 MegaPose ^(†**) [28] RGB-D 62.6 48.7 85.1 46.7 46.8 73.0 76.4 62.8 ZeroPose ^(†**) [5] RGB-D CNOS (FastSAM) [40] 53.8 40.0 83.5 39.2 52.1 65.3 65.3 57.0 GigaPose [41] RGB 29.9 27.3 30.2 23.1 18.8 34.8 29.0 27.6 GigaPose ^(†) [41] RGB 59.9 57.0 63.5 46.7 39.7 72.2 66.3 57.9 SAM-6D (Ours) RGB-D 65.1 47.9 82.5 49.7 56.2 73.8 81.5 65.3 SAM-6D (Ours) RGB-D SAM-6D (FastSAM) 66.7 48.5 82.9 51.0 57.2 73.6 83.4 66.2 SAM-6D (Ours) RGB-D SAM-6D (SAM) 69.9 51.5 90.4 58.8 60.2 77.6 84.5 70.4| Method | Input Type | Detection / Segmentation | BOP Dataset | | | | | | | Mean | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | | | LM-O | T-LESS | TUD-L | IC-BIN | ITODD | HB | YCB-V | | | With Supervised Detection / Segmentation | | | | | | | | | | | | MegaPose [28] | RGB | | 18.7 | 19.7 | 20.5 | 15.3 | 8.00 | 18.6 | 13.9 | 16.2 | | MegaPose ${ }^{\dagger}$ [28] | RGB | | 53.7 | 62.2 | 58.4 | 43.6 | 30.1 | 72.9 | 60.4 | 54.5 | | MegaPose ${ }^{\dagger}$ [28] | RGB-D | | 58.3 | 54.3 | 71.2 | 37.1 | 40.4 | 75.7 | 63.3 | 57.2 | | ZeroPose [5] | RGB-D | MaskRCNN [16] | 26.1 | 24.3 | 61.1 | 24.7 | 26.4 | 38.2 | 29.5 | 32.6 | | ZeroPose ${ }^{\dagger}$ [5] | RGB-D | | 56.2 | 53.3 | 87.2 | 41.8 | 43.6 | 68.2 | 58.4 | 58.4 | | SAM-6D (Ours) | RGB-D | | 66.5 | 66.0 | 80.9 | 61.9 | 31.9 | 81.8 | 79.6 | 66.9 | | With Zero-Shot Detection / Segmentation | | | | | | | | | | | | ZeroPose [5] | RGB-D | | 26.0 | 17.8 | 41.2 | 17.7 | 38.0 | 43.9 | 25.7 | 25.7 | | ZeroPose ${ }^{\dagger}$ [5] | RGB-D | ZeroPose [5] | 49.1 | 34.0 | 74.5 | 39.0 | 42.9 | 61.0 | 57.7 | 51.2 | | SAM-6D (Ours) | RGB-D | | 63.5 | 43.0 | 80.2 | 51.8 | 48.4 | 69.1 | 79.2 | 62.2 | | MegaPose* [28] | RGB | | 22.9 | 17.7 | 25.8 | 15.2 | 10.8 | 25.1 | 28.1 | 20.8 | | MegaPose ${ }^{\dagger *}$ [28] | RGB | | 49.9 | 47.7 | 65.3 | 36.7 | 31.5 | 65.4 | 60.1 | 50.9 | | MegaPose ${ }^{\dagger *}$ [28] | RGB-D | | 62.6 | 48.7 | 85.1 | 46.7 | 46.8 | 73.0 | 76.4 | 62.8 | | ZeroPose ${ }^{\dagger *}$ [5] | RGB-D | CNOS (FastSAM) [40] | 53.8 | 40.0 | 83.5 | 39.2 | 52.1 | 65.3 | 65.3 | 57.0 | | GigaPose [41] | RGB | | 29.9 | 27.3 | 30.2 | 23.1 | 18.8 | 34.8 | 29.0 | 27.6 | | GigaPose ${ }^{\dagger}$ [41] | RGB | | 59.9 | 57.0 | 63.5 | 46.7 | 39.7 | 72.2 | 66.3 | 57.9 | | SAM-6D (Ours) | RGB-D | | 65.1 | 47.9 | 82.5 | 49.7 | 56.2 | 73.8 | 81.5 | 65.3 | | SAM-6D (Ours) | RGB-D | SAM-6D (FastSAM) | 66.7 | 48.5 | 82.9 | 51.0 | 57.2 | 73.6 | 83.4 | 66.2 | | SAM-6D (Ours) | RGB-D | SAM-6D (SAM) | 69.9 | 51.5 | 90.4 | 58.8 | 60.2 | 77.6 | 84.5 | 70.4 |
Table 2. Pose estimation results of different methods on the seven core datasets of the BOP benchmark [54]. We report the mean Average Recall (AR) among VSD, MSSD and MSPD, as introduced in Sec. 4. The symbol ’ \dagger ’ denotes the use of pose refinement proposed in [28]. The symbol ’ *** ’ denotes the results published on BOP leaderboard. Our used masks of MaskRCNN [16] are provided by CosyPose [27].
表 2. 不同方法在 BOP 基准测试[54]七个核心数据集上的姿态估计结果。我们报告了第 4 节中介绍的 VSD、MSSD 和 MSPD 的平均召回率(AR)。符号‘ \dagger ’表示使用了[28]中提出的姿态优化。符号‘ *** ’表示在 BOP 排行榜上发布的结果。我们使用的 MaskRCNN[16]的掩码由 CosyPose[27]提供。

4.1. Instance Segmentation of Novel Objects
4.1. 新颖物体的实例分割

We compare our ISM of SAM-6D with ZeroPose [5] and CNOS [40], both of which score the object proposals in terms of semantics solely, for instance segmentation of novel objects. The quantitative results are presented in Table 1, demonstrating that our ISM, built on the publicly available foundation models of SAM [26] / FastSAM [74] and ViT (pre-trained by DINOv2 [44]), delivers superior results without the need for network re-training or finetuning. Note that our baseline with only semantic matching score s s e m s s e m s_(sem)s_{s e m}, whether based on SAM or FastSAM [74], aligns precisely with the method of CNOS; the only difference is that we adjust the hyperparameters of SAM to generate more
我们将 SAM-6D 的 ISM 与 ZeroPose [5]和 CNOS [40]进行了比较,这两者仅基于语义对新颖物体的实例分割中的物体提议进行评分。定量结果如表 1 所示,表明我们的 ISM 基于公开可用的基础模型 SAM [26] / FastSAM [74]和 ViT(由 DINOv2 [44]预训练)构建,无需网络重新训练或微调即可提供更优的结果。注意,我们仅使用语义匹配分数 s s e m s s e m s_(sem)s_{s e m} 的基线,无论是基于 SAM 还是 FastSAM [74],都与 CNOS 方法完全一致;唯一的区别是我们调整了 SAM 的超参数以生成更多的

proposals for scoring. Further enhancements to our baselines are achieved via the inclusion of appearance and geometry matching scores, i.e., s appe s appe  s_("appe ")s_{\text {appe }} and s geo s geo  s_("geo ")s_{\text {geo }}, as verified in Table 1. Qualitative results of ISM are visualized in Fig. 1.
提议进行评分。通过引入外观和几何匹配分数,即 s appe s appe  s_("appe ")s_{\text {appe }} s geo s geo  s_("geo ")s_{\text {geo }} ,进一步提升了我们的基线性能,这一点在表 1 中得到了验证。ISM 的定性结果可见图 1。

4.2. Pose Estimation of Novel Objects
4.2. 新颖物体的姿态估计

4.2.1 Comparisons with Existing Methods
4.2.1 与现有方法的比较

We compare our PEM of SAM-6D with the representative methods, including MegaPose [28], ZeroPose [5], and GigaPose [41], for pose estimation of novel objects. Quantitative comparisons, as presented in Table 5, show that our PEM, without the time-intensive render-based refiner [28], outperforms the existing methods under various mask pre-
我们将 SAM-6D 的 PEM 与代表性方法进行了比较,包括 MegaPose [28]、ZeroPose [5]和 GigaPose [41],用于新颖物体的姿态估计。表 5 中的定量比较显示,我们的 PEM 在没有耗时的基于渲染的细化器[28]的情况下,在各种掩码预测下均优于现有方法。

dictions. Importantly, the mask predictions from our ISM significantly enhance the performance of PEM, compared to other mask predictions, further validating the advantages of ISM. Qualitative results of PEM are visualized in Fig. 1.
重要的是,我们的 ISM 生成的掩码预测相比其他掩码预测显著提升了 PEM 的性能,进一步验证了 ISM 的优势。PEM 的定性结果如图 1 所示。

4.2.2 Ablation Studies and Analyses
4.2.2 消融研究与分析

We conduct ablation studies on the YCB-V dataset to evaluate the efficacy of individual designs in PEM, with the mask predictions generated by ISM based on SAM.
我们在 YCB-V 数据集上进行了消融研究,以评估 PEM 中各个设计的有效性,掩码预测由基于 SAM 的 ISM 生成。

Efficacy of Background Tokens We address the partial-to-partial point matching issue through a simple yet effective design of background tokens. Another existing solution is the use of optimal transport [48] with iterative optimization, which, however, is time-consuming. The two solutions are compared in Table 3, which shows that our PEM with background tokens achieves results comparable to optimal transport, but with a faster inference speed. As the density of points for matching increases, optimal transport requires more time to derive the assignment matrices.
背景标记的有效性 我们通过一种简单而有效的背景标记设计来解决部分到部分的点匹配问题。另一种现有的解决方案是使用带有迭代优化的最优传输[48],但这种方法耗时较长。表 3 中对这两种方案进行了比较,结果显示我们采用背景标记的 PEM 取得了与最优传输相当的效果,但推理速度更快。随着匹配点密度的增加,最优传输需要更多时间来计算分配矩阵。

Efficacy of Two Point Matching Stages With the background tokens, we design PEM with two stages of point matching via a Coarse Point Matching module and a Fine Point Matching module. Firstly, we validate the effectiveness of the Fine Point Matching module, which effectively improves the results of the coarse module, as verified in Table 4. Further, we evaluate the effectiveness of the Coarse Point Matching module by removing it from PEM. In this case, the point sets of object proposals are not transformed and are directly used to learn the positional encodings in the fine module. The results, presented in Table 4, indicate that the removal of Coarse Point Matching significantly degrades the performance, which may be attributed to the large distance between the sampled point sets of the proposals and target objects, as no initial poses are provided.
两阶段点匹配的有效性 基于背景标记,我们设计了包含粗匹配模块和细匹配模块的两阶段点匹配 PEM。首先,我们验证了细匹配模块的有效性,其能够有效提升粗匹配模块的结果,如表 4 所示。进一步地,我们通过移除 PEM 中的粗匹配模块来评估其有效性。在这种情况下,物体提议的点集不进行变换,直接用于细匹配模块中学习位置编码。表 4 中的结果表明,移除粗匹配模块会显著降低性能,这可能是由于提议点集与目标物体之间距离较大,且没有提供初始姿态所致。

Efficacy of Sparse-to-Dense Point Transformers We design Sparse-to-Dense Point Transformers (SDPT) in the Fine Point Matching module to manage dense point interactions. Within each SDPT, Geometric Transformers [48] is employed to learn the relationships between sparse point sets, which are then spread to the dense ones via Linear Transformers [24]. We conduct experiments on either Geometric Transformers using sparse point sets with 196 points or Linear Transformers using dense point sets with 2048 points. The results, presented in Table 5, indicate inferior performance compared to using our SDPTs. This is because Geometric Transformers struggle to handle dense point sets due to high computational costs, whereas Linear Transformers prove to be ineffective in modeling dense correspondence with attention along the feature dimension.
稀疏到稠密点变换器的有效性 我们在精细点匹配模块中设计了稀疏到稠密点变换器(SDPT)以管理稠密点的交互。在每个 SDPT 中,采用几何变换器[48]来学习稀疏点集之间的关系,然后通过线性变换器[24]将其传播到稠密点集。我们分别对使用 196 个点的稀疏点集的几何变换器和使用 2048 个点的稠密点集的线性变换器进行了实验。结果如表 5 所示,性能均不及使用我们的 SDPT。这是因为几何变换器由于计算成本高昂,难以处理稠密点集,而线性变换器在沿特征维度的注意力机制下对稠密对应关系的建模效果不佳。

4.3. Runtime Analysis  4.3. 运行时间分析

We conduct evaluation on a server with a GeForce RTX 3090 GPU, and report in Table 6 the runtime averaged
我们在配备 GeForce RTX 3090 GPU 的服务器上进行了评测,表 6 中报告了平均运行时间
AR Time (s)  时间(秒)
PEM with Optimal Transport
带有最优传输的 PEM
81.4 4.31
PEM with Background Tokens
带有背景标记的 PEM
8 4 . 5 8 4 . 5 84.5\mathbf{8 4 . 5} 1 . 3 6 1 . 3 6 1.36\mathbf{1 . 3 6}
AR Time (s) PEM with Optimal Transport 81.4 4.31 PEM with Background Tokens 84.5 1.36| | AR | Time (s) | | :--- | :---: | :---: | | PEM with Optimal Transport | 81.4 | 4.31 | | PEM with Background Tokens | $\mathbf{8 4 . 5}$ | $\mathbf{1 . 3 6}$ |
Table 3. Quantitative results of Optimal Transport [48] and our design of Background Tokens in the Pose Estimation Model on YCB-V. The reported time is the average per-image processing time of pose estimation across the entire dataset on a server with a GeForce RTX 3090 GPU.
表 3. 在 YCB-V 数据集上,最优传输[48]和我们设计的背景标记在姿态估计模型中的定量结果。报告的时间是使用配备 GeForce RTX 3090 GPU 的服务器对整个数据集进行姿态估计时每张图像的平均处理时间。
Coarse Point Matching  粗略点匹配 Fine Point Matching  精细点匹配 AR
\checkmark × × xx\times 77.6
× × xx\times \checkmark 40.2
\checkmark \checkmark 8 4 . 5 8 4 . 5 84.5\mathbf{8 4 . 5}
Coarse Point Matching Fine Point Matching AR ✓ xx 77.6 xx ✓ 40.2 ✓ ✓ 84.5| Coarse Point Matching | Fine Point Matching | AR | | :---: | :---: | :---: | | $\checkmark$ | $\times$ | 77.6 | | $\times$ | $\checkmark$ | 40.2 | | $\checkmark$ | $\checkmark$ | $\mathbf{8 4 . 5}$ |
Table 4. Ablation studies on the the strategy of two point matching stages in the Pose Estimation Model on YCB-V.
表 4. 在 YCB-V 数据集上,姿态估计模型中两阶段点匹配策略的消融研究。
Transformer  变换器 #Point AR
Geometric Transformer [48]
几何变换器 [48]
196 81.7
Linear Transformer [24]  线性变换器 [24] 2048 78.4
Sparse-to-Dense Point Transformer
稀疏到密集点变换器
196 2048 196 2048 196 rarr2048196 \rightarrow 2048 8 4 . 5 8 4 . 5 84.5\mathbf{8 4 . 5}
Transformer #Point AR Geometric Transformer [48] 196 81.7 Linear Transformer [24] 2048 78.4 Sparse-to-Dense Point Transformer 196 rarr2048 84.5| Transformer | #Point | AR | | :--- | :---: | :---: | | Geometric Transformer [48] | 196 | 81.7 | | Linear Transformer [24] | 2048 | 78.4 | | Sparse-to-Dense Point Transformer | $196 \rightarrow 2048$ | $\mathbf{8 4 . 5}$ |
Table 5. Quantitative comparisons among various types of transformers employed in the Fine Point Matching module of the Pose Estimation Model on YCB-V.
表 5. 在 YCB-V 上姿态估计模型的精细点匹配模块中使用的各种类型变换器的定量比较。
Segmentation Model  分割模型 Time (s)  时间(秒)
Instance Segmentation  实例分割 Pose Estimaiton  姿态估计 All  全部
FastSAM [74] 0.45 0.98 1.43
SAM [26] 2.80 1.57 4.37
Segmentation Model Time (s) Instance Segmentation Pose Estimaiton All FastSAM [74] 0.45 0.98 1.43 SAM [26] 2.80 1.57 4.37| Segmentation Model | Time (s) | | | | :---: | :---: | :---: | :---: | | | Instance Segmentation | Pose Estimaiton | All | | FastSAM [74] | 0.45 | 0.98 | 1.43 | | SAM [26] | 2.80 | 1.57 | 4.37 |
Table 6. Runtime of SAM-6D with different segmentation models. The reported time is the average per-image processing time across the seven core datasets of BOP benchmark on a server with a GeForce RTX 3090 GPU.
表 6. 使用不同分割模型的 SAM-6D 运行时间。报告的时间是基于配备 GeForce RTX 3090 GPU 的服务器,在 BOP 基准的七个核心数据集上每张图像的平均处理时间。

on the seven core datasets of BOP benchmark, indicating the efficiency of SAM-6D which avoids the use of timeintensive render-based refiners. We note that SAM-based method takes more time on pose estimation than FastSAMbased one, due to more object proposals generated by SAM.
在 BOP 基准的七个核心数据集上,表明 SAM-6D 的高效性,因为它避免了使用耗时的基于渲染的细化器。我们注意到,基于 SAM 的方法在姿态估计上比基于 FastSAM 的方法花费更多时间,这是由于 SAM 生成了更多的物体提议。

5. Conclusion  5. 结论

In this paper, we take Segment Anything Model (SAM) as an advanced starting point for zero-shot 6D object pose estimation, and present a novel framework, named SAM-6D, which comprises an Instance Segmentation Model (ISM) and a Pose Estimation Model (PEM) to accomplish the task in two steps. ISM utilizes SAM to segment all potential object proposals and assigns each of them an object matching score in terms of semantics, appearance, and geometry. PEM then predicts the object pose for each proposal by solving a partial-to-partial point matching problem through two stages of Coarse Point Matching and Fine Point Matching. The effectiveness of SAM-6D is validated on the seven core datasets of BOP benchmark, where SAM-6D significantly outperforms existing methods.
本文以 Segment Anything Model(SAM)作为零样本 6D 物体姿态估计的先进起点,提出了一种新颖的框架,命名为 SAM-6D。该框架包含实例分割模型(ISM)和姿态估计模型(PEM),通过两步完成任务。ISM 利用 SAM 分割所有潜在的物体候选区域,并根据语义、外观和几何为每个候选区域分配一个物体匹配分数。随后,PEM 通过两阶段的粗匹配和精匹配,解决部分点到部分点的匹配问题,预测每个候选区域的物体姿态。SAM-6D 在 BOP 基准的七个核心数据集上验证了其有效性,显著优于现有方法。

References  参考文献

[1] Dingding Cai, Janne Heikkilä, and Esa Rahtu. Ove6d: Object viewpoint encoding for depth-based 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 68036813, 2022. 3, 7
[1] Dingding Cai, Janne Heikkilä, and Esa Rahtu. Ove6d: Object viewpoint encoding for depth-based 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6803-6813, 2022. 3, 7

[2] Jun Cen, Yizheng Wu, Kewei Wang, Xingyi Li, Jingkang Yang, Yixuan Pei, Lingdong Kong, Ziwei Liu, and Qifeng Chen. Sad: Segment any rgbd. arXiv preprint arXiv:2305.14207, 2023. 3
[2] Jun Cen, Yizheng Wu, Kewei Wang, Xingyi Li, Jingkang Yang, Yixuan Pei, Lingdong Kong, Ziwei Liu, 和 Qifeng Chen. SAD:分割任意 RGBD。arXiv 预印本 arXiv:2305.14207, 2023. 3

[3] Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Xiaopeng Zhang, and Qi Tian. Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308, 2023. 3
[3] Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Xiaopeng Zhang, 和 Qi Tian. 使用 NeRFs 进行 3D 任意分割。arXiv 预印本 arXiv:2304.12308, 2023. 3

[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 3, 6
[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, 等. ShapeNet:一个信息丰富的 3D 模型库。arXiv 预印本 arXiv:1512.03012, 2015. 3, 6

[5] Jianqiu Chen, Mingshan Sun, Tianpeng Bao, Rui Zhao, Liwei Wu, and Zhenyu He. 3d model-based zero-shot pose estimation pipeline. arXiv preprint arXiv:2305.17934, 2023. 2, 3, 7, 1
[5] Jianqiu Chen, Mingshan Sun, Tianpeng Bao, Rui Zhao, Liwei Wu, 和 Zhenyu He. 基于 3D 模型的零样本姿态估计流程。arXiv 预印本 arXiv:2305.17934, 2023. 2, 3, 7, 1

[6] Jiaqi Chen, Zeyu Yang, and Li Zhang. Semantic segment anything. https://github.com/fudan-zvg/ Semantic-Segment-Anything, 2023. 3
[6] 陈佳琦,杨泽宇,张力。语义分割任意物体。https://github.com/fudan-zvg/Semantic-Segment-Anything,2023。3

[7] Kai Chen and Qi Dou. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2773-2782, 2021. 2
[7] 陈凯,窦琦。SGPA:用于类别级 6D 物体姿态估计的结构引导先验适应。在 IEEE/CVF 国际计算机视觉会议论文集,页 2773-2782,2021。2

[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16 × 16 16 × 16 16 xx1616 \times 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 4, 5, 2
[8] 阿列克谢·多索维茨基,卢卡斯·拜耶,亚历山大·科列斯尼科夫,迪尔克·魏森博恩,翟晓华,托马斯·安特尔蒂纳,莫斯塔法·德赫加尼,马蒂亚斯·明德勒,乔治·海戈尔德,西尔万·盖利等。一张图片胜过 16 × 16 16 × 16 16 xx1616 \times 16 个词:大规模图像识别的变换器。arXiv 预印本 arXiv:2010.11929,2020。4,5,2

[9] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A highquality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553-2560. IEEE, 2022. 3, 6
[9] 劳拉·唐斯,安东尼·弗朗西斯,内特·科尼格,布兰登·金曼,瑞安·希克曼,克里斯塔·雷曼,托马斯·B·麦克休,文森特·范霍克。谷歌扫描物体:高质量的 3D 扫描家用物品数据集。在 2022 年国际机器人与自动化会议(ICRA),页 2553-2560。IEEE,2022。3,6

[10] Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Dejia Xu, Hanwen Jiang, and Zhangyang Wang. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference. arXiv preprint arXiv:2305.15727, 2023. 3
[10] 范志文,潘攀旺,王培浩,江一帆,徐德嘉,江汉文,王章扬。Pope:基于单一参考的任意物体、任意场景的 6 自由度可提示姿态估计。arXiv 预印本 arXiv:2305.15727,2023 年。3

[11] Walter Goodwin, Sagar Vaze, Ioannis Havoutis, and Ingmar Posner. Zero-shot category-level object pose estimation. In European Conference on Computer Vision, pages 516-532. Springer, 2022. 3
[11] Walter Goodwin,Sagar Vaze,Ioannis Havoutis,Ingmar Posner。零样本类别级物体姿态估计。载于欧洲计算机视觉会议,页码 516-532。施普林格,2022 年。3

[12] Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 59615971, 2023. 3, 6, 5
[12] 韩东辰,潘旭然,韩一增,宋世吉,黄高。Flatten Transformer:使用聚焦线性注意力的视觉 Transformer。载于 IEEE/CVF 国际计算机视觉会议论文集,页码 5961-5971,2023 年。3,6,5

[13] Dongsheng Han, Chaoning Zhang, Yu Qiao, Maryam Qamar, Yuna Jung, SeungKyu Lee, Sung-Ho Bae, and Choong Seon Hong. Segment anything model (sam) meets
[13] 韩东升,张朝宁,乔宇,Maryam Qamar,郑允娥,SeungKyu Lee,裴成浩,洪忠善。Segment Anything Model (SAM) 结合

glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278, 2023. 3
glass:镜面和透明物体不易被检测。arXiv 预印本 arXiv:2305.00278,2023。3

[14] Rasmus Laurvig Haugaard and Anders Glent Buch. Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6749-6758, 2022. 6
[14] Rasmus Laurvig Haugaard 和 Anders Glent Buch。Surfemb:用于物体姿态估计的密集且连续的对应分布,基于学习的表面嵌入。在 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 6749-6758,2022。6

[15] Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, and Dacheng Tao. Scalable mask annotation for video text spotting. arXiv preprint arXiv:2305.01443, 2023. 3
[15] Haibin He、Jing Zhang、Mengyang Xu、Juhua Liu、Bo Du 和 Dacheng Tao。可扩展的视频文本检测掩码标注。arXiv 预印本 arXiv:2305.01443,2023。3

[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961-2969, 2017. 7
[16] Kaiming He、Georgia Gkioxari、Piotr Dollár 和 Ross Girshick。Mask R-CNN。在 IEEE 国际计算机视觉会议论文集,页码 2961-2969,2017。7

[17] Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. Onepose++: Keypoint-free oneshot object pose estimation without cad models. Advances in Neural Information Processing Systems, 35:35103-35115, 2022. 3
[17] 何兴义,孙嘉明,王源,黄迪,鲍虎军,周晓伟。Onepose++:无关键点的一次性物体姿态估计,无需 CAD 模型。神经信息处理系统进展,35:35103-35115,2022 年。3

[18] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11632-11641, 2020. 2
[18] 何一生,孙伟,黄海滨,刘建然,范浩强,孙健。PVN3D:一种用于 6 自由度姿态估计的深度点级 3D 关键点投票网络。IEEE/CVF 计算机视觉与模式识别会议论文集,页 11632-11641,2020 年。2

[19] Yisheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3003-3013, 2021. 2
[19] 何一生,黄海滨,范浩强,陈启峰,孙健。FFB6D:一种用于 6D 姿态估计的全流双向融合网络。IEEE/CVF 计算机视觉与模式识别会议论文集,页 3003-3013,2021 年。2

[20] Yisheng He, Yao Wang, Haoqiang Fan, Jian Sun, and Qifeng Chen. Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6814-6824, 2022. 3
[20] 何一生,王尧,范浩强,孙健,陈启峰。FS6D:新颖物体的少样本 6D 姿态估计。IEEE/CVF 计算机视觉与模式识别会议论文集,页 6814-6824,2022 年。3

[21] Shengyu Huang, Zan Gojcic, Mikhail Usvyatsov, Andreas Wieser, and Konrad Schindler. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 4267-4276, 2021. 3
[21] 黄胜宇,Zan Gojcic,Mikhail Usvyatsov,Andreas Wieser,Konrad Schindler。Predator:低重叠 3D 点云配准。载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 4267-4276 页,2021 年。3

[22] Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, and Luc Van Gool. Sam struggles in concealed scenes-empirical study on" segment anything". arXiv preprint arXiv:2304.06022, 2023. 3
[22] 纪格鹏,樊登平,徐鹏,程明明,周博文,Luc Van Gool。SAM 在隐蔽场景中的表现不佳——关于“Segment Anything”的实证研究。arXiv 预印本 arXiv:2304.06022,2023 年。3

[23] Wei Ji, Jingjing Li, Qi Bi, Wenbo Li, and Li Cheng. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023. 3
[23] 纪伟,李晶晶,毕琦,李文博,程力。Segment Anything 并非总是完美的:对 SAM 在不同真实应用中的调查。arXiv 预印本 arXiv:2304.05750,2023 年。3

[24] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156-5165. PMLR, 2020. 3, 6, 8, 5
[24] Angelos Katharopoulos,Apoorv Vyas,Nikolaos Pappas,François Fleuret。Transformers 是 RNN:具有线性注意力的快速自回归 Transformer。载于国际机器学习大会论文集,第 5156-5165 页。PMLR,2020 年。3, 6, 8, 5

[25] Tong Ke and Stergios I Roumeliotis. An efficient algebraic solution to the perspective-three-point problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7225-7233, 2017. 6
[25] Tong Ke 和 Stergios I Roumeliotis. 一种高效的透视三点问题代数解法. 载于 IEEE 计算机视觉与模式识别会议论文集, 页码 7225-7233, 2017. 6

[26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
[26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo 等人。Segment any-

thing. arXiv preprint arXiv:2304.02643, 2023. 2, 3, 4, 6, 7, 8, 1
事物。arXiv 预印本 arXiv:2304.02643,2023 年。 2,3,4,6,7,8,1

[27] Yann Labbé, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVII 16, pages 574-591. Springer, 2020. 7
[27] Yann Labbé、Justin Carpentier、Mathieu Aubry 和 Josef Sivic。Cosypose:一致的多视角多物体 6D 姿态估计。载于《计算机视觉—ECCV 2020:第 16 届欧洲会议,英国格拉斯哥,2020 年 8 月 23-28 日,会议论文集,第十七部分 16》,第 574-591 页。施普林格,2020 年。7

[28] Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022. 2, 3, 6, 7, 5
[28] Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox 和 Josef Sivic. Megapose:通过渲染与比较实现新颖物体的 6D 姿态估计。发表于第六届机器人学习会议(CoRL),2022 年。2, 3, 6, 7, 5

[29] Jiehong Lin, Hongyang Li, Ke Chen, Jiangbo Lu, and Kui Jia. Sparse steerable convolutions: An efficient learning of se (3)-equivariant features for estimation and tracking of object poses in 3d space. Advances in Neural Information Processing Systems, 34:16779-16790, 2021. 2
[29] 林杰宏,李洪阳,陈柯,卢江波,贾奎。稀疏可转向卷积:一种高效学习 SE(3)等变特征的方法,用于 3D 空间中物体姿态的估计与跟踪。神经信息处理系统进展,34 卷:16779-16790,2021 年。2

[30] Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, and Yuanqing Li. Dualposenet: Category-level 6 d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3560-3569, 2021.
[30] 林杰宏,魏泽伟,李志豪,徐松岑,贾奎,李元庆。Dualposenet:利用双重姿态网络及姿态一致性精炼学习实现类别级 6D 物体姿态和尺寸估计。发表于 IEEE/CVF 国际计算机视觉大会,页码 3560-3569,2021 年。

[31] Jiehong Lin, Zewei Wei, Changxing Ding, and Kui Jia. Category-level 6 d object pose and size estimation using selfsupervised deep prior deformation networks. In European Conference on Computer Vision, pages 19-34. Springer, 2022.
[31] 林杰宏,魏泽伟,丁长兴,贾奎。利用自监督深度先验变形网络实现类别级 6D 物体姿态和尺寸估计。发表于欧洲计算机视觉大会,页码 19-34。施普林格,2022 年。

[32] Jiehong Lin, Zewei Wei, Yabin Zhang, and Kui Jia. Vi-net: Boosting category-level 6 d object pose estimation via learning decoupled rotations on the spherical representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14001-14011, 2023. 2
[32] 林杰宏,魏泽伟,张亚斌,贾奎。Vi-net:通过学习球面表示上的解耦旋转提升类别级 6D 物体姿态估计。在 IEEE/CVF 国际计算机视觉会议论文集,页 14001-14011,2023 年。2

[33] Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Generalizable model-free 6 -dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298-315. Springer, 2022. 3
[33] 刘源,温一林,彭思达,林成,龙晓晓,小村拓,王文平。Gen6d:基于 RGB 图像的可泛化无模型 6 自由度物体姿态估计。在欧洲计算机视觉会议,页 298-315。施普林格,2022 年。3

[34] Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310, 2023. 3
[34] 刘洋,朱慕之,李恒涛,陈浩,王新龙,沈春华。Matcher:使用通用特征匹配实现一次性分割所有物体。arXiv 预印本 arXiv:2305.13310,2023 年。3

[35] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li , Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023. 3
[35] 刘朝阳,何一楠,王文海,王伟云,王毅,陈守发,张庆龙,杨洋,李庆云,余嘉硕,等。Internchat:通过与聊天机器人交互解决以视觉为中心的任务,超越语言。arXiv 预印本 arXiv:2305.05662,2023 年。3

[36] Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023. 3
[36] Jun Ma 和 Bo Wang。医学图像中的任意分割。arXiv 预印本 arXiv:2304.12306,2023。3

[37] Maciej A Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, and Yixin Zhang. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023. 3
[37] Maciej A Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz 和 Yixin Zhang。用于医学图像分析的 Segment Anything 模型:一项实验研究。医学图像分析,89:102918,2023 年。3

[38] Van Nguyen Nguyen, Yinlin Hu, Yang Xiao, Mathieu Salzmann, and Vincent Lepetit. Templates for 3d object pose estimation revisited: Generalization to new objects and robust-
[38] Van Nguyen Nguyen、Yinlin Hu、Yang Xiao、Mathieu Salzmann 和 Vincent Lepetit。3D 物体姿态估计模板再探:对新物体的泛化与鲁棒性—

ness to occlusions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6771-6780, 2022. 3
对遮挡的鲁棒性。载于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 6771-6780,2022 年。 3

[39] Van Nguyen Nguyen, Thibault Groueix, Yinlin Hu, Mathieu Salzmann, and Vincent Lepetit. Nope: Novel object pose estimation from a single image. arXiv preprint arXiv:2303.13612, 2023. 3
[39] Van Nguyen Nguyen, Thibault Groueix, Yinlin Hu, Mathieu Salzmann 和 Vincent Lepetit. Nope:基于单张图像的新颖物体姿态估计。arXiv 预印本 arXiv:2303.13612,2023 年。3

[40] Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Vincent Lepetit, and Tomas Hodan. Cnos: A strong baseline for cad-based novel object segmentation. arXiv preprint arXiv:2307.11067, 2023. 2, 3, 4, 6, 7, 1
[40] Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Vincent Lepetit 和 Tomas Hodan. Cnos:基于 CAD 的新颖物体分割的强基线。arXiv 预印本 arXiv:2307.11067,2023 年。2, 3, 4, 6, 7, 1

[41] Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 7
[41] Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann 和 Vincent Lepetit. Gigapose:通过单一对应点实现快速且鲁棒的新颖物体姿态估计。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,2024 年。3, 7

[42] Brian Okorn, Qiao Gu, Martial Hebert, and David Held. Zephyr: Zero-shot pose hypothesis rating. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 14141-14148. IEEE, 2021. 3
[42] Brian Okorn, Qiao Gu, Martial Hebert 和 David Held. Zephyr:零样本姿态假设评分。发表于 2021 年 IEEE 国际机器人与自动化会议(ICRA),第 14141-14148 页。IEEE,2021 年。3

[43] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 6, 5
[43] Aaron van den Oord, Yazhe Li, 和 Oriol Vinyals. 使用对比预测编码的表示学习. arXiv 预印本 arXiv:1807.03748, 2018. 6, 5

[44] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6, 7, 1
[44] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby 等. Dinov2:无监督学习鲁棒视觉特征. arXiv 预印本 arXiv:2304.07193, 2023. 6, 7, 1

[45] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4
[45] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby 等. Dinov2:无监督学习鲁棒视觉特征. arXiv 预印本 arXiv:2304.07193, 2023. 4

[46] Panwang Pan, Zhiwen Fan, Brandon Y Feng, Peihao Wang, Chenxin Li, and Zhangyang Wang. Learning to estimate 6dof pose from limited data: A few-shot, generalizable approach using rgb images. arXiv preprint arXiv:2306.07598, 2023. 3
[46] Panwang Pan, Zhiwen Fan, Brandon Y Feng, Peihao Wang, Chenxin Li, 和 Zhangyang Wang. 从有限数据中学习估计 6 自由度姿态:一种使用 RGB 图像的少样本、可泛化方法. arXiv 预印本 arXiv:2306.07598, 2023. 3

[47] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 6, 5
[47] Charles Ruizhongtai Qi, Li Yi, Hao Su, 和 Leonidas J Guibas. Pointnet++:在度量空间中对点集进行深度分层特征学习。神经信息处理系统进展,30,2017 年。6,5

[48] Zheng Qin, Hao Yu, Changjian Wang, Yulan Guo, Yuxing Peng, and Kai Xu. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11143-11152, 2022. 3, 6, 8, 5
[48] Zheng Qin, Hao Yu, Changjian Wang, Yulan Guo, Yuxing Peng, 和 Kai Xu. 几何变换器用于快速且鲁棒的点云配准。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 11143-11152,2022 年。3,6,8,5

[49] Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023. 3
[49] Qiuhong Shen, Xingyi Yang, 和 Xinchao Wang. Anything3d:迈向野外单视图任意物体重建。arXiv 预印本 arXiv:2304.10261,2023 年。3

[50] Ivan Shugurov, Fu Li, Benjamin Busam, and Slobodan Ilic. Osop: A multi-stage one shot object pose estimation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6835-6844, 2022. 3
[50] Ivan Shugurov, Fu Li, Benjamin Busam, 和 Slobodan Ilic. Osop:一个多阶段一次性物体姿态估计框架。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 6835-6844,2022 年。3

[51] Yongzhi Su, Mahdi Saleh, Torben Fetzer, Jason Rambach, Nassir Navab, Benjamin Busam, Didier Stricker, and Fed-
[51] 苏永志,马赫迪·萨利赫,托本·费策,杰森·兰巴赫,纳西尔·纳瓦布,本杰明·布萨姆,迪迪埃·斯特里克,以及 Fed-

erico Tombari. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6738-6748, 2022. 2
Erico Tombari. Zebrapose:用于 6 自由度物体姿态估计的粗到细表面编码。载于《IEEE/CVF 计算机视觉与模式识别会议论文集》,第 6738-6748 页,2022 年。 2

[52] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922-8931, 2021. 3
[52] 孙嘉明,沈泽宏,王远,鲍虎军,周晓伟。Loftr:基于变换器的无检测器局部特征匹配。载于《IEEE/CVF 计算机视觉与模式识别会议论文集》,第 8922-8931 页,2021 年。3

[53] Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. Onepose: One-shot object pose estimation without cad models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6825-6834, 2022. 3
[53] 孙嘉明,王子豪,张思宇,何兴义,赵宏成,张国锋,周晓伟。Onepose:无需 CAD 模型的一次性物体姿态估计。载于《IEEE/CVF 计算机视觉与模式识别会议论文集》,第 6825-6834 页,2022 年。3

[54] Martin Sundermeyer, Tomáš Hodaň, Yann Labbe, Gu Wang, Eric Brachmann, Bertram Drost, Carsten Rother, and Jiří Matas. Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2784-2793, 2023. 1, 3, 6, 7, 2, 4, 8
[54] Martin Sundermeyer, Tomáš Hodaň, Yann Labbe, Gu Wang, Eric Brachmann, Bertram Drost, Carsten Rother 和 Jiří Matas. 2022 年 BOP 挑战赛:特定刚性物体的检测、分割与姿态估计。载于《IEEE/CVF 计算机视觉与模式识别会议论文集》,第 2784-2793 页,2023 年。1, 3, 6, 7, 2, 4, 8

[55] Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023. 3
[55] Lv Tang, Haoke Xiao 和 Bo Li. SAM 能分割任何东西吗?当 SAM 遇上伪装物体检测。arXiv 预印本 arXiv:2304.04709,2023 年。3

[56] Meng Tian, Marcelo H Ang, and Gim Hee Lee. Shape prior deformation for categorical 6d object pose and size estimation. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI 16, pages 530-546. Springer, 2020. 2
[56] Meng Tian, Marcelo H Ang 和 Gim Hee Lee. 类别 6D 物体姿态与尺寸估计的形状先验变形。载于《计算机视觉-ECCV 2020:第 16 届欧洲会议,英国格拉斯哥,2020 年 8 月 23-28 日,论文集,Part XXI 16》,第 530-546 页。施普林格,2020 年。2

[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 6
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser 和 Illia Polosukhin. 注意力机制就是你所需要的。神经信息处理系统进展,30,2017 年。6

[58] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343-3352, 2019. 2
[58] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, 和 Silvio Savarese. Densefusion:通过迭代密集融合进行 6D 物体姿态估计。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 3343-3352,2019 年。2

[59] Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Süsstrunk. Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields. arXiv preprint arXiv:2305.15094, 2023. 3
[59] Dongqing Wang, Tong Zhang, Alaa Abboud, 和 Sabine Süsstrunk. Inpaintnerf360:基于文本引导的无界神经辐射场三维修复。arXiv 预印本 arXiv:2305.15094,2023 年。3

[60] Gu Wang, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16611-16621, 2021. 2
[60] Gu Wang, Fabian Manhardt, Federico Tombari, 和 Xiangyang Ji. Gdr-net:用于单目 6D 物体姿态估计的几何引导直接回归网络。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 16611-16621,2021 年。2

[61] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26422651, 2019. 2
[61] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, 和 Leonidas J Guibas. 归一化物体坐标空间用于类别级 6D 物体姿态和尺寸估计。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,页码 2642-2651,2019 年。2

[62] Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusionbased image editing with user instructions. arXiv preprint arXiv:2305.18047, 2023. 3
[62] Qian Wang, Biao Zhang, Michael Birsak, 和 Peter Wonka. Instructedit:通过用户指令改进基于扩散的图像编辑的自动掩码。arXiv 预印本 arXiv:2305.18047, 2023. 3

[63] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017. 2
[63] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, 和 Dieter Fox. Posecnn:用于杂乱场景中 6D 物体姿态估计的卷积神经网络。arXiv 预印本 arXiv:1711.00199, 2017. 2

[64] Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, Dong Yang, Fobo Shi, and Xiaodong Lin. Edit everything: A text-guided generative system for images editing. arXiv preprint arXiv:2304.14006, 2023. 3
[64] Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, Dong Yang, Fobo Shi, 和 Xiaodong Lin. Edit everything:一个基于文本指导的图像生成编辑系统。arXiv 预印本 arXiv:2304.14006, 2023. 3

[65] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023. 3
[65] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, 和 Feng Zheng. Track anything:Segment anything 与视频的结合。arXiv 预印本 arXiv:2304.11968, 2023. 3

[66] Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023. 3
[66] Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, 和 Xihui Liu. Sam3d:三维场景中的任意分割。arXiv 预印本 arXiv:2306.03908, 2023. 3

[67] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023. 3
[67] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, 和 Zhibo Chen. 任意修复:任意分割与图像修复的结合。arXiv 预印本 arXiv:2304.06790, 2023. 3

[68] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023. 3
[68] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, 和 Choong Seon Hong. 更快的任意分割:面向移动应用的轻量级 SAM。arXiv 预印本 arXiv:2306.14289, 2023. 3

[69] Chaoning Zhang, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, Sung-Ho Bae, et al. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv preprint arXiv:2306.06211, 2023. 3
[69] Chaoning Zhang, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, Sung-Ho Bae, 等. 任意分割模型(SAM)综述:视觉基础模型与提示工程的结合。arXiv 预印本 arXiv:2306.06211, 2023. 3

[70] Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, and Xiang Bai. Sam3d: Zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023. 3
[70] 张定远,梁定康,杨宏成,邹志康,叶晓青,刘哲,白翔。Sam3d:通过 Segment Anything 模型实现零样本 3D 目标检测。arXiv 预印本 arXiv:2306.02245,2023 年。3

[71] Haojie Zhang, Yongyi Su, Xun Xu, and Kui Jia. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. arXiv preprint arXiv:2312.03502, 2023. 3
[71] 张昊杰,苏永义,徐勋,贾奎。通过弱监督适应提升分割基础模型在分布偏移下的泛化能力。arXiv 预印本 arXiv:2312.03502,2023 年。3

[72] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023. 3
[72] 张仁睿,姜正凯,郭子瑜,严世霖,潘俊廷,董浩,高鹏,李宏升。一次性个性化 Segment Anything 模型。arXiv 预印本 arXiv:2305.03048,2023 年。3

[73] Zhenghao Zhang, Zhichao Wei, Shengfan Zhang, Zuozhuo Dai, and Siyu Zhu. Uvosam: A mask-free paradigm for unsupervised video object segmentation via segment anything model. arXiv preprint arXiv:2305.12659, 2023. 3
[73] 张正浩,魏志超,张胜帆,戴作卓,朱思宇。Uvosam:一种基于 Segment Anything 模型的无掩码无监督视频目标分割范式。arXiv 预印本 arXiv:2305.12659,2023 年。3

[74] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023. 3, 6, 7, 8, 2
[74] 赵旭,丁文超,安永琪,杜英龙,余涛,李敏,唐明,王金桥。快速分割任何物体。arXiv 预印本 arXiv:2306.12156,2023 年。3,6,7,8,2

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
SAM-6D:分割任何物体模型遇见零样本 6D 物体姿态估计

Supplementary Material  补充材料

CONTENT:  目录:

  • §A. Supplementary Material for Instance Segmentation Model
    §A. 实例分割模型的补充材料
  • §A.1. Visible Ratio for Geometric Matching Score
    §A.1. 几何匹配分数的可见比例
  • §A.2. Template Selection for Object Matching
    §A.2. 物体匹配的模板选择
  • §A.3. Hyperparameter Settings
    §A.3. 超参数设置
  • §A.4. More Quantitative Results
    §A.4. 更多定量结果
  • §A.4.1. Detection Results
    §A.4.1. 检测结果
  • §A.4.2. Effects of Model Sizes
    §A.4.2. 模型规模的影响
  • §A.5. More Qualitative Results
    §A.5. 更多定性结果
  • §A.5.1. Qualitative Comparisons on Appearance Matching Score
    §A.5.1. 外观匹配分数的定性比较
  • §A.5.2. Qualitative Comparisons on Geometric Matching Score
    §A.5.2. 几何匹配分数的定性比较
  • §A.5.3. More Qualitative Comparisons with Existing Methods
    §A.5.3. 与现有方法的更多定性比较
  • §B. Supplementary Material for Pose Estimation Model
    §B. 姿态估计模型的补充材料
  • §B.1. Network Architectures and Specifics
    §B.1. 网络架构与细节
  • §B.1.1. Feature Extraction
    §B.1.1. 特征提取
  • §B.1.2. Coarse Point Matching
    §B.1.2. 粗略点匹配
  • §B.1.3. Fine Point Matching
    §B.1.3. 精细点匹配
  • §B.2. Training Objectives
    §B.2. 训练目标
  • §B.3. More Quantitative Results
    §B.3. 更多定量结果
  • §B.3.1. Effects of The View Number of Templates
    §B.3.1. 模板视角数量的影响
  • §B.3.2. Comparisons with OVE6D
    §B.3.2. 与 OVE6D 的比较
  • §B.4. More Qualitative Comparisons with Existing Methods
    §B.4. 与现有方法的更多定性比较

A. Supplementary Material for Instance Segmentation Model
A. 实例分割模型的补充材料

A.1. Visible Ratio for Geometric Matching Score
A.1. 几何匹配分数的可见比例

In the Instance Segmentation Model (ISM) of our SAM-6D, we introduce a visible ratio r v i s r v i s r_(vis)r_{v i s} to weight the reliability of the geometric matching score s geo s geo  s_("geo ")s_{\text {geo }}. Specifically, given an RGB crop I m I m I_(m)\mathcal{I}_{m} of a proposal m m mm and the best-matched template T best T best  T_("best ")\mathcal{T}_{\text {best }} of the target object O O O\mathcal{O}, along with their patch embeddings { f I m , j patch } j = 1 N I m patch f I m , j patch  j = 1 N I m patch  {f_(I_(m),j)^("patch ")}_(j=1)^(N_(I_(m))^("patch "))\left\{\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right\}_{j=1}^{N_{\mathcal{I}_{m}}^{\text {patch }}} and { f T best , i patch } i = 1 N T best patch , r vis f T best  , i patch  i = 1 N T best  patch  , r vis  {f_(T_("best "),i)^("patch ")}_(i=1)^(N_(T_("best "))^("patch ")),r_("vis ")\left\{\boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right\}_{i=1}^{N_{\mathcal{T}_{\text {best }}}^{\text {patch }}}, r_{\text {vis }} is calculated as the ratio of patches in T best T best  T_("best ")\mathcal{T}_{\text {best }} that can find a corresponding patch in I m I m I_(m)\mathcal{I}_{m}, estimating the occlusion degree of O O O\mathcal{O} in I m I m I_(m)\mathcal{I}_{m}. We can formulate the calculation of visible ratio r vis r vis  r_("vis ")r_{\text {vis }} as follows:
在我们 SAM-6D 的实例分割模型(ISM)中,我们引入了一个可见比例 r v i s r v i s r_(vis)r_{v i s} 来加权几何匹配分数 s geo s geo  s_("geo ")s_{\text {geo }} 的可靠性。具体来说,给定一个提议 m m mm 的 RGB 裁剪图像 I m I m I_(m)\mathcal{I}_{m} 和目标物体 O O O\mathcal{O} 的最佳匹配模板 T best T best  T_("best ")\mathcal{T}_{\text {best }} ,以及它们的补丁嵌入 { f I m , j patch } j = 1 N I m patch f I m , j patch  j = 1 N I m patch  {f_(I_(m),j)^("patch ")}_(j=1)^(N_(I_(m))^("patch "))\left\{\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right\}_{j=1}^{N_{\mathcal{I}_{m}}^{\text {patch }}} { f T best , i patch } i = 1 N T best patch , r vis f T best  , i patch  i = 1 N T best  patch  , r vis  {f_(T_("best "),i)^("patch ")}_(i=1)^(N_(T_("best "))^("patch ")),r_("vis ")\left\{\boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right\}_{i=1}^{N_{\mathcal{T}_{\text {best }}}^{\text {patch }}}, r_{\text {vis }} ,可见比例 T best T best  T_("best ")\mathcal{T}_{\text {best }} 被计算为在 T best T best  T_("best ")\mathcal{T}_{\text {best }} 中能够找到对应补丁的补丁数量与总补丁数量的比值,用以估计 O O O\mathcal{O} I m I m I_(m)\mathcal{I}_{m} 中的遮挡程度。我们可以将可见比例 r vis r vis  r_("vis ")r_{\text {vis }} 的计算公式表示如下:
r v i s = 1 N T best patch i = 1 N T best patch r v i s , i , r v i s = 1 N T best  patch  i = 1 N T best  patch  r v i s , i , r_(vis)=(1)/(N_(T_("best "))^("patch "))sum_(i=1)^(N_(T_("best "))^("patch "))r_(vis,i),r_{v i s}=\frac{1}{N_{\mathcal{T}_{\text {best }}}^{\text {patch }}} \sum_{i=1}^{N_{\mathcal{T}_{\text {best }}}^{\text {patch }}} r_{v i s, i},
where  其中
r v i s , i = { 0 if s v i s , i < δ v i s 1 if s v i s , i δ v i s , r v i s , i = 0  if  s v i s , i < δ v i s 1  if  s v i s , i δ v i s , r_(vis,i)={[0," if ",s_(vis,i) < delta_(vis)],[1," if ",s_(vis,i) >= delta_(vis)],:}r_{v i s, i}=\left\{\begin{array}{lll} 0 & \text { if } & s_{v i s, i}<\delta_{v i s} \\ 1 & \text { if } & s_{v i s, i} \geq \delta_{v i s} \end{array},\right.
and  
s v i s , i = max j = 1 , , N I m patch f I m , j patch , f T best , i patch | f I m , j patch | | f T best , i patch | . s v i s , i = max j = 1 , , N I m patch  f I m , j patch  , f T best  , i patch  f I m , j patch  f T best  , i patch  . s_(vis,i)=max_(j=1,dots,N_(I_(m))^("patch "))((:f_(I_(m),j)^("patch "),f_(T_("best "),i)^("patch "):))/(|f_(I_(m),j)^("patch ")|*|f_(T_("best "),i)^("patch ")|).s_{v i s, i}=\max _{j=1, \ldots, N_{\mathcal{I}_{m}}^{\text {patch }}} \frac{\left\langle\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}, \boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right\rangle}{\left|\boldsymbol{f}_{\mathcal{I}_{m}, j}^{\text {patch }}\right| \cdot\left|\boldsymbol{f}_{\mathcal{T}_{\text {best }}, i}^{\text {patch }}\right|} .
The constant threshold δ v i s δ v i s delta_(vis)\delta_{v i s} is empirically set as 0.5 to determine whether the patches in T best T best  T_("best ")\mathcal{T}_{\text {best }} are occluded.
常数阈值 δ v i s δ v i s delta_(vis)\delta_{v i s} 经验设定为 0.5,用于判断 T best T best  T_("best ")\mathcal{T}_{\text {best }} 中的补丁是否被遮挡。

A.2. Template Selection for Object Matching
A.2. 用于目标匹配的模板选择

For each given target object, we follow [40] to first sample 42 well-distributed viewpoints defined by the icosphere primitive of Blender. Corresponding to these viewpoints, we select 42 fully visible object templates from the Physically-based Rendering (PBR) training images of the BOP benchmark [54] by cropping regions and masking backgrounds using the ground truth object bounding boxes and masks, respectively. These cropped and masked images then serve as the templates of the target object, which are used to calculate the object matching scores for all generated proposals. It’s noted that these 42 templates can also be directly rendered using the pre-defined viewpoints.
对于每个给定的目标物体,我们遵循[40]的方法,首先采样由 Blender 的 icosphere 原语定义的 42 个分布均匀的视点。对应这些视点,我们从 BOP 基准[54]的基于物理渲染(PBR)的训练图像中选择 42 个完全可见的物体模板,通过使用真实物体的边界框和掩码分别裁剪区域和遮罩背景。这些裁剪和遮罩后的图像随后作为目标物体的模板,用于计算所有生成提议的物体匹配分数。需要注意的是,这 42 个模板也可以直接使用预定义的视点进行渲染。

A.3. Hyperparameter Settings
A.3. 超参数设置

In the paper, we use SAM [26] based on ViT-H or FastSAM based on YOLOv8x as the segmentation model, and ViT-L of DINOv2 [44] as the description model. We utilize the publicly available codes for autonomous segmentation from SAM and FastSAM, with the hyperparameter settings displayed in Table 7.
在本文中,我们使用基于 ViT-H 的 SAM [26]或基于 YOLOv8x 的 FastSAM 作为分割模型,使用 DINOv2 [44]的 ViT-L 作为描述模型。我们利用 SAM 和 FastSAM 的公开代码进行自动分割,超参数设置如表 7 所示。

A.4. More Quantitative Results
A.4. 更多定量结果

A.4.1 Detection Results  A.4.1 检测结果

We compare our Instance Segmentation Model (ISM) with ZeroPose [5] and CNOS [40] in terms of 2D object detection in Table 8, where our ISM outperforms both methods owing to the meticulously crafted design of object matching score.
我们在表 8 中将我们的实例分割模型(ISM)与 ZeroPose [5]和 CNOS [40]在二维目标检测方面进行了比较,得益于精心设计的目标匹配分数,我们的 ISM 优于这两种方法。

A.4.2 Effects of Model Sizes
A.4.2 模型规模的影响

We draw a comparison across different model sizes for both segmentation and description models on YCB-V dataset in Table 9, which indicates a positive correlation between larger model sizes and higher performance for both models.
我们在表 9 中对 YCB-V 数据集上的分割模型和描述模型的不同规模进行了比较,结果表明模型规模越大,两个模型的性能越高,二者呈正相关。
Hyperparameter  超参数 Setting  设置
(a) SAM [26]
point_per_size 32
pred_iou_thresh  预测 iou 阈值 0.88
stability_score_thresh  稳定性分数阈值 0.85
stability_score_offset  稳定性分数偏移 1.0
box_nms_thresh  边界框非极大值抑制阈值 0.7
crop_n_layer  裁剪_n_层 0
point_grids  点_网格 None  
min_mask_region_area  最小_掩码_区域_面积 0
(b) FastSAM [74]
iou 0.9
conf 0.05
max_det 200
Hyperparameter Setting (a) SAM [26] point_per_size 32 pred_iou_thresh 0.88 stability_score_thresh 0.85 stability_score_offset 1.0 box_nms_thresh 0.7 crop_n_layer 0 point_grids None min_mask_region_area 0 (b) FastSAM [74] iou 0.9 conf 0.05 max_det 200| Hyperparameter | Setting | | :--- | :---: | | (a) SAM [26] | | | point_per_size | 32 | | pred_iou_thresh | 0.88 | | stability_score_thresh | 0.85 | | stability_score_offset | 1.0 | | box_nms_thresh | 0.7 | | crop_n_layer | 0 | | point_grids | None | | min_mask_region_area | 0 | | (b) FastSAM [74] | | | iou | 0.9 | | conf | 0.05 | | max_det | 200 |
Table 7. Hyperparameter Settings of (a) SAM [26] and (b) FastSAM [74] in their publicly available codes for autonomous segmentation.
表 7. (a) SAM [26] 和 (b) FastSAM [74] 在其公开代码中用于自动分割的超参数设置。

Figure 4. Qualitative results of our Instance Segmentation Model with or without the appearance matching score s appe s appe  s_("appe ")s_{\text {appe }}.
图 4. 我们的实例分割模型在有无外观匹配分数 s appe s appe  s_("appe ")s_{\text {appe }} 情况下的定性结果。

A.5. More Qualitative Results
A.5. 更多定性结果

A.5.1 Qualitative Comparisons on Appearance Matching Score
A.5.1 关于外观匹配分数的定性比较

We visualize the qualitative comparisons of the appearance matching score s appe s appe  s_("appe ")s_{\text {appe }} in Fig. 4 to show its advantages in scoring the proposals w.r.t. a given object in terms of appearance.
我们在图 4 中可视化了外观匹配分数 s appe s appe  s_("appe ")s_{\text {appe }} 的定性比较,以展示其在根据外观对给定物体的提议进行评分方面的优势。

A.5.2 Qualitative Comparisons on Geometric Matching Score
A.5.2 几何匹配分数的定性比较

We visualize the qualitative comparisons of the geometric matching score s geo s geo  s_("geo ")s_{\text {geo }} in Fig. 5 to show its advantages in scoring the proposals w.r.t. a given object in terms of geometry, e.g., object shapes and sizes.
我们在图 5 中可视化了几何匹配分数 s geo s geo  s_("geo ")s_{\text {geo }} 的定性比较,以展示其在根据几何特征(如物体形状和大小)对给定物体的提议进行评分方面的优势。

Figure 5. Qualitative results of our Instance Segmentation Model with or without the geometric matching score s geo s geo  s_("geo ")s_{\text {geo }}.
图 5. 我们的实例分割模型在有无几何匹配分数 s geo s geo  s_("geo ")s_{\text {geo }} 情况下的定性结果。

A.5.3 More Qualitative Comparisons with Existing Methods
A.5.3 与现有方法的更多定性比较

To illustrate the advantages of our Instance Segmentation Model (ISM), we visualize in Fig. 6 the qualitative comparisons with CNOS [40] on all the seven core datasets of the BOP benchmark [54] for instance segmentation of novel objects. For reference, we also provide the ground truth masks, except for the ITODD and HB datasets, as their ground truths are not available.
为了展示我们实例分割模型(ISM)的优势,我们在图 6 中对比了 ISM 与 CNOS [40]在 BOP 基准测试[54]的七个核心数据集上对新颖物体实例分割的定性效果。作为参考,我们还提供了真实标签掩码,除了 ITODD 和 HB 数据集,因为它们的真实标签不可用。

B. Supplementary Material for Pose Estimation Model
B. 姿态估计模型的补充材料

B.1. Network Architectures and Specifics
B.1. 网络架构及细节

B.1.1 Feature Extraction
B.1.1 特征提取

In the Pose Estimation Model (PEM) of our SAM-6D, the Feature Extraction module utilizes the base version of the Visual Transformer (ViT) backbone [8], termed as ViTBase, to process masked RGB image crops of observed object proposals or rendered object templates, yielding perpixel feature maps.
在我们 SAM-6D 的姿态估计模型(PEM)中,特征提取模块采用了视觉变换器(ViT)骨干网络的基础版本[8],称为 ViTBase,用于处理观察到的物体提议或渲染的物体模板的掩码 RGB 图像裁剪,生成每像素特征图。
Fig. 7 gives an illustration of the per-pixel feature learning process for an RGB image within the Feature Extraction module. More specifically, given an RGB image of the object, the initial step involves image processing, including masking the background, cropping the region of interest, and resizing it to a fixed resolution of 224 × 224 224 × 224 224 xx224224 \times 224. The object mask and bounding box utilized in the process can be sourced from the Instance Segmentation Model (ISM) for the observed scene image or from the renderer for the object template. The processed image is subsequently fed into ViT-Base to extract per-patch features using 12 attention blocks. The patch features from the third, sixth, ninth, and twelfth blocks are subsequently concatenated and passed through a fully-connected layer. They are then reshaped
图 7 展示了特征提取模块中 RGB 图像的每像素特征学习过程。更具体地说,给定物体的 RGB 图像,初始步骤包括图像处理,涵盖背景掩码、感兴趣区域裁剪以及调整为固定分辨率 224 × 224 224 × 224 224 xx224224 \times 224 。该过程中使用的物体掩码和边界框可以来自观察场景图像的实例分割模型(ISM)或物体模板的渲染器。处理后的图像随后输入 ViT-Base,利用 12 个注意力块提取每个图像块的特征。来自第 3、第 6、第 9 和第 12 个注意力块的图像块特征随后被拼接,并通过一个全连接层处理,然后被重塑。
Method  方法 Segmentation Model  分割模型 BOP Dataset  BOP 数据集 Mean  平均值
LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V
ZeroPose [5] SAM [26] 36.7 30.0 43.1 22.8 25.0 39.8 41.6 34.1
CNOS [40] FastSAM [74] 43.3 39.5 53.4 22.6 32.5 51.7 56.8 42.8
CNOS [40] SAM [26] 39.5 33.0 36.8 20.7 31.3 42.3 49.0 36.1
SAM-6D FastSAM [74] 46.3 45.8 57.3 24.5 41.9 55.1 58.9 47.1
SAM-6D SAM [26] 46.6 43.7 53.7 26.1 39.3 53.1 51.9 44.9
Method Segmentation Model BOP Dataset Mean LM-O T-LESS TUD-L IC-BIN ITODD HB YCB-V ZeroPose [5] SAM [26] 36.7 30.0 43.1 22.8 25.0 39.8 41.6 34.1 CNOS [40] FastSAM [74] 43.3 39.5 53.4 22.6 32.5 51.7 56.8 42.8 CNOS [40] SAM [26] 39.5 33.0 36.8 20.7 31.3 42.3 49.0 36.1 SAM-6D FastSAM [74] 46.3 45.8 57.3 24.5 41.9 55.1 58.9 47.1 SAM-6D SAM [26] 46.6 43.7 53.7 26.1 39.3 53.1 51.9 44.9| Method | Segmentation Model | BOP Dataset | | | | | | | Mean | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | | LM-O | T-LESS | TUD-L | IC-BIN | ITODD | HB | YCB-V | | | ZeroPose [5] | SAM [26] | 36.7 | 30.0 | 43.1 | 22.8 | 25.0 | 39.8 | 41.6 | 34.1 | | CNOS [40] | FastSAM [74] | 43.3 | 39.5 | 53.4 | 22.6 | 32.5 | 51.7 | 56.8 | 42.8 | | CNOS [40] | SAM [26] | 39.5 | 33.0 | 36.8 | 20.7 | 31.3 | 42.3 | 49.0 | 36.1 | | SAM-6D | FastSAM [74] | 46.3 | 45.8 | 57.3 | 24.5 | 41.9 | 55.1 | 58.9 | 47.1 | | SAM-6D | SAM [26] | 46.6 | 43.7 | 53.7 | 26.1 | 39.3 | 53.1 | 51.9 | 44.9 |
Table 8. Object Detection results of different methods on the seven core datasets of the BOP benchmark [54]. We report the mean Average Precision (mAP) scores at different Intersection-over-Union (IoU) values ranging from 0.50 to 0.95 with a step size of 0.05 .
表 8. 不同方法在 BOP 基准测试[54]七个核心数据集上的目标检测结果。我们报告了在不同交并比(IoU)值(从 0.50 到 0.95,步长为 0.05)下的平均精度均值(mAP)分数。
Segmentation Model  分割模型 Description Model  描述模型 AP
Type  类型 #Param Type  类型 #Param
FastSAM-s 23 M  2300 万 ViT-S 21 M  2100 万 43.1
ViT-L 300 M  3 亿 54.0
FastSAM-x 138 M  1.38 亿 ViT-S 21 M  2100 万 48.9
ViT-L 300 M  3 亿 62.0
SAM-B 357 M ViT-S 21 M 44.0
ViT-L 300 M  3 亿 55.8
SAM-L ViT-S 21 M  2100 万 47.2
1 , 188 M 1 , 188 M 1,188M1,188 \mathrm{M} ViT-L 300 M  3 亿 59.8
SAM-H ViT-S 21 M  2100 万 47.1
2 , 437 M 2 , 437 M 2,437M2,437 \mathrm{M} ViT-L 300 M 60.5
Segmentation Model Description Model AP Type #Param Type #Param FastSAM-s 23 M ViT-S 21 M 43.1 ViT-L 300 M 54.0 FastSAM-x 138 M ViT-S 21 M 48.9 ViT-L 300 M 62.0 SAM-B 357 M ViT-S 21 M 44.0 ViT-L 300 M 55.8 SAM-L ViT-S 21 M 47.2 1,188M ViT-L 300 M 59.8 SAM-H ViT-S 21 M 47.1 2,437M ViT-L 300 M 60.5| Segmentation Model | | Description Model | | AP | | :--- | :--- | :--- | :--- | :--- | | Type | #Param | Type | #Param | | | FastSAM-s | 23 M | ViT-S | 21 M | 43.1 | | | | ViT-L | 300 M | 54.0 | | FastSAM-x | 138 M | ViT-S | 21 M | 48.9 | | | | ViT-L | 300 M | 62.0 | | SAM-B | 357 M | ViT-S | 21 M | 44.0 | | | | ViT-L | 300 M | 55.8 | | SAM-L | | ViT-S | 21 M | 47.2 | | | $1,188 \mathrm{M}$ | ViT-L | 300 M | 59.8 | | SAM-H | | ViT-S | 21 M | 47.1 | | | $2,437 \mathrm{M}$ | ViT-L | 300 M | 60.5 |
Table 9. Quantitative comparisons on the model sizes of both segmentation and description models on YCB-V. We report the mean Average Precision (mAP) scores at different Intersection-over-Union (IoU) values ranging from 0.50 to 0.95 with a step size of 0.05 .
表 9. 在 YCB-V 数据集上对分割模型和描述模型的模型大小进行的定量比较。我们报告了在不同交并比(IoU)值(从 0.50 到 0.95,步长为 0.05)下的平均精度均值(mAP)分数。
Method  方法
  分割模型
Segmentation
Model
Segmentation Model| Segmentation | | :---: | | Model |
Server  服务器 Time (s)  时间(秒)
CNOS [40] Tesla V100 0.22
CNOS [40] FastSAM [74] DeForce RTX 3090 0.23
SAM-6D DeForce RTX 3090 0.45
CNOS [40] Tesla V100 1.84
CNOS [40] SAM [26] DeForce RTX 3090 2.35
SAM-6D DeForce RTX 3090 2.80
Method "Segmentation Model" Server Time (s) CNOS [40] Tesla V100 0.22 CNOS [40] FastSAM [74] DeForce RTX 3090 0.23 SAM-6D DeForce RTX 3090 0.45 CNOS [40] Tesla V100 1.84 CNOS [40] SAM [26] DeForce RTX 3090 2.35 SAM-6D DeForce RTX 3090 2.80| Method | Segmentation <br> Model | Server | Time (s) | | :--- | :---: | :---: | :---: | | CNOS [40] | | Tesla V100 | 0.22 | | CNOS [40] | FastSAM [74] | DeForce RTX 3090 | 0.23 | | SAM-6D | | DeForce RTX 3090 | 0.45 | | CNOS [40] | | Tesla V100 | 1.84 | | CNOS [40] | SAM [26] | DeForce RTX 3090 | 2.35 | | SAM-6D | | DeForce RTX 3090 | 2.80 |
Table 10. Runtime comparisons of different methods for instance segmentation of novel objects. The reported time is the average per-image processing time across the seven core datasets of the BOP benchmark [54].
表 10. 不同方法在新颖物体实例分割上的运行时间比较。报告的时间是 BOP 基准测试[54]中七个核心数据集的每张图像平均处理时间。

and bilinearly interpolated to match the input resolution of 224 × 224 224 × 224 224 xx224224 \times 224 with 256 feature channels. Further specifics about the network can be found in Fig. 7.
并通过双线性插值调整到与输入分辨率 224 × 224 224 × 224 224 xx224224 \times 224 匹配,具有 256 个特征通道。关于网络的更多细节可见图 7。
For a cropped observed RGB image, the pixel features within the mask are ultimately chosen to correspond to the point set transformed from the masked depth image. For object templates, the pixels within the masks across views are finally aggregated, with the surface point of per pixel known
对于裁剪后的观测 RGB 图像,掩膜内的像素特征最终被选为与从掩膜深度图像转换而来的点集对应。对于物体模板,跨视图掩膜内的像素最终被聚合,每个像素的表面点由渲染器已知。

from the renderer. Both point sets of the proposal and the target object are normalized to fit a unit sphere by dividing by the object scale, effectively addressing the variations in object scales.
提议物体和目标物体的两个点集均通过除以物体尺度进行归一化,以适应单位球体,有效解决了物体尺度的变化问题。
We use two views of object templates for training, and 42 views for evaluation as CNOS [40], which is the standard setting for the results reported in this paper.
我们使用两个视角的物体模板进行训练,使用 42 个视角进行评估,参考 CNOS [40],这是本文报告结果的标准设置。

B.1.2 Coarse Point Matching
B.1.2 粗略点匹配

In the Coarse Point Matching module, we utilize T c T c T^(c)T^{c} Geometric Transformers [48] to model the relationships between the sparse point set P m c R N m c × 3 P m c R N m c × 3 P_(m)^(c)inR^(N_(m)^(c)xx3)\mathcal{P}_{m}^{c} \in \mathbb{R}^{N_{m}^{c} \times 3} of the observed object proposal m m mm and the set P o c R N o c × 3 P o c R N o c × 3 P_(o)^(c)inR^(N_(o)^(c)xx3)\mathcal{P}_{o}^{c} \in \mathbb{R}^{N_{o}^{c} \times 3} of the target object O O O\mathcal{O}. Their respective features F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} and F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c} are thus improved to their enhanced versions F ~ m c F ~ m c tilde(F)_(m)^(c)\tilde{\boldsymbol{F}}_{m}^{c} and F ~ o c F ~ o c tilde(F)_(o)^(c)\tilde{\boldsymbol{F}}_{o}^{c}. Each of these enhanced feature maps also includes the background token. An additional fully-connected layer is applied to the features both before and after the transformers. In this paper, we use the upper script ‘c’ to indicate variables associated with the Coarse Point Matching module, and the lower scripts ‘m’ and ‘o’ to distinguish between the proposal and the object.
在粗匹配点模块中,我们利用 T c T c T^(c)T^{c} 几何变换器[48]来建模观察到的物体提议 m m mm 的稀疏点集 P m c R N m c × 3 P m c R N m c × 3 P_(m)^(c)inR^(N_(m)^(c)xx3)\mathcal{P}_{m}^{c} \in \mathbb{R}^{N_{m}^{c} \times 3} 与目标物体 O O O\mathcal{O} 的点集 P o c R N o c × 3 P o c R N o c × 3 P_(o)^(c)inR^(N_(o)^(c)xx3)\mathcal{P}_{o}^{c} \in \mathbb{R}^{N_{o}^{c} \times 3} 之间的关系。它们各自的特征 F m c F m c F_(m)^(c)\boldsymbol{F}_{m}^{c} F o c F o c F_(o)^(c)\boldsymbol{F}_{o}^{c} 因此被提升为增强版本 F ~ m c F ~ m c tilde(F)_(m)^(c)\tilde{\boldsymbol{F}}_{m}^{c} F ~ o c F ~ o c tilde(F)_(o)^(c)\tilde{\boldsymbol{F}}_{o}^{c} 。每个增强的特征图也包含背景标记。变换器前后均对特征应用了额外的全连接层。本文中,我们使用上标“c”表示与粗匹配点模块相关的变量,使用下标“m”和“o”区分提议和物体。
During inference, we compute the soft assignment matrix A ~ c R ( N m c + 1 ) × ( N o c + 1 ) A ~ c R N m c + 1 × N o c + 1 tilde(A)^(c)inR^((N_(m)^(c)+1)xx(N_(o)^(c)+1))\tilde{\mathcal{A}}^{c} \in \mathbb{R}^{\left(N_{m}^{c}+1\right) \times\left(N_{o}^{c}+1\right)}, and obtain two binary-value matrices M m c R N m c × 1 M m c R N m c × 1 M_(m)^(c)inR^(N_(m)^(c)xx1)\boldsymbol{M}_{m}^{c} \in \mathcal{R}^{N_{m}^{c} \times 1} and M o c R N o c × 1 M o c R N o c × 1 M_(o)^(c)inR^(N_(o)^(c)xx1)\boldsymbol{M}_{o}^{c} \in \mathcal{R}^{N_{o}^{c} \times 1}, denoting whether the points in P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} and P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} correspond to the background, owing to the design of background tokens; ‘0’ indicates correspondence to the background, while ’ 1 ’ indicates otherwise. We then have the probabilities P c R N m c × N o c P c R N m c × N o c P^(c)inR^(N_(m)^(c)xxN_(o)^(c))\boldsymbol{P}^{c} \in \mathbb{R}^{N_{m}^{c} \times N_{o}^{c}} to indicate the matching degree of the N m c × N o c N m c × N o c N_(m)^(c)xxN_(o)^(c)N_{m}^{c} \times N_{o}^{c} point pairs between P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} and P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c}, formulated as follows:
在推理过程中,我们计算软分配矩阵 A ~ c R ( N m c + 1 ) × ( N o c + 1 ) A ~ c R N m c + 1 × N o c + 1 tilde(A)^(c)inR^((N_(m)^(c)+1)xx(N_(o)^(c)+1))\tilde{\mathcal{A}}^{c} \in \mathbb{R}^{\left(N_{m}^{c}+1\right) \times\left(N_{o}^{c}+1\right)} ,并获得两个二值矩阵 M m c R N m c × 1 M m c R N m c × 1 M_(m)^(c)inR^(N_(m)^(c)xx1)\boldsymbol{M}_{m}^{c} \in \mathcal{R}^{N_{m}^{c} \times 1} M o c R N o c × 1 M o c R N o c × 1 M_(o)^(c)inR^(N_(o)^(c)xx1)\boldsymbol{M}_{o}^{c} \in \mathcal{R}^{N_{o}^{c} \times 1} ,表示 P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} 中的点是否对应背景,这得益于背景标记的设计;“0”表示对应背景,“1”表示非背景。随后,我们得到概率 P c R N m c × N o c P c R N m c × N o c P^(c)inR^(N_(m)^(c)xxN_(o)^(c))\boldsymbol{P}^{c} \in \mathbb{R}^{N_{m}^{c} \times N_{o}^{c}} ,用于表示 P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} 之间 N m c × N o c N m c × N o c N_(m)^(c)xxN_(o)^(c)N_{m}^{c} \times N_{o}^{c} 对点的匹配程度,其公式如下:
P c = M m c ( A ~ c [ 1 : , 1 : ] ) γ M o c T P c = M m c A ~ c [ 1 : , 1 : ] γ M o c T P^(c)=M_(m)^(c)*( tilde(A)^(c)[1:,1:])^(gamma)*M_(o)^(cT)\boldsymbol{P}^{c}=\boldsymbol{M}_{m}^{c} \cdot\left(\tilde{\mathcal{A}}^{c}[1:, 1:]\right)^{\gamma} \cdot \boldsymbol{M}_{o}^{c T}
where γ γ gamma\gamma is used to sharpen the probabilities and set as 1.5. The probabilities of points that have no correspondence, whether in P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} or P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c}, are all set to 0 . Following this, the probabilities P c P c P^(c)\boldsymbol{P}^{c} are normalized to ensure their sum equals 1 , and act as weights used to randomly select 6,000 triplets of point pairs from the total pool of N m c × N o c N m c × N o c N_(m)^(c)xxN_(o)^(c)N_{m}^{c} \times N_{o}^{c} pairs. Each
其中 γ γ gamma\gamma 用于锐化概率,设置为 1.5。无对应点的概率,无论是在 P m c P m c P_(m)^(c)\mathcal{P}_{m}^{c} 还是 P o c P o c P_(o)^(c)\mathcal{P}_{o}^{c} 中,均设为 0。随后,对概率 P c P c P^(c)\boldsymbol{P}^{c} 进行归一化,确保其和为 1,并作为权重用于从总共 N m c × N o c N m c × N o c N_(m)^(c)xxN_(o)^(c)N_{m}^{c} \times N_{o}^{c} 对点对中随机选择 6,000 个三元组。每个

Figure 6. Qualitative results on the seven core datasets of the BOP benchmark [54] for instance segmentation of novel objects.
图 6. BOP 基准测试[54]中七个核心数据集上新颖物体实例分割的定性结果。

Figure 7. An illustration of the per-pixel feature learning process for an RGB image within the Feature Extraction module of the Pose Estimation Model.
图 7. 姿态估计模型的特征提取模块中,RGB 图像的逐像素特征学习过程示意图。

Figure 8. An illustration of the positional encoding for a point set with N N NN points within the Fine Point Matching Module of the Pose Estimation Model.
图 8. 姿态估计模型的精细点匹配模块中,包含 N N NN 个点的点集的位置编码示意图。

triplet, which consists of three point pairs, is utilized to calculate a pose using SVD, along with a distance between the point pairs based on the computed pose. Through this procedure, a total of 6,000 pose hypotheses are generated, and to minimize computational cost, only the 300 poses with the smallest point pair distances are selected. Finally, the initial pose for the Fine Point Matching module is determined from these 300 poses, with the pose that has the highest pose matching score being selected.
三元组由三对点组成,利用奇异值分解(SVD)计算位姿,同时基于计算出的位姿计算点对之间的距离。通过这一过程,共生成了 6000 个位姿假设,为了降低计算成本,仅选择点对距离最小的 300 个位姿。最终,细点匹配模块的初始位姿从这 300 个位姿中确定,选择具有最高位姿匹配得分的位姿。
In the Coarse Point Matching module, we set T c = 3 T c = 3 T^(c)=3T^{c}=3 and N m c = N o c = 196 N m c = N o c = 196 N_(m)^(c)=N_(o)^(c)=196N_{m}^{c}=N_{o}^{c}=196, with all the feature channels designated as 256. The configurations of the Geometric Transformers adhere to those used in [48].
在粗点匹配模块中,我们设置了 T c = 3 T c = 3 T^(c)=3T^{c}=3 N m c = N o c = 196 N m c = N o c = 196 N_(m)^(c)=N_(o)^(c)=196N_{m}^{c}=N_{o}^{c}=196 ,所有特征通道均设为 256。几何变换器的配置遵循文献[48]中的设置。

B.1.3 Fine Point Matching
B.1.3 细点匹配

In the Fine Point Matching module, we utilize T f T f T^(f)T^{f} Sparse-to-Dense Point Transformers to model the relationships between the dense point set P m f R N m f × 3 P m f R N m f × 3 P_(m)^(f)inR^(N_(m)^(f)xx3)\mathcal{P}_{m}^{f} \in \mathbb{R}^{N_{m}^{f} \times 3} of the observed object proposal m m mm and the set P o f R N o f × 3 P o f R N o f × 3 P_(o)^(f)inR^(N_(o)^(f)xx3)\mathcal{P}_{o}^{f} \in \mathbb{R}^{N_{o}^{f} \times 3} of the target object O O O\mathcal{O}. Their respective features F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} are thus improved to their enhanced versions F ~ m f F ~ m f tilde(F)_(m)^(f)\tilde{\boldsymbol{F}}_{m}^{f} and F ~ o f F ~ o f tilde(F)_(o)^(f)\tilde{\boldsymbol{F}}_{o}^{f}. Each of these enhanced feature maps also includes the background token. An additional fully-connected layer is applied to the features both before and after the transformers. We use the upper script ’ f f ff ’ to indicate variables associated with the Fine Point Matching module, and the lower scripts ’ m m mm ’ and ‘o’ to distinguish between the proposal and the object.
在精细点匹配模块中,我们利用 T f T f T^(f)T^{f} 个稀疏到密集点变换器来建模观测到的物体提议 m m mm 的密集点集 P m f R N m f × 3 P m f R N m f × 3 P_(m)^(f)inR^(N_(m)^(f)xx3)\mathcal{P}_{m}^{f} \in \mathbb{R}^{N_{m}^{f} \times 3} 与目标物体 O O O\mathcal{O} 的点集 P o f R N o f × 3 P o f R N o f × 3 P_(o)^(f)inR^(N_(o)^(f)xx3)\mathcal{P}_{o}^{f} \in \mathbb{R}^{N_{o}^{f} \times 3} 之间的关系。它们各自的特征 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} 因此被提升为增强版本 F ~ m f F ~ m f tilde(F)_(m)^(f)\tilde{\boldsymbol{F}}_{m}^{f} F ~ o f F ~ o f tilde(F)_(o)^(f)\tilde{\boldsymbol{F}}_{o}^{f} 。每个增强的特征图也包含背景标记。在变换器之前和之后,特征均经过一个额外的全连接层。我们使用上标“ f f ff ”表示与精细点匹配模块相关的变量,使用下标“ m m mm ”和“o”来区分提议和物体。
Different from the coarse module, we condition both fea-
与粗略模块不同,我们对两个 fea-

tures F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} and F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} before applying them to the transformers by adding their respective positional encodings, which are learned via a multi-scale Set Abstract Level [47] from P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} transformed by the initial pose and P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} without transformation, respectively. The used architecture for positional encoding learning is illustrated in Fig. 8. For more details, one can refer to [47].
在将它们应用于变换器之前,通过添加各自的位置编码来处理特征 F m f F m f F_(m)^(f)\boldsymbol{F}_{m}^{f} F o f F o f F_(o)^(f)\boldsymbol{F}_{o}^{f} ,这些位置编码是通过多尺度集合抽象层 [47] 学习得到的,分别来自由初始姿态变换的 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} 和未变换的 P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} 。用于位置编码学习的架构如图 8 所示。更多细节可参考文献 [47]。
Another difference from the coarse module is the type of transformers used. To handle dense relationships, we design the Sparse-to-Dense Point Transformers, which utilize Geometric Transformers [48] to process sparse point sets and disseminate information to dense point sets via Linear Cross-attention layers [12, 24]. The configurations of the Geometric Transformers adhere to those used in [48]; the point numbers of the sampled sparse point sets are all set as 196. The Linear Cross-attention layer enables attention along the feature dimension, and details of its architecture can be found in Fig. 9; for more details, one can refer to [12, 24].
与粗略模块的另一个区别是所使用的变换器类型。为了处理稠密关系,我们设计了稀疏到稠密点变换器,该变换器利用几何变换器[48]来处理稀疏点集,并通过线性交叉注意力层[12, 24]将信息传播到稠密点集。几何变换器的配置遵循[48]中使用的设置;采样的稀疏点集的点数均设为 196。线性交叉注意力层使得注意力沿特征维度进行,其架构细节见图 9;更多细节可参考[12, 24]。
During inference, similar to the coarse module, we compute the soft assignment matrix A ~ f R ( N m f + 1 ) × ( N o f + 1 ) A ~ f R N m f + 1 × N o f + 1 tilde(A)^(f)inR^((N_(m)^(f)+1)xx(N_(o)^(f)+1))\tilde{\mathcal{A}}^{f} \in \mathbb{R}^{\left(N_{m}^{f}+1\right) \times\left(N_{o}^{f}+1\right)}, and obtain two binary-value matrices M m f R N m f × 1 M m f R N m f × 1 M_(m)^(f)inR^(N_(m)^(f)xx1)\boldsymbol{M}_{m}^{f} \in \mathcal{R}^{N_{m}^{f} \times 1} and M o f R N o f × 1 M o f R N o f × 1 M_(o)^(f)inR^(N_(o)^(f)xx1)\boldsymbol{M}_{o}^{f} \in \mathcal{R}^{N_{o}^{f} \times 1}. We then formulate the probabilities P f P f P^(f)in\boldsymbol{P}^{f} \in R N m f × N o f R N m f × N o f R^(N_(m)^(f)xxN_(o)^(f))\mathbb{R}^{N_{m}^{f} \times N_{o}^{f}} as follows:
在推理过程中,与粗略模块类似,我们计算软分配矩阵 A ~ f R ( N m f + 1 ) × ( N o f + 1 ) A ~ f R N m f + 1 × N o f + 1 tilde(A)^(f)inR^((N_(m)^(f)+1)xx(N_(o)^(f)+1))\tilde{\mathcal{A}}^{f} \in \mathbb{R}^{\left(N_{m}^{f}+1\right) \times\left(N_{o}^{f}+1\right)} ,并获得两个二值矩阵 M m f R N m f × 1 M m f R N m f × 1 M_(m)^(f)inR^(N_(m)^(f)xx1)\boldsymbol{M}_{m}^{f} \in \mathcal{R}^{N_{m}^{f} \times 1} M o f R N o f × 1 M o f R N o f × 1 M_(o)^(f)inR^(N_(o)^(f)xx1)\boldsymbol{M}_{o}^{f} \in \mathcal{R}^{N_{o}^{f} \times 1} 。然后我们将概率 P f P f P^(f)in\boldsymbol{P}^{f} \in R N m f × N o f R N m f × N o f R^(N_(m)^(f)xxN_(o)^(f))\mathbb{R}^{N_{m}^{f} \times N_{o}^{f}} 表述如下:
P f = M m f ( A ~ f [ 1 : , 1 : ] ) M o f T . P f = M m f A ~ f [ 1 : , 1 : ] M o f T . P^(f)=M_(m)^(f)*( tilde(A)^(f)[1:,1:])*M_(o)^(fT).\boldsymbol{P}^{f}=\boldsymbol{M}_{m}^{f} \cdot\left(\tilde{\mathcal{A}}^{f}[1:, 1:]\right) \cdot \boldsymbol{M}_{o}^{f T} .
Based on P f P f P^(f)\boldsymbol{P}^{f}, we search for the best-matched point in P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} for each point in P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f}, assigned with the matching probability. The final object pose is then calculated using a weighted SVD, with the matching probabilities of the point pairs serving as the weights.
基于 P f P f P^(f)\boldsymbol{P}^{f} ,我们为 P m f P m f P_(m)^(f)\mathcal{P}_{m}^{f} 中的每个点在 P o f P o f P_(o)^(f)\mathcal{P}_{o}^{f} 中搜索最佳匹配点,并赋予匹配概率。最终的物体姿态通过加权奇异值分解(SVD)计算,点对的匹配概率作为权重。
Besides, we set T f = 3 T f = 3 T^(f)=3T^{f}=3 and N m f = N o f = 2 , 048 N m f = N o f = 2 , 048 N_(m)^(f)=N_(o)^(f)=2,048N_{m}^{f}=N_{o}^{f}=2,048, with all the feature channels designated as 256 . During training, we follow [28] to obtain the initial object poses by augmenting the ground truth ones with random noises.
此外,我们设置了 T f = 3 T f = 3 T^(f)=3T^{f}=3 N m f = N o f = 2 , 048 N m f = N o f = 2 , 048 N_(m)^(f)=N_(o)^(f)=2,048N_{m}^{f}=N_{o}^{f}=2,048 ,所有特征通道均设为 256。在训练过程中,我们遵循[28]的方法,通过对真实姿态添加随机噪声来获得初始物体姿态。

B.2. Training Objectives
B.2. 训练目标

We use InfoNCE loss [43] to supervise the learning of attention matrices for both coarse and fine modules. Specifically, given two point sets P m R N m × 3 P m R N m × 3 P_(m)inR^(N_(m)xx3)\mathcal{P}_{m} \in \mathbb{R}^{N_{m} \times 3} and P o R N o × 3 P o R N o × 3 P_(o)inR^(N_(o)xx3)\mathcal{P}_{o} \in \mathbb{R}^{N_{o} \times 3}, along with their enhanced features F ~ m F ~ m tilde(F)_(m)\tilde{\boldsymbol{F}}_{m} and F ~ o F ~ o tilde(F)_(o)\tilde{\boldsymbol{F}}_{o}, which are
我们使用 InfoNCE 损失[43]来监督粗糙和精细模块中注意力矩阵的学习。具体来说,给定两个点集 P m R N m × 3 P m R N m × 3 P_(m)inR^(N_(m)xx3)\mathcal{P}_{m} \in \mathbb{R}^{N_{m} \times 3} P o R N o × 3 P o R N o × 3 P_(o)inR^(N_(o)xx3)\mathcal{P}_{o} \in \mathbb{R}^{N_{o} \times 3} ,以及它们的增强特征 F ~ m F ~ m tilde(F)_(m)\tilde{\boldsymbol{F}}_{m} F ~ o F ~ o tilde(F)_(o)\tilde{\boldsymbol{F}}_{o} ,这些特征是

Figure 9. Left: The structure of Linear Cross-attention layer. Right: The structure of Linear Cross-attention.
图 9. 左:线性交叉注意力层的结构。右:线性交叉注意力的结构。

learnt via the transformers and equipped with background tokens, we compute the attention matrix A = F ~ m × F ~ o T A = F ~ m × F ~ o T A= tilde(F)_(m)xx tilde(F)_(o)^(T)in\mathcal{A}=\tilde{\boldsymbol{F}}_{m} \times \tilde{\boldsymbol{F}}_{o}^{T} \in R ( N m + 1 ) × ( N o + 1 ) R N m + 1 × N o + 1 R^((N_(m)+1)xx(N_(o)+1))\mathbb{R}^{\left(N_{m}+1\right) \times\left(N_{o}+1\right)}. Then A A A\mathcal{A} can be supervised by the following objective:
通过变换器学习并配备背景标记后,我们计算注意力矩阵 A = F ~ m × F ~ o T A = F ~ m × F ~ o T A= tilde(F)_(m)xx tilde(F)_(o)^(T)in\mathcal{A}=\tilde{\boldsymbol{F}}_{m} \times \tilde{\boldsymbol{F}}_{o}^{T} \in R ( N m + 1 ) × ( N o + 1 ) R N m + 1 × N o + 1 R^((N_(m)+1)xx(N_(o)+1))\mathbb{R}^{\left(N_{m}+1\right) \times\left(N_{o}+1\right)} 。然后 A A A\mathcal{A} 可以通过以下目标进行监督:
L = ce ( A [ 1 : , : ] , Y ^ m ) + ce ( A [ : , 1 : ] T , Y ^ o ) L = ce A [ 1 : , : ] , Y ^ m + ce A [ : , 1 : ] T , Y ^ o L=ce(A[1:,:], hat(Y)_(m))+ce(A[:,1:]^(T), hat(Y)_(o))\mathcal{L}=\operatorname{ce}\left(\mathcal{A}[1:,:], \hat{\mathcal{Y}}_{m}\right)+\operatorname{ce}\left(\mathcal{A}[:, 1:]^{T}, \hat{\mathcal{Y}}_{o}\right)
where CE ( , ) CE ( , ) CE(*,*)\mathrm{CE}(\cdot, \cdot) denotes the cross-entropy loss function. Y ^ m R N m Y ^ m R N m hat(Y)_(m)inR^(N_(m))\hat{\mathcal{Y}}_{m} \in \mathbb{R}^{N_{m}} and Y ^ o R N o Y ^ o R N o hat(Y)_(o)inR^(N_(o))\hat{\mathcal{Y}}_{o} \in \mathbb{R}^{N_{o}} denote the ground truths for P m P m P_(m)\mathcal{P}_{m} and P o P o P_(o)\mathcal{P}_{o}. Given the ground truth pose R ^ R ^ hat(R)\hat{\boldsymbol{R}} and t ^ t ^ hat(t)\hat{\boldsymbol{t}}, each element y m y m y_(m)y_{m} in Y ^ m Y ^ m hat(Y)_(m)\hat{\mathcal{Y}}_{m}, corresponding to the point p m p m p_(m)\boldsymbol{p}_{m} in P m P m P_(m)\mathcal{P}_{m}, could be obtained as follows:
其中 CE ( , ) CE ( , ) CE(*,*)\mathrm{CE}(\cdot, \cdot) 表示交叉熵损失函数。 Y ^ m R N m Y ^ m R N m hat(Y)_(m)inR^(N_(m))\hat{\mathcal{Y}}_{m} \in \mathbb{R}^{N_{m}} Y ^ o R N o Y ^ o R N o hat(Y)_(o)inR^(N_(o))\hat{\mathcal{Y}}_{o} \in \mathbb{R}^{N_{o}} 分别表示 P m P m P_(m)\mathcal{P}_{m} P o P o P_(o)\mathcal{P}_{o} 的真实值。给定真实姿态 R ^ R ^ hat(R)\hat{\boldsymbol{R}} t ^ t ^ hat(t)\hat{\boldsymbol{t}} Y ^ m Y ^ m hat(Y)_(m)\hat{\mathcal{Y}}_{m} 中对应于 P m P m P_(m)\mathcal{P}_{m} 中点 p m p m p_(m)\boldsymbol{p}_{m} 的每个元素 y m y m y_(m)y_{m} 可以按如下方式获得:
y m = { 0 if d k δ d i s k if d k < δ d i s y m = 0  if  d k δ d i s k  if  d k < δ d i s y_(m)={[0," if "d_(k^(**)) >= delta_(dis)],[k^(**)," if "d_(k^(**)) < delta_(dis)]:}y_{m}=\left\{\begin{array}{cl} 0 & \text { if } d_{k^{*}} \geq \delta_{d i s} \\ k^{*} & \text { if } d_{k^{*}}<\delta_{d i s} \end{array}\right.
where  其中
k = Argmin k = 1 , , N m R ^ ( p m t ^ ) p o , k 2 k = Argmin k = 1 , , N m R ^ p m t ^ p o , k 2 k^(**)=Argmin_(k=1,dots,N_(m))||( hat(R))(p_(m)-( hat(t)))-p_(o,k)||_(2)k^{*}=\operatorname{Argmin}_{k=1, \ldots, N_{m}}\left\|\hat{\boldsymbol{R}}\left(\boldsymbol{p}_{m}-\hat{\boldsymbol{t}}\right)-\boldsymbol{p}_{o, k}\right\|_{2}
and  
d k = R ^ ( p m t ^ ) p o , k 2 d k = R ^ p m t ^ p o , k 2 d_(k^(**))=||( hat(R))(p_(m)-( hat(t)))-p_(o,k^(**))||_(2)d_{k^{*}}=\left\|\hat{\boldsymbol{R}}\left(\boldsymbol{p}_{m}-\hat{\boldsymbol{t}}\right)-\boldsymbol{p}_{o, k^{*}}\right\|_{2}
k k k^(**)k^{*} is the index of the closest point p o , k p o , k p_(o,k^(**))\boldsymbol{p}_{o, k^{*}} in P o P o P_(o)\mathcal{P}_{o} to p m p m p_(m)\boldsymbol{p}_{m}, while d k d k d_(k^(**))d_{k^{*}} denotes the distance between p m p m p_(m)\boldsymbol{p}_{m} and p o , k p o , k p_(o,k^(**))\boldsymbol{p}_{o, k^{*}} in the object coordinate system. δ dis δ dis  delta_("dis ")\delta_{\text {dis }} is a distance threshold determining whether the point p m p m p_(m)\boldsymbol{p}_{m} has the correspondence in P o P o P_(o)\mathcal{P}_{o}; we set δ dis δ dis  delta_("dis ")\delta_{\text {dis }} as a constant 0.15 , since both P m P m P_(m)\mathcal{P}_{m} and P o P o P_(o)\mathcal{P}_{o} are normalized to a unit sphere. The elements in Y ^ o Y ^ o hat(Y)_(o)\hat{\mathcal{Y}}_{o} are also generated in a similar way.
k k k^(**)k^{*} P o P o P_(o)\mathcal{P}_{o} 中距离 p m p m p_(m)\boldsymbol{p}_{m} 最近的点 p o , k p o , k p_(o,k^(**))\boldsymbol{p}_{o, k^{*}} 的索引,而 d k d k d_(k^(**))d_{k^{*}} 表示在物体坐标系中 p m p m p_(m)\boldsymbol{p}_{m} p o , k p o , k p_(o,k^(**))\boldsymbol{p}_{o, k^{*}} 之间的距离。 δ dis δ dis  delta_("dis ")\delta_{\text {dis }} 是一个距离阈值,用于判断点 p m p m p_(m)\boldsymbol{p}_{m} 是否在 P o P o P_(o)\mathcal{P}_{o} 中有对应点;我们将 δ dis δ dis  delta_("dis ")\delta_{\text {dis }} 设为常数 0.15,因为 P m P m P_(m)\mathcal{P}_{m} P o P o P_(o)\mathcal{P}_{o} 都被归一化到单位球。 Y ^ o Y ^ o hat(Y)_(o)\hat{\mathcal{Y}}_{o} 中的元素也是以类似的方式生成的。
We employ the objective (12) upon all the transformer blocks of both coarse and fine point matching modules, and thus optimize the Pose Estimation Model by solving the following problem:
我们在粗匹配和细匹配模块的所有 transformer 块上采用目标函数 (12),从而通过求解以下问题来优化姿态估计模型:
min l = 1 , , T c L l c + l = 1 , , T f L l f min l = 1 , , T c L l c + l = 1 , , T f L l f minsum_(l=1,dots,T_(c))L_(l)^(c)+sum_(l=1,dots,T_(f))L_(l)^(f)\min \sum_{l=1, \ldots, T_{c}} \mathcal{L}_{l}^{c}+\sum_{l=1, \ldots, T_{f}} \mathcal{L}_{l}^{f}
where for the loss L L L\mathcal{L} in Eq. (12), we use the upper scripts ’ c c cc ’ and ’ f f ff ’ to distinguish between the losses in the coarse and fine point matching modules, respectively, while the lower script ’ l l ll ’ denotes the sequence of the transformer blocks in each module.
对于式 (12) 中的损失 L L L\mathcal{L} ,我们使用上标 ’ c c cc ’ 和 ’ f f ff ’ 来区分粗匹配和细匹配模块中的损失,而下标 ’ l l ll ’ 表示每个模块中 transformer 块的序列。

B.3. More Quantitative Results
B.3. 更多定量结果

B.3.1 Effects of The View Number of Templates
B.3.1 模板视角数量的影响

We present a comparison of results using different views of object templates in Table 11. As shown in the table, results with only one template perform poorly as a single view cannot fully depict the entire object. With an increase in the number of views, performance improves. For consistency with our Instance Segmentation Model and CNOS [40], we utilize 42 views of templates as the default setting in the main paper.
我们在表 11 中展示了使用不同视角的物体模板的结果对比。如表中所示,仅使用一个模板的结果表现较差,因为单一视角无法完整描述整个物体。随着视角数量的增加,性能有所提升。为了与我们的实例分割模型和 CNOS [40]保持一致,本文中默认使用 42 个视角的模板。
# View  # 视角 1 2 8 16 42
AR 21.8 62.7 83.9 84.1 84.5
# View 1 2 8 16 42 AR 21.8 62.7 83.9 84.1 84.5| # View | 1 | 2 | 8 | 16 | 42 | | :---: | :---: | :---: | :---: | :---: | :---: | | AR | 21.8 | 62.7 | 83.9 | 84.1 | 84.5 |
Table 11. Pose estimation results with different view numbers of object templates on YCB-V. We report the mean Average Recall (AR) among VSD, MSSD and MSPD.
表 11. 在 YCB-V 数据集上,不同视角数量的物体模板的姿态估计结果。我们报告了 VSD、MSSD 和 MSPD 的平均召回率(AR)。

B.3.2 Comparisons with OVE6D
B.3.2 与 OVE6D 的比较

OVE6D [1] is a classical method for zero-shot pose estimation based on image matching, which first constructs a codebook from the object templates for viewpoint rotation retrieval and subsequently regresses the in-plane rotation. When comparing our SAM-6D with OVE6D using their provided segmentation masks (as shown in Table 12), SAM-6D outperforms OVE6D on LM-O dataset, without the need for using Iterative Closest Point (ICP) algorithm for post-optimization.
OVE6D [1] 是一种基于图像匹配的零样本姿态估计经典方法,首先从物体模板构建视角旋转检索的码本,随后回归平面内旋转。使用其提供的分割掩码(如表 12 所示)将我们的 SAM-6D 与 OVE6D 进行比较时,SAM-6D 在 LM-O 数据集上表现优于 OVE6D,且无需使用迭代最近点(ICP)算法进行后期优化。
Method  方法 LM-O
OVE6D [1] 56.1
OVE6D with ICP [1]
OVE6D 与 ICP [1]
72.8
SAM-6D (Ours)  SAM-6D(我们的) 7 4 . 7 7 4 . 7 74.7\mathbf{7 4 . 7}
Method LM-O OVE6D [1] 56.1 OVE6D with ICP [1] 72.8 SAM-6D (Ours) 74.7| Method | LM-O | | :--- | :---: | | OVE6D [1] | 56.1 | | OVE6D with ICP [1] | 72.8 | | SAM-6D (Ours) | $\mathbf{7 4 . 7}$ |
Table 12. Quantitative results of OVE6D [1] and our SAM-6D on LM-O dataset. The evaluation metric is the standard ADD(-S) for pose estimation. SAM-6D is evaluated with the same masks provided by [1].
表 12. OVE6D [1] 与我们提出的 SAM-6D 在 LM-O 数据集上的定量结果。评估指标为姿态估计的标准 ADD(-S)。SAM-6D 使用与 [1] 提供的相同掩码进行评估。

B.4. More Qualitative Comparisons with Existing Methods
B.4. 与现有方法的更多定性比较

To illustrate the advantages of our Pose Estimation Model (ISM), we visualize in Fig. 10 the qualitative comparisons with MegaPose [28] on all the seven core datasets of the BOP benchmark [54] for pose estimation of novel objects. For reference, we also present the corresponding ground truths, barring those for the ITODD and HB datasets, as these are unavailable.
为了说明我们姿态估计模型(ISM)的优势,我们在图 10 中展示了与 MegaPose [28]在 BOP 基准测试[54]的七个核心数据集上对新颖物体姿态估计的定性比较。作为参考,我们还展示了相应的真实值,除了 ITODD 和 HB 数据集,因为这些数据集的真实值不可用。

Figure 10. Qualitative results on the seven core datasets of the BOP benchmark [54] for pose estimation of novel objects.
图 10. 在 BOP 基准测试[54]的七个核心数据集上对新颖物体姿态估计的定性结果。

    • Equal contribution. ^(†){ }^{\dagger} Corresponding author kuijia@gmail.com.
      同等贡献。 ^(†){ }^{\dagger} 通讯作者 kuijia@gmail.com。