这是用户在 2025-6-24 14:41 为 https://ar5iv.labs.arxiv.org/html/2403.16848#S3.F2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Nanjing University 22institutetext: China Mobile (Suzhou) Software Technology Co., Ltd. 33institutetext: Shanghai AI Lab

多目标跟踪作为 ID 预测

Ruopeng Gao 11    Yijun Zhang 22    Limin Wang,🖂 1133
Abstract

In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at https://github.com/MCG-NJU/MOTIP.

Keywords:
Multiple Object Tracking Tracking End-to-End
🖂 : Corresponding author (lmwang@nju.edu.cn).

1 Introduction

Multi-Object Tracking (MOT) aims to locate the objects of interest in each frame and assign their corresponding identities during the whole video sequence. As a fundamental vision task, it can be utilized for numerous downstream tasks ranging from action recognition [9] to trajectory prediction [16]. Furthermore, it’s also noteworthy in many real-world applications, such as autonomous driving, surveillance, and so on.

In the historical development of MOT, the tracking-by-detection paradigm [4, 47, 40, 2, 44] has emerged as the shining star. Based on the problem definition, they naturally break down the process into two sub-tasks: object detection and association. They employ single-frame detectors to detect all targets in the current frame and subsequently apply post-processing algorithms to associate these detections with historical trajectories, thereby achieving online object tracking, as shown in Fig. 1(a). Precisely, heuristic methods are often exercised to select the optimal solution for ID assignment. They [4, 47, 6] primarily rely on motion information, utilizing the Kalman Filter [38] for linear motion estimation and computing the Intersection-over-Union (IoU) cost matrix. Some methods [48, 39, 1] additionally incorporate Re-ID features and employ cosine similarity for similarity calculation. Although these methods have achieved commendable results, their excessive reliance on manual design and human-crafted prior assumptions remains a concern. For instance, the Kalman Filter, relying on linear motion assumptions, struggles to fit target trajectories accurately in intricate motion patterns. Similarly, cosine similarity, based on the assumption of linearly separable features, may be unreliable when facing undistinguishing appearances. Therefore, in the face of complicated or novel situations, tracking-by-detection methods always require upgrading heuristic algorithms through human-driven analysis. This practice may lead to increasingly bloated and intricate codebases, potentially missing optimal tracking strategies for specific scenarios.

Refer to caption
(a) Tracking-by-Detection
Refer to caption
(b) Tracking-by-Query
Refer to caption
(c) Ours
Figure 1: The illustration of various Multi-Object Tracking (MOT) pipelines. Tracking-by-detection paradigm (Fig. 1(a)) always utilize surrogate tasks to calculate the cost matrix for ID matching. Tracking-by-query methods (Fig. 1(b)) use track queries to represent tracked targets and propagate them frame-by-frame. Ours (Fig. 1(c)) directly predicts the ID labels for current detections.

In the new era, numerous studies [46, 26, 35] have extended DETR [7, 52] to multi-object tracking. Tracked objects are represented as queries, which are propagated frame-by-frame and used to regress the object positions, as illustrated in Fig. 1(b). Aided by the end-to-end process, these methods [46, 13] can directly learn the necessary tracking strategies from the training data, allowing them to gain an advantage in complex scenarios (like DanceTrack [34]) through straightforward design principles. In spite of achieving remarkable performance, they still remain imperfect in some aspects. For example, the tracking-by-query paradigm necessitates frame-by-frame processing during the training phase, akin to how recurrent neural networks (RNNs) operate. This nature hinders efficient long-term training, leading to a discrepancy between training and inference, ultimately constraining the model’s performance. Furthermore, simultaneously handling detect and track queries may lead to conflicts and mutual inhibition [42, 45], resulting in a decline in comprehensive performance.

Our paper introduces a fresh perspective for formulating the MOT task by treating it as an end-to-end ID prediction problem. Specifically, given the historical trajectories of all tracked targets, the model is responsible for detecting all objects in the current frame and directly predicting their ID labels. In practice, we opted for DETR [52, 7] as our detector because it directly provides embeddings for each object, allowing us to construct historical trajectories without the need to consider various feature extraction techniques like Region-of-Interest (RoI), hierarchical, pooling, etc. To represent the identities of different tracked targets, we have constructed a learnable ID dictionary. Based on this, we concatenate the object embedding with its corresponding ID embedding to form historical trajectory information. We employ a straightforward transformer decoder architecture for the critical ID predictor module. It takes the embeddings of the current frame’s objects as input and directly predicts their ID labels based on historical trajectory information.

On the one hand, compared to tracking-by-detection methods, our approach bypasses surrogate tasks and heuristic algorithms, streamlining the tracking pipeline and enabling end-to-end exploitation of tracking capabilities from specific scenarios. On the other hand, both our detector and ID predictor can be highly parallelized, eliminating the serial processing as in tracking-by-query models, which enables efficient long-term training and further unlocks its potential. Besides, not having to handle detection and association simultaneously within a single module also alleviates the conflict.

We evaluate our method across distinct scenarios. Our approach demonstrates remarkably superior performance on DanceTrack [34] and SportsMOT [10] while achieving competitive results compared with other transformer-based methods on MOT17 [27]. These experimental findings highlight the advantages and promising potential of our proposed pipeline and model. Moreover, comprehensive ablation experiments have also validated the efficacy of our method.

2 Related Work

Tracking-by-Detection is the most widely used paradigm for multi-object tracking in the community. These methods [4, 47, 48] employ heuristic algorithms to associate the detection results from the current frame with historical trajectories, thereby achieving online multi-object tracking frame by frame. Since traditional multi-object tracking primarily focuses on pedestrian tracking [27, 11], and pedestrians exhibit relatively simple motion patterns, a natural approach is to estimate their motion for target association. For example, SORT [4] and ByteTrack [47] have achieved impressive tracking performance by leveraging the Kalman Filter [38] for linear motion prediction. However, this linear motion assumption fails to capture chaotic trajectories. OC-SORT [6] further upgrades the algorithm, making it adaptable to abrupt stops and non-linear motions, thereby improving the performance in complex scenarios such as DanceTrack [34]. Some methods [39, 48, 37, 10, 8, 24] also attempt to incorporate object appearance similarity for measurement. Deep-SORT [39] and Deep-OC-SORT [23] incorporate additional Re-ID modules to obtain Re-ID features. In contrast, works like FairMOT [48] and JDE [37] utilize the same framework for both detection and feature extraction. Furthermore, there are also methods that utilize other forms of data. For instance, BoT-SORT [1] corrects motion estimation results by estimating camera displacement, while TrackFlow [25] enhances the model through depth information. Due to the imperfect nature of linear motion assumptions in complex scenes, Hybrid-SORT [44] recently introduced state estimations based on object height and detection confidence, which enhances tracking capabilities, particularly on the challenging DanceTrack [34] dataset. Although patching up heuristic algorithms can still yield competitive results, it tends to bloat the codebase. Moreover, manually crafting surrogate tasks and relying on prior assumptions do not allow for learning optimal strategies in specific scenarios, resulting in a lot of effort to tune the parameters and algorithms when facing situations not considered.

Tracking-by-Query is a recently proposed tracking paradigm inspired by the DETR family [7, 52]. They [46, 26, 35] extended the detect queries to the MOT task, using track queries to represent tracked targets and propagate them through the video sequence. TransTrack [35] builds a Siamese transformer decoder network for detection and tracking. TrackFormer [26] and MOTR [46] utilize the same transformer decoder for joint detection and tracking by simultaneously processing the detect and track queries. MQT [17] adopts multiple queries to represent one tracked object and cares more about class-agnostic tracking. MeMOT [5] builds a colossal memory bank to store historical object features. MeMOTR [13] suggests using a long-term memory injection mechanism, a simple yet effective way to improve tracking performance. However, recent studies [13, 45, 42] have highlighted that the conflict between newborns and tracked objects still remains a severe problem for the tracking-by-query paradigm. Both CO-MOT [42] and MOTRv3 [45] attempt to balance the two in supervision, which can alleviate this contradiction.

3 Method

In this paper, we regard Multiple Object Tracking as an ID Prediction problem, thereby propose our method, MOTIP. In this section, we primarily discuss how we formulate this pipeline and design our approach.

3.1 MOT as an ID Prediction Problem

Multiple Object Tracking aims to generate the ordered bounding boxes frame-by-frame by each identity kk. This implies that, for online processing for each frame, it strives to match detected objects with previous trajectories. We denote these historical trajectories as a trajectory set 𝒯={𝒯1,𝒯2,,𝒯K}\mathcal{T}=\{\mathcal{T}^{1},\mathcal{T}^{2},\cdots,\mathcal{T}^{K}\}. Each 𝒯k=(τ1k,τ2k,,τTk)\mathcal{T}^{k}=(\tau_{1}^{k},\tau_{2}^{k},\cdots,\tau_{T}^{k}) reveals a complete trajectory of the kk-th identified target, where τtk\tau_{t}^{k} represents its tracklet context information at time step tt. In practice, the composition of τtk\tau_{t}^{k} depends on each different approach. For example, motion-based methods [4, 6, 47] always utilize position, velocity, and acceleration to characterize each tracklet, while some Re-ID methods [48, 37, 39] introduce object features for appearance matching.

Previously, given the historical trajectories 𝒯1:t1\mathcal{T}_{1:t-1}, as shown in Fig. 1, tracking-by-detection methods computed the cost matrix through surrogate tasks, while tracking-by-query approaches inherited ID information by propagating track queries. In this paper, we introduce a new perspective for MOT, which formulates it as an ID prediction problem. Specifically, given the detection results 𝒟t\mathcal{D}_{t} for the current frame ItI_{t}, we straightforwardly predict their ID labels based on historical trajectories. Formally, it can be expressed as follows:

ID(Ot)=θ(𝒯1:t1,𝒟t),\text{ID}(O_{t})=\theta(\mathcal{T}_{1:t-1},\mathcal{D}_{t}), (1)

where 𝒯1:t1\mathcal{T}_{1:t-1} represents the trajectories before the tt-th frame. θ\theta is an end-to-end learnable module tailed with a simple classification head. It predicts ID labels through a classification task without any hand-crafted surrogate tasks.

Refer to caption
Figure 2: Overview of MOTIP. There are three primary components: a learnable ID dictionary that represents different identities, a DETR detector that detects targets of interest, and an ID Decoder that predicts the ID labels of objects in the current frame, as we discussed in Sec. 3.2. We combine object embeddings with their corresponding ID embeddings to form the historical trajectories. Therefore, the ID field is regarded as an in-context prompt, and the ID Decoder predicts the ID of each new detection accordingly.

3.2 MOTIP Architecture

The overall MOTIP architecture is surprisingly simple and illustrated in Fig. 2. It contains three main components, described below:

  • A DETR detector detects objects and extracts their embeddings.

  • A learnable ID dictionary represents different identities as CC-dimensional embeddings.

  • An ID Decoder predicts the IDs of new-detected targets.

DETR Detector. We use DETR [7, 52], an end-to-end object detection model using a transformer encoder-decoder architecture, as our image detector. Starting from an original input image ItI_{t}, the CNN [15] backbone and transformer encoder extract and enhance the image features. Next, the decoder generates the output embeddings from NN learnable detect queries. They are decoded into bounding boxes and classification confidence by the bbox and cls head, as illustrated in Fig. 2. We then use a confidence threshold τdet\tau_{\textit{det}} to filter out the negative detections and remain MtM_{t} active targets. The use of DETR [7, 52] further streamlines our approach, as it allows us to utilize the decoded embeddings Ot={ot1,ot2,,otMt}O_{t}=\{o_{t}^{1},o_{t}^{2},\cdots,o_{t}^{M_{t}}\} to represent the corresponding targets without the need to consider hierarchical or RoI techniques for feature extraction and fusion.

Learnable ID Dictionary. One possible naïve approach is to use one-hot labels to represent IDs. However, on the one hand, discrete values are not conducive to neural network learning. On the other hand, the one-hot form becomes impractical when extending the model to scenarios with a large number of targets, resulting in excessive dimensionality. Therefore, we create an ID dictionary \mathcal{I} that consists of K+1K+1 learnable words to represent the identities, as follows:

={i1,i2,,iK,ispec}.\mathcal{I}=\{i^{1},i^{2},\cdots,i^{K},i^{\textit{spec}}\}. (2)

Each word ii is a learnable CC-dimensional embedding. In detail, the first KK words {i1,i2,,iK}\{i^{1},i^{2},\cdots,i^{K}\} are regular tokens that represent corresponding identities, while the last word ispeci^{\textit{spec}} is a special token that stands for the newborn objects, without any ID yet. In practice, KK will be set to a value remarkably greater than the average number of objects in a single frame, depending on different datasets.

Historical Trajectory. MOT methods often employ various contexts to represent historical trajectories based on their own requirements, such as positions [43], motions [4, 47, 6], Re-ID features [48, 39], etc. Like many transformer-based methods [46, 13], we directly utilize the output embeddings obtained from decoding detect queries to represent tracked targets. In practice, the target embedding ee is derived from the output embedding oo through a simple FFN structure. As the embedding ee does not contain identity information, we introduce its corresponding ID embedding from Eq. 2 to complete the tracklet context:

τtk=concat(etk,ik),\tau_{t}^{k}=\textit{concat}(e_{t}^{k},i^{k}), (3)

where τtk\tau_{t}^{k} is the tracklet context of the tracked object with identity kk, at time step tt, while etke_{t}^{k} is its target embedding. Therefore, each tracklet context is a 2C2C-dimensional vector and can be spliced into the historical trajectory 𝒯k=(,τt1k,τtk,τt+1k,)\mathcal{T}^{k}=(\cdots,\tau_{t-1}^{k},\tau_{t}^{k},\tau_{t+1}^{k},\cdots). It should be noted that, during experiments, we only keep trajectories from the most recent TT frames, as depicted in Fig. 2.

ID Decoder. To handle inputs of varying lengths, we leverage a 66-layer transformer decoder with relative temporal position encoding as our ID predictor. It takes the trajectories and detections as input, as shown in Eq. 1. Similar to Eq. 3, we incorporate the identity field into the detection to form a 2C2C-dimensional token:

dtm=concat(otm,ispec),d_{t}^{m}=\textit{concat}(o_{t}^{m},i^{\textit{spec}}), (4)

where otmo_{t}^{m} is the mm-th active DETR output embedding in the current frame, and ispeci^{\textit{spec}} is the special token in the ID dictionary to signify the ID is unknown. Then, we input these detection tokens 𝒟={dt1,dt2,,dtMt}\mathcal{D}=\{d_{t}^{1},d_{t}^{2},\cdots,d_{t}^{M_{t}}\} into the ID Decoder as QQ, and the trajectory tokens 𝒯tT:t1\mathcal{T}_{t-T:t-1} are regarded as KK and VV. Based on this, ID tokens in the historical trajectories serve as a kind of in-context prompt, propagating specific identity information to the corresponding detections. Afterward, the decoded dtm^\widehat{d_{t}^{m}} is then transformed into ID probabilities through a linear projection network, which is illustrated as id pred head in Fig. 2. Therefore, we transform the ID assignment step into a K+1K+1 classification task, which can be end-to-end supervised by cross-entropy loss.

3.3 训练和推理

训练。 我们为每次训练迭代采样一个包含 T+11T+1 帧的视频片段。然后,模型需要预测除第一帧之外的每个时间步 tt (1<tT+1)11(1<t\leq T+1) 的 ID。为了遵循在线追踪协议,我们在训练期间使用因果注意力掩码,以确保只有之前的帧可见。

对于我们提出的 ID 预测器,我们采用直接的交叉熵进行监督。由于每帧的目标数量各不相同,我们通过对所有目标取平均值来计算最终的 id 损失,如下所示:

id=t=2T+1m=1Mtk=1K+1ymklog(pmk)t=2T+1Mt,\mathcal{L}_{id}=\frac{-\sum\limits_{t=2}^{T+1}\sum\limits_{m=1}^{M_{t}}\sum\limits_{k=1}^{K+1}y_{m}^{k}\log(p_{m}^{k})}{\sum\limits_{t=2}^{T+1}M_{t}},\\ (5)
ymk={1gt id ofmthobject isk,0else,y_{m}^{k}=\left\{\begin{array}[]{rcl}1&&\text{gt {id} of}~{}m^{th}~{}\text{object is}~{}k\text{,}\\ 0&&\text{else,}\end{array}\right. (6)

其中 T+11T+1 表示训练片段中的帧数,而第 tt 帧中有 MtsubscriptM_{t} 个真实对象。 ymksuperscriptsubscripty_{m}^{k} 是根据每个物体身份基本事实 (ground truth) 的指示函数,如公式 6 所示。在实践中,我们采用端到端策略同时训练物体检测器和 ID 预测器。因此,我们利用一个整体损失函数 \mathcal{L} 来监督这两个部分:

=λclscls+λL1L1+λgiougiou+λidid,\mathcal{L}=\lambda_{\textit{cls}}\mathcal{L}_{\textit{cls}}+\lambda_{\textit{L1}}\mathcal{L}_{\textit{L1}}+\lambda_{\textit{giou}}\mathcal{L}_{\textit{giou}}+\lambda_{\textit{id}}\mathcal{L}_{\textit{id}}, (7)

其中 clssubscript\mathcal{L}_{\textit{cls}} 是焦点损失 [ 18 ]L1subscript\mathcal{L}_{\textit{L1}}giousubscript\mathcal{L}_{\textit{giou}} 分别表示 L1 损失和广义 IoU 损失 [ 31 ]λclssubscript\lambda_{\textit{cls}}λL1subscript\lambda_{\textit{L1}}λgiousubscript\lambda_{\textit{giou}} 是它们对应的权重系数, λidsubscript\lambda_{\textit{id}} 是我们提出的 id 损失 idsubscript\mathcal{L}_{\textit{id}} 的权重系数。

推理。 在视频序列的第一帧中,置信度得分大于 τnewsubscript\tau_{\textit{new}} 的检测到的对象被记录为新生对象,然后分配唯一身份。在每个后续时间步 tt (t>1)1(t>1) ,我们首先用置信度阈值 τdetsubscript\tau_{\textit{det}} 过滤来自 DETR 的检测结果。之后,将这些主动检测与最近 TT 帧的历史轨迹 𝒯tT:t1subscript1\mathcal{T}_{t-T:t-1} 一起输入到 ID 解码器中,预测相应的 ID 标签。在​​这些 ID 预测中,只有超过特定概率阈值 τidsubscript\tau_{\textit{id}} 的预测才会被采用。任何检测置信度大于 τnewsubscript\tau_{\textit{new}} 但未分配任何跟踪 ID 的目标都将被视为新生目标并被赋予新的身份。此外,一旦长期视频序列中出现的身份数量超过 KK ,ID 字典 \mathcal{I} 中的标记将被循环重用。

训练中的轨迹增强。 多物体跟踪总是面临具有挑战性的情况,例如遮挡、模糊或类似物体。 这些挑战可能导致推理过程中的 ID 分配错误,从而损害历史轨迹的可靠性。 然而,这些情况在训练期间不会出现,因为所有 ID 都是使用二分匹配从地面实况中获得的,类似于 DETR [ 7 , 52 ] 。因此,训练和推理之间存在分歧,这可能会降低跟踪性能。为了缓解这个问题,我们引入了两种在训练中使用的轨迹增强技术。首先,我们以概率 λswsubscript\lambda_{\textit{sw}} 交换同一帧内两个历史目标的 ID。其次,我们以概率 λdropsubscript\lambda_{\textit{drop}} 随机删除给定轨迹中的 ll 个连续标记,其中 ll 是从均匀分布中采样的随机长度。这两种方法可以模拟训练阶段涉及目标遮挡和 ID 失败的情况,从而增强我们模型的鲁棒性。

4 实验

4.1 数据集和指标

数据集。 我们主要在 DanceTrack [ 34 ] 和 SportsMOT [ 10 ] 数据集上评估和分析了我们提出的方法。这两个近期提出的数据集拥有大规模训练数据,有利于网络训练,并避免过拟合问题。此外,这两个数据集都提供了官方验证集,方便我们进行探索性实验。我们还展示了在 MOT17 [ 27 ] 数据集上的实验结果。

指标。 我们主要使用用于评估多目标跟踪的高阶指标(高阶跟踪准确度,HOTA) [ 22 ] 来评估我们的方法,因为它提供了一种平衡的方法来明确地衡量目标检测和关联的性能。此外,我们还在实验结果中列出了 MOTA [ 3 ] 和 IDF1 [ 32 ] 指标。

4.2 实现细节

默认情况下,我们基于 Deformable DETR [ 52 ] 和 ResNet-50 [ 15 ] 主干网络构建 MOTIP,然后利用 COCO [ 19 ] 上官方预训练的权重初始化参数。我们的模型使用 PyTorch 实现,主要在 8 块 NVIDIA RTX 4090 GPU 上进行训练。在训练期间,虽然我们并行处理 T+11T+1 张图像,但只有 444 帧用于梯度记录。其余 T33T-3 帧以无梯度模式(PyTorch 中为 no_grad )运行,以减少计算开销。此外,为了加快收敛速度​​,模型在相应的数据集上进行了简短的检测预训练,更多细节将在附录中讨论。

在实践中,监督权重系数 λclssubscript\lambda_{\textit{cls}}λL1subscript\lambda_{\textit{L1}}λgiousubscript\lambda_{\textit{giou}}λidsubscript\lambda_{\textit{id}} 分别设置为 2.02.02.05.05.05.02.02.02.01.01.01.0 。我们使用 AdamW 优化器,初始学习率为 1.0×1041.0superscript1041.0\times 10^{-4} ,权重衰减为 5.0×1045.0superscript1045.0\times 10^{-4} 。DanceTrack 和 SportsMOT 的历史轨迹的最大时间长度 TT 设置为 393939 ,MOT17 设置为 191919 。DanceTrack 和 SportsMOT 的 ID 字典大小 KK505050 ,而 MOT17 的 ID 字典大小为 200200200 ,因为其场景较为拥挤。虽然进一步对超参数进行周全的调整可以获得更好的性能,但为了简单起见,在所有数据集上都将训练增强参数 λswsubscript\lambda_{\textit{sw}}λdropsubscript\lambda_{\textit{drop}} 设置为 0.30.30.30.50.50.5

表 1 与 DanceTrack 上最先进的方法的性能比较 [ 34 ]
方法 HOTA 德塔 协会 MOTA IDF1
公平 MOT [ 48 ] 39.7 66.7 23.8 82.2 40.8
中心轨道 [ 50 ] 41.8 78.1 22.6 86.8 35.7
贸易 [ 40 ] 43.3 74.5 25.4 86.2 41.2
跨轨 [ 35 ] 45.5 75.9 27.5 88.4 45.2
字节轨道 [ 47 ] 47.7 71.0 32.1 89.6 53.9
通用电气 [ 51 ] 48.0 72.5 31.9 84.7 50.3
QDTrack [ 28 ] 54.2 80.1 36.8 87.7 50.4
机动车 [ 46 ] 54.2 73.5 40.2 79.7 51.5
OC-SORT [ 6 ] 55.1 80.3 38.3 92.0 54.6
C-BIoU [ 43 ] 60.6 81.3 45.4 91.6 61.6
美摩托 [ 13 ] 63.4 77.0 52.3 85.4 65.5
MOTIP(我们的) 67.5 79.4 57.6 90.3 72.2

4.3 与最先进的方法的比较

在本节中,我们将我们的 MOTIP 与 DanceTrack [ 34 ] 、SportsMOT [ 10 ] 和 MOT17 [ 27 ] 上的先前方法进行了比较。在与其他基于 Transformer 的方法 [ 461326 ] 进行比较时,我们仅列出了在标准可变形 DETR [ 52 ] 和 ResNet-50 [ 15 ] 上的实验结果。其他结果,例如使用更为改进的 DETR 框架 [ 20 ] 或更强大的主干 [ 21 ] 获得的结果,将在补充材料中讨论。此外,关于 DanceTrack 和 SporsMOT,我们主要比较了未使用额外数据集所获得的结果。附录中还将讨论额外训练数据的引入及其相应的结果。

舞蹈曲目。1 中,我们将 MOTIP 与 DanceTrack [ 34 ] 测试集上的当前最佳方法进行了比较。我们的方法没有任何花哨的功能,就实现了 67.567.567.5 HOTA 和 57.657.657.6 AssA,大大超越了其他当前最佳方法。与基于检测的跟踪算法( 例如 ByteTrack [ 47 ] 、OC-SORT [ 6 ] 和 C-BIoU [ 43 ]) 相比,我们的方法仅使用简化的 ID 解码器就获得了令人难以置信的关联准确率。我们认为,与手动设计的启发式方法相比,我们提出的端到端 ID 预测可以更有效地从复杂情况中学习跟踪能力。与同样采用可变形 DETR [ 52 ] 的基于查询的跟踪方法 [ [ 46 , 13 ] 相比,我们的方法实现了更高的检测和关联性能。我们认为,一方面,这是由于 MOTIP 中两个任务的形式化解耦,从而减少了相互冲突。另一方面,我们高度并行化的训练流程使我们能够有效地从长期序列中学习更稳健的跟踪策略。

SportsMOT。 我们还在2 中比较了我们在 SportsMOT [ 10 ] 上的方法。SportsMOT 论文中现有方法 [ 475051 ] 的一些结果使用了额外的训练数据。为了公平起见,我们选择了两种代表性方法,ByteTrack [ 47 ] 和 OC-SORT [ 6 ] ,并使用它们的官方代码库来报告它们的结果,而无需额外的训练数据。在此比较中,我们的模型表现出明显优越的性能, 71.971.971.9 HOTA。特别是与一些专注于物体关联的强大竞争对手(如 OC-SORT [ 6 ] 和 MeMOTR [ 13 ]) 相比,我们的方法表现出令人印象深刻的关联性能( 62.062.062.0 AssA 和 75.075.075.0 IDF1)。这些实验结果表明,我们的方法可以推广到各种各样的情况,因为 SportsMOT [ 10 ] 代表了与 DanceTrack [ 34 ] 非常不同的场景,其特点是快速移动和相机位移。

表 2 在 SportsMOT [ 10 ] 上与最新方法的性能比较。我们使用 ByteTrack [ 47 ] 和 OC-SORT [ 6 ] 的官方代码,报告了无需额外训练数据的结果。
方法 HOTA 德塔 协会 MOTA IDF1
FairMOT [48] 49.3 70.2 34.7 86.4 53.5
QDTrack [28] 60.4 77.5 47.2 90.1 62.3
ByteTrack [47] 62.1 76.5 50.5 93.4 69.1
OC-SORT [6] 68.1 84.8 54.8 93.4 68.0
MeMOTR [13] 68.8 82.0 57.8 90.2 69.9
MOTIP (ours) 71.9 83.4 62.0 92.9 75.0

MOT17. As a representative benchmark for pedestrian multi-object tracking, we also report the experimental results on the MOT17 [27] test set in Tab. 3. Similar to previous approaches [46, 13, 47], we introduce CrowdHuman [33] as additional training data to alleviate the issue of overfitting. In Tab. 3, we categorize the results of existing methods into two groups, CNN based and Transformer based, because transformer-based detectors [7, 52] exhibit limitations in detecting small and densely packed targets. Compared to the well-designed MOTR [46], our MOTIP achieves better tracking performance (59.259.2 HOTA). It is worth noting that, under the same Deformable DETR [52] framework, our method achieves significantly better detection performance (62.062.0 vs. 58.958.9 DetA). We attribute this improvement to the fact that, formally, we do not need to simultaneously handle both detection and association tasks, thereby avoiding conflicts between them. However, when compared to recent CNN-based methods [47, 6], there is till a certain gap, which remains a focus for future efforts.

Table 3: Performance comparison with state-of-the-art methods on MOT17 [27]. The best performance among the transformer-based methods is marked in bold.
Methods HOTA DetA AssA MOTA IDF1
CNN based:
CenterTrack [50] 52.2 53.8 51.0 67.8 64.7
QDTrack [28] 53.9 55.6 52.7 68.7 66.3
GTR [51] 59.1 61.6 57.0 75.3 71.5
FairMOT [48] 59.3 60.9 58.0 73.7 72.3
DeepSORT [39] 61.2 63.1 59.7 78.0 74.5
SORT [4] 63.0 64.2 62.2 80.1 78.2
ByteTrack [47] 63.1 64.5 62.0 80.3 77.3
Quo Vadis [12] 63.1 64.6 62.1 80.3 77.7
OC-SORT [6] 63.2 63.2 63.4 78.0 77.5
C-BIoU [43] 64.1 64.8 63.7 81.1 79.7
MotionTrack [29] 65.1 65.4 65.1 81.1 80.1
Transformer based:
TrackFormer [26] / / / 74.1 68.0
TransTrack [35] 54.1 61.6 47.9 74.5 63.9
TransCenter [41] 54.5 60.1 49.7 73.2 62.2
MeMOT [5] 56.9 / 55.2 72.5 69.0
MOTR [46] 57.2 58.9 55.8 71.9 68.4
MOTIP (ours) 59.2 62.0 56.9 75.5 71.2

4.4 Ablation Experiments

We perform our ablation experiments on DanceTrack [34] and SportsMOT [10] because they have large-scale training data and official validation sets. Unless otherwise stated, all trajectory augmentation techniques will not be used, i.e., λsw=0.0\lambda_{\textit{sw}}=0.0 and λdrop=0.0\lambda_{\textit{drop}}=0.0. More details are discussed in the appendix.

Different Tracking Pipeline. Section 3.1 introduces a novel formula that regards object association as an ID classification task, while we only utilize object embeddings from DETR to represent tracked targets. To validate the superiority of the pipeline we proposed, we constructed two additional distinct strategies for completing ID assignment. One of the strategies we employed is similar to ReID-based approaches. It directly supervises the cosine similarity of embeddings between historical trajectories and the current target. During inference, it selects the minimum cost match. The other strategy involves replacing the supervision with the widely used contrastive loss function (infoNCE) [30]. We denote these two above strategies and ours as cosine, contra and id pred, respectively. As shown in Tab. 4 (#4 to #6), with similar detection performance, our proposed ID prediction approach (#6) accomplishes significantly better tracking performance. We believe that, compared to manually designed similarity calculation methods such as cosine similarity, our ID Predictor can learn more appropriate ways to assign IDs directly from the data.

Table 4: Ablation experiments about the training strategy and tracking pipeline. The Two-Stage first trains the DETR for object detection and then freezes it to train the additional tracking module, while the One-Stage joint trains these two parts. Besides, cosine represents using cosine similarity for object matching, while contra refers to introducing the contrastive loss (InfoNCE).
# Training Pipeline DanceTrack val SportsMOT val
HOTA DetA AssA IDF1 HOTA DetA AssA IDF1
# 1 Two-Stage cosine 36.2 73.0 18.1 31.8 47.4 83.8 26.9 42.2
# 2 contra 40.0 73.4 22.0 36.1 53.7 84.0 34.4 49.9
# 3 id pred 51.2 73.3 36.1 51.2 68.6 84.2 56.0 70.5
# 4 One-Stage cosine 47.1 73.4 30.8 45.7 59.0 84.1 41.5 58.0
# 5 contra 49.6 74.1 33.5 47.1 62.5 84.7 46.2 60.5
# 6 id pred 59.1 74.0 47.5 61.6 73.5 84.6 64.0 76.7

One-Stage vs. Two-Stage Training. In Tab. 4, we also explore different training strategies: One-Stage trains the detection and tracking simultaneously, while Two-Stage first trains the DETR network for object detection and then freezes it to train the additional tracking part. Experimental results show that one-stage training can achieve better results no matter which tracking strategy. We suggest that joint training can help the DETR learn more distinguishable object embeddings. Meanwhile, the frozen DETR network provides a fair playing field (#1 to #3 in Tab. 4) for the three tracking pipelines, as the output embeddings are consistent. In this competition, our proposed ID Decoder still earns the best tracking performance, further substantiating the advantages of our approach.

Visualization of ID Decoder. In Fig. 3, we also visualize the cross-attention weights of the ID Decoder in a complex scenario. For object 55, a dancer standing behind and occluded by other dancers from frame 638638 to 641641, the attention weights between it and its own historical trajectory are depicted as a heat map. Although the 637637-th frame is closer to the current frame (642642-th frame) in time, the 630630-th frame contributes the most to re-linking after occlusion because it’s more visible and reliable. In contrast, objects 11 and 22 choose to trust the most recent tracklet embeddings (frame 641641) because they were not occluded in the past. These observations validate that our ID Decoder can dynamically capture reliable tracklet embeddings, especially in complicated situations. This also explains why, compared to other tracking pipelines, our method achieves better tracking performance in Tab. 4.

Refer to caption
Figure 3: Visualization of ID Decoder cross-attention scores between objects and their own historical trajectories. Object 55 is occluded from frame 638638 to 641641, other two objects (11 and 22) are visible during these 2020 frames. The red cross means the target disappears.
Table 5: Ablations about the one-hot and learnable ID embedding
ID Embed DanceTrack val SportsMOT val
HOTA DetA AssA IDF1 HOTA DetA AssA IDF1
one-hot 57.7 73.6 45.5 59.5 72.7 84.6 62.5 74.8
learnable 59.1 74.0 47.5 61.6 73.5 84.6 64.0 76.7
Table 6: Ablations about the self attention in our proposed ID Decoder
self-attn DanceTrack val SportsMOT val
HOTA DetA AssA IDF1 HOTA DetA AssA IDF1
58.1 74.6 45.4 58.5 72.5 84.8 62.0 74.6
59.1 74.0 47.5 61.6 73.5 84.6 64.0 76.7

One-Hot vs. Learnable ID Embedding. As for the ID dictionary shown in Eq. 2, we compared the use of one-hot and learnable embeddings, as reported in Tab. 5. Experimental results indicate that learnable ID tokens yield slightly better results. We attribute this improvement to the end-to-end training process. Not only that, but the excellent scalability of learnable embeddings is also a crucial reason for the ultimate choice.

Self-Attention in ID Decoder. Our ID Decoder is a stack of alternating self-attention and cross-attention layers, following the standard architecture of the transformer [36]. Although identity information can be obtained from trajectories using only cross-attention, self-attention would help new-detected targets exchange identity information, thus distinguishing each other during the ID prediction process. The experiments in Tab. 6 prove that the introduction of self-attention indeed improves our tracking performance. Moreover, it also shows that predicting ID independently for each target also yields satisfactory results, validating the robustness of our pipeline.

Table 7: Ablation experiments about the trajectory augmentation on SportsMOT.
λdrop\lambda_{\textit{drop}} λsw\lambda_{\textit{sw}} SportsMOT val
HOTA DetA AssA MOTA IDF1
0.0 0.0 73.5 84.6 64.0 93.7 76.7
0.5 0.0 74.5 84.6 65.6 93.7 78.1
1.0 0.0 73.9 84.8 64.5 93.9 77.2
0.0 0.1 73.9 84.8 64.5 93.8 76.9
0.0 0.3 74.3 84.9 65.0 94.0 77.9
0.0 0.5 73.9 84.6 64.6 93.6 77.2
0.5 0.3 75.7 84.9 67.6 93.9 79.5
Table 8: Statistics after shuffling the order of ID assignments on SportsMOT.
statistics SportsMOT val
HOTA DetA AssA MOTA IDF1
max 73.94 84.70 64.65 93.79 77.44
min 73.20 84.44 63.34 93.73 76.25
avg 73.61 84.61 64.10 93.75 76.91
std 0.041 0.004 0.118 1.9e-4 0.118

Trajectory Augmentation in Training. We explore the impact of different probability hyperparameters in Tab. 7, on the SportsMOT [10] val set. The tracking performance significantly improves when λdrop\lambda_{\textit{drop}} is set to 0.50.5. However, if too many tokens are discarded, it can make training excessively challenging and detrimental to the final performance. Similarly, we also conducted ablation experiments with different values of λsw\lambda_{\textit{sw}}. When progressively increasing the λsw\lambda_{\textit{sw}} from 0.10.1 to 0.50.5, our method achieves highest HOTA and AssA scores while λsw\lambda_{\textit{sw}} is set to 0.30.3. We suggest that exchanging partial historical trajectories can enhance the robustness of the model, but it’s crucial to choose an appropriate ratio. Therefore, we combine the optimal parameters of these two trajectory augmentation processes, i.e., λdrop=0.5\lambda_{\textit{drop}}=0.5 and λsw=0.3\lambda_{\textit{sw}}=0.3, to conduct comparative experiments with other state-of-the-art methods.

Study on ID Prompts. As shown in Eq. 3, historical trajectories consist of object embedding ee and ID embedding ii. The latter serves as in-context prompts, prompting the ID Decoder to predict the corresponding identities. To verify its effectiveness and robustness, we randomly shuffled the order of ID assignments during the inference process and conducted 2020 experiments. As a consequence, the identification prompt for the same trajectory varied across different experiments. The statistical results in Tab. 8 show that our model can always achieve satisfactory performance under different ID prompts.

5 Conclusions

We have presented MOTIP, a new design for multiple object tracking systems based on a streamlined ID prediction pipeline. Without bells and whistles, the approach has achieved impressive tracking performance on various benchmarks. Despite this, our method still has some limitations. For example, the lack of motion estimation may cause our model to lag behind in crowded scenarios. However, the road ahead is long, compared to well-established paradigms with extensive development, ours remains that of a beginner. The potential it demonstrates gives us confidence in a promising future.

Table of Contents for Appendix

  1. A.

    Experiment Details \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 16

    1. A.1.

      Training Details \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 16

    2. A.2.

      Inference Details \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 17

    3. A.3.

      Ablation Details \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 17

  2. B.

    More State-of-the-art Comparisons \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 18

    1. B.1.

      Boosting Performance via DAB-Deformable DETR \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 18

    2. B.2.

      Boosting Performance with Additional Training Data \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 18

    3. B.3.

      Inference Speed \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 20

  3. C.

    Rethinking the Introduction of Static Images \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 20

    1. C.1.

      Inconsistency of Scenarios and Objects of Interest \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 20

    2. C.2.

      Too Simple for Tracker: Random Shift Simulation \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 22

  4. D.

    Parallelized Training \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 22

  5. E.

    Limitations and Discussions \cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot\cdot 23

Appendix 0.A Experiment Details

Due to space limitations in the main text, we were unable to provide a comprehensive account of all experimental details. In this section, we will describe the specific details related to the training, inference, and ablation experiments.

0.A.1 Training Details

In each training iteration, we need to sample T+1T+1 frames for training. Similar to previous works [46, 13, 49, 42] that employ multi-frame training, we adopt random sampling intervals to enhance the diversity of training data. However, continuously increasing the sampling interval can lead to overly challenging training examples, deviating from the specific requirements of real-world scenarios and ultimately adversely affecting the model’s performance. In our experiments, we set the random sampling interval for both DanceTrack [34] and SportsMOT [10] to range from 11 to 44. On MOT17 [27], due to the presence of video sequences with low frame rates, the maximum sampling interval is set to 22 to balance performance. After obtaining all the clips, similar to the previous approaches [49, 13], we employ various data augmentation techniques, including random cropping, random resizing, and random horizontal flipping, to further enhance their diversity.

By default, on DanceTrack [34], we train MOTIP for 1414 epochs on the train set and drop the learning rate by a factor of 1010 at the 8th8^{\text{th}} and 12th12^{\text{th}} epoch. On SportsMOT [10], we train our model for 1818 epochs on the train set and drop the learning rate by a factor of 1010 at the 10th10^{\text{th}} and 16th16^{\text{th}} epoch. On MOT17 [27], following previous methods [47, 26], we use CrowdHuman [33] as additional simulated training data. We train the model on this joint train set for 4545 epochs. The learning rate is dropped at the 30th30^{\text{th}} and 40th40^{\text{th}} epoch. For all experiments, the overall batch size is set to 88 by using 88 NVIDIA RTX 4090 GPUs.

0.A.2 Inference Details

Hyperparameters Setting. As demonstrated in the inference process outlined in Section 3.3, the three hyperparameters τdet\tau_{\textit{det}}, τnew\tau_{\textit{new}}, and τid\tau_{\textit{id}} play a crucial role in the inference. In practice, the τdet\tau_{\textit{det}}, τnew\tau_{\textit{new}}, and τid\tau_{\textit{id}} are set to 0.30.3, 0.60.6, and 0.20.2 on DanceTrack [34] and SportsMOT [10]. On MOT17 [27], the detection and newborn thresholds (τdet\tau_{\textit{det}} and τnew\tau_{\textit{new}}) are set to 0.50.5 because of the extreme difficulty on detection, while the ID threshold τid\tau_{\textit{id}} is set to 0.020.02 as the ID dictionary is larger than others.

Ensuring the Unique ID Label. As shown in Sec. 3.2 and Fig. 2, our MOTIP is directly tailed with a linear layer termed id pred head to generate the probability for each ID label. However, in the general classification tasks, multiple objects can be predicted as belonging to the same category. In contrast, for evaluating MOT, an ID label can be used at most once within the same frame. Therefore, we require additional processing to ensure compliance with this criterion.

An intuitive approach would be to employ a bipartite matching algorithm, such as the Hungarian algorithm, to ensure that each ID label is activated only once. In practice, to accommodate multiple newborn targets that may appear in the same frame, the confidence value corresponding to the special ID ispeci^{\textit{spec}} will be duplicated MtM_{t} times in advance, where MtM_{t} is the number of detections in the tt-th frame. Another alternative way is to design a custom algorithm that ensures that for each ID label, we only select the highest valid probability for consideration. In this way, we can guarantee that each ID is used at most once. We have experimented with both above methods, and the resulting performance impact is negligible. Therefore, we ultimately opted for the Hungarian algorithm because the existing package allows us to accomplish this process with just two lines of code in our implementation.

0.A.3 Ablation Details

Experiment Settings. As we discussed in Sec. 4.4, we trained models on the train sets of DanceTrack [34] and SportsMOT [10], and subsequently evaluated them on their corresponding official validation sets to complete our ablation experiments. To reduce computational overhead, we configured the maximum historical trajectory length to be 1919 (i.e., T=19T=19). We also slightly decrease the training epochs. Specifically, we train the model on DanceTrack and SportsMOT for 1212 and 1616 epochs, respectively. Excluding Tab. 7, for the sake of a fair comparison, we refrain from utilizing any trajectory augmentation techniques in experiments, i.e., λsw=0.0\lambda_{\textit{sw}}=0.0 and λdrop=0.0\lambda_{\textit{drop}}=0.0. As for the inference, we leverage τdet=τnew=0.5\tau_{\textit{det}}=\tau_{\textit{new}}=0.5, while τid=0.1\tau_{\textit{id}}=0.1 for simplicity.

Different Tracking Pipeline. As shown in Tab. 4, we leveraged different tracking pipelines for comparison, i.e., cosine and contra. As for cosine, i.e., cosine similarity, we replace the learnable ID Decoder with a fixed cosine similarity computing as follows:

cosine(ei,ej)=eiejeiej,\textit{cosine}(e_{i},e_{j})=\frac{e_{i}\cdot e_{j}}{\|e_{i}\|\|e_{j}\|}, (8)

where eie_{i} and eje_{j} are the two given object embeddings modified from the DETR output embeddings as discussed in Sec. 3.2. The cosine(ei,ej)\textit{cosine}(e_{i},e_{j}) represents their cosine similarity. During training, if eie_{i} and eje_{j} belong to the same trajectory, the similarity cosine(ei,ej)\textit{cosine}(e_{i},e_{j}) is supervised to be 1.01.0. Otherwise, we supervise it to be 1.0-1.0. During training, we aggregate the historical embeddings of each trajectory by averaging them into a single embedding. Subsequently, we compute the cosine similarity between this aggregated embedding and the object embeddings from the current frame. This process yields the final ID-matching results.

Furthermore, we also attempted to replace the supervised function shown in Eq. 8 with the commonly used infoNCE [30] loss function in contrastive learning, aiming to achieve better performance and more comprehensive comparisons. Specifically, object embeddings belonging to the same identity are supervised as the same category in the cross-entropy and vice versa. In Tab. 4, this is denoted as contra, and it is evident that it indeed yields better results than directly using cosine similarity for supervision.

Appendix 0.B More State-of-the-art Comparisons

In this section, we conducted a more comprehensive set of state-of-the-art comparative experiments on DanceTrack [34]. Additionally, we extended the comparisons to include some recently proposed advanced algorithms.

Table 9: Performance comparison with state-of-the-art methods on DanceTrack [34]. The top three results are highlighted in the order of red, blue, and green. MeMOTR+ and MOTIP+ are implemented upon the DAB-Deformable DETR [20] framework. MOTRv2 [49] leverages additional detection proposals from YOLOX-X [14]. CO-MOT [42] utilizes multiple queries to represent the same target. MOTRv3 [45] employs a more powerful backbone ConvNeXT-Base [21]. In the with extra data section, these models are jointly trained using additional dataset CrowdHuman [33] and the DanceTrack train set.
Methods HOTA DetA AssA MOTA IDF1
w/o extra data:
FairMOT [48] 39.7 66.7 23.8 82.2 40.8
CenterTrack [50] 41.8 78.1 22.6 86.8 35.7
TraDeS [40] 43.3 74.5 25.4 86.2 41.2
TransTrack [35] 45.5 75.9 27.5 88.4 45.2
ByteTrack [47] 47.7 71.0 32.1 89.6 53.9
GTR [51] 48.0 72.5 31.9 84.7 50.3
QDTrack [28] 54.2 80.1 36.8 87.7 50.4
MOTR [46] 54.2 73.5 40.2 79.7 51.5
OC-SORT [6] 55.1 80.3 38.3 92.0 54.6
C-BIoU [43] 60.6 81.3 45.4 91.6 61.6
Hybrid-SORT [44] 62.2 / / 91.6 63.0
MeMOTR [13] 63.4 77.0 52.3 85.4 65.5
CO-MOT [42] 65.3 80.1 53.5 89.3 66.5
Hybrid-SORT-ReID [44] 65.7 / / 91.8 67.4
MeMOTR+ [13] 68.5 80.5 58.4 89.9 71.2
MOTIP (ours) 67.5 79.4 57.6 90.3 72.2
MOTIP+ (ours) 70.0 80.8 60.8 91.0 75.1
with extra data:
CO-MOT [42] 69.4 82.1 58.9 91.2 71.9
MOTRv2 [49] 69.9 83.0 59.0 91.9 71.7
MOTRv3 [45] 70.4 83.8 59.3 92.9 72.3
MOTIP (ours) 71.4 81.3 62.8 91.6 76.3

0.B.1 Boosting Performance via DAB-Deformable DETR

Similar to some recent works [13], we explored substituting Deformable DETR [52] with DAB-Deformable DETR [20] architecture to improve tracking performance. To ensure a fair comparison, we also employed the maximum resolution of 800×1536800\times 1536, while other settings were kept consistent with the default configurations. As illustrated in Tab. 9, we use the symbol + to indicate that the model leverages DAB-Deformable DETR. Our model MOTIP+ achieves 70.070.0 HOTA, significantly surpassing that of existing methods. Compared to MeMOTR+ [13], it earns superior association performance (i.e., 60.860.8 vs. 58.458.4). Employing DAB-Deformable DETR enhanced the detection capabilities compared with MOTIP, thereby improving overall performance. These experimental results further substantiate the potential of our proposed pipeline.

0.B.2 Boosting Performance with Additional Training Data

In recent works [49, 45, 42], they opted to utilize additional training data to bolster the model’s tracking abilities. A common choice is to incorporate the static human detection dataset CrowdHuman [33]. To obtain the continuous video sequences required for tracking, they employ random shifts for multiple distinct resamplings for a single image. These results are presented in the with extra data section of Tab. 9. We also followed the same practice as previous researchers and achieved an exciting result of 71.471.4 HOTA.

It should be noted that a completely fair comparison with the aforementioned studies [49, 42, 45] is challenging to achieve. For example, MOTRv2 [49] enhances the model’s detection performance by utilizing additional detection proposals derived from a pre-trained YOLOX-X [14] as reference points for detect and track queries. MOTRv3 [45] leverages a more powerful backbone, ConvNeXT-Base [21], to improve its tracking performance. CO-MOT [42] disrupts the conventional one-to-one correspondence between queries and targets in traditional DETR [7, 52] frameworks by employing multiple shadow queries to represent a single target. This approach effectively amplifies the representational capacity of individual targets within the decoder, thereby enhancing both detection and association performance. While our approach may be perceived as a disadvantage due to its reliance on the original Deformable DETR [52] framework without the incorporation of sophisticated techniques, it nonetheless surpasses other methods in terms of tracking performance, which further corroborates its promising potential. We believe that the adoption of additional novel technologies or ingenious designs could potentially improve the performance of our method. However, such explorations fall beyond the scope of this paper and may be handled in future research endeavors.

0.B.3 Inference Speed

As the computational cost of the ID Decoder is mainly caused by several cross-attention layers, the inference speed of our method is primarily determined by the Deformable DETR [52] detector. On NVIDIA RTX 4090 with FP3232-precision, our inference speed is about 16.016.0 FPS. For comparison, the inference speed of MOTR [46] at the same resolution is about 16.516.5 FPS.

Appendix 0.C Rethinking the Introduction of Static Images

Although the results in Tab. 9 indicate that incorporating additional static images for training can enhance the model’s performance, we believe it is not a flawless solution. We are deeply concerned about the negative implications and the limitations it imposes on the development of MOT methods. Due to the fact that most methods [49, 45, 42] utilize the human detection dataset CrowdHuman [33] dataset as an additional static dataset, in this section, we primarily conduct our analysis on this dataset to substantiate our perspective.

0.C.1 Inconsistency of Scenarios and Objects of Interest

Different Scenarios. As discussed in Section 3.1 of CrowdHuman [33], this dataset aims to be diverse for real-world scenarios. To achieve this, various different keywords were used to collect data from Google Image search. In contrast, existing MOT datasets [34, 10] predominantly focus on specific scenarios. For instance, SportsMOT [10] primarily collects high-quality videos from professional sports events, while DanceTrack [34] crawls network videos, including mostly group dancing. Consequently, in CrowdHuman, some scenes may never appear in specific MOT datasets. As illustrated in Fig. 4, the crowded scenes characteristic of CrowdHuman are virtually absent in both DanceTrack and SportsMOT. Additionally, the CrowdHuman dataset also encompasses some scenes under atypical low-light conditions and wide-angle lens perspectives.

The inconsistencies across these scenarios did not adversely impact the performance on DanceTrack and SportsMOT during joint training yet, but there are still some concerns. The utilization of out-of-domain data is inherently a double-edged sword. While it can enhance performance, it may also cause the model to deviate from its intended application scenarios. This necessitates a careful adjustment of the training data ratio during training to prevent disrupting this delicate balance.

Refer to caption
Figure 4: Illustrate the inconsistency of scenarios between different datasets. (a) CrowdHuman [33] primarily focuses on the detection of humans in high-density scenarios. (b) DanceTrack [34] aims to track dancers from a fixed indoor camera position. (c) SportsMOT [10] is chiefly concerned with the tracking of sports events.
Refer to caption
Figure 5: Visualizing the different protocols of object annotations between SportsMOT [10] and CrowdHuman [33]. (a) In SportsMOT, only athletes are annotated, excluding referees and spectators, or any other people. (b) Since CrowdHUman aims to detect all humans, it also includes annotations for the crowd at the sidelines in sports scenarios, as illustrated by the red masked region.

Different Protocols of Object Annotations. In different datasets, the objects of interest may not be uniform. A salient example is the contrast between the CrowdHuman [33] and SportsMOT [10] datasets. CrowdHuman aims to detect every visible human in the images, whereas SportsMOT only focuses on the athletes in the videos. This results in differences in the annotations between these two datasets within the sports scenario. As illustrated in Fig. 5, compared to SportsMOT, CrowdHuman includes additional annotations for spectators and referees. When training a model on joint training datasets from these two sources, inconsistent annotation practices can lead to significant confusion for the model, hindering its understanding of which targets it should focus on. This can directly degrade the model’s detection performance because it wavers on whether to predict humans other than athletes.

It is possible to follow the approach implemented by MixSort [10], which involves training on the joint dataset and then fine-tuning on SportsMOT to mitigate the impact of inconsistent annotations. However, such multi-stage training would require researchers to invest significant effort in balancing on a razor’s edge. This deviates from the primary contradictions and challenges within the field.

Refer to caption
(a) A real-world video sequence, which is directly sampled from DanceTrack [34].
Refer to caption
(b) A simulated video sequence is generated by sampling regions through a random shift technique [46] from a static image (in CrowdHuman [33]).
Figure 6: Illustrating two distinct approaches of video sequence acquisition: real-world vs. simulated sequences. The latter is tantamount to transform the objects by mere translational and scaling transformations, which intuitively seems overly simplistic for a tracking model.

0.C.2 Too Simple for Tracker: Random Shift Simulation

Recent researches [46, 13] have demonstrated that multi-frame training is highly beneficial for developing a more robust tracking model. When incorporating static datasets like CrowdHuman [33] for training, in order to generate a video clip, random shifting is often employed to sample different regions of the same image. Specifically, for each target, it equates to continuously performing a translation and scaling operation at a constant ratio, as shown in Fig. 6(b). However, the video sequences obtained in this manner are overly simplistic when compared to real-world video sequences (as illustrated in Fig. 6(a)), lacking in target occlusion, deformation, and positional exchanges.

Contemporary methodologies [12, 13] increasingly emphasize temporal information in tracking, yet long-term training hardly benefits from overly simplistic simulated video sequences and may, in fact, contaminate the data distribution, leading to detrimental effects. We would place our hopes on the emergence of more sophisticated and clever video data simulation techniques, which could potentially lead to a turnaround.

Refer to caption
Figure 7: Visualize the parallelized training of MOTIP using a five-frame demo. Thanks to parallelized training techniques, we only need to perform two forward passes on DETR (as shown in number 11 and 22), which is GPU-friendly.

Appendix 0.D Parallelized Training

As discussed in Sec. 1, recent tracking-by-query methods [46, 13, 42] process multi-frame video sequences in a manner similar to RNNs. Specifically, when provided with a training example consisting of five frames, they need to handle this video sequence one frame at a time in a serial fashion, resulting in a total of five forward computations for DETR. However, in the training of our MOTIP, as illustrated in Fig. 7, there are no interdependencies between DETR components. Consequently, they can be highly parallelized for processing. As discussed in Sec. 4.2, some computations in DETR are performed in gradient-free mode, so in practice, all DETR models require only two parallel forward passes to complete the calculation (as illustrated in Fig. 7 with numbers).

While this approach does not reduce the overall computational workload, parallelized computations are more GPU-friendly. Therefore, even though our model processes 4040 frames in each iteration, it only costs 1.51.5 days to train our MOTIP on DanceTrack [34]. This efficient training pipeline makes long-term training feasible, and we firmly believe it benefits the training of a robust tracking model.

Appendix 0.E Limitations and Discussions

Trajectory Modeling. In our MOTIP system, as discussed in Sec. 3.2, we employ a Feed-Forward Neural Network (FFN) structure solely as an adapter for historical trajectory information. This design ensures that each tracklet embedding is independently generated. However, intuitively, modeling each trajectory temporally could indeed yield better feature representations. In our experiments, we also attempted to model each individual target trajectory using a straightforward causal attention mechanism. This approach did not yield significant improvements. We believe that more sophisticated trajectory modeling designs can be expected, thereby further enhancing the tracking performance.

Table 10: Comparisons with end-to-end transformer-based methods on MOT17 [27].
Methods HOTA DetA AssA MOTA IDF1
TrackFormer [26] / / / 74.1 68.0
TransTrack [35] 54.1 61.6 47.9 74.5 63.9
TransCenter [41] 54.5 60.1 49.7 73.2 62.2
MeMOT [5] 56.9 / 55.2 72.5 69.0
MOTR [46] 57.2 58.9 55.8 71.9 68.4
MeMOTR [13] 58.8 59.6 58.4 72.8 71.5
CO-MOT [42] 60.1 59.5 60.6 72.6 72.7
MOTRv3 [45] 60.2 62.1 58.7 75.9 72.4
MOTIP (ours) 59.2 62.0 56.9 75.5 71.2

Motion Estimation. Unlike many existing methods [6, 47, 29], our approach does not estimate the motion of trajectories. This poses a significant challenge for tracking, especially in crowded scenarios like MOT17 [27], where location information can assist the model in eliminating many incorrect answers. Therefore, in the case of MOT17, our approach did not demonstrate overwhelming superiority, as shown in Tab. 10. It is worth noting that, as discussed in Sec. 0.B.2, CO-MOT [42] and MOTRv3 [45] utilize a modified Deformable DETR [52] and a more powerful Backbone ConvNeXT-B [21], respectively. In the case where our model utilizes ResNet-50 [15] and the original Deformable DETR, MOTIP only lags slightly behind existing state-of-the-art approaches. This result further demonstrates the feasibility of the proposed pipeline. In future developments, we believe that introducing end-to-end motion modeling and estimation is a promising improvement. Luckily, some recent works [29, 12] are researching and advancing this technology.

Discussions. While our method has achieved remarkable results, there are still some issues that need attention as discussed above, especially in crowded scenarios. Current well-designed MOT approaches have dedicated several years to their development and addressing the challenges they encounter. We also hope that future research will address the limitations faced by MOTIP, thereby achieving enhanced tracking performance.

Acknowledgements

Ruopeng Gao would like to thank Yunzhe Lv for the kind discussion and Muyan Yang for the social support.

References

  • [1] Aharon, N., Orfaig, R., Bobrovsky, B.: Bot-sort: Robust associations multi-pedestrian tracking. CoRR abs/2206.14651 (2022)
  • [2] Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: ICCV. pp. 941–951. IEEE (2019). https://doi.org/10.1109/iccv.2019.00103
  • [3] Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008 (2008). https://doi.org/10.1155/2008/246309
  • [4] Bewley, A., Ge, Z., Ott, L., Ramos, F.T., Upcroft, B.: Simple online and realtime tracking. In: ICIP. pp. 3464–3468. IEEE (2016). https://doi.org/10.1109/icip.2016.7533003
  • [5] Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: CVPR. pp. 8080–8090. IEEE (2022). https://doi.org/10.1109/cvpr52688.2022.00792
  • [6] Cao, J., Weng, X., Khirodkar, R., Pang, J., Kitani, K.: Observation-centric SORT: rethinking SORT for robust multi-object tracking. CoRR abs/2203.14360 (2022). https://doi.org/10.1109/cvpr52729.2023.00934
  • [7] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (1). Lecture Notes in Computer Science, vol. 12346, pp. 213–229. Springer (2020). https://doi.org/10.1007/978-3-030-58452-8_13
  • [8] Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: ICME. pp. 1–6. IEEE Computer Society (2018). https://doi.org/10.1109/icme.2018.8486597
  • [9] Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: ECCV (4). Lecture Notes in Computer Science, vol. 7575, pp. 215–230. Springer (2012). https://doi.org/10.1007/978-3-642-33765-9_16
  • [10] Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: A large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023). https://doi.org/10.1109/iccv51070.2023.00910
  • [11] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I.D., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for multi object tracking in crowded scenes. CoRR abs/2003.09003 (2020)
  • [12] Dendorfer, P., Yugay, V., Osep, A., Leal-Taixé, L.: Quo vadis: Is trajectory forecasting the key towards long-term multi-object tracking? In: NeurIPS (2022)
  • [13] Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9901–9910 (October 2023). https://doi.org/10.1109/iccv51070.2023.00908
  • [14] Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding YOLO series in 2021. CoRR abs/2107.08430 (2021)
  • [15] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/cvpr.2016.90
  • [16] Kesa, O., Styles, O., Sanchez, V.: Multiple object tracking and forecasting: Jointly predicting current and future object locations. In: WACV (Workshops). pp. 560–569. IEEE (2022). https://doi.org/10.1109/wacvw54805.2022.00062
  • [17] Korbar, B., Zisserman, A.: End-to-end tracking with a multi-query transformer. CoRR abs/2210.14601 (2022)
  • [18] Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV. pp. 2999–3007. IEEE Computer Society (2017)
  • [19] Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: ECCV (5). Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer (2014). https://doi.org/10.1007/978-3-319-10602-1_48
  • [20] Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB-DETR: dynamic anchor boxes are better queries for DETR. In: ICLR. OpenReview.net (2022)
  • [21] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR. pp. 11966–11976. IEEE (2022). https://doi.org/10.1109/cvpr52688.2022.01167
  • [22] Luiten, J., Osep, A., Dendorfer, P., Torr, P.H.S., Geiger, A., Leal-Taixé, L., Leibe, B.: HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 129(2), 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2
  • [23] Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. arXiv preprint arXiv:2302.11813 (2023)
  • [24] Mahmoudi, N., Ahadi, S.M., Rahmati, M.: Multi-target tracking using cnn-based features: CNNMTT. Multim. Tools Appl. 78(6), 7077–7096 (2019). https://doi.org/10.1007/s11042-018-6467-6
  • [25] Mancusi, G., Panariello, A., Porrello, A., Fabbri, M., Calderara, S., Cucchiara, R.: Trackflow: Multi-object tracking with normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9531–9543 (2023). https://doi.org/10.1109/iccv51070.2023.00874
  • [26] Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: Trackformer: Multi-object tracking with transformers. In: CVPR. pp. 8834–8844. IEEE (2022). https://doi.org/10.1109/cvpr52688.2022.00864
  • [27] Milan, A., Leal-Taixé, L., Reid, I.D., Roth, S., Schindler, K.: MOT16: A benchmark for multi-object tracking. CoRR abs/1603.00831 (2016)
  • [28] Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: CVPR. pp. 164–173. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/cvpr46437.2021.00023
  • [29] Qin, Z., Zhou, S., Wang, L., Duan, J., Hua, G., Tang, W.: Motiontrack: Learning robust short-term and long-term motions for multi-object tracking. In: CVPR. pp. 17939–17948. IEEE (2023). https://doi.org/10.1109/cvpr52729.2023.01720
  • [30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
  • [31] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: CVPR. pp. 658–666. Computer Vision Foundation / IEEE (2019)
  • [32] Ristani, E., Solera, F., Zou, R.S., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCV Workshops (2). Lecture Notes in Computer Science, vol. 9914, pp. 17–35 (2016). https://doi.org/10.1007/978-3-319-48881-3_2
  • [33] Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. CoRR abs/1805.00123 (2018)
  • [34] Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In: CVPR. pp. 20961–20970. IEEE (2022). https://doi.org/10.1109/cvpr52688.2022.02032
  • [35] Sun, P., Jiang, Y., Zhang, R., Xie, E., Cao, J., Hu, X., Kong, T., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple-object tracking with transformer. CoRR abs/2012.15460 (2020)
  • [36] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
  • [37] Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: ECCV (11). Lecture Notes in Computer Science, vol. 12356, pp. 107–122. Springer (2020)
  • [38] Welch, G., Bishop, G., et al.: An introduction to the kalman filter (1995)
  • [39] Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP. pp. 3645–3649. IEEE (2017). https://doi.org/10.1109/icip.2017.8296962
  • [40] Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: CVPR. pp. 12352–12361. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/cvpr46437.2021.01217
  • [41] Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: Transformers with dense queries for multiple-object tracking. CoRR abs/2103.15145 (2021)
  • [42] Yan, F., Luo, W., Zhong, Y., Gan, Y., Ma, L.: Bridging the gap between end-to-end and non-end-to-end multi-object tracking (2023)
  • [43] Yang, F., Odashima, S., Masui, S., Jiang, S.: Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In: WACV. pp. 4788–4797. IEEE (2023). https://doi.org/10.1109/wacv56688.2023.00478
  • [44] Yang, M., Han, G., Yan, B., Zhang, W., Qi, J., Lu, H., Wang, D.: Hybrid-sort: Weak cues matter for online multi-object tracking. CoRR abs/2308.00783 (2023)
  • [45] Yu, E., Wang, T., Li, Z., Zhang, Y., Zhang, X., Tao, W.: Motrv3: Release-fetch supervision for end-to-end multi-object tracking. CoRR abs/2305.14298 (2023)
  • [46] Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: ECCV (27). Lecture Notes in Computer Science, vol. 13687, pp. 659–675. Springer (2022). https://doi.org/10.1007/978-3-031-19812-0_38
  • [47] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: ECCV (22). Lecture Notes in Computer Science, vol. 13682, pp. 1–21. Springer (2022). https://doi.org/10.1007/978-3-031-20047-2_1
  • [48] Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129(11), 3069–3087 (2021). https://doi.org/10.1007/s11263-021-01513-4
  • [49] Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors (2022). https://doi.org/10.1109/cvpr52729.2023.02112
  • [50] Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: ECCV (4). Lecture Notes in Computer Science, vol. 12349, pp. 474–490. Springer (2020)
  • [51] Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR. pp. 8761–8770. IEEE (2022). https://doi.org/10.1109/cvpr52688.2022.00857
  • [52] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR. OpenReview.net (2021)