这是用户在 2025-7-22 16:09 为 https://ar5iv.labs.arxiv.org/html/2503.02101 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
广义扩散检测器:从扩散模型中挖掘用于领域通用检测的鲁棒特征

Boyong He1111Equal contribution. , Yuxiang Ji1111Equal contribution. , Qianwen Ye2, Zhuoyue Tan1, Liaoni Wu1 2222Corresponding authors ,
1Institute of Artifcial Intelligence, Xiamen University
2School of Aerospace Engineering, Xiamen University
{boyonghe, yuxiangji, Zhuoyue Tan}@stu.xmu.edu.cn
qianwenye3101@gmail.com
 wuliaoni@xmu.edu.cn222Corresponding authors
Abstract

Domain generalization (DG) for object detection aims to enhance detectors’ performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at Generalized Diffusion Detector

Refer to caption
Figure 1: Left: Six DG benchmarks used in our paper. The sample images demonstrate substantial domain shifts between source and target distributions. Right: Compared with previous SOTA methods, our approach achieves superior performance across all 13 datasets from six benchmarks.

1 Introduction
1 引言

Object detection stands as a fundamental task in computer vision and has achieved remarkable technological breakthroughs in recent years. Most object detection methods, including CNN [18, 51, 41, 61] and transformer-based [5, 82] detectors assume consistent distributions between training and testing data. However, in practice, detectors face significant challenges from domain shifts and environmental variations, and the performance often deteriorates substantially when deployed in unseen scenarios [9, 77].
目标检测是计算机视觉中的一项基础任务,近年来取得了显著的技术突破。大多数目标检测方法,包括基于卷积神经网络(CNN)[18, 51, 41, 61] 和基于 Transformer 的 [5, 82] 检测器,都假设训练数据和测试数据之间具有一致的分布。然而,在实际应用中,检测器面临着来自域偏移和环境变化的重大挑战,并且在部署到未见场景时性能往往会大幅下降 [9, 77]

Domain generalization [34, 35, 64] and adaptation [27, 40, 13, 3] methods have been developed to address these challenges. Mainstream approaches include semi-supervised learning with pseudo-labels [6, 40, 13, 3], feature alignment via adversarial training [9, 56, 67, 40], and style transfer techniques [19, 80]. However, these adaptation methods require target domain data during training, limiting their practical applications. This has led to increased interest in domain generalization through data augmentation [76, 25, 12, 33], adversarial training [36, 81], and meta-learning [1, 15]. While recent advances in foundation models like ClipGap [62] have shown promising results, building robust detectors remains challenging.
领域泛化 [34, 35, 64] 和适应 [27, 40, 13, 3] 方法已被开发出来以应对这些挑战。主流方法包括带伪标签的半监督学习 [6, 40, 13, 3]、通过对抗训练进行特征对齐 [9, 56, 67, 40] 以及风格迁移技术 [19, 80]。然而,这些适应方法在训练期间需要目标域数据,这限制了它们的实际应用。 这引发了人们对通过数据增强 [76, 25, 12, 33]、对抗训练 [36, 81] 和元学习 [1, 15] 进行领域泛化的兴趣增加。虽然像 ClipGap[62] 这样的基础模型的最新进展已显示出有前景的结果,但构建强大的检测器仍然具有挑战性。

Inspired by the remarkable capabilities of diffusion models [24, 58, 53] in handling visual variations, we propose to leverage them for domain-generalized detection. We extract and fuse multi-timestep intermediate features during the diffusion process to construct a diffusion-based detector that learns domain-invariant representations. However, directly applying these models introduces significant computational overhead due to the multi-step feature extraction process, limiting their practical deployment.
受扩散模型在处理视觉变化方面的卓越能力启发 [24, 58, 53],我们提议利用它们进行领域通用检测。我们在扩散过程中提取并融合多时间步的中间特征,以构建一个基于扩散的检测器,该检测器学习领域不变表示。然而,由于多步特征提取过程,直接应用这些模型会带来显著的计算开销,限制了它们的实际部署。

To address this limitation while preserving their strong generalization advantages, we develop an efficient knowledge transfer framework that enables lightweight detectors to inherit capabilities from diffusion-based detectors. Specifically, our framework consists of feature-level alignment using correlation-based matching and object-level alignment through shared region proposals, allowing conventional detectors to learn both domain-invariant features and robust detection capabilities. Through feature and object-level alignment, conventional detectors can achieve improved generalization without increasing inference time. Our work pioneers the application of diffusion models in domain-generalized detection, demonstrating their potential in enhancing detector generalization through knowledge distillation.
为了在保留扩散模型强大泛化优势的同时解决这一局限性,我们开发了一种高效的知识转移框架,使轻量级检测器能够从基于扩散的检测器继承能力。具体而言,我们的框架包括使用基于相关性匹配的特征级对齐和通过共享区域提议的对象级对齐,使传统检测器能够学习领域不变特征和强大的检测能力。通过特征和对象级对齐,传统检测器可以在不增加推理时间的情况下提高泛化能力。我们的工作开创了扩散模型在领域泛化检测中的应用,证明了它们通过知识蒸馏增强检测器泛化能力的潜力。

We conduct comprehensive experiments on six challenging DG benchmarks as shown in Fig. 1: Cross Camera, Adverse Weather, Synthetic to Real, Real to Artistic, Diverse Weather Benchmark, Corruption Benchmark. Experimental results demonstrate that our diffusion-based detector achieves consistent improvements across these benchmarks, with average performance gains of {18.6, 15.0, 27.2, 16.4, 2.3, 4.7}% mAP compared to previous methods, even outperforming most domain adaptation methods that have access to target domain data. Moreover, through our proposed feature-level and object-level learning framework, diffusion-guided detectors obtain significant improvements of {20.8, 21.4, 24.6, 9.9, 5.3, 13.8}% mAP compared to their baselines. These results validate the effectiveness of leveraging diffusion models for domain-generalized detection and provide a promising direction for building robust detectors in real-world scenarios.
我们在图 1 所示的六个具有挑战性的领域泛化基准上进行了全面实验:跨摄像头、恶劣天气、合成到真实、真实到艺术、多样天气基准、损坏基准。实验结果表明,我们基于扩散的检测器在这些基准上实现了一致的性能提升,与之前的方法相比,平均 mAP 性能增益分别为{18.6%、15.0%、27.2%、16.4%、2.3%、4.7%},甚至优于大多数能够获取目标域数据的域适应方法。此外,通过我们提出的特征级和对象级学习框架,扩散引导的检测器相对于其基线在 mAP 上有显著提升,分别为{20.8%、21.4%、24.6%、9.9%、5.3%、13.8%}。这些结果验证了利用扩散模型进行领域泛化检测的有效性,并为在现实场景中构建鲁棒检测器提供了一个有前景的方向。

The main contributions of this work can be summarized as follows:
这项工作的主要贡献可总结如下:

  • This work introduces diffusion models into domain-generalized detection. The inherent denoising mechanism and powerful representation capabilities of diffusion models are utilized to extract domain-invariant features for robust detection.
    这项工作将扩散模型引入到领域通用检测中。利用扩散模型固有的去噪机制和强大的表示能力来提取领域不变特征,以进行鲁棒检测。

  • To address the computational overhead, we propose a simple yet effective knowledge transfer framework. This framework enables detectors to inherit strong generalization capabilities through feature-level alignment and object-level learning, maintaining efficient inference time.
    为了解决计算开销问题,我们提出了一个简单而有效的知识转移框架。该框架使检测器能够通过特征级对齐和对象级学习继承强大的泛化能力,同时保持高效的推理时间。

  • Comprehensive experiments on six DG benchmarks demonstrate significant improvements over previous approaches in various scenarios. The findings provide valuable insights for robust visual recognition tasks.
    在六个领域泛化基准上进行的全面实验表明,在各种场景下,相较于先前的方法有显著改进。这些发现为鲁棒的视觉识别任务提供了有价值的见解。

2 Related Work
2 相关工作

2.1 Domain generalization for object detection
2.1 用于目标检测的领域泛化

Domain adaptation methods for object detection focus on adversarial feature alignment [9, 56, 67, 40] and consistency-based learning with pseudo labels [6, 40, 13, 3]. However, these approaches inherently require target domain data during training. Domain generalization methods have been explored through data augmentation [76, 25, 11, 12], adversarial training [36, 81], and meta-learning [1, 15] to enhance model robustness through style transfer and domain shift simulation. Recent works [78, 66, 33, 12, 43] extend these strategies to domain-generalized detection through multi-view learning, specialized augmentation and causal learning. Additionally, ClipGap [62] demonstrates the potential of foundation models by leveraging CLIP [48]. These advances inspire us to explore diffusion models as foundation models, harnessing their inherent generalization capabilities for domain-generalized detection.
用于目标检测的域适应方法主要关注对抗性特征对齐 [9, 56, 67, 40] 以及基于伪标签的一致性学习 [6, 40, 13, 3]。然而,这些方法本质上在训练期间需要目标域数据。已经通过数据增强 [76, 25, 11, 12]、对抗训练 [36, 81] 和元学习 [1, 15] 探索了域泛化方法,以通过风格迁移和域转移模拟来增强模型的鲁棒性。 近期的研究工作 [78, 66, 33, 12, 43] 通过多视图学习、专门的数据增强和因果学习,将这些策略扩展到领域泛化检测。此外,ClipGap[62] 通过利用 CLIP[48] 展示了基础模型的潜力。这些进展激励我们探索扩散模型作为基础模型,利用其固有的泛化能力进行领域泛化检测。

2.2 Diffusion models and applications
2.2 扩散模型与应用

Recent studies demonstrate that diffusion models [24, 58, 53, 49, 55, 16] not only excel in image generation but also exhibit unique advantages in representation learning [2, 68, 20]. The noise-adding and denoising mechanism enables effective handling of visual perturbations like noise, blur, and illumination changes [46, 60]. These properties suggest the potential of diffusion models in addressing domain generalization challenges. Specifically, the intermediate features during diffusion contain rich semantic information [46], while the denoising process naturally builds robustness against various perturbations [60]. Recent works like [2, 20] further demonstrate the effectiveness of diffusion-based representations in various vision tasks. These promising properties motivate us to leverage diffusion models for domain-generalized detection, and further inspire us to explore transferring their superior capabilities to other detectors.
近期研究表明,扩散模型 [24, 58, 53, 49, 55, 16] 不仅在图像生成方面表现出色,而且在表示学习中也展现出独特优势 [2, 68, 20]。添加噪声和去噪机制能够有效处理诸如噪声、模糊和光照变化等视觉干扰 [46, 60]。这些特性表明扩散模型在应对领域泛化挑战方面具有潜力。具体而言,扩散过程中的中间特征包含丰富的语义信息 [46],而去噪过程自然地构建了对各种干扰的鲁棒性 [60]。 近期的工作,如 [2, 20] 进一步证明了基于扩散的表示在各种视觉任务中的有效性。这些有前景的特性促使我们利用扩散模型进行领域通用检测,并进一步激发我们探索将其卓越能力转移到其他检测器上。

3 Method
3 方法

3.1 Overview
3.1 概述

In this section, we introduce our approach for domain generalization detection. The proposed method leverages a diffusion model as feature extractor to learn domain-robust representations through its iterative process. A two-level alignment mechanism is designed to transfer knowledge from the diffusion model to standard detectors: feature-level alignment for domain-invariant patterns and object-level alignment for accurate detection across domains. The overall framework is illustrated in Fig. 3.
在本节中,我们介绍用于领域泛化检测的方法。所提出的方法利用扩散模型作为特征提取器,通过其迭代过程学习领域鲁棒表示。设计了一种两级对齐机制,将知识从扩散模型转移到标准检测器:用于领域不变模式的特征级对齐和用于跨领域准确检测的对象级对齐。整体框架如图3所示。

In the following subsections, we first present the problem formulation and introduce the diffusion model basics in Sec. 3.2. The feature extraction and fusion strategy from diffusion models is described in Sec. 3.3. Our two-level alignment approach, including feature-level alignment (Sec. 3.4) and object-level alignment (Sec. 3.5), guides standard detectors to learn robust representations while maintaining accurate detection. The overall training objectives integrating detection losses with alignment constraints are detailed in Sec. 3.6.
在接下来的小节中,我们首先在3.2节给出问题公式并介绍扩散模型的基础知识。扩散模型的特征提取与融合策略在3.3节中描述。我们的两级对齐方法,包括特征级对齐(3.4节)和对象级对齐(3.5节),引导标准检测器在保持准确检测的同时学习鲁棒表示。将检测损失与对齐约束相结合的整体训练目标在3.6节中详细说明。

3.2 Preliminaries
3.2 预备知识

DG for detection: Let 𝒮={𝐱si,𝐲si}i=1Ns\mathcal{S}=\left\{\mathbf{x}_{s}^{i},\mathbf{y}_{s}^{i}\right\}_{i=1}^{N_{s}} denote the source domain with NsN_{s} labeled samples, where 𝐱si\mathbf{x}_{s}^{i} represents an image and 𝐲si\mathbf{y}_{s}^{i} represents the corresponding bounding box annotations. Let 𝒯={𝐱tj}j=1Nt\mathcal{T}=\{\mathbf{x}_{t}^{j}\}_{j=1}^{N_{t}} denote the target domain with NtN_{t} unlabeled images. The source and target samples are drawn from different distributions P𝒮P_{\mathcal{S}} and P𝒯P_{\mathcal{T}}, with discrepancies in image distributions (e.g., style, scene), label distributions (e.g., shapes, density), and sample sizes. The goal is to learn robust representations from the labeled source domain that generalize well to the unseen target domain.
用于检测的领域泛化(DG):设\(\mathcal{D}_s\)表示具有\(n_s\)个带标签样本的源域,其中\(x \in \mathcal{D}_s\)表示一幅图像,\(y \in \mathcal{D}_s\)表示相应的边界框注释。设\(\mathcal{D}_t\)表示具有\(n_t\)个无标签图像的目标域。源样本和目标样本从不同分布\(p_s(x, y)\)和\(p_t(x)\)中抽取,在图像分布(例如,风格、场景)、标签分布(例如,形状、密度)和样本大小方面存在差异。目标是从带标签的源域中学习鲁棒表示,使其能很好地泛化到未见的目标域。

Diffusion Model: The diffusion process gradually converts source domain images 𝐱si\mathbf{x}_{s}^{i} into pure noise through a fixed Markov chain of TT steps. At each time step t[0,T]t\in[0,T], Gaussian noise is progressively added to obtain the noisy samples 𝐱t\mathbf{x}_{t}. The forward process is formulated as:
扩散模型:扩散过程通过一个由 TT 步组成的固定马尔可夫链,将源域图像 𝐱si\mathbf{x}_{s}^{i} 逐渐转换为纯噪声。在每个时间步 t[0,T]t\in[0,T] ,逐步添加高斯噪声以获得噪声样本 𝐱t\mathbf{x}_{t} 。前向过程公式如下:

𝐱t=α¯t𝐱0+1α¯tϵ\mathbf{x}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon} (1)

where ϵ𝒩(0,𝐈)\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I}), and α¯t\bar{\alpha}_{t} represents the cumulative product of noise schedule {αt}t=1T\{\alpha_{t}\}_{t=1}^{T} controlling the noise magnitude. During training, a neural network θ\mathcal{F}_{\theta} learns to estimate the added noise from noisy observations 𝐱t\mathbf{x}_{t} conditioned on time step tt. The network is optimized to minimize the discrepancy between predicted and actual noise. At inference, the model reverses this process by iteratively denoising pure noise through refinement steps until obtaining the final output 𝐱^0\hat{\mathbf{x}}_{0}.
其中 ϵ𝒩(0,𝐈)\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})α¯t\bar{\alpha}_{t} 表示控制噪声幅度的噪声调度 {αt}t=1T\{\alpha_{t}\}_{t=1}^{T} 的累积乘积。在训练期间,神经网络 θ\mathcal{F}_{\theta} 学习根据时间步长 tt 从有噪声的观测值 𝐱t\mathbf{x}_{t} 中估计添加的噪声。该网络经过优化以最小化预测噪声与实际噪声之间的差异。在推理时,模型通过细化步骤对纯噪声进行迭代去噪,直到获得最终输出 𝐱^0\hat{\mathbf{x}}_{0} ,从而逆转此过程。

3.3 Multi-timestep Feature Extraction and Fusion
3.3 多时间步特征提取与融合

Refer to caption
Figure 2: Overview of multi-timestep feature extraction and fusion. Multi-scale features are extracted from a frozen diffusion model at different timesteps, then processed through trainable bottleneck structures and weighted aggregation module to obtain the final hierarchical features.
图2:多时间步长特征提取与融合概述。在不同时间步长从冻结的扩散模型中提取多尺度特征,然后通过可训练的瓶颈结构和加权聚合模块进行处理,以获得最终的分层特征。

Feature Extraction via Diffusion: The inherent characteristics of diffusion models’ intermediate representations make them particularly suitable for domain-invariant feature learning. During the denoising process, the noise predictor θ\mathcal{F}_{\theta} accumulates multi-scale semantic information by modeling data variations at different noise levels.
通过扩散进行特征提取:扩散模型中间表示的固有特性使其特别适合于域不变特征学习。在去噪过程中,噪声预测器 θ\mathcal{F}_{\theta} 通过对不同噪声水平下的数据变化进行建模来积累多尺度语义信息。

To leverage these properties, we extract and aggregate intermediate features from a sequence of time steps during the forward diffusion process. Specifically, given a source image 𝐱si\mathbf{x}_{s}^{i}, we progressively add noise following 𝐱si𝐱1𝐱t\mathbf{x}_{s}^{i}\rightarrow\mathbf{x}_{1}\rightarrow\cdots\rightarrow\mathbf{x}_{t}, where t{1,2,,T}t\in\{1,2,\cdots,T\}. At each time step tt, we extract features from the four upsampling stages of θ\mathcal{F}_{\theta}, denoted as 𝐬tCl,k×Hl,k×Wl,k\mathbf{s}_{t}\in\mathbb{R}^{C_{l,k}\times H_{l,k}\times W_{l,k}}, where for each layer l{1,2,3,4}l\in\{1,2,3,4\}, we extract three intermediate features (k=1,2,3k=1,2,3) from the middle of the layer. The feature dimensions are defined as Cl,kC_{l,k}, Hl,kH_{l,k}, and Wl,kW_{l,k} based on the corresponding layer architecture. This process captures the representation of the transition 𝐱t1𝐱t\mathbf{x}_{t-1}\rightarrow\mathbf{x}_{t} and generates a multi-timestep feature sequence {𝐬t}t=1T\{\mathbf{s}_{t}\}_{t=1}^{T} for each input image.
为了利用这些特性,我们在正向扩散过程中从一系列时间步中提取并聚合中间特征。具体来说,给定一个源图像$\mathbf{x}_0$,我们按照$q(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1}; \sqrt{1-\beta_t}\mathbf{x}_t, \beta_t\mathbf{I})$逐步添加噪声,其中$\beta_t$。在每个时间步$t$,我们从$\mathbf{U}_t$的四个上采样阶段提取特征,记为$\{\mathbf{F}_{t,1}, \mathbf{F}_{t,2}, \mathbf{F}_{t,3}, \mathbf{F}_{t,4}\}$,其中对于每个层$\mathbf{F}_{t,i}$,我们从层的中间提取三个中间特征($\mathbf{F}_{t,i}^1, \mathbf{F}_{t,i}^2, \mathbf{F}_{t,i}^3$)。根据相应的层架构,特征维度定义为$C_{t,1}, C_{t,2}, C_{t,3}$。这个过程捕捉了过渡$\mathbf{x}_{t-1}\rightarrow\mathbf{x}_t$的表示,并为每个输入图像生成一个多时间步特征序列$\{\mathbf{F}_{1}, \mathbf{F}_{2}, \cdots, \mathbf{F}_{T}\}$。

Multi-timestep Feature Fusion: For each timestep, features first go through individual bottleneck structures to align dimensions. A weighted aggregation module then combines features across timesteps with learnable weights as shown in Fig. 2. The fused features form a feature pyramid containing four levels with increasing channel dimensions Cl{C1,C2,C3,C4}C_{l}\in\{C_{1},C_{2},C_{3},C_{4}\} where Cl=256×2l1C_{l}=256\times 2^{l-1}, and corresponding spatial resolutions Hl×WlH_{l}\times W_{l} where Hl=H/2l+1H_{l}=H/2^{l+1} and Wl=W/2l+1W_{l}=W/2^{l+1} for l{1,2,3,4}l\in\{1,2,3,4\}, with HH and WW being the height and width of input image respectively.
多时间步特征融合:对于每个时间步,特征首先通过单独的瓶颈结构来对齐维度。然后,一个加权聚合模块使用可学习的权重跨时间步组合特征,如图2所示。融合后的特征形成一个包含四个级别的特征金字塔,通道维度递增,其中,对于,对应的空间分辨率为,其中,和分别是输入图像的高度和宽度。

Refer to caption
Figure 3: Overview of the proposed methods. The framework consists of a frozen diffusion detector (top) and a trainable ResNet-based detector (bottom). Knowledge transfer is achieved through feature-level alignment (align\mathcal{L}_{\text{align}}, cross\mathcal{L}_{\text{cross}}) and object-level prediction alignment using shared RoIs (cls\mathcal{L}_{\text{cls}} and reg\mathcal{L}_{\text{reg}}).
图 3:所提方法概述。该框架由一个冻结的扩散检测器(顶部)和一个可训练的基于 ResNet 的检测器(底部)组成。知识转移通过特征级对齐(ℒalign、ℒcross)以及使用共享感兴趣区域的对象级预测对齐(ℒcls 和ℒreg)来实现。

3.4 Feature level imitation and alignment
3.4 特征级模仿与对齐

We construct our detector based on the features extracted from the diffusion model as described above. Specifically, we employ Faster R-CNN [18] with all default parameters unchanged and train it exclusively on the source domain to obtain diff\mathcal{F}_{\textnormal{diff}}.
我们基于上述从扩散模型中提取的特征来构建我们的检测器。具体而言,我们采用 Faster R-CNN [18],所有默认参数保持不变,并仅在源域上对其进行训练以获得 diff\mathcal{F}_{\textnormal{diff}}

Motivation: Common detectors comm\mathcal{F}_{\textnormal{comm}} tend to overfit source domain data, limiting their generalization to unseen domains. To address this, we propose leveraging knowledge from diffusion-based detector diff.\mathcal{F}_{\textnormal{diff.}} through a two-level alignment approach. Through aligning both feature distributions and object predictions, we expect comm\mathcal{F}_{\textnormal{comm}} to learn more domain-invariant representations while preserving its detection capability.
动机:常见的检测器往往会过度拟合源域数据,限制了它们对未见域的泛化能力。为了解决这个问题,我们提出通过两级对齐方法利用基于扩散的检测器的知识。通过对齐特征分布和目标预测,我们期望它能学习到更多域不变的表示,同时保持其检测能力。

Feature alignment: Due to the inherent feature distribution differences between diffusion-based and standard detectors, we adopt PKD [4] for feature alignment, which enables robust cross-architecture knowledge transfer through correlation-based matching.
特征对齐: 由于基于扩散的检测器和标准检测器之间存在固有的特征分布差异,我们采用知识蒸馏(PKD)[4] 进行特征对齐,它能够通过基于相关性的匹配实现强大的跨架构知识转移。

We extract FPN features comml\mathcal{M}_{\textnormal{comm}}^{l} and diffl\mathcal{M}_{\textnormal{diff}}^{l} from the ResNet [21] of comm\mathcal{F}_{\textnormal{comm}} and the multi-timestep diffusion feature extraction network in diff\mathcal{F}_{\textnormal{diff}} respectively. Both features have dimensions B×C×Hl×Wl\mathbb{R}^{B\times C\times H_{l}\times W_{l}} at pyramid level ll, where BB is the batch size, CC is the number of channels, and Hl×WlH_{l}\times W_{l} represents the spatial dimensions. With ^\hat{\mathcal{M}} denoting normalized features, the alignment loss is defined as:
我们分别从 comm\mathcal{F}_{\textnormal{comm}} 的 ResNet[21]diff\mathcal{F}_{\textnormal{diff}} 中的多时间步扩散特征提取网络中提取 FPN 特征 comml\mathcal{M}_{\textnormal{comm}}^{l}diffl\mathcal{M}_{\textnormal{diff}}^{l} 。在金字塔层级 ll ,这两种特征的维度均为 B×C×Hl×Wl\mathbb{R}^{B\times C\times H_{l}\times W_{l}} ,其中 BB 是批量大小, CC 是通道数, Hl×WlH_{l}\times W_{l} 表示空间维度。用 ^\hat{\mathcal{M}} 表示归一化后的特征,对齐损失定义为:

align=l=1L1Nl^comml^diffl22\mathcal{L}_{\textnormal{align}}=\sum_{l=1}^{L}\frac{1}{N_{l}}\|\hat{\mathcal{M}}_{\textnormal{comm}}^{l}-\hat{\mathcal{M}}_{\textnormal{diff}}^{l}\|_{2}^{2} (2)

Cross feature adaptation: To address the potential instability of direct feature alignment between heterogeneous models, we propose to feed diff\mathcal{M}_{\textnormal{diff}} into comm\mathcal{F}_{\textnormal{comm}}’s detection heads, enabling stable adaptation while preserving the original detection pipeline. The cross feature loss is defined as:
跨特征适应: 为了解决异构模型之间直接特征对齐的潜在不稳定性问题,我们建议将 diff\mathcal{M}_{\textnormal{diff}} 输入到 comm\mathcal{F}_{\textnormal{comm}} 的检测头中,在保留原始检测管道的同时实现稳定的适配。交叉特征损失定义为:

cross=commrpn(diff;θcomm)+commroi(diff;θcomm)\mathcal{L}_{\textnormal{cross}}=\mathcal{L}_{\textnormal{comm}}^{\textnormal{rpn}}(\mathcal{M}_{\textnormal{diff}};\theta_{\textnormal{comm}})+\mathcal{L}_{\textnormal{comm}}^{\textnormal{roi}}(\mathcal{M}_{\textnormal{diff}};\theta_{\textnormal{comm}}) (3)

where commrpn\mathcal{L}_{\textnormal{comm}}^{\textnormal{rpn}} combines objectness classification and box regression losses for RPN, and commroi\mathcal{L}_{\textnormal{comm}}^{\textnormal{roi}} combines classification and box regression losses for ROI head, following the standard Faster R-CNN detection losses.
其中, commrpn\mathcal{L}_{\textnormal{comm}}^{\textnormal{rpn}} 结合了区域提议网络(RPN)的目标性分类损失和边框回归损失, commroi\mathcal{L}_{\textnormal{comm}}^{\textnormal{roi}} 则按照标准的更快区域卷积神经网络(Faster R-CNN)检测损失,结合了感兴趣区域(ROI)头的分类损失和边框回归损失。

3.5 Domain-invariant Object-level Knowledge Transfer
3.5 领域不变的对象级知识迁移

Beyond feature-level alignment, we aim to enhance comm\mathcal{F}_{\textnormal{comm}}’s object detection capability in unseen domains by transferring task-relevant knowledge from diff\mathcal{F}_{\textnormal{diff}}. However, this presents two challenges: (1) target domain data is unavailable during training under the DG setting, and (2) traditional knowledge distillation methods are not directly applicable due to the heterogeneous architectures between comm\mathcal{F}_{\textnormal{comm}} and diff\mathcal{F}_{\textnormal{diff}}.
除了特征级对齐之外,我们旨在通过从 diff\mathcal{F}_{\textnormal{diff}} 转移与任务相关的知识,来增强 comm\mathcal{F}_{\textnormal{comm}} 在未见领域中的目标检测能力。然而,这带来了两个挑战:(1)在域泛化设置下训练期间,目标域数据不可用;(2)由于 comm\mathcal{F}_{\textnormal{comm}}diff\mathcal{F}_{\textnormal{diff}} 之间的异构架构,传统的知识蒸馏方法不能直接应用。

Shared RoI Feature Propagation: Inspired by CrossKD [63], we propose to align object-level predictions through shared region proposals. We first generate candidate regions roiN×4\mathcal{R}_{\textnormal{roi}}\in\mathbb{R}^{N\times 4} using the RPN of comm\mathcal{F}_{\textnormal{comm}}, where NN is the number of proposals. These regions are used to pool features from both detectors, yielding fixed-size features commroi\mathcal{M}^{\textnormal{roi}}_{\textnormal{comm}} and diffroiN×d\mathcal{M}^{\textnormal{roi}}_{\textnormal{diff}}\in\mathbb{R}^{N\times d}, where dd is the feature dimension. The spatially aligned features are then fed into diffusion detector’s branches:
共享感兴趣区域特征传播: 受 CrossKD 启发 [63],我们建议通过共享区域提议来对齐对象级预测。我们首先使用 comm\mathcal{F}_{\textnormal{comm}} 的区域提议网络(RPN)生成候选区域 roiN×4\mathcal{R}_{\textnormal{roi}}\in\mathbb{R}^{N\times 4} ,其中 NN 是提议的数量。这些区域用于从两个检测器中提取特征,生成固定大小的特征 commroi\mathcal{M}^{\textnormal{roi}}_{\textnormal{comm}}diffroiN×d\mathcal{M}^{\textnormal{roi}}_{\textnormal{diff}}\in\mathbb{R}^{N\times d} ,其中 dd 是特征维度。然后,将空间对齐的特征输入到扩散检测器的分支中:

𝐏cat\displaystyle\mathbf{P}_{\textnormal{cat}} =diffcls(diffroi),𝐏bbox=diffreg(diffroi)\displaystyle=\mathcal{F}_{\textnormal{diff}}^{\textnormal{cls}}(\mathcal{M}^{\textnormal{roi}}_{\textnormal{diff}}),\quad\mathbf{P}_{\textnormal{bbox}}=\mathcal{F}_{\textnormal{diff}}^{\textnormal{reg}}(\mathcal{M}^{\textnormal{roi}}_{\textnormal{diff}}) (4)
𝐐cat\displaystyle\mathbf{Q}_{\textnormal{cat}} =diffcls(commroi),𝐐bbox=diffreg(commroi)\displaystyle=\mathcal{F}_{\textnormal{diff}}^{\textnormal{cls}}(\mathcal{M}^{\textnormal{roi}}_{\textnormal{comm}}),\quad\mathbf{Q}_{\textnormal{bbox}}=\mathcal{F}_{\textnormal{diff}}^{\textnormal{reg}}(\mathcal{M}^{\textnormal{roi}}_{\textnormal{comm}})

where 𝐏cat,𝐐catN×(C+1)\mathbf{P}_{\textnormal{cat}},\mathbf{Q}_{\textnormal{cat}}\in\mathbb{R}^{N\times(C+1)} denote the class logits for CC object categories plus background, and 𝐏bbox,𝐐bboxN×4\mathbf{P}_{\textnormal{bbox}},\mathbf{Q}_{\textnormal{bbox}}\in\mathbb{R}^{N\times 4} represent the predicted box coordinates.
其中, 𝐏cat,𝐐catN×(C+1)\mathbf{P}_{\textnormal{cat}},\mathbf{Q}_{\textnormal{cat}}\in\mathbb{R}^{N\times(C+1)} 表示 CC 个目标类别加上背景的类别对数几率, 𝐏bbox,𝐐bboxN×4\mathbf{P}_{\textnormal{bbox}},\mathbf{Q}_{\textnormal{bbox}}\in\mathbb{R}^{N\times 4} 表示预测的边界框坐标。

Classification Knowledge Transfer: For classification knowledge transfer, we use KL divergence with temperature scaling:
分类知识迁移: 对于分类知识迁移,我们使用带温度缩放的 KL 散度:

cls=1Ni=1Nτ2DKL(𝐐cati𝐏cati)\mathcal{L}_{\textnormal{cls}}=\frac{1}{N}\sum_{i=1}^{N}\tau^{2}D_{KL}(\mathbf{Q}_{\textnormal{cat}}^{i}\|\mathbf{P}_{\textnormal{cat}}^{i}) (5)

where 𝐐cati,𝐏cati\mathbf{Q}_{\textnormal{cat}}^{i},\mathbf{P}_{\textnormal{cat}}^{i} are temperature-scaled softmax outputs for the ii-th proposal, and τ\tau is the temperature parameter.
其中, 𝐐cati,𝐏cati\mathbf{Q}_{\textnormal{cat}}^{i},\mathbf{P}_{\textnormal{cat}}^{i} 是第 ii 个提议的温度缩放后的 softmax 输出, τ\tau 是温度参数。

Regression Knowledge Transfer: For box regression, we use L1 loss:
回归知识迁移: 对于边界框回归,我们使用 L1 损失:

reg=1Ni=1N|𝐐bboxi𝐏bboxi|1\mathcal{L}_{\textnormal{reg}}=\frac{1}{N}\sum_{i=1}^{N}|\mathbf{Q}_{\textnormal{bbox}}^{i}-\mathbf{P}_{\textnormal{bbox}}^{i}|_{1} (6)

Through this object-level knowledge transfer mechanism, comm\mathcal{F}_{\textnormal{comm}} learns domain-invariant detection capabilities from diff\mathcal{F}_{\textnormal{diff}} via ROI-level alignment. By sharing the same detection head while processing features from different sources, we encourage comm\mathcal{F}_{\textnormal{comm}} to extract domain-agnostic object features that can generalize well across domains.
通过这种对象级知识转移机制, comm\mathcal{F}_{\textnormal{comm}} 通过感兴趣区域(ROI)级对齐从 diff\mathcal{F}_{\textnormal{diff}} 学习领域不变的检测能力。在处理来自不同来源的特征时共享相同的检测头,我们促使 comm\mathcal{F}_{\textnormal{comm}} 提取能够在不同领域间良好泛化的领域无关对象特征。

3.6 Joint Optimization Objective
3.6 联合优化目标

The overall objective combines supervised detection learning with feature-level and object-level alignments:
总体目标是将监督检测学习与特征级和对象级对齐相结合:

total=\displaystyle\mathcal{L}_{\text{total}}= det(comm(𝐱s),𝐲s)supervised learning on source domain+\displaystyle\underbrace{\mathcal{L}_{\textnormal{det}}(\mathcal{F}_{\textnormal{comm}}(\mathbf{x}_{s}),\mathbf{y}_{s})}_{\text{supervised learning on source domain}}+ (7)
λfeature(align+cross)feature-level alignment+λobject(cls+reg)object-level alignment\displaystyle\lambda_{\text{feature}}\underbrace{(\mathcal{L}_{\text{align}}+\mathcal{L}_{\text{cross}})}_{\text{feature-level alignment}}+\lambda_{\text{object}}\underbrace{(\mathcal{L}_{\text{cls}}+\mathcal{L}_{\text{reg}})}_{\text{object-level alignment}}

where det\mathcal{L}_{\textnormal{det}} is the detection loss on source domain data (𝐱s,𝐲s)(\mathbf{x}_{s},\mathbf{y}_{s}), with λfeature\lambda_{\text{feature}} and λobject\lambda_{\text{object}} being the weights for feature-level and object-level alignment respectively.
其中 det\mathcal{L}_{\textnormal{det}} 是源域数据 (𝐱s,𝐲s)(\mathbf{x}_{s},\mathbf{y}_{s}) 上的检测损失, λfeature\lambda_{\text{feature}}λobject\lambda_{\text{object}} 分别是特征级对齐和对象级对齐的权重。

4 Experiments
4 实验

4.1 DG Detection Benchmarks
4.1 DG 检测基准

Cross Camera. Train on Cityscapes [10] (2,975 training images from 50 cities) and test on BDD100K [73] day-clear split with 7 shared categories following SWDA [56], evaluating generalization across diverse urban scenes.
跨摄像头。在 Cityscapes 数据集 [10](来自 50 个城市的 2975 张训练图像)上进行训练,并在 BDD100K 数据集 [73] 的白天晴朗分割上进行测试,按照 SWDA[56] 有 7 个共享类别,评估在不同城市场景中的泛化能力。

Adverse Weather. Train on Cityscapes [10] and test on FoggyCityscapes [57] and RainyCityscapes [28] (synthesized by adding fog and rain effects), using the challenging 0.02 split setting for FoggyCityscapes to evaluate robustness under degraded visibility conditions.
恶劣天气。在 Cityscapes 数据集上进行训练 <引用文献id=1>[<引用链接id=2>10],并在 FoggyCityscapes 数据集 <引用文献id=4>[<引用链接id=5>57]和 RainyCityscapes 数据集 <引用文献id=7>[<引用链接id=8>28](通过添加雾和雨的效果合成)上进行测试,使用具有挑战性的 0.02 分割设置对 FoggyCityscapes 进行评估,以评估在能见度降低条件下的鲁棒性。

Synthetic to Real. Train on Sim10K [31] (10K synthetic driving scenes rendered by GTA-V) and test on Cityscapes [10] and BDD100K [73] for the car category, examining synthetic-to-real transfer capability.
从合成数据到真实数据。在 Sim10K [31](由 GTA-V 渲染的 10K 合成驾驶场景)上进行训练,并在 Cityscapes [10]和 BDD100K [73]上对汽车类别进行测试,以检验从合成数据到真实数据的迁移能力。

Real to Artistic. Train on VOC [17] (16,551 real-world images from 2007 and 2012) and test on Clipart [30] (1K images, 20 categories), Comic [30] (2K images, 6 categories), and Watercolor [30] (2K images, 6 categories) following [40].
从真实到艺术。在 VOC 数据集 [[17]](2007 年和 2012 年的 16,551 张真实世界图像)上进行训练,并按照 [[40]] 在剪贴画数据集 [[30]](1000 张图像,20 个类别)、漫画数据集 [[30]](2000 张图像,6 个类别)和水彩画数据集 [[30]](2000 张图像,6 个类别)上进行测试。

Diverse Weather benchmark. Train on Daytime-Sunny (26,518 images) and test on four challenging conditions: Night-Sunny (26,158 images), Night-Rainy (2,494 images), Dusk-Rainy (3,501 images), and Daytime-Foggy (3,775 images) following [65], evaluating robustness across diverse weather and lighting scenarios. We follow settings from OADG [33] for comparison.

Corruption benchmark. A comprehensive test-only benchmark [47] with 15 different corruption types at 5 severity levels for Cityscapes [10], spanning noise, blur, weather, and digital perturbations to evaluate model robustness systematically. We follow settings from OADG [33] for comparison.

4.2 Implementation Details

Training settings: We adopt Faster R-CNN [18] with ResNet101 [21] backbone pretrained on ImageNet [54] as baseline detector. Models are trained for 20K iterations with batch size 16, learning rate 0.02 and SGD optimizer. We use EMA updated model for stable training. Other settings follow MMDetection defaults [7]. Code and trained models are provided in supplementary materials, along with additional experimental analyses, class-wise results, and more qualitative visualizations.

Evaluation metrics: We report AP50\text{AP}_{50} for individual categories and mAP across categories. For Corruption benchmark [47], we additionally report mPC (average AP50:95\text{AP}_{50:95} across 15 corruptions with 5 levels) and rPC (ratio between mPC and clean performance).

Domain augmentation: Following [40, 20], we employ Strong Augmentation including both color transformations (color jittering, contrast, equalization, sharpness) and spatial transformations (rotation, shear, translation). Additionally, we design domain-level augmentation strategies by applying FDA [71], Histogram Matching, and Pixel Distribution Matching between source domain images to generate diverse training samples.

Hyper-parameters: We set diffusion steps T=5T=5 and max-timestep as 500 for artistic benchmarks and 100 for other benchmarks as described in Sec. 3.3. The loss weights are set as λfeature=0.5\lambda_{\text{feature}}=0.5 and λobject=1\lambda_{\text{object}}=1 in Eq. 7.

Table 1: Cross Camera DG and DA Results (%) on BDD100K.
Methods Bike Bus Car Motor Psn. Rider Truck mAP
DG methods (without target data)
CDSD [65] (CVPR’22) 22.9 20.5 33.8 14.7 18.5 23.6 18.2 21.7
SHADE [75] (ECCV’22) 25.1 19.0 36.8 18.4 24.1 24.9 19.8 24.0
SRCD [50] (TNNLS’24) 24.8 21.5 38.7 19.0 25.7 28.4 23.1 25.9
MAD [69] (CVPR’23) - - - - - - - 28.0
DA methods (with unlabeled target data)
TDD [23] (CVPR’22) 28.8 25.5 53.9 24.5 39.6 38.9 24.1 33.6
PT [8] (ICML’22) 28.8 33.8 52.7 23.0 40.5 39.9 25.8 34.9
SIGMA [38] (CVPR’22) 26.3 23.6 64.1 17.9 46.9 29.6 20.2 32.7
SIGMA++ [39] (TPAMI’23) 27.1 26.3 65.6 17.8 47.5 30.4 21.1 33.7
NSA [79] (ICCV’23) - - - - - - - 35.5
HT [13] (CVPR’23) 38.0 30.6 63.5 28.2 53.4 40.4 27.4 40.2
Ours (DG settings)
Diff. Detector (SD-1.5) 38.9 31.0 71.5 37.6 61.5 47.0 38.5 46.6
Diff. Detector (SD-2.1) 38.0 33.6 69.9 36.6 62.1 46.3 34.2 45.8
Diff. Guided (SD-1.5) 38.4 33.4 72.0 38.3 60.3 47.0 35.0 46.3+20.9
Diff. Guided (SD-2.1) 38.5 32.6 71.8 37.5 60.2 46.7 35.3 46.1+20.7
Table 2: Adverse Weather DG and DA Results (%) on FoggyCityscapes.
Methods Bus Bike Car Motor Psn. Rider Train Truck mAP
DG methods
FACT [70] (CVPR’21) 27.7 31.3 35.9 23.3 26.2 41.2 3.0 13.6 25.3
FSDR [29] (CVPR’22) 36.6 34.1 43.3 27.1 31.2 44.4 11.9 19.3 31.0
MAD [69] (CVPR’23) 44.0 40.1 45.0 30.3 34.2 47.4 42.4 25.6 38.6
DA methods
MGA [78] (CVPR’22) 53.2 36.9 61.5 27.9 43.1 47.3 50.3 30.2 43.8
MTTrans [74] (CVPR’22) 45.9 46.5 65.2 32.6 47.7 49.9 33.8 25.8 43.4
OADA [72] (CVPR’22) 48.5 39.8 62.9 34.3 47.8 46.5 50.9 32.1 45.4
MIC [26] (CVPR’23) 52.4 47.5 67.0 40.6 50.9 55.3 33.7 33.9 47.6
SIGMA++ [39] (TPAMI’23) 52.2 39.9 61.0 34.8 46.4 45.1 44.6 32.1 44.5
CIGAR [42] (CVPR’23) 56.6 41.3 62.1 33.7 46.1 47.3 44.3 27.8 44.9
CMT [3] (CVPR’23) 66.0 51.2 63.7 41.4 45.9 55.7 38.8 39.6 50.3
HT [13] (CVPR’23) 55.9 50.3 67.5 40.1 52.1 55.8 49.1 32.7 50.4
Ours (DG settings)
Diff. Detector (SD-1.5) 56.2 50.4 66.7 39.9 50.2 59.5 39.9 38.0 50.1
Diff. Detector (SD-2.1) 55.5 49.6 67.0 40.4 50.4 58.2 29.2 36.4 48.3
Diff. Guided (SD-1.5) 53.8 54.2 67.5 45.6 52.1 60.8 53.9 32.4 52.5+21.8
Diff. Guided (SD-2.1) 55.1 53.9 67.0 43.4 51.9 59.5 42.2 34.8 51.0+20.3
Table 3: Adverse Weather DG and DA Results (%) on RainyCityscapes.
Methods mAP
DG methods
FACT [70] (CVPR’21) 39.9
FSDR [29] (CVPR’22) 42.8
SCG [69] (CVPR’23) 39.1
MAD [69] (CVPR’23) 42.3
DA methods
MGA [78] (CVPR’22) 43.0
TDD [23] (CVPR’23) 50.3
CMT [3] (CVPR’23) 52.1
SIGMA++ [39] (TPAMI’23) 46.9
Ours (DG settings)
Diff. Detector (SD-1.5) 58.2
Diff. Detector (SD-2.1) 56.1
Diff. Guided (SD-1.5) 57.9+21.5
Diff. Guided (SD-2.1) 58.3+21.9
Table 4: Synthetic to Real DG and DA Results (%) of category car on Cityscapes and BDD100K.
Methods Cityscapes BDD100K
DG methods
CDSD [65] (CVPR’22) 35.2 27.4
SHADE [75] (CVPR’22) 40.9 30.3
SRCD [50] (TNNLS’24) 43.0 31.6
DA methods
SWDA [56] (CVPR’19) 40.7 42.9
MTTrans [74] (CVPR’22) 57.9 -
SIGMA [38] (CVPR’22) 53.7 -
TDD [23] (CVPR’22) 53.4 -
MGA [78] (CVPR’22) 54.1 -
SIGMA++ [39] (TPAMI’23) 53.7 -
CIGAR [42] (CVPR’23) 58.5 -
NSA [79] (ICCV’23) 56.3 -
Ours (DG settings)
Diff. Detector (SD-1.5) 62.8 64.4
Diff. Detector (SD-2.1) 64.5 64.1
Diff. Guided (SD-1.5) 59.7+22.3 58.2+30.0
Diff. Guided (SD-2.1) 57.3+19.9 54.5+26.3
Table 5: Generalization detection Results (%) on Diverse Weather benchmark. DF: Daytime-Foggy, DR: Dusk-Rainy, NR: Night-Rainy, NS: Night-Sunny, as described in Sec. 4.1. Class-wise results are provided in the supplementary material.
Methods DF DR NR NS Average
CDSD [65] (CVPR’22) 33.5 28.2 16.6 36.6 28.7
SHADE [75] (CVPR’22) 33.4 29.5 16.8 33.9 28.4
CLIPGap [62] (CVPR’23) 32.0 26.0 12.4 34.4 26.2
SRCD [50] (TNNLS’24) 35.9 28.8 17.0 36.7 29.6
G-NAS [66] (AAAI’24) 36.4 35.1 17.4 45.0 33.5
OA-DG [33] (AAAI’24) 38.3 33.9 16.8 38.0 31.8
DivAlign [12] (CVPR’24) 37.2 38.1 24.1 42.5 35.5
UFR [43] (CVPR’24) 39.6 33.2 19.2 40.8 33.2
Diff. Detector (SD-1.5) 43.3 42.5 27.8 47.0 40.2
Diff. Detector (SD-2.1) 44.6 41.6 23.2 46.4 39.0
Diff. Guided (SD-1.5) 44.7 37.4 21.7 48.7 38.1+13.9
Diff. Guided (SD-2.1) 44.7 37.1 20.0 49.3 37.8+13.6
Table 6: Real to Artistic DG and DA Results (%) on Clipart, Comic, Watercolor. Class-wise results are provided in the supplementary material.
Methods Clipart Comic Watercolor
DG methods
Div. (CVPR’24) 33.7 25.5 52.5
DivAlign (CVPR’24) 38.9 33.2 57.4
DA methods
SWDA (CVPR’19) 29.4 53.3
MCRA (ECCV’20) 33.5 56.0
I3Net (CVPR’21) 30.1 51.5
DBGL (ICCV’21) 29.7 53.8
AT (CVPR’22) 49.3 59.9
D-ADAPT (ICLR’22) 49.0 40.5
TIA (CVPR’22) 46.3
LODS (CVPR’22) 45.2 58.2
CIGAR (CVPR’23) 46.2
CMT (CVPR’23) 47.0
Ours (DG settings)
Diff. Detector (SD-1.5) 58.3 51.9 68.4
Diff. Detector (SD-2.1) 51.7 46.6 62.1
Diff. Guided (SD-1.5) 40.8+13.6 29.7+11.6 54.2+12.7
Diff. Guided (SD-2.1) 32.7+5.5 24.9+6.8 50.6+9.1
Table 7: Generalization detection Results (%) on Cityscapes Corruption benchmark. (mPC and rPC are defined in Sec. 4.2).
Noise Blur Weather Digital
Methods Clean Gauss. Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic JPEG Pixel mPC \uparrow rPC \uparrow
FSCE [59] (CVPR’21) 43.1 7.4 10.2 8.2 23.3 20.3 21.5 4.8 5.6 23.6 37.1 38.0 31.9 40.0 20.4 23.2 21.0 48.7
OA-Mix [33] (AAAI’24) 42.7 7.2 9.6 7.7 22.8 18.8 21.9 5.4 5.2 23.6 37.3 38.7 31.9 40.2 20.2 22.2 20.8 48.7
OA-DG [33] (AAAI’24) 43.4 8.2 10.6 8.4 24.6 20.5 22.3 4.8 6.1 25.0 38.4 39.7 32.8 40.2 22.0 23.8 21.8 50.2
Diff. Detector (SD-1.5) 34.7 20.3 23.2 17.2 26.8 21.7 23.7 3.4 16.6 24.2 32.5 34.4 30.6 33.7 29.1 24.4 24.1 69.5
Diff. Detector (SD-2.1) 34.7 18.4 20.9 15.7 26.2 20.5 21.9 4.0 14.3 22.6 31.4 33.8 29.3 32.5 27.9 21.5 22.7 65.5
Diff. Guided (SD-1.5) 42.1 11.0 13.6 10.8 25.0 14.2 21.4 3.4 5.4 24.0 39.6 40.3 36.3 39.2 18.9 16.0 21.3+5.7 50.5+12.3
Diff. Guided (SD-2.1) 42.2 8.2 10.5 8.2 21.6 12.4 20.1 3.0 3.1 24.5 39.2 39.5 35.8 38.7 23.1 19.6 20.5+4.9 48.6+10.4

4.3 Results and Comparisons

We compare our approach against existing DG methods (target domain unseen) and DA methods (target domain unlabeled). Our results include Diff. Detector trained solely on source domain (Sec. 3.3), SD-1.5 and SD-2.1 using different StableDiffusion [52] versions, and Diff. Guided which applies our alignment approach to Faster R-CNN [9] baseline through Diff. Detector as described in Sec. 3.4 and 3.5.

In all tables, bold and underline denote the best and second-best results. Yellow background highlights the best average performance. And +xx indicates mAP(%) gains over baseline.

We conduct extensive experiments across six challenging benchmarks to evaluate our method in Tab. 1,  2,  4,  4,  6,  5 and  7. Our comprehensive evaluations demonstrate that the diffusion detector consistently achieves SOTA performance in DG settings and even surpasses most domain adaptation methods that require target domain data. Through effective feature and object alignment, our diffusion guidance mechanism successfully enhances detector generalization under moderate domain gaps. However, its improvement becomes more limited when facing extreme domain shifts, particularly in Real to Artistic benchmarks (in Tab. 6).

Table 8: Testing results and inference costs of different diffusion steps TT.
T
BDD
100K
Cityscapes
(car)
Clipart
Inference
Time (ms)
1 28.6 49.8 37.4 270
2 34.9 54.1 48.6 404
5 46.6 62.8 58.3 789
10 47.1 62.6 58.9 1,424
20 45.6 61.4 57.7 2,820
[Uncaptioned image] Figure 4: Testing results of different max timesteps.
Table 9: Comparison of Diffusion backbone and stronger models
Models Clipart Comic Watercolor DF DR NR NS
ConvNeXt-base [45] 43.6 26.6 55.1 39.7 39.2 23.4 45.9
VIT-base [14] 29.5 15.5 43.0 24.8 25.8 11.4 23.0
Swin-base [44] 30.2 18.0 42.6 37.2 38.9 22.6 42.2
MAE (VIT-base) [22] 28.1 16.7 44.4 32.5 32.6 17.0 34.3
Glip (Swin-tiny) [37] 39.2 18.7 50.4 38.5 36.3 20.3 45.5
Diffusion backbone 58.3 51.9 68.4 43.4 42.5 27.8 47.0

5 Ablation Studies

5.1 Studies on Diffusion Detector

We investigate two key parameters for extracting features from diffusion models: the number of diffusion steps TT and max timesteps (Sec. 3.3).

Comparison with stronger backbones: We evaluate against recent advanced models [45, 14, 44, 22, 37] under same settings. Results in Tab. 9 demonstrate our diffusion backbone’s superior generalization through effective domain-invariant feature extraction.

Analysis of diffusion steps TT: Tab. 8 shows that larger TT values improve performance but increase inference time. We set T=5T=5 as default for balancing accuracy and efficiency.

Analysis of max timesteps: Fig. 4 demonstrates that larger timesteps (e.g., 500) benefit benchmarks with severe domain shifts like Real-to-Artistic, while others perform well at timestep 100.

Insights on generalization: Our empirical results reveal two key findings: (1) Without noise diffusion (max timestep 0), the model inherits strong transfer capability from large-scale pre-training, similar to GLIP [37]; (2) The noise-adding and denoising process enhances generalization, with higher noise levels particularly benefiting larger domain shifts through learning domain-invariant features.

5.2 Studies on Diffusion Guided Detector

Analysis of proposed modules: Tab. 10 validates our components’ effectiveness for domain generalization. Domain augmentation yields limited gains (2.7%) for the diffusion detector but significant improvement (11.0%) for baseline, indicating diffusion detector relies more on inherent generalization while trainable-backbone detectors benefit more from diverse training samples.

Feature-level and object-level alignment improve performance by 3.5% and 6.4% respectively, showing effective representation learning under diffusion guidance. The consistent gains (6.1%) with data augmentation further demonstrate our alignment approach’s compatibility with conventional techniques.

Table 10: Ablation studies of our framework components. Settings: domain augmentation (Aug.), feature-level alignment (Fea.), and object-level alignment (Obj.). Foggy Cityscapes (F-C), Rainy Cityscapes (R-C)
Settings Aug. Fea. Obj. BDD F-C R-C DF DR NR NS
Diff.
Detector
44.5 48.0 54.4 41.3 40.4 24.3 43.7
46.6+2.1 50.1+2.1 58.2+3.8 43.3+2.0 42.5+2.1 27.8+3.5 47.0+3.3
FR-R101
baseline
25.4 30.7 36.4 28.8 24.1 12.4 31.4
36.2+10.8 47.9+17.2 50.1+13.7 39.9+11.1 34.8+10.7 16.4+4.0 41.2+9.8
Diff.
Guided
29.8+4.4 34.2+3.5 41.2+4.8 34.2+5.4 24.9+0.8 14.4+2.0 35.2+3.8
34.9+9.5 38.3+7.6 43.1+6.7 37.2+8.4 26.3+2.2 16.4+4.0 38.1+6.7
38.1+12.7 40.4+9.7 45.9+9.5 38.6+9.8 28.4+4.3 16.8+4.4 39.5+8.1
46.3+20.9 52.5+21.8 57.9+21.5 44.7+15.9 37.4+13.3 21.7+9.3 48.7+17.3

Analysis of λfeature\lambda_{\text{feature}} and λobject\lambda_{\text{object}} : As shown in Fig. 5, we investigate the impact of different weighting factors λfeature\lambda_{\text{feature}} and λobject\lambda_{\text{object}}. We observe that larger weights lead to improved performance on the target domain while slightly degrading the source domain performance. This trade-off suggests an inherent conflict between the model’s ability to fit training data and generalize to unseen target domain data.

Refer to caption
Refer to caption
Figure 5: Testing results of λfeature\lambda_{\text{feature}} and λobject\lambda_{\text{object}} on source domain and target domain.
Table 11: Model calibration performance with D-ECE [32] metric.
D-ECE(%) \downarrow
Detector BDD100K Cityscapes (car) Clipart
FR-R101 Baseline 10.9 20.2 10.8
Diff. Detector 8.5 8.7 5.5
Refer to caption
Figure 6: Reliability Diagram on different target domains. Curves closer to the diagonal indicate better performance.

5.3 Model Calibration Performance for DG

Our diffusion detector demonstrates superior confidence calibration compared to FR-R101 baseline (Fig. 6, Tab. 11). The Diff. Detector reduces D-ECE [32] to 8.5%, 8.7%, and 5.5% on BDD100K, Cityscapes, and Clipart datasets respectively. As shown in Fig. 6, our calibration curves more closely follow the diagonal line, indicating better alignment between predicted confidence scores and empirical accuracies across domains.

5.4 Limitations

Despite achieving strong results across various benchmarks, our diffusion detector faces efficiency challenges. The large parameter count and multi-step denoising process incur substantial computational costs (Tab. 8), limiting its applicability to larger-scale scenarios.

Moreover, while leveraging diffusion models’ generalization capability, performance gains remain limited under severe domain shifts (Tab. 6). Future work could explore more efficient architectures and effective learning strategies to better handle extreme domain gaps.

6 Conclusion

This paper addresses DG detection through two key contributions. First, we propose a diffusion detector that extracts domain-invariant representations by fusing multi-step features during diffusion. Second, to enable other detectors to benefit from such generalization capability, we develop a diffusion-guided detector framework that transfers knowledge to conventional detectors through feature and object level alignment. Extensive evaluations on six domain generalization benchmarks demonstrate substantial improvements across different domains and corruption types. Our work not only provides an effective solution for domain-generalized detection but also opens up new possibilities for leveraging diffusion models to enhance visual recognition robustness.

References

  • Balaji et al. [2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018.
  • Baranchuk et al. [2022] Dmitry Baranchuk, Andrey Voynov, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations, 2022.
  • Cao et al. [2023] Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui, and Yu-Xiong Wang. Contrastive mean teacher for domain adaptive object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23839–23848, 2023.
  • Cao et al. [2022] Weihan Cao, Yifan Zhang, Jianfei Gao, Anda Cheng, Ke Cheng, and Jian Cheng. Pkd: General distillation framework for object detectors via pearson correlation coefficient. Advances in Neural Information Processing Systems, 35:15394–15406, 2022.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  • Chen et al. [2020] Chaoqi Chen, Zebiao Zheng, Xinghao Ding, Yue Huang, and Qi Dou. Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8869–8878, 2020.
  • Chen et al. [2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  • Chen et al. [2022] Meilin Chen, Weijie Chen, Shicai Yang, Jie Song, Xinchao Wang, Lei Zhang, Yunfeng Yan, Donglian Qi, Yueting Zhuang, Di Xie, et al. Learning domain adaptive object detection with probabilistic teacher. In International Conference on Machine Learning, pages 3040–3055. PMLR, 2022.
  • Chen et al. [2018] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3339–3348, 2018.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • Cubuk et al. [2020] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  • Danish et al. [2024] Muhammad Sohail Danish, Muhammad Haris Khan, Muhammad Akhtar Munir, M Saquib Sarfraz, and Mohsen Ali. Improving single domain-generalized object detection: A focus on diversification and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17732–17742, 2024.
  • Deng et al. [2023] Jinhong Deng, Dongli Xu, Wen Li, and Lixin Duan. Harmonious teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23829–23838, 2023.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Du et al. [2020] Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees GM Snoek, and Ling Shao. Learning to learn with variational information bottleneck for domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 200–216. Springer, 2020.
  • Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  • Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  • Girshick [2015] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • He et al. [2024] Boyong He, Yuxiang Ji, Zhuoyue Tan, and Liaoni Wu. Diffusion domain teacher: Diffusion guided domain adaptive object detector. In ACM Multimedia 2024, 2024.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2022a] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022a.
  • He et al. [2022b] Mengzhe He, Yali Wang, Jiaxi Wu, Yiru Wang, Hanqing Li, Bo Li, Weihao Gan, Wei Wu, and Yu Qiao. Cross domain object detection by target-perceived dual branch distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9570–9580, 2022b.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Hong et al. [2021] Minui Hong, Jinwoo Choi, and Gunhee Kim. Stylemix: Separating content and style for enhanced data augmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14862–14870, 2021.
  • Hoyer et al. [2023] Lukas Hoyer, Dengxin Dai, Haoran Wang, and Luc Van Gool. Mic: Masked image consistency for context-enhanced domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11721–11732, 2023.
  • Hsu et al. [2020] Cheng-Chun Hsu, Yi-Hsuan Tsai, Yen-Yu Lin, and Ming-Hsuan Yang. Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 733–748. Springer, 2020.
  • Hu et al. [2019] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Huang et al. [2021] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. Fsdr: Frequency space domain randomization for domain generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6891–6902, 2021.
  • Inoue et al. [2018] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5001–5009, 2018.
  • Johnson-Roberson et al. [2017] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017.
  • Kuppers et al. [2020] Fabian Kuppers, Jan Kronenberger, Amirhossein Shantia, and Anselm Haselhoff. Multivariate confidence calibration for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 326–327, 2020.
  • Lee et al. [2024] Wooju Lee, Dasol Hong, Hyungtae Lim, and Hyun Myung. Object-aware domain generalization for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2947–2955, 2024.
  • Li et al. [2019] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M Hospedales. Episodic training for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1446–1455, 2019.
  • Li et al. [2018a] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5400–5409, 2018a.
  • Li et al. [2018b] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5400–5409, 2018b.
  • Li et al. [2022a] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022a.
  • Li et al. [2022b] Wuyang Li, Xinyu Liu, and Yixuan Yuan. Sigma: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5291–5300, 2022b.
  • Li et al. [2023] Wuyang Li, Xinyu Liu, and Yixuan Yuan. Sigma++: Improved semantic-complete graph matching for domain adaptive object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Li et al. [2022c] Yu-Jhe Li, Xiaoliang Dai, Chih-Yao Ma, Yen-Cheng Liu, Kan Chen, Bichen Wu, Zijian He, Kris Kitani, and Peter Vajda. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7581–7590, 2022c.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • Liu et al. [2023] Yabo Liu, Jinghua Wang, Chao Huang, Yaowei Wang, and Yong Xu. Cigar: Cross-modality graph reasoning for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23776–23786, 2023.
  • Liu et al. [2024] Yajing Liu, Shijun Zhou, Xiyao Liu, Chunhui Hao, Baojie Fan, and Jiandong Tian. Unbiased faster r-cnn for single-source domain generalized object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28838–28847, 2024.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  • Luo et al. [2024] Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
  • Michaelis et al. [2019] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • Rao et al. [2024] Zhijie Rao, Jingcai Guo, Luyao Tang, Yue Huang, Xinghao Ding, and Song Guo. Srcd: Semantic reasoning with compound domains for single-domain generalized object detection. IEEE Transactions on Neural Networks and Learning Systems, 2024.
  • Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
  • Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Saito et al. [2019] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6956–6965, 2019.
  • Sakaridis et al. [2018] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126:973–992, 2018.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Sun et al. [2021] Bo Sun, Banghuai Li, Shengcai Cai, Ye Yuan, and Chi Zhang. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7352–7362, 2021.
  • Tang et al. [2023] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems, 36:1363–1389, 2023.
  • Tian et al. [2020] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1922–1933, 2020.
  • Vidit et al. [2023] Vidit Vidit, Martin Engilberge, and Mathieu Salzmann. Clip the gap: A single domain generalization approach for object detection. In CVPR, pages 3219–3229, 2023.
  • Wang et al. [2024] Jiabao Wang, Yuming Chen, Zhaohui Zheng, Xiang Li, Ming-Ming Cheng, and Qibin Hou. Crosskd: Cross-head knowledge distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16520–16530, 2024.
  • Wang et al. [2020] Shujun Wang, Lequan Yu, Caizi Li, Chi-Wing Fu, and Pheng-Ann Heng. Learning from extrinsic and intrinsic supervisions for domain generalization. In European Conference on Computer Vision, pages 159–176. Springer, 2020.
  • Wu and Deng [2022] Aming Wu and Cheng Deng. Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 847–856, 2022.
  • Wu et al. [2024] Fan Wu, Jinling Gao, Lanqing Hong, Xinbing Wang, Chenghu Zhou, and Nanyang Ye. G-nas: Generalizable neural architecture search for single domain generalization object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5958–5966, 2024.
  • Xu et al. [2020] Chang-Dong Xu, Xing-Ran Zhao, Xin Jin, and Xiu-Shen Wei. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11724–11733, 2020.
  • Xu et al. [2023a] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023a.
  • Xu et al. [2023b] Mingjun Xu, Lingyun Qin, Weijie Chen, Shiliang Pu, and Lei Zhang. Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 8103–8112, 2023b.
  • Xu et al. [2021] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. A fourier-based framework for domain generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14383–14392, 2021.
  • Yang and Soatto [2020] Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4085–4095, 2020.
  • Yoo et al. [2022] Jayeon Yoo, Inseop Chung, and Nojun Kwak. Unsupervised domain adaptation for one-stage object detector using offsets to bounding box. In European Conference on Computer Vision, pages 691–708. Springer, 2022.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  • Yu et al. [2022] Jinze Yu, Jiaming Liu, Xiaobao Wei, Haoyi Zhou, Yohei Nakata, Denis Gudovskiy, Tomoyuki Okuno, Jianxin Li, Kurt Keutzer, and Shanghang Zhang. Mttrans: Cross-domain object detection with mean teacher transformer. In European Conference on Computer Vision, pages 629–645. Springer, 2022.
  • Zhao et al. [2022] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and Gim Hee Lee. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In European conference on computer vision, pages 535–552. Springer, 2022.
  • Zhou et al. [2021] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. arXiv preprint arXiv:2104.02008, 2021.
  • Zhou et al. [2022a] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, 2022a.
  • Zhou et al. [2022b] Wenzhang Zhou, Dawei Du, Libo Zhang, Tiejian Luo, and Yanjun Wu. Multi-granularity alignment domain adaptation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9581–9590, 2022b.
  • Zhou et al. [2023] Wenzhang Zhou, Heng Fan, Tiejian Luo, and Libo Zhang. Unsupervised domain adaptive detection with network stability analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6986–6995, 2023.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • Zhu et al. [2022] Wei Zhu, Le Lu, Jing Xiao, Mei Han, Jiebo Luo, and Adam P Harrison. Localized adversarial domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7108–7118, 2022.
  • Zhu et al. [2021] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.