这是用户在 2025-7-24 11:52 为 https://app.immersivetranslate.com/pdf-pro/9341092d-246c-46b6-acb5-e33b6973f994/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

计算机视觉中 YOLO 架构的全面回顾:从 YOLOv1 到 YOLOv8 和 YOLO-NAS

作为期刊论文发表在 Machine Learning and Knowledge Extraction 上- 胡安·特文国立理工大学信卡塔-Qro- 戴安娜·科尔多瓦-埃斯帕扎克雷塔罗自治大学Facultad de Informática

抽象

YOLO 已成为机器人、无人驾驶汽车和视频监控应用的中央实时物体检测系统。我们对 YOLO 的演变进行了全面分析,研究了从原始 YOLO 到 YOLOv8、YOLO-NAS 和 YOLO with Transformers 的每次迭代中的创新和贡献。我们首先描述标准指标和后处理;然后,我们讨论了网络架构的主要变化和每个模型的训练技巧。最后,我们总结了 YOLO 发展的重要经验教训,并提供了对其未来的看法,强调了增强实时物体检测系统的潜在研究方向。

关键词 YOLO *\cdot物体检测 *\cdot深度学习 *\cdot计算机视觉

1 介绍

实时物体检测已成为众多应用中的关键组件,涵盖自动驾驶汽车、机器人、视频监控和增强现实等各个领域。在不同的物体检测算法中,YOLO(You Only Look Once)框架以其速度和准确性的卓越平衡而脱颖而出,能够快速可靠地识别图像中的物体。自诞生以来,YOLO 系列经历了多次迭代,每次迭代都建立在以前的版本的基础上,以解决限制并提高性能(见图 1)。本文旨在全面回顾 YOLO 框架的发展历程,从最初的 YOLOv1 到最新的 YOLOv8,阐明每个版本的主要创新、差异和改进。
除了 YOLO 框架之外,对象检测和图像处理领域还开发了其他几种值得注意的方法。R-CNN(基于区域的卷积神经网络)[1]及其后继技术Fast R-CNN[2]和Faster R-CNN[3]在提高目标检测的准确性方面发挥了关键作用。这些方法依赖于两阶段过程,其中选择性搜索生成区域建议,卷积神经网络对这些区域进行分类和细化。另一种重要的方法是单次多盒检测器(SSD)[4],它与YOLO类似,通过消除单独的区域建议步骤来专注于速度和效率。此外,Mask R-NN [5]等方法具有实例分割的扩展功能,可实现精确的对象定位和像素级分割。这些发展,以及RetinaNet [6]和EfficientDet [7]等其他发展,共同促进了目标检测算法的多样化。每种方法在速度、准确性和复杂性之间都有独特的权衡,以满足不同的应用需求和计算限制。
图 1:YOLO 版本的时间线。
其他好评包括 [8, 9, 10]。然而,[8] 的评论涵盖了 YOLOv3 之前,[9] 涵盖了 YOLOv4 之前,留下了最新的发展。与[10]不同的是,我们的论文展示了大多数YOLO架构的深入架构,并涵盖了其他变体,如YOLOX、PP-YOLO、带变压器的YOLO和YOLO-NAS。
本文首先探讨了原始 YOLO 模型的基本概念和架构,这为 YOLO 系列的后续进展奠定了基础。接下来,我们将深入探讨每个版本中引入的改进和增强功能,从 YOLOv2 到 YOLOv8。这些改进涵盖网络设计、损失函数修改、锚盒适配和输入分辨率缩放等各个方面。通过研究这些发展,我们旨在全面了解 YOLO 框架的演变及其对对象检测的影响。
除了讨论每个 YOLO 版本的具体进步外,本文还强调了在整个框架开发过程中出现的速度和准确性之间的权衡。这强调了在选择最合适的 YOLO 模型时考虑特定应用的背景和要求的重要性。最后,我们设想了 YOLO 框架的未来方向,涉及进一步研究和开发的潜在途径,这些途径将塑造实时物体检测系统的持续进步。

2 YOLO 在不同领域的应用

YOLO的实时物体检测能力在自动驾驶汽车系统中具有无价的价值,能够快速识别和跟踪各种物体,如车辆、行人[11,12]、自行车和其他障碍物[13,14,15,16]。这些能力已应用于许多领域,包括用于监控[18]的视频序列中的动作识别[17]、体育分析[19]和人机交互[20]。
YOLO模型已用于农业,以检测和分类作物[21,22],病虫害[23],协助精准农业技术和自动化农业过程。它们还适用于生物识别、安全和面部识别系统中的人脸检测任务[24,25]。
在医学领域,YOLO已被用于癌症检测[26,27]、皮肤分割[28]和药丸识别[29],从而提高了诊断准确性和更有效的治疗过程。在遥感中,它已被用于卫星和航空图像中的物体检测和分类,有助于土地利用测绘、城市规划和环境监测[30,31,32,33]。
安全系统集成了YOLO模型,用于实时监控和分析视频源,从而可以快速检测可疑活动[34]、保持社交距离和口罩检测[35]。这些模型还应用于表面检测,以检测缺陷和异常,从而加强制造和生产过程中的质量控制[36,37,38]。在交通应用中,YOLO模型已被用于车牌检测[39]和交通标志识别[40]等任务,为智能交通系统和交通管理解决方案的开发做出了贡献。它们已被用于野生动物检测和监测,以识别濒危物种,以保护生物多样性和生态系统管理[41]。最后,YOLO已广泛应用于机器人应用[42,43]和无人机的物体
检测[44,45]。图2显示了在Scopus中找到的所有论文的文献计量网络可视化,标题中带有YOLO一词,并按对象检测关键字进行过滤。然后,我们手动过滤所有与应用程序相关的论文。

图 2:使用 [?] 创建的主要 YOLO 应用程序的文献计量网络可视化。

3 对象检测指标和非最大抑制 (NMS)

平均精度 (AP),传统上称为平均平均精度 (mAP),是评估目标检测模型性能的常用指标。它测量所有类别的平均精度,提供单个值来比较不同的模型。COCO 数据集不区分 AP 和 mAP。在本文的其余部分,我们将这个指标称为AP。在YOLOv1和YOLOv2中,用于训练和基准测试的数据集是PASCAL VOC 2007和VOC 2012 [46]。然而,从YOLOv3开始,使用的数据集是Microsoft COCO(上下文中的通用对象)[47]。对于这些数据集,AP 的计算方式不同。以下部分将讨论 AP 背后的基本原理并解释它是如何计算的。

3.1 AP 如何工作?

AP 指标基于精度召回率指标,处理多个对象类别,并使用并集交集 (IoU) 定义正预测。精度和召回率:精度衡量模型正面预测的准确性,而召回率衡量模型正确识别的实际正面案例的比例。精度和召回率之间通常需要权衡;例如,增加检测到的对象数量(更高的召回率)可能会导致更多的误报(更低的精度)。为了考虑这种权衡,AP 指标结合了精确率-召回率曲线,该曲线绘制了不同置信度阈值的精确度
与召回率。该指标通过考虑精确率-召回率曲线下的面积,对精度和召回率进行了平衡的评估。
处理多个对象类别:对象检测模型必须识别和定位图像中的多个对象类别。AP 指标通过单独计算每个类别的平均精度 (AP),然后取这些 AP 在所有类别中的平均值来解决这个问题(这就是为什么它也称为平均平均精度)。这种方法可确保针对每个类别单独评估模型的性能,从而对模型的整体性能进行更全面的评估。交集:对象检测旨在通过预测边界框来准确定位图像中的对象。AP 指标包含并交 (IoU) 度量,以评估预测边界框的质量。IoU 是相交面积与预测边界框和地面实况边界框的并集面积之比(见图 3)。它测量基本实况和预测边界框之间的重叠。COCO 基准测试考虑了多个 IoU 阈值来评估模型在不同定位精度水平下的性能。
图 3:联合交集 (IoU)。a) IoU 的计算方法是将两个方框的交点除以方框的并集;b) 不同盒子位置的三种不同 IoU 值的示例。

3.2 计算AP

在VOC和COCO数据集中,AP的计算方式不同。在本节中,我们将介绍如何在每个数据集上计算它。

VOC数据集

该数据集包括 20 个对象类别。要计算 VOC 中的 AP,我们遵循以下步骤:
  1. 对于每个类别,通过改变模型预测的置信阈值来计算精度-召回率曲线。
  2. 使用精确率-召回率曲线的插值 11 点抽样计算每个类别的平均精度 (AP)。
  3. 通过取所有 20 个类别的 AP 的平均值来计算最终平均精度 (AP)。

Microsoft COCO 数据集

该数据集包括 80 个对象类别,并使用更复杂的方法来计算 AP。它不使用 11 点插值,而是使用 101 点插值,即,它以 0.01 的增量计算 101 个召回阈值的精度,从 0 到 1。此外,AP 是通过对多个 IoU 值(而不仅仅是一个)进行平均来获得的,但称为 A P 50 A P 50 AP_(50)A P_{50},这是单个 IoU 阈值为 0.5 的 AP。在COCO中计算AP的步骤如下:
  1. 对于每个类别,通过改变模型预测的置信阈值来计算精度-召回率曲线。
  2. 使用 101 次召回阈值计算每个类别的平均精度 (AP)。
  3. 计算不同交集跨并集 (IoU) 阈值下的 AP,通常从 0.5 到 0.95,步长为 0.05。更高的 IoU 阈值需要更准确的预测才能被视为真阳性。
  4. 对于每个 IoU 阈值,取所有 80 个类别的 AP 的平均值。
  5. 最后,通过对在每个 IoU 阈值下计算的 AP 值进行平均来计算整体 AP。
AP 计算的差异使得很难直接比较两个数据集中的目标检测模型的性能。当前标准使用 COCO AP,因为它对模型在不同 IoU 阈值下的表现进行了更细粒度的评估。

3.3 非最大抑制 (NMS)

非最大抑制(NMS)是目标检测算法中使用的一种后处理技术,用于减少重叠边界框的数量并提高整体检测质量。对象检测算法通常在同一对象周围生成多个具有不同置信度分数的边界框。NMS 过滤掉冗余和不相关的边界框,只保留最准确的边界框。算法 1 描述了该过程。图 4 显示了包含多个重叠边界框的目标检测模型的典型输出以及 NMS 后的输出。
Algorithm 1 Non-Maximum Suppression Algorithm
Require: Set of predicted bounding boxes \(B\), confidence scores \(S\), IoU threshold \(\tau\), confidence threshold \(T\)
Ensure: Set of filtered bounding boxes \(F\)
    \(F \leftarrow \emptyset\)
    Filter the boxes: \(B \leftarrow\{b \in B \mid S(b) \geq T\}\)
    Sort the boxes \(B\) by their confidence scores in descending order
    while \(B \neq \emptyset\) do
        Select the box \(b\) with the highest confidence score
        Add \(b\) to the set of final boxes \(F: F \leftarrow F \cup\{b\}\)
        Remove \(b\) from the set of boxes \(B: B \leftarrow B-\{b\}\)
        for all remaining boxes \(r\) in \(B\) do
            Calculate the IoU between \(b\) and \(r: i o u \leftarrow I o U(b, r)\)
            if \(i o u \geq \tau\) then
                Remove \(r\) from the set of boxes \(B: B \leftarrow B-\{r\}\)
            end if
        end for
    end while
图 4:非最大抑制 (NMS)。a) 显示包含多个重叠框的目标检测模型的典型输出。b) 显示 NMS 后的输出。
我们已准备好开始描述不同的 YOLO 模型。

4 YOLO:你只看一次

Joseph Redmon等人的YOLO发表在CVPR 2016上[48]。它首次提出了一种用于对象检测的实时端到端方法。YOLO 这个名字代表“You Only Look Once”,指的是它能够通过一次网络来完成检测任务,而不是以前使用滑动窗口,然后使用分类器,每张图像需要运行数百或数千次,或者将任务分为两步的更高级方法, 其中第一步检测具有对象或区域建议的可能区域,第二步对建议运行分类器。此外,YOLO使用仅基于回归的更直接的输出来预测检测输出,而Fast R-CNN [2]则使用两个独立的输出,即概率分类和框坐标回归。

4.1 YOLOv1 是如何工作的?

YOLOv1 通过同时检测所有边界框来统一对象检测步骤。为了实现这一点,YOLO 将输入图像划分为 S × S S × S S xx SS \times S网格和预测 B B BB边界框,以及它对 C C CC每个网格元素的不同类。每个边界框预测由五个值组成: P c , b x , b y , b h , b w P c , b x , b y , b h , b w Pc,bx,by,bh,bwP c, b x, b y, b h, b w哪里 P c P c PcP c是盒子的置信度分数,它反映了模型对盒子包含对象的置信度以及盒子的准确性。这 b x b x bxb x b y b y byb y坐标是框相对于网格单元的中心,并且 b h b h bhb h b w b w bwb w是框相对于完整图像的高度和宽度。YOLO 的输出是 S × S × ( B × 5 + C ) S × S × ( B × 5 + C ) S xx S xx(B xx5+C)S \times S \times(B \times 5+C)可选后跟非最大抑制 (NMS) 以删除重复检测。在最初的YOLO论文中,作者使用了PASCAL VOC数据集[46],该数据集包含20个类( C = 20 C = 20 C=20C=20);一个网格 7 × 7 ( S = 7 ) 7 × 7 ( S = 7 ) 7xx7(S=7)7 \times 7(S=7)每个网格元素最多 2 个类 ( B = 2 B = 2 B=2B=2),给出一个 7 × 7 × 30 7 × 7 × 30 7xx7xx307 \times 7 \times 30输出预测。
图 5 显示了一个简化的输出向量,考虑了 3 x 3 网格、三个类以及每个网格对 8 个值的单个类。在这种简化的情况下,YOLO 的输出将为 3 × 3 × 8 3 × 3 × 8 3xx3xx83 \times 3 \times 8.
YOLOv1 在 PASCAL VOC2007 数据集上的平均精度 (AP) 为 63.4。
图 5:YOLO 输出预测。该图描绘了一个简化的 YOLO 模型,该模型具有三乘三的网格、三个类和每个网格元素的单个类预测,以生成包含 8 个值的向量。
表 1:YOLO 架构。该架构由 24 个卷积层组成,组合 3 × 3 3 × 3 3xx33 \times 3卷积与 1 × 1 1 × 1 1xx11 \times 1用于信道缩减的卷积。输出是一个全连接层,生成一个网格 7 × 7 7 × 7 7xx77 \times 7每个网格单元格有 30 个值,以容纳 20 个类别的 10 个边界框坐标(2 个框)。
类型 过滤 器 尺寸/步幅 输出
转换 64 7 × 7 / 2 7 × 7 / 2 7xx7//27 \times 7 / 2 224 × 224 224 × 224 224 xx224224 \times 224
最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 112 × 112 112 × 112 112 xx112112 \times 112
转换 192 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 112 × 112 112 × 112 112 xx112112 \times 112
最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 56 × 56 56 × 56 56 xx5656 \times 56
1 × 1 × 1xx1 \times 转换 128 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 56 × 56 56 × 56 56 xx5656 \times 56
转换 256 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 56 × 56 56 × 56 56 xx5656 \times 56
转换 256 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 56 × 56 56 × 56 56 xx5656 \times 56
转换 512 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 56 × 56 56 × 56 56 xx5656 \times 56
最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 28 × 28 28 × 28 28 xx2828 \times 28
4 × 4 × 4xx4 \times 转换 256 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 28 × 28 28 × 28 28 xx2828 \times 28
转换 512 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 28 × 28 28 × 28 28 xx2828 \times 28
转换 512 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 28 × 28 28 × 28 28 xx2828 \times 28
转换 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 28 × 28 28 × 28 28 xx2828 \times 28
最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 14 × 14 14 × 14 14 xx1414 \times 14
2 × 2 × 2xx2 \times 转换 512 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 14 × 14 14 × 14 14 xx1414 \times 14
转换 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 14 × 14 14 × 14 14 xx1414 \times 14
转换 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 14 × 14 14 × 14 14 xx1414 \times 14
转换 1024 3 × 3 / 2 3 × 3 / 2 3xx3//23 \times 3 / 2 7 × 7 7 × 7 7xx77 \times 7
转换 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 7 × 7 7 × 7 7xx77 \times 7
转换 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 7 × 7 7 × 7 7xx77 \times 7
FC的 4096 4096
辍学 0.5 4096
FC的 7 × 7 × 30 7 × 7 × 30 7xx7xx307 \times 7 \times 30 7 × 7 × 30 7 × 7 × 30 7xx7xx307 \times 7 \times 30
Type Filters Size/Stride Output Conv 64 7xx7//2 224 xx224 Max Pool 2xx2//2 112 xx112 Conv 192 3xx3//1 112 xx112 Max Pool 2xx2//2 56 xx56 1xx Conv 128 1xx1//1 56 xx56 Conv 256 3xx3//1 56 xx56 Conv 256 1xx1//1 56 xx56 Conv 512 3xx3//1 56 xx56 Max Pool 2xx2//2 28 xx28 4xx Conv 256 1xx1//1 28 xx28 Conv 512 3xx3//1 28 xx28 Conv 512 1xx1//1 28 xx28 Conv 1024 3xx3//1 28 xx28 Max Pool 2xx2//2 14 xx14 2xx Conv 512 1xx1//1 14 xx14 Conv 1024 3xx3//1 14 xx14 Conv 1024 3xx3//1 14 xx14 Conv 1024 3xx3//2 7xx7 Conv 1024 3xx3//1 7xx7 Conv 1024 3xx3//1 7xx7 FC 4096 4096 Dropout 0.5 4096 FC 7xx7xx30 7xx7xx30| | Type | Filters | Size/Stride | Output | | :--- | :--- | :--- | :--- | :--- | | | Conv | 64 | $7 \times 7 / 2$ | $224 \times 224$ | | | Max Pool | | $2 \times 2 / 2$ | $112 \times 112$ | | | Conv | 192 | $3 \times 3 / 1$ | $112 \times 112$ | | | Max Pool | | $2 \times 2 / 2$ | $56 \times 56$ | | $1 \times$ | Conv | 128 | $1 \times 1 / 1$ | $56 \times 56$ | | | Conv | 256 | $3 \times 3 / 1$ | $56 \times 56$ | | | Conv | 256 | $1 \times 1 / 1$ | $56 \times 56$ | | | Conv | 512 | $3 \times 3 / 1$ | $56 \times 56$ | | | Max Pool | | $2 \times 2 / 2$ | $28 \times 28$ | | $4 \times$ | Conv | 256 | $1 \times 1 / 1$ | $28 \times 28$ | | | Conv | 512 | $3 \times 3 / 1$ | $28 \times 28$ | | | Conv | 512 | $1 \times 1 / 1$ | $28 \times 28$ | | | Conv | 1024 | $3 \times 3 / 1$ | $28 \times 28$ | | | Max Pool | | $2 \times 2 / 2$ | $14 \times 14$ | | $2 \times$ | Conv | 512 | $1 \times 1 / 1$ | $14 \times 14$ | | | Conv | 1024 | $3 \times 3 / 1$ | $14 \times 14$ | | | Conv | 1024 | $3 \times 3 / 1$ | $14 \times 14$ | | | Conv | 1024 | $3 \times 3 / 2$ | $7 \times 7$ | | | Conv | 1024 | $3 \times 3 / 1$ | $7 \times 7$ | | | Conv | 1024 | $3 \times 3 / 1$ | $7 \times 7$ | | | FC | | 4096 | 4096 | | | Dropout 0.5 | | | 4096 | | | FC | | $7 \times 7 \times 30$ | $7 \times 7 \times 30$ |

4.2 YOLOv1 架构

YOLOv1 架构由 24 个卷积层组成,后跟两个全连接层,用于预测边界框坐标和概率。除了最后一层使用线性激活函数外,所有层都使用泄漏整流线性单元激活[49]。受到 GoogLeNet [50] 和网络中的网络 [51] 的启发,YOLO 使用 1 × 1 1 × 1 1xx11 \times 1卷积层来减少特征图的数量并保持相对较少的参数数量。作为激活层,表 1 描述了 YOLOv1 架构。作者还引入了一种名为 Fast YOLO 的更轻的模型,由九个卷积层组成。

4.3 YOLOv1 训练

作者以 224 × 224 224 × 224 224 xx224224 \times 224使用ImageNet数据集[52]。然后,他们添加了具有随机初始化权重的最后四层,并使用PASCAL VOC 2007和VOC 2012数据集[46]对模型进行了微调,分辨率 448 × 448 448 × 448 448 xx448448 \times 448以增加细节以实现更准确的对象检测。对于增强,作者使用随机缩放和最多 20 % 20 % 20%20 \%输入图像大小,以及 HSV 色彩空间中上端系数为 1.5 的随机曝光和饱和度。 YOLOv1 使用了一个由多个和平方误差组成的损失函数,如图 6 所示。在损失函数中, λ coord = 5 λ coord  = 5 lambda_("coord ")=5\lambda_{\text {coord }}=5是一个比例因子,它更重视边界框预测,并且 λ noobj = 0.5 λ noobj  = 0.5 lambda_("noobj ")=0.5\lambda_{\text {noobj }}=0.5是一个比例因子,可降低不包含对象的框的重要性。损失的前两项表示局部化损失;它计算预测边界框位置 ( x , y x , y x,yx, y) 和尺寸 ( w , h w , h w,hw, h).请注意,这些错误仅在包含对象(由 1 i j o b j 1 i j o b j 1_(ij)^(obj)\mathbb{1}_{i j}^{o b j}1),仅当该网格单元格中存在对象时才会受到惩罚。第三和第四损失项表示置信损失;第三项测量在盒子中检测到物体时的置信度误差( 1 i j o b j 1 i j o b j 1_(ij)^(obj)\mathbb{1}_{i j}^{o b j}1)和第四项测量在盒子中未检测到物体时的置信误差 ( 1 i j noobj ) 1 i j noobj  (1_(ij)^("noobj "))\left(\mathbb{1}_{i j}^{\text {noobj }}\right)1.由于大多数框都是空的,因此这种损失会被 λ noob j λ noob  j lambda_("noob "j)\lambda_{\text {noob } j}术语。最后一个损失分量是分类损失,仅当对象出现在单元格中时,它才测量每个类的类条件概率的平方误差 ( 1 i o b j ) 1 i o b j (1_(i)^(obj))\left(\mathbb{1}_{i}^{o b j}\right)1.
图 6:YOLO 成本函数:包括边界框坐标的定位损失、对象存在或不存在的置信度损失以及类别预测精度的分类损失。

4.4 YOLOv1 的优势和局限性

YOLO 的简单架构,加上其新颖的全图像一次性回归,使其比现有的物体检测器快得多,从而实现实时性能。
然而,虽然YOLO的性能比任何目标探测器都快,但与Fast R-CNN等最先进的方法相比,定位误差更大[2]。造成这种限制有三个主要原因:
  1. 它最多只能检测网格单元中两个同类的物体,限制了它预测附近物体的能力。
  2. 它很难预测具有训练数据中未看到的纵横比的物体。
  3. 由于下采样层,它从粗糙的物体特征中学习。

5 YOLOv2:更好、更快、更强

YOLOv2 由 Joseph Redmon 和 Ali Farhadi 发表在 CVPR 2017 [53] 上。它包括对原始 YOLO 的多项改进,使其变得更好,保持相同的速度并且更强大 - 能够检测 9000 个类别-.改进如下:
  1. 所有卷积层的批量归一化改善了收敛性,并充当正则化器以减少过度拟合。
  2. 高分辨率分类器。与YOLOv1 一样,他们用 ImageNet 对模型进行了预训练 224 × 224 224 × 224 224 xx224224 \times 224.然而,这一次,他们在 ImageNet 上对模型进行了十个纪元的微调,分辨率为 448 × 448 448 × 448 448 xx448448 \times 448,提高了更高分辨率输入的网络性能。
  3. 完全卷积。他们去除了致密层并使用了完全卷积的架构。
  4. 使用锚框预测边界框。它们使用一组先验的框或锚框,这些框具有预定义形状,用于匹配对象的原型形状,如图 7 所示。为每个网格单元定义多个锚框,系统预测每个锚框的坐标和类。网络输出的大小与每个网格单元的锚框数量成正比。
  5. 维度聚类。选择良好的先验框有助于网络学习预测更准确的边界框。作者在训练边界框上运行了k-means聚类,以找到良好的先验。他们选择了五个先前的盒子,在召回率和模型复杂性之间提供了良好的权衡。
  6. 直接位置预测。与其他预测偏移量的方法不同[3],YOLOv2遵循相同的理念,预测相对于网格单元的位置坐标。该网络预测每个像元的五个边界框,每个边界框有五个值 t x , t y , t w , t h t x , t y , t w , t h t_(x),t_(y),t_(w),t_(h)t_{x}, t_{y}, t_{w}, t_{h} t o t o t_(o)t_{o}哪里 t o t o t_(o)t_{o}相当于 P c P c PcP c从 YOLOv1 得到最终的边界框坐标,如图 8 所示。
  7. Finner 颗粒特征。YOLOv2 与 YOLOv1 相比,去除了一个池化层以获得输出特征图或网格 13 × 13 13 × 13 13 xx1313 \times 13用于输入图像 416 × 416 416 × 416 416 xx416416 \times 416.YOLOv2 还使用一个直通层,该层将 26 × 26 × 512 26 × 26 × 512 26 xx26 xx51226 \times 26 \times 512特征图,并通过将相邻特征堆叠到不同的通道中而不是通过空间子采样丢失它们来重新组织它。这会生成 13 × 13 × 2048 13 × 13 × 2048 13 xx13 xx204813 \times 13 \times 2048在通道维度中以较低分辨率连接的特征图 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024获取的地图 13 × 13 × 3072 13 × 13 × 3072 13 xx13 xx307213 \times 13 \times 3072特征图。有关架构详细信息,请参见表 2。
  8. 多尺度训练。由于 YOLOv2 不使用全连接层,因此输入可以有不同的大小。为了使YOLOv2对不同的输入大小具有鲁棒性,作者随机训练了模型,将输入大小-从中更改为 320 × 320 320 × 320 320 xx320320 \times 320为止 608 × 608 608 × 608 608 xx608608 \times 608- 每十批。
图 7:锚盒。YOLOv2 为每个网格单元定义了多个锚框。通过所有这些改进,YOLOv2 的平均精度(AP)达到了 78.6 % 78.6 % 78.6%78.6 \%在 PASCAL VOC2007 数据集上与 63.4 % 63.4 % 63.4%63.4 \%由YOLOv1获得。

5.1 YOLOv2 架构

YOLOv2 使用的骨干架构称为 Darknet-19,包含 19 个卷积层和 5 个 maxpooling 层。与YOLOv1的架构类似,它的灵感来自网络中的网络[51],使用。 1 × 1 1 × 1 1xx11 \times 1
图 8:边界框预测。盒子的中心坐标是用预测的 t x , t y t x , t y t_(x),t_(y)t_{x}, t_{y}通过 sigmoid 函数的值,并由网格单元的位置偏移 c x , c y c x , c y c_(x),c_(y)c_{x}, c_{y}.最终框的宽度和高度使用先前的宽度 p w p w p_(w)p_{w}和高度 p h p h p_(h)p_{h}缩放方式 e t w e t w e^(t_(w))e^{t_{w}} e t h e t h e^(t_(h))e^{t_{h}}分别,其中 t w t w t_(w)t_{w} t h t h t_(h)t_{h}由YOLOv2预测。之间的卷积 3 × 3 3 × 3 3xx33 \times 3以减少参数数量。此外,如上所述,它们使用批量归一化来正则化并帮助收敛。
表 2 显示了带有目标检测头的整个 Darknet-19 主干网。YOLOv2 在使用 PASCAL VOC 数据集时预测了 5 个边界框,每个边界框有 5 个值和 20 个类。对象分类头将最后四个卷积层替换为具有 1000 个过滤器的单个卷积层,然后是全局平均池化层和 Softmax。

5.2 YOLO9000 是更强的 YOLOv2

作者在同一篇论文中介绍了一种训练联合分类和检测的方法。它使用COCO [47]的检测标记数据来学习边界框坐标和ImageNet的分类数据,以增加可以检测的类别数量。在训练过程中,他们将两个数据集组合在一起,当使用检测训练图像时,它会反向传播检测网络,当使用分类训练图像时,它会反向传播架构的分类部分。结果是一个能够检测 9000 多个类别的 YOLO 模型,因此得名 YOLO9000。

6 YOLOv3

YOLOv3 [54]由Joseph Redmon和Ali Farhadi于2018年在ArXiv上发表。它包括重大更改和更大的架构,以与最先进的技术相提并论,同时保持实时性能。在下文中,我们描述了YOLOv2的变化。
  1. 边界框预测。与YOLOv2一样,该网络预测每个边界框的四个坐标 t x , t y t x , t y t_(x),t_(y)t_{x}, t_{y}, t w t w t_(w)t_{w} t h t h t_(h)t_{h};然而,这一次,YOLOv3 使用逻辑回归预测每个边界框的客观性分数。对于与地面实况重叠最高的锚框,此分数为 1,其余锚框为 0。与 Faster R-CNN [3] 相比,YOLOv3 只为每个地面实况对象分配一个锚框。此外,如果未为对象分配锚框,则只会产生分类损失,而不会导致本地化损失或置信度损失。
  2. 类预测。他们没有使用softmax进行分类,而是使用二元交叉熵来训练独立的逻辑分类器,并将问题作为多标签分类提出。这一变化允许将多个标签分配给同一个框,这可能发生在一些标签重叠的复杂数据集[55]上。例如,同一个对象可以是 Person 和 Man。
  3. 新的骨干。YOLOv3 具有一个更大的特征提取器,由 53 个具有残差连接的卷积层组成。第 6.1 节更详细地描述了该体系结构。
表 2:YOLOv2 架构。Darknet-19 主干网(第 1 层至第 23 层)加上由最后四个卷积层组成的检测头和重组 17 th 17 th  17^("th ")17^{\text {th }}输出 26 × 26 × 512 26 × 26 × 512 26 xx26 xx51226 \times 26 \times 512 13 × 13 × 2048 13 × 13 × 2048 13 xx13 xx204813 \times 13 \times 2048然后是与 25 th 25 th  25^("th ")25^{\text {th }}层。最终卷积生成一个网格 13 × 13 13 × 13 13 xx1313 \times 13具有 125 个通道,可容纳 5 个边界框的 25 个预测(5 个坐标 + 20 个类)。
编号 类型 过滤 器 尺寸/步幅 输出
1 转换/BN 32 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 416 × 416 × 32 416 × 416 × 32 416 xx416 xx32416 \times 416 \times 32
2 最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 208 × 208 × 32 208 × 208 × 32 208 xx208 xx32208 \times 208 \times 32
3 转换/BN 64 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 208 × 208 × 64 208 × 208 × 64 208 xx208 xx64208 \times 208 \times 64
4 最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 104 × 104 × 64 104 × 104 × 64 104 xx104 xx64104 \times 104 \times 64
5 转换/BN 128 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 104 × 104 × 128 104 × 104 × 128 104 xx104 xx128104 \times 104 \times 128
6 转换/BN 64 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 104 × 104 × 64 104 × 104 × 64 104 xx104 xx64104 \times 104 \times 64
7 转换/BN 128 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 104 × 104 × 128 104 × 104 × 128 104 xx104 xx128104 \times 104 \times 128
8 最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 52 × 52 × 128 52 × 52 × 128 52 xx52 xx12852 \times 52 \times 128
9 转换/BN 256 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 52 × 52 × 256 52 × 52 × 256 52 xx52 xx25652 \times 52 \times 256
10 转换/BN 128 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 52 × 52 × 128 52 × 52 × 128 52 xx52 xx12852 \times 52 \times 128
11 转换/BN 256 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 52 × 52 × 256 52 × 52 × 256 52 xx52 xx25652 \times 52 \times 256
12 最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 52 × 52 × 256 52 × 52 × 256 52 xx52 xx25652 \times 52 \times 256
13 转换/BN 512 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 26 × 26 × 512 26 × 26 × 512 26 xx26 xx51226 \times 26 \times 512
14 转换/BN 256 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 26 × 26 × 256 26 × 26 × 256 26 xx26 xx25626 \times 26 \times 256
15 转换/BN 512 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 26 × 26 × 512 26 × 26 × 512 26 xx26 xx51226 \times 26 \times 512
16 转换/BN 256 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 26 × 26 × 256 26 × 26 × 256 26 xx26 xx25626 \times 26 \times 256
17 转换/BN 512 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 26 × 26 × 512 26 × 26 × 512 26 xx26 xx51226 \times 26 \times 512
18 最大池 2 × 2 / 2 2 × 2 / 2 2xx2//22 \times 2 / 2 13 × 13 × 512 13 × 13 × 512 13 xx13 xx51213 \times 13 \times 512
19 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
20 转换/BN 512 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 13 × 13 × 512 13 × 13 × 512 13 xx13 xx51213 \times 13 \times 512
21 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
22 转换/BN 512 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 13 × 13 × 512 13 × 13 × 512 13 xx13 xx51213 \times 13 \times 512
23 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
24 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
25 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
26 重组层 17 13 × 13 × 2048 13 × 13 × 2048 13 xx13 xx204813 \times 13 \times 2048
27 Concat 25 和 26 13 × 13 × 3072 13 × 13 × 3072 13 xx13 xx307213 \times 13 \times 3072
28 转换/BN 1024 3 × 3 / 1 3 × 3 / 1 3xx3//13 \times 3 / 1 13 × 13 × 1024 13 × 13 × 1024 13 xx13 xx102413 \times 13 \times 1024
29 转换 125 1 × 1 / 1 1 × 1 / 1 1xx1//11 \times 1 / 1 13 × 13 × 125 13 × 13 × 125 13 xx13 xx12513 \times 13 \times 125
Num Type Filters Size/Stride Output 1 Conv/BN 32 3xx3//1 416 xx416 xx32 2 Max Pool 2xx2//2 208 xx208 xx32 3 Conv/BN 64 3xx3//1 208 xx208 xx64 4 Max Pool 2xx2//2 104 xx104 xx64 5 Conv/BN 128 3xx3//1 104 xx104 xx128 6 Conv/BN 64 1xx1//1 104 xx104 xx64 7 Conv/BN 128 3xx3//1 104 xx104 xx128 8 Max Pool 2xx2//2 52 xx52 xx128 9 Conv/BN 256 3xx3//1 52 xx52 xx256 10 Conv/BN 128 1xx1//1 52 xx52 xx128 11 Conv/BN 256 3xx3//1 52 xx52 xx256 12 Max Pool 2xx2//2 52 xx52 xx256 13 Conv/BN 512 3xx3//1 26 xx26 xx512 14 Conv/BN 256 1xx1//1 26 xx26 xx256 15 Conv/BN 512 3xx3//1 26 xx26 xx512 16 Conv/BN 256 1xx1//1 26 xx26 xx256 17 Conv/BN 512 3xx3//1 26 xx26 xx512 18 Max Pool 2xx2//2 13 xx13 xx512 19 Conv/BN 1024 3xx3//1 13 xx13 xx1024 20 Conv/BN 512 1xx1//1 13 xx13 xx512 21 Conv/BN 1024 3xx3//1 13 xx13 xx1024 22 Conv/BN 512 1xx1//1 13 xx13 xx512 23 Conv/BN 1024 3xx3//1 13 xx13 xx1024 24 Conv/BN 1024 3xx3//1 13 xx13 xx1024 25 Conv/BN 1024 3xx3//1 13 xx13 xx1024 26 Reorg layer 17 13 xx13 xx2048 27 Concat 25 and 26 13 xx13 xx3072 28 Conv/BN 1024 3xx3//1 13 xx13 xx1024 29 Conv 125 1xx1//1 13 xx13 xx125| Num | Type | Filters | Size/Stride | Output | | :--- | :--- | :--- | :--- | :--- | | 1 | Conv/BN | 32 | $3 \times 3 / 1$ | $416 \times 416 \times 32$ | | 2 | Max Pool | | $2 \times 2 / 2$ | $208 \times 208 \times 32$ | | 3 | Conv/BN | 64 | $3 \times 3 / 1$ | $208 \times 208 \times 64$ | | 4 | Max Pool | | $2 \times 2 / 2$ | $104 \times 104 \times 64$ | | 5 | Conv/BN | 128 | $3 \times 3 / 1$ | $104 \times 104 \times 128$ | | 6 | Conv/BN | 64 | $1 \times 1 / 1$ | $104 \times 104 \times 64$ | | 7 | Conv/BN | 128 | $3 \times 3 / 1$ | $104 \times 104 \times 128$ | | 8 | Max Pool | | $2 \times 2 / 2$ | $52 \times 52 \times 128$ | | 9 | Conv/BN | 256 | $3 \times 3 / 1$ | $52 \times 52 \times 256$ | | 10 | Conv/BN | 128 | $1 \times 1 / 1$ | $52 \times 52 \times 128$ | | 11 | Conv/BN | 256 | $3 \times 3 / 1$ | $52 \times 52 \times 256$ | | 12 | Max Pool | | $2 \times 2 / 2$ | $52 \times 52 \times 256$ | | 13 | Conv/BN | 512 | $3 \times 3 / 1$ | $26 \times 26 \times 512$ | | 14 | Conv/BN | 256 | $1 \times 1 / 1$ | $26 \times 26 \times 256$ | | 15 | Conv/BN | 512 | $3 \times 3 / 1$ | $26 \times 26 \times 512$ | | 16 | Conv/BN | 256 | $1 \times 1 / 1$ | $26 \times 26 \times 256$ | | 17 | Conv/BN | 512 | $3 \times 3 / 1$ | $26 \times 26 \times 512$ | | 18 | Max Pool | | $2 \times 2 / 2$ | $13 \times 13 \times 512$ | | 19 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 20 | Conv/BN | 512 | $1 \times 1 / 1$ | $13 \times 13 \times 512$ | | 21 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 22 | Conv/BN | 512 | $1 \times 1 / 1$ | $13 \times 13 \times 512$ | | 23 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 24 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 25 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 26 | Reorg layer 17 | | | $13 \times 13 \times 2048$ | | 27 | Concat 25 and 26 | | | $13 \times 13 \times 3072$ | | 28 | Conv/BN | 1024 | $3 \times 3 / 1$ | $13 \times 13 \times 1024$ | | 29 | Conv | 125 | $1 \times 1 / 1$ | $13 \times 13 \times 125$ |
  1. 空间金字塔池化(SPP)虽然论文中没有提到,但作者还在主干网中添加了一个修改的SPP块[56],该块连接了多个最大池化输出,无需子采样(步幅 = 1),每个输出具有不同的内核大小 k × k k × k k xx kk \times k哪里 k = 1 , 5 , 9 , 13 k = 1 , 5 , 9 , 13 k=1,5,9,13k=1,5,9,13允许更大的感受野。这个版本称为YOLOv3-spp,是性能最好的版本,改进了 AP 50 AP 50 AP_(50)\mathrm{AP}_{50} 2.7 % 2.7 % 2.7%2.7 \%.
  2. 多尺度预测。与特征金字塔网络[57]类似,YOLOv3预测了三种不同尺度的三个盒子。第 6.2 节描述了多尺度预测机制,并提供了更多详细信息。
  3. 边界框先验。与YOLOv2一样,作者也使用k-means来确定锚框的边界框先验。不同之处在于,在YOLOv2中,他们每个单元格总共使用了五个先验框,而在YOLOv3中,他们为三个不同的比例使用了三个先验框。

6.1 YOLOv3 架构

YOLOv3 中介绍的架构主干网称为 Darknet-53。它用跨度卷积取代了所有最大池化层,并添加了残差连接。它总共包含 53 个卷积层。图 9 显示了架构细节。
Darknet-53 主干网获得与 ResNet-152 相当的 Top-1 和 Top-5 精度,但几乎 2 × 2 × 2xx2 \times更快。
过滤器大小 重复 输出大小
图像 416 × 416 416 × 416 416 xx416416 \times 416
转换 32 3 × 3 / 1 32 3 × 3 / 1 32quad3xx3//132 \quad 3 \times 3 / 1 1 416 × 416 416 × 416 416 xx416416 \times 416
转换 64 3 × 3 / 2 64 3 × 3 / 2 64quad3xx3//264 \quad 3 \times 3 / 2 1 208 × 208 208 × 208 208 xx208208 \times 208
残余 32 1 × 1 / 1 32 1 × 1 / 1 32quad1xx1//132 \quad 1 \times 1 / 1 Conv Residual × 1 × 1 xx1\times 1 208 × 208 208 × 208 208 xx208208 \times 208
转换 1283 × 3 / 2 1283 × 3 / 2 1283 xx3//21283 \times 3 / 2 1 104 × 104 104 × 104 104 xx104104 \times 104
转换 641 × 1 / 1 641 × 1 / 1 641 xx1//1641 \times 1 / 1 转换 104 × 104 104 × 104 104 xx104104 \times 104
转换 转换 × 2 × 2 xx2\times 2 104 × 104 104 × 104 104 xx104104 \times 104
残余
转换 2563 × 3 / 2 2563 × 3 / 2 2563 xx3//22563 \times 3 / 2 2563 × 3 / 211152260 2563 × 3 / 211152260 2563 xx3//2111522602563 \times 3 / 211152260 残余
转换 128 1 × 1 / 1 128 1 × 1 / 1 128quad1xx1//1128 \quad 1 \times 1 / 1 转换 52 × 52 52 × 52 52 xx5252 \times 52
转换 2563 × 3 / 1 2563 × 3 / 1 2563 xx3//12563 \times 3 / 1 转换 × 8 × 8 xx8\times 8 52 × 52 52 × 52 52 xx5252 \times 52
残余
残余
52 × 52 52 × 52 52 xx5252 \times 52
Residual 52 xx52| Residual | | :--- | | $52 \times 52$ |
转换
转换 2561 × 1 / 1 2561 × 1 / 1 2561 xx1//12561 \times 1 / 1 转换 26 × 26 26 × 26 26 xx2626 \times 26
转换 5123 × 3 / 1 5123 × 3 / 1 5123 xx3//15123 \times 3 / 1 转换 × 8 × 8 xx8\times 8 26 × 26 26 × 26 26 xx2626 \times 26
残余
L后
残余
26 × 26 26 × 26 26 xx2626 \times 26 26 × 26 26 × 26 26 xx2626 \times 26
LResidual Residual 26 xx26 26 xx26| LResidual | | :--- | | Residual | | $26 \times 26$ $26 \times 26$ |
转换 10243 × 3 / 2 10243 × 3 / 2 10243 xx3//210243 \times 3 / 2 1 13 × 13 13 × 13 13 xx1313 \times 13
转换 5121 × 1 / 1 5121 × 1 / 1 5121 xx1//15121 \times 1 / 1 转换 13 × 13 13 × 13 13 xx1313 \times 13
转换 10243 × 3 / 1 10243 × 3 / 1 10243 xx3//110243 \times 3 / 1 转换 × 4 × 4 xx4\times 4 13 × 13 13 × 13 13 xx1313 \times 13
残余 残余 13 × 13 13 × 13 13 xx1313 \times 13
Layer Filters size Repeat Output size Image 416 xx416 Conv 32quad3xx3//1 1 416 xx416 Conv 64quad3xx3//2 1 208 xx208 https://cdn.mathpix.com/cropped/2025_07_24_3f9b4ffb3101bbc3dae5g-12.jpg?height=155&width=541&top_left_y=430&top_left_x=1238 Residual 32quad1xx1//1 Conv Residual xx1 208 xx208 Conv 1283 xx3//2 1 104 xx104 Conv 641 xx1//1 Conv 104 xx104 Conv Conv xx2 104 xx104 Residual Conv 2563 xx3//2 2563 xx3//211152260 Residual Conv 128quad1xx1//1 Conv 52 xx52 Conv 2563 xx3//1 Conv xx8 52 xx52 Residual "Residual 52 xx52" Conv Conv 2561 xx1//1 Conv 26 xx26 Conv 5123 xx3//1 Conv xx8 26 xx26 Residual "LResidual Residual 26 xx26 26 xx26" Conv 10243 xx3//2 1 13 xx13 Conv 5121 xx1//1 Conv 13 xx13 Conv 10243 xx3//1 Conv xx4 13 xx13 Residual Residual 13 xx13 | Layer | Filters size | Repeat | | Output size | | | :--- | :--- | :--- | :--- | :--- | :--- | | Image | $416 \times 416$ | | | | | | Conv | $32 \quad 3 \times 3 / 1$ | 1 | | $416 \times 416$ | | | Conv | $64 \quad 3 \times 3 / 2$ | 1 | | $208 \times 208$ | ![](https://cdn.mathpix.com/cropped/2025_07_24_3f9b4ffb3101bbc3dae5g-12.jpg?height=155&width=541&top_left_y=430&top_left_x=1238) | | Residual | $32 \quad 1 \times 1 / 1$ | Conv Residual | $\times 1$ | $208 \times 208$ | | | Conv | $1283 \times 3 / 2$ | 1 | | $104 \times 104$ | | | Conv | $641 \times 1 / 1$ | Conv | | $104 \times 104$ | | | Conv | | Conv | $\times 2$ | $104 \times 104$ | | | Residual | | | | | | | | Conv | $2563 \times 3 / 2$ | | | $2563 \times 3 / 211152260$ | Residual | | Conv | $128 \quad 1 \times 1 / 1$ | Conv | | $52 \times 52$ | | | Conv | $2563 \times 3 / 1$ | Conv | $\times 8$ | $52 \times 52$ | | | Residual | Residual <br> $52 \times 52$ | | | | | | Conv | | | | | | | Conv | $2561 \times 1 / 1$ | Conv | | $26 \times 26$ | | | Conv | $5123 \times 3 / 1$ | Conv | $\times 8$ | $26 \times 26$ | | | Residual | LResidual <br> Residual <br> $26 \times 26$ $26 \times 26$ | | | | | | Conv | $10243 \times 3 / 2$ | 1 | | $13 \times 13$ | | | Conv | $5121 \times 1 / 1$ | Conv | | $13 \times 13$ | | | Conv | $10243 \times 3 / 1$ | Conv | $\times 4$ | $13 \times 13$ | | | Residual | | Residual | | $13 \times 13$ | |
图 9:YOLOv3 Darknet-53 主干网。YOLOv3 的架构由 53 个卷积层组成,每个卷积层都有批量归一化和 Leaky ReLU 激活。此外,残余连接连接 1 × 1 1 × 1 1xx11 \times 1整个网络的卷积,输出 3 × 3 3 × 3 3xx33 \times 3卷 积。此处显示的体系结构仅由主干组成;不包括由多尺度预测组成的检测头。

6.2 YOLOv3 多尺度预测

除了更大的架构外,YOLOv3 的一个基本特征是多尺度预测,即在多个网格大小下进行预测。这有助于获得更精细的框,并显着改善了对小物体的预测,这是以前版本的 YOLO 的主要弱点之一。图 10 所示的多尺度检测架构的工作原理如下:第一个输出标记为 y 1 y 1 y1\mathbf{y 1}相当于 YOLOv2 输出,其中 13 × 13 13 × 13 13 xx1313 \times 13grid 定义输出。第二个输出 y 2 y 2 y2\mathbf{y 2}是通过将 (Res × 4 × 4 xx4\times 4)的 Darknet-53 的输出(Res × 8 × 8 xx8\times 8).特征图具有不同的大小,即 13 × 13 13 × 13 13 xx1313 \times 13 26 × 26 26 × 26 26 xx2626 \times 26,因此在串联之前有一个上采样作。最后,使用上采样作,第三个输出 y 3 y 3 y3\mathbf{y 3}连接 26 × 26 26 × 26 26 xx2626 \times 26特征图,并使用 52 × 52 52 × 52 52 xx5252 \times 52特征图。对于具有 80 个类别的 COCO 数据集,每个尺度提供一个形状为 N × N × [ 3 × ( 4 + 1 + 80 ) ] N × N × [ 3 × ( 4 + 1 + 80 ) ] N xx N xx[3xx(4+1+80)]N \times N \times[3 \times(4+1+80)]哪里 N × N N × N N xx NN \times N是特征图(或网格单元格)的大小,3 表示每个单元格的框,而 4 + 1 4 + 1 4+14+1包括四个坐标和客观性分数。

6.3 YOLOv3 结果

当YOLOv3发布时,对象检测的基准测试已经从PASCAL VOC变为Microsoft COCO [47]。因此,从这里开始,所有YOLO都在MS COCO数据集中进行评估。YOLOv3-spp 实现了 36.2 % 36.2 % 36.2%36.2 \% AP 50 AP 50 AP_(50)\mathrm{AP}_{50} 60.6 % 60.6 % 60.6%60.6 \%在 20 FPS 时,实现了当时最先进的水平和 2 × 2 × 2xx2 \times更快。
图 10:YOLOv3 多尺度检测架构。Darknet-53 主干网的输出分支到三个不同的输出,标记为 y 1 , y 2 y 1 , y 2 y1,y2\mathbf{y 1}, \mathbf{y 2} y 3 y 3 y3\mathbf{y 3},每个分辨率都提高了。使用非最大值抑制过滤最终预测框。CBL(卷积-BatchNorm-泄漏 ReLU)块包含一个卷积层,具有批量归一化和泄漏 ReLU。Res块包括一个CBL,然后是两个具有残差连接的CBL结构,如图9所示

7 脊椎、颈部和头部

这时,物体探测器的架构开始分为三个部分:骨架、颈部和头部。图 11 显示了高级主干、颈部和头部图。主干网负责从输入图像中提取有用的特征。它通常是在大规模图像分类任务(例如 ImageNet)上训练的卷积神经网络 (CNN)。主干网捕获不同尺度的分层特征,在前一层提取较低级别的特征(例如边缘和纹理),在较深的层中提取更高级别的特征(例如,对象部分和语义信息)。
颈部是连接脊柱和头部的中间部件。它聚合和细化主干提取的特征,通常专注于增强不同尺度的空间和语义信息。颈部可能包括额外的卷积层、特征金字塔网络(FPN)[57]或其他机制,以改善特征的表示。头部是物体探测器的最后一个部件;它负责根据主干和颈部提供的特征进行预测。它通常由一个或多个特定于任务的子网组成,这些子网执行分类、定位,以及最近的实例分割和姿态估计。头部处理颈部提供的特征,为每个候选对象生成预测。最后,后处理步骤(例如非最大抑制 (NMS))会过滤掉重叠的预测,并仅保留最可靠的检测。
在其余的 YOLO 模型中,我们将使用主干网、颈部和头部来描述架构。

8 YOLOv4

两年过去了,YOLO 没有新版本。直到2020年4月,Alexey Bochkovskiy、Chien-Yao Wang和Hong-Yuan Mark Liao在ArXiv上发布了YOLOv4的论文[58]。起初,不同的作者提出了 YOLO 的新“官方”版本,这感觉很奇怪;然而,YOLOv4 保持了相同的 YOLO 理念--实时、开源、单发和暗网框架--而且改进非常令人满意,以至于社区迅速接受了这个版本作为官方 YOLOv4。
图 11:现代物体检测器的架构可以描述为主干、颈部和头部。主干网通常是卷积神经网络 (CNN),从不同尺度的图像中提取重要特征。颈部细化了这些特征,增强了空间和语义信息。最后,头部使用这些精细特征来进行物体检测预测。
YOLOv4 tried to find the optimal balance by experimenting with many changes categorized as bag-of-freebies and bag-of-specials. Bag-of-freebies are methods that only change the training strategy and increase training cost but do not increase the inference time, the most common being data augmentation. On the other hand, bag-of-specials are methods that slightly increase the inference cost but significantly improve accuracy. Examples of these methods are those for enlarging the receptive field [56, 59, 60], combining features [61, 57, 62, 63], and post-processing [64, 49, 65, 66] among others.
We summarize the main changes of YOLOv4 in the following points:
  • An Enhanced Architecture with Bag-of-Specials (BoS) Integration. The authors tried multiple architectures for the backbone, such as ResNeXt50 [67], EfficientNet-B3 [68], and Darknet-53. The best-performing architecture was a modification of Darknet-53 with cross-stage partial connections (CSPNet) [69], and Mish activation function [65] as the backbone (see Figure 12. For the neck, they used the modified version of spatial pyramid pooling (SPP) [56] from YOLOv3-spp and multi-scale predictions as in YOLOv3, but with a modified version of path aggregation network (PANet) [70] instead of FPN as well as a modified spatial attention module (SAM) [71]. Finally, for the detection head, they use anchors as in YOLOv3. Therefore, the model was called CSPDarknet53-PANet-SPP. The cross-stage partial connections (CSP) added to the Darknet-53 help reduce the computation of the model while keeping the same accuracy. The SPP block, as in YOLOv3-spp increases the receptive field without affecting the inference speed. The modified version of PANet concatenates the features instead of adding them as in the original PANet paper.
  • Integrating bag-of-freebies (BoF) for an Advanced Training Approach. Apart from the regular augmentations such as random brightness, contrast, scaling, cropping, flipping, and rotation, the authors implemented mosaic augmentation that combines four images into a single one allowing the detection of objects outside their usual context and also reducing the need for a large mini-batch size for batch normalization. For regularization, they used DropBlock [72] that works as a replacement of Dropout [73] but for convolutional neural networks as well as class label smoothing [74, 75]. For the detector, they added CIoU loss [76] and Cross mini-bath normalization (CmBN) for collecting statistics from the entire batch instead of from single mini-batches as in regular batch normalization [77].
  • Self-adversarial Training (SAT). To make the model more robust to perturbations, an adversarial attack is performed on the input image to create a deception that the ground truth object is not in the image but keeps the original label to detect the correct object.
  • Hyperparameter Optimization with Genetic Algorithms. To find the optimal hyperparameters used for training, they use genetic algorithms on the first 10 % 10 % 10%10 \% of periods, and a cosine annealing scheduler [78] to alter the learning rate during training. It starts reducing the learning rate slowly, followed by a quick reduction halfway through the training process ending with a slight reduction.
Figure 12: YOLOv4 Architecture for object detection. The modules in the diagram are C M B C M B CMB\mathbf{C M B} : Convolution + Batch Normalization + Mish activation, CBL: Convolution + Batch Normalization + Leaky ReLU, UP: upsampling, SPP: Spatial Pyramid Pooling, and PANet: Path Aggregation Network. Diagram inspired by [79].
Table 3 lists the final selection of BoFs and BoS for the backbone and the detector.
Evaluated on MS COCO dataset test-dev 2017, YOLOv4 achieved an AP of
43.5 % 43.5 % 43.5%43.5 \% and AP 50 AP 50 AP_(50)\mathrm{AP}_{50} of 65.7 % 65.7 % 65.7%65.7 \% at more than 50 FPS on an NVIDIA V100.

9 YOLOv5

YOLOv5 [80] was released a couple of months after YOLOv4 in 2020 by Glen Jocher, founder and CEO of Ultralytics. It uses many improvements described in the YOLOv4 section but developed in Pytorch instead of Darknet. YOLOv5 incorporates an Ultralytics algorithm called AutoAnchor. This pre-training tool checks and adjusts anchor boxes if they are ill-fitted for the dataset and training settings, such as image size. It first applies a k-means function to dataset labels to generate initial conditions for a Genetic Evolution (GE) algorithm. The GE algorithm then evolves these anchors over 1000 generations by default, using CIoU loss [76] and Best Possible Recall as its fitness function. Figure 13shows the detailed architecture of YOLOv5.

9.1 YOLOv5 Architecture

The backbone is a modified CSPDarknet53 that starts with a Stem, a strided convolution layer with a large window size to reduce memory and computational costs; followed by convolutional layers that extract relevant features from the
Table 3: YOLOv4 final selection of bag-of-freebies (BoF) and bag-of-specials (BoS). BoF are methods that increase performance with no inference cost but longer training times. On the other hand, BoS are methods that slightly increase the inference cost but significantly improve accuracy.
Backbone Detector
Bag-of-Freebies Bag-of-Freebies
Data augmentation Data augmentation
- Mosaic - Mosaic
- CutMix - Self-Adversarial Training
Regularization CIoU loss
- DropBlock Cross mini-Batch Normalization (CmBN)
Class label smoothing
Eliminate grid sensitivity
Multiple anchors for a single ground truth Cosine annealing scheduler Optimal hyper-parameteres Random training shapes
Eliminate grid sensitivity Multiple anchors for a single ground truth Cosine annealing scheduler Optimal hyper-parameteres Random training shapes| Eliminate grid sensitivity | | :--- | | Multiple anchors for a single ground truth Cosine annealing scheduler Optimal hyper-parameteres Random training shapes |
Bag-of-Specials Bag-of-Specials
Mish activation Mish activation
Cross-stage partial connections Spatial pyramid pooling block
Multi-input weighted residual connections Spatial attention module (SAM)
Path aggregation network (PAN)
Distance-IoU Non-Maximum Suppression
Backbone Detector Bag-of-Freebies Bag-of-Freebies Data augmentation Data augmentation - Mosaic - Mosaic - CutMix - Self-Adversarial Training Regularization CIoU loss - DropBlock Cross mini-Batch Normalization (CmBN) Class label smoothing "Eliminate grid sensitivity Multiple anchors for a single ground truth Cosine annealing scheduler Optimal hyper-parameteres Random training shapes" Bag-of-Specials Bag-of-Specials Mish activation Mish activation Cross-stage partial connections Spatial pyramid pooling block Multi-input weighted residual connections Spatial attention module (SAM) Path aggregation network (PAN) Distance-IoU Non-Maximum Suppression| Backbone | Detector | | :--- | :--- | | Bag-of-Freebies | Bag-of-Freebies | | Data augmentation | Data augmentation | | - Mosaic | - Mosaic | | - CutMix | - Self-Adversarial Training | | Regularization | CIoU loss | | - DropBlock | Cross mini-Batch Normalization (CmBN) | | Class label smoothing | Eliminate grid sensitivity <br> Multiple anchors for a single ground truth Cosine annealing scheduler Optimal hyper-parameteres Random training shapes | | Bag-of-Specials | Bag-of-Specials | | Mish activation | Mish activation | | Cross-stage partial connections | Spatial pyramid pooling block | | Multi-input weighted residual connections | Spatial attention module (SAM) | | | Path aggregation network (PAN) | | | Distance-IoU Non-Maximum Suppression |
input image. The SPPF (spatial pyramid pooling fast) layer and the following convolution layers process the features at various scales, while the upsample layers increase the resolution of the feature maps. The SPPF layer aims to speed up the computation of the network by pooling features of different scales into a fixed-size feature map. Each convolution is followed by batch normalization (BN) and SiLU activation [81]. The neck uses SPPF and a modified CSP-PAN, while the head resembles YOLOv3.
YOLOv5 uses several augmentations such as Mosaic, copy paste [82], random affine, MixUp [83], HSV augmentation, random horizontal flip, as well as other augmentations from the albumentations package [84]. It also improves the grid sensitivity to make it more stable to runaway gradients.
YOLOv5 provides five scaled versions: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv51 (large), and YOLOv5x (extra large), where the width and depth of the convolution modules vary to suit specific applications and hardware requirements. For instance, YOLOv5n and YOLOv5s are lightweight models targeted for low-resource devices, while YOLOv5x is optimized for high performance, albeit at the expense of speed.
The YOLOv5 released version at the time of this writing is v7.0, including YOLOv5 versions capable of classification and instance segmentation.
YOLOv5 is open source and actively maintained by Ultralytics, with more than 250 contributors and new improvements frequently. YOLOv5 is easy to use, train and deploy. Ultralytics provide a mobile version for iOS and Android and many integrations for labeling, training, and deployment.
Evaluated on MS COCO dataset test-dev 2017, YOLOv5x achieved an AP of
50.7 % 50.7 % 50.7%50.7 \% with an image size of 640 pixels. Using a batch size of 32, it can achieve a speed of 200 FPS on an NVIDIA V100. Using a larger input size of 1536 pixels and test-time augmentation (TTA), YOLOv5 achieves an AP of 55.8 % 55.8 % 55.8%55.8 \%.

10 Scaled-YOLOv4

One year after YOLOv4, the same authors presented Scaled-YOLOv4 [87] in CVPR 2021. Differently from YOLOv4, Scaled YOLOv4 was developed in Pytorch instead of Darknet. The main novelty was the introduction of scaling-up and scaling-down techniques. Scaling up means producing a model that increases accuracy at the expense of a lower speed; on the other hand, scaling down entails producing a model that increases speed sacrificing accuracy. In addition, scaled-down models need less computing power and can run on embedded systems.
The scaled-down architecture was called YOLOv4-tiny; it was designed for low-end GPUs and can run at 46 FPS on a Jetson TX2 or 440 FPS on RTX2080Ti, achieving
22 % 22 % 22%22 \% AP on MS COCO.
Figure 13: YOLOv5 Architecture. The architecture uses a modified CSPDarknet53 backbone with a Stem, followed by convolutional layers that extract image features. A spatial pyramid pooling fast (SPPF) layer accelerates computation by pooling features into a fixed-size map. Each convolution has batch normalization and SiLU activation. The network’s neck uses SPPF and a modified CSP-PAN, while the head resembles YOLOv3. Diagram based in [85] and [86].
The scaled-up model architecture was called YOLOv4-large, which included three different sizes P5, P6, and P7. This architecture was designed for cloud GPU and achieved state-of-the-art performance, surpassing all previous models [7, 6, 88] with 56 % 56 % 56%56 \% AP on MS COCO.

11 YOLOR

YOLOR [89] was published in ArXiv in May 2021 by the same research team of YOLOv4. It stands for You Only Learn One Representation. In this paper, the authors followed a different approach; they developed a multi-task learning approach that aims to create a single model for various tasks (e.g., classification, detection, pose estimation) by learning a general representation and using sub-networks to create task-specific representations. With the insight that the traditional joint learning method often leads to suboptimal feature generation, YOLOR aims to overcome this by encoding the implicit knowledge of neural networks to be applied to multiple tasks, similar to how humans use past experiences to approach new problems. The results showed that introducing implicit knowledge into the neural network benefits all the tasks.
Evaluated on MS COCO dataset test-dev 2017, YOLOR achieved a AP of 55.4 % 55.4 % 55.4%55.4 \% and AP 50 AP 50 AP_(50)\mathrm{AP}_{50} of 73.3 % 73.3 % 73.3%73.3 \% at 30 FPS on an NVIDIA V100.

12 YOLOX

YOLOX [90] was published in ArXiv in July 2021 by Megvii Technology. Developed in Pytorch and using YOLOV3 from Ultralytics as starting point, it has five principal changes: an anchor-free architecture, multiple positives, a decoupled head, advanced label assignment, and strong augmentations. It achieved state-of-the-art results in 2021 with an optimal balance between speed and accuracy with 50.1 % AP 50.1 % AP 50.1%AP50.1 \% \mathrm{AP} at 68.9 % FPS 68.9 % FPS 68.9%FPS68.9 \% \mathrm{FPS} on Tesla V100. In the following, we describe the five main changes of YOLOX with respect to YOLOv3:
  1. Anchor-free. Since YOLOv2, all subsequent YOLO versions were anchor-based detectors. YOLOX, inspired by anchor-free state-of-the-art object detectors such as CornerNet [91], CenterNet [92], and FCOS [93], returned to an anchor-free architecture simplifying the training and decoding process. The anchor-free increased the AP by 0.9 points concerning the YOLOv3 baseline.
  2. Multi positives. To compensate for the large imbalances the lack of anchors produced, the authors use center sampling [93] where they assigned the center 3 × 3 3 × 3 3xx33 \times 3 area as positives. This approach increased AP by 2.1 points.
  3. Decoupled head. In [94, 95], it was shown that there could be a misalignment between the classification confidence and localization accuracy. Due to this, YOLOX separates these two into two heads (as shown in Fig. [14], one for classification tasks and the other for regression tasks improving the AP by 1.1 points and speeding up the model convergence.
  4. Advanced label assignment. In [96], it was shown that the ground truth label assignment could have ambiguities when the boxes of multiple objects overlap and formulate the assigning procedure as an Optimal Transport (OT) problem. YOLOX, inspired by this work, proposed a simplified version called simOTA. This change increased AP by 2.3 points.
  5. Strong augmentations. YOLOX uses MixUP [83] and Mosaic augmentations. The authors found that ImageNet pretraining was no longer beneficial after using these augmentations. The strong augmentations increased AP by 2.4 points.

13 YOLOv6

YOLOv6 [97] was published in ArXiv in September 2022 by Meituan Vision AI Department. The network design consists of an efficient backbone with RepVGG or CSPStackRep blocks, a PAN topology neck, and an efficient decoupled head with a hybrid-channel strategy. In addition, the paper introduces enhanced quantization techniques using post-training quantization and channel-wise distillation, resulting in faster and more accurate detectors. Overall, YOLOv6 outperforms previous state-of-the-art models on accuracy and speed metrics, such as YOLOv5, YOLOX, and PP-YOLOE.
Figure 15 shows the detailed architecture of YOLOv6.
The main novelties of this model are summarized below:
  1. A new backbone based on RepVGG [98] called EfficientRep that uses higher parallelism than previous YOLO backbones. For the neck, they use PAN [70] enhanced with RepBlocks [98] or CSPStackRep[69] Blocks for the larger models. And following YOLOX, they developed an efficient decoupled head.
  2. Label assignment using the Task alignment learning approach introduced in TOOD [100].
Figure 14: Difference between YOLOv3 head and YOLOX decoupled head. For each level of the FPN, they used a 1 × 1 1 × 1 1xx11 \times 1 convolution layer to reduce the feature channel to 256 and then added two parallel branches with two 3 × 3 3 × 3 3xx33 \times 3 convolution layers each for the class confidence (classification) and localization (regression) tasks. The IoU branch is added to the regression head.
3. New classification and regression losses. They used a classification VariFocal loss [101] and an SIoU [102]/GIoU [103] regression loss.
4. A self-distillation strategy for the regression and classification tasks.
5. A quantization scheme for detection using RepOptimizer[104] and channel-wise distillation [105] that helped to achieve a faster detector.
The authors provide eight scaled models, from YOLOv6-N to YOLOv6-L6. Evaluated on MS COCO dataset test-dev 2017, the largest model, achieved an AP of 57.2 % 57.2 % 57.2%57.2 \% at around 29 FPS on an NVIDIA Tesla T4.

14 YOLOv7

YOLOv7 [106] was published in ArXiv in July 2022 by the same authors of YOLOv4 and YOLOR. At the time, it surpassed all known object detectors in speed and accuracy in the range of 5 FPS to 160 FPS. Like YOLOv4, it was trained using only the MS COCO dataset without pre-trained backbones. YOLOv7 proposed a couple of architecture changes and a series of bag-of-freebies, which increased the accuracy without affecting the inference speed, only the training time.
Figure 16 shows the detailed architecture of YOLOv7.
The architecture changes of YOLOv7 are:
  • Extended efficient layer aggregation network (E-ELAN). ELAN [108] is a strategy that allows a deep model to learn and converge more efficiently by controlling the shortest longest gradient path. YOLOv7 proposed E-ELAN that works for models with unlimited stacked computational blocks. E-ELAN combines the features of different groups by shuffling and merging cardinality to enhance the network’s learning without destroying the original gradient path.
  • Model scaling for concatenation-based models. Scaling generates models of different sizes by adjusting some model attributes. The architecture of YOLOv7 is a concatenation-based architecture in which standard scaling techniques, such as depth scaling, cause a ratio change between the input channel and the output channel of a transition layer which, in turn, leads to a decrease in the hardware usage of the model. YOLOv7 proposed a new strategy for scaling concatenation-based models in which the depth and width of the block are scaled with the same factor to maintain the optimal structure of the model.
The bag-of-freebies used in YOLOv7 include:
  • Planned re-parameterized convolution. Like YOLOv6, the architecture of YOLOv7 is also inspired by re-parameterized convolutions (RepConv) [98]. However, they found that the identity connection in RepConv
Figure 15: YOLOv6 Architecture. The architecture uses a new backbone with RepVGG blocks [98]. The spatial pyramid pooling fast (SPPF) and Conv Modules are similar to YOLOv5. However, YOLOv6 uses a decoupled head. Diagram based in [99].
destroys the residual in ResNet [61] and the concatenation in DenseNet [109]. For this reason, they removed the identity connection and called it RepConvN.
  • Coarse label assignment for auxiliary head and fine label assignment for the lead head. The lead head is responsible for the final output, while the auxiliary head assists with the training.
  • Batch normalization in conv-bn-activation. This integrates the mean and variance of batch normalization into the bias and weight of the convolutional layer at the inference stage.
  • Implicit knowledge inspired in YOLOR [89].
  • Exponential moving average as the final inference model.

14.1 Comparison with YOLOv4 and YOLOR

In this section, we highlight the enhancements of YOLOv7 compared to previous YOLO models developed by the same authors.
Compared to YOLOv4, YOLOv7 achieved a 75 % 75 % 75%75 \% reduction in parameters and a 36 % 36 % 36%36 \% reduction in computation while simultaneously improving the average precision (AP) by 1.5 % 1.5 % 1.5%1.5 \%.
In contrast to YOLOv4-tiny, YOLOv7-tiny managed to reduce parameters and computation by 39 % 39 % 39%39 \% and 49 % 49 % 49%49 \%, respectively, while maintaining the same AP.
Lastly, compared to YOLOR, YOLOv7 reduced the number of parameters and computation by
43 % 43 % 43%43 \% and 15 % 15 % 15%15 \%, respectively, along with a slight 0.4 % 0.4 % 0.4%0.4 \% increase in AP.
Evaluated on MS COCO dataset test-dev 2017, YOLOv7-E6 achieved an AP of 55.9 % 55.9 % 55.9%55.9 \% and AP 50 AP 50 AP_(50)\mathrm{AP}_{50} of 73.5 % 73.5 % 73.5%73.5 \% with an input size of 1280 pixels with a speed of 50 FPS on an NVIDIA V100.
Figure 16: YOLOv7 Architecture. Changes in this architecture include the ELAN blocks that combine features of different groups by shuffling and merging cardinality to enhance the model learning and modified RepVGG without identity connection. Diagram based in [107].

15 DAMO-YOLO

DAMO-YOLO [110] was published in ArXiv in November 2022 by Alibaba Group. Inspired by the current technologies, DAMO-YOLO included the following:
  1. A Neural architecture search (NAS). They used a method called MAE-NAS [111] developed by Alibaba to find an efficient architecture automatically.
  2. A large neck. Inspired by GiraffeDet [112], CSPNet [69], and ELAN [108], the authors designed a neck that can work in real-time called Efficient-RepGFPN.
  3. A small head. The authors found that a large neck and a small neck yield better performance, and they only left one linear layer for classification and one for regression. They called this approach ZeroHead.
  4. AlignedOTA label assignment. Dynamic label assignment methods, such as OTA[96] and TOOD[100], have gained popularity due to their significant improvements over static methods. However, the misalignment between classification and regression remains a problem, partly because of the imbalance between classification and regression losses. To address this issue, their AlignOTA method introduces focal loss [6] into the classification cost and uses the IoU of prediction and ground truth box as the soft label, enabling the selection of aligned samples for each target and solving the problem from a global perspective.
  5. Knowledge distillation. Their proposed strategy consists of two stages: the teacher guiding the student in the first stage and the student fine-tuning independently in the second stage. Additionally, they incorporate two enhancements in the distillation approach: the Align Module, which adapts student features to the same resolution as the teacher’s, and Channel-wise Dynamic Temperature, which normalizes teacher and student features to reduce the impact of real value differences.
The authors generated scaled models named DAMO-YOLO-Tiny/Small/Medium, with the best model achieving an AP of 50.0 % at 233 FPS on an NVIDIA V100.

16 YOLOv8

YOLOv8 [113] was released in January 2023 by Ultralytics, the company that developed YOLOv5. YOLOv8 provided five scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv81 (large) and YOLOv8x
(extra large). YOLOv8 supports multiple vision tasks such as object detection, segmentation, pose estimation, tracking, and classification.

16.1 YOLOv8 Architecture

Figure 17 shows the detailed architecture of YOLOv8. YOLOv8 uses a similar backbone as YOLOv5 with some changes on the CSPLayer, now called the C2f module. The C2f module (cross-stage partial bottleneck with two convolutions) combines high-level features with contextual information to improve detection accuracy.
YOLOv8 uses an anchor-free model with a decoupled head to independently process objectness, classification, and regression tasks. This design allows each branch to focus on its task and improves the model’s overall accuracy. In the output layer of YOLOv8, they used the sigmoid function as the activation function for the objectness score, representing the probability that the bounding box contains an object. It uses the softmax function for the class probabilities, representing the objects’ probabilities belonging to each possible class.
YOLOv8 uses CIoU [76] and DFL [114] loss functions for bounding box loss and binary cross-entropy for classification loss. These losses have improved object detection performance, particularly when dealing with smaller objects.
YOLOv8 also provides a semantic segmentation model called YOLOv8-Seg model. The backbone is a CSPDarknet53 feature extractor, followed by a C2f module instead of the traditional YOLO neck architecture. The C2f module is followed by two segmentation heads, which learn to predict the semantic segmentation masks for the input image. The model has similar detection heads to YOLOv8, consisting of five detection modules and a prediction layer. The YOLOv8-Seg model has achieved state-of-the-art results on various object detection and semantic segmentation benchmarks while maintaining high speed and efficiency.
YOLOv8 can be run from the command line interface (CLI), or it can also be installed as a PIP package. In addition, it comes with multiple integrations for labeling, training, and deploying.
Evaluated on MS COCO dataset test-dev 2017, YOLOv8x achieved an AP of
53.9 % 53.9 % 53.9%53.9 \% with an image size of 640 pixels (compared to 50.7 % 50.7 % 50.7%50.7 \% of YOLOv5 on the same input size) with a speed of 280 FPS on an NVIDIA A100 and TensorRT.

17 PP-YOLO, PP-YOLOv2, and PP-YOLOE

PP-YOLO models have been growing parallel to the YOLO models we described. However, we decided to group them in a single section because they began with YOLOv3 and had been gradually improving upon the previous PP-YOLO version. Nevertheless, these models have been influential in the evolution of YOLO. PP-YOLO [88] similar to YOLOv4 and YOLOv5 was based on YOLOv3. It was published in ArXiv in July 2020 by researchers from Baidu Inc. The authors used the PaddlePaddle [116] deep learning platform, hence its P P P P PPP P name. Following the trend we have seen starting with YOLOv4, PP-YOLO added ten existing tricks to improve the detector’s accuracy, keeping the speed unchanged. According to the authors, this paper was not intended to introduce a novel object detector but to show how to build a better detector step by step. Most of the tricks PP-YOLO uses are different from the ones used in YOLOv4, and the ones that overlap use a different implementation. The changes of PP-YOLO concerning YOLOv3 are:
  1. A ResNet50-vd backbone replacing the DarkNet-53 backbone with an architecture augmented with deformable convolutions [117] in the last stage and a distilled pre-trained model, which has a higher classification accuracy on ImageNet. This architecture was called ResNet5-vd-dcn.
  2. A larger batch size to improve training stability, they went from 64 to 192 , along with an updated training schedule and learning rate.
  3. Maintained moving averages for the trained parameters and use them instead of the final trained values.
  4. DropBlock is applied only to the FPN.
  5. An IoU loss is added in another branch along with the L1-loss for bounding box regression.
  6. An IoU prediction branch is added to measure localization accuracy along with an IoU aware loss. During inference, YOLOv3 multiplies the classification probability and objectiveness score to compute the final detection, PP-YOLO also multiplies the predicted IoU to consider the localization accuracy.
  7. Grid Sensitive approach similar to YOLOv4 is used to improve the bounding box center prediction at the grid boundary.
  8. Matrix NMS [118] is used, which can be run in parallel making it faster than traditional NMS.
Figure 17: YOLOv8 Architecture. The architecture uses a modified CSPDarknet53 backbone. The C2f module replaces the CSPLayer used in YOLOv5. A spatial pyramid pooling fast (SPPF) layer accelerates computation by pooling features into a fixed-size map. Each convolution has batch normalization and SiLU activation. The head is decoupled to process objectness, classification, and regression tasks independently. Diagram based in [115].
9. CoordConv [119] is used for the
1 × 1 1 × 1 1xx11 \times 1 convolution of the FPN, and on the first convolution layer in the detection head. CoordConv allows the network to learn translational invariance improving the detection localization.
10. Spatial Pyramid Pooling is used only on the top feature map to increase the receptive field of the backbone.

17.1 PP-YOLO augmentations and preprocessing

PP-YOLO used the following augmentations and preprocessing:
  1. Mixup Training [83] with a weight sampled from Beta ( α , β ) Beta ( α , β ) Beta(alpha,beta)\operatorname{Beta}(\alpha, \beta) distribution where α = 1.5 α = 1.5 alpha=1.5\alpha=1.5 and β = 1.5 β = 1.5 beta=1.5\beta=1.5.
  2. Random Color Distortion.
  3. Random Expand.
  4. Random Crop and Random Flip with a probability of 0.5 .
  5. RGB channel z-score normalization with a mean of [0.485, 0.456, 0.406] and a standard deviation of [0.229, 0.224, 0.225].
  6. Multiple image sizes evenly drawn from [320, 352, 384, 416, 448, 480, 512, 544, 576, 608].
Evaluated on MS COCO dataset test-dev 2017, PP-YOLO achieved an AP of 45.9 % 45.9 % 45.9%45.9 \% and AP 50 AP 50 AP_(50)\mathrm{AP}_{50} of 65.2 % 65.2 % 65.2%65.2 \% at 73 FPS on an NVIDIA V100.

17.2 PP-YOLOv2

PP-YOLOv2 [120] was published in ArXiv on April 2021 and added four refinements to PP-YOLO that increased performance from 45.9 % 45.9 % 45.9%45.9 \% AP to 49.5 % 49.5 % 49.5%49.5 \% AP at 69 FPS on NVIDIA V100. The changes of PP-YOLOv2 concerning PP-YOLO are the following:
  1. Backbone changed from ResNet50 to ResNet101.
  2. Path aggregation network (PAN) instead of FPN similar to YOLOv4.
  3. Mish Activation Function. Unlike YOLOv4 and YOLOv5, they only applied the mish activation function in the detection neck to keep the backbone unchanged with ReLU.
  4. Larger input sizes help to increase performance on small objects. They expanded the largest input size from 608 to 768 and reduced the batch size from 24 to 12 images per GPU. The input sizes are evenly drawn from [ 320 , 352 , 384 , 416 , 448 , 480 , 512 , 544 , 576 , 608 , 640 , 672 , 704 , 736 , 768 320 , 352 , 384 , 416 , 448 , 480 , 512 , 544 , 576 , 608 , 640 , 672 , 704 , 736 , 768 320,352,384,416,448,480,512,544,576,608,640,672,704,736,768320,352,384,416,448,480,512,544,576,608,640,672,704,736,768 ].
  5. A modified IoU aware branch. They modified the calculation of the IoU aware loss calculation using a soft label format instead of a soft weight format.

17.3 PP-YOLOE

PP-YOLOE [121] was published in ArXiv in March 2022. It added improvements upon PP-YOLOv2 achieving a performance of 51.4 % 51.4 % 51.4%51.4 \% AP at 78.1 FPS on NVIDIA V100. Figure 18 shows a detailed architecture diagram. The main changes of PP-YOLOE concerning PP-YOLOv2 are:
  1. Anchor-free. Following the time trends driven by the works of 93 , 92 , 91 , 90 93 , 92 , 91 , 90 93,92,91,9093,92,91,90, PP-YOLOE uses an anchor-free architecture.
  2. New backbone and neck. Inspired by TreeNet [122], the authors modified the architecture of the backbone and neck with RepResBlocks combining residual and dense connections.
  3. Task Alignment Learning (TAL). YOLOX was the first to bring up the problem of task misalignment, where the classification confidence and the location accuracy do not agree in all cases. To reduce this problem, PP-YOLOE implemented TAL as proposed in TOOD [100], which includes a dynamic label assignment combined with a task-alignment loss.
  4. Efficient Task-aligned Head (ET-head). Different from YOLOX where the classification and locations heads were decoupled, PP-YOLOE instead used a single head based on TOOD to improve speed and accuracy.
  5. Varifocal (VFL) and Distribution focal loss (DFL). VFL [101] weights loss of positive samples using target score, giving higher weight to those with high IoU. This prioritizes high-quality samples during training. Similarly, both use IoU-aware classification score (IACS) as the target, allowing for joint learning of classification and localization quality, leading to consistency between training and inference. On the other hand, DFL [114] extends Focal Loss from discrete to continuous labels, enabling successful optimization of improved representations that combine quality estimation and class prediction. This allows for an accurate depiction of flexible distribution in real data, eliminating the risk of inconsistency.
Like previous YOLO versions, the authors generated multiple scaled models by varying the width and depth of the backbone and neck. The models are called PP-YOLOE-s (small), PP-YOLOE-m (medium), PP-YOLOE-1 (large), and PP-YOLOE-x (extra large).
Figure 18: PP-YOLOE Architecture. The backbone is based on CSPRepResNet, the neck uses a path aggregation network, and the head uses ES layers to form an Efficient Task-aligned Head (ET-head). Diagram based in [123].

18 YOLO-NAS

YOLO-NAS [124] was released in May 2023 by Deci, a company that develops production-grade models and tools to build, optimize, and deploy deep learning models. YOLO-NAS is designed to detect small objects, improve localization accuracy, and enhance the performance-per-compute ratio, making it suitable for real-time edge-device applications. In addition, its open-source architecture is available for research use.
The novelty of YOLO-NAS includes the following:
  • Quantization aware modules [125] called QSP and QCI that combine re-parameterization for 8-bit quantization to minimize the accuracy loss during post-training quantization.
  • Automatic architecture design using AutoNAC, Deci’s proprietary NAS technology.
  • Hybrid quantization method to selectively quantize certain parts of a model to balance latency and accuracy instead of standard quantization, where all the layers are affected.
  • A pre-training regimen with automatically labeled data, self-distillation, and large datasets.
The AutoNAC system, which was instrumental in creating YOLO-NAS, is versatile and can accommodate any task, the specifics of the data, the environment for making inferences, and the setting of performance goals. It assists users in identifying the most suitable structure that offers the perfect blend of precision and inference speed for their particular use. This technology considers the data and hardware and other elements involved in the inference process, such as compilers and quantization. In addition, RepVGG blocks were incorporated into the model architecture during the NAS process for compatibility with Post-Training Quantization (PTQ). They generated three architectures by varying the depth and positions of the QSP and QCI blocks: YOLO-NASS, YOLO-NASM, and YOLO-NASL (S,M,L for small, medium, and large, respectively). Figure 19 shows the model architecture for YOLO-NASL.
Figure 19: YOLO-NAS Architecture. The architecture is found automatically via a Neural Architecture Search (NAS) system called AutoNAC to balance latency vs. throughput. They generated three architectures called YOLO-NASS (small), YOLO-NASM (medium), and YOLO-NASL (large), varying the depth and positions of the QSP and QCI blocks. The figure shows the YOLO-NASL architecture.
The model is pre-trained on Objects365 [126], which contains two million images and 365 categories, then the COCO dataset was used to generate pseudo-labels. Finally, the models are trained with the original 118k train images of the COCO dataset.
At this writing, three YOLO-NAS models have been released in FP32, FP16, and INT8 precisions, achieving an AP of 52.2 % 52.2 % 52.2%52.2 \% on MS COCO with 16-bit precision.

19 YOLO with Transformers

With the rise of the Transformer [127] taking over most Deep Learning tasks from Language and Audio Processing to Vision, it was natural for Transformers and YOLO to be combined. One of the first attempts at using transformers for object detection was You Only Look at One Sequence or YOLOS [128], turned a pre-trained Vision Transfomer (ViT) [129] from image classification to object detection, achieving 42.0 % 42.0 % 42.0%42.0 \% AP on MS COCO dataset. The changes
made to ViT were two: 1) replace one [CLS] token used in classification with one hundred [DET] tokens for detection, and 2) replace the image classification loss in ViT with a bipartite matching loss similar to the End-to-end object detection with transformers [130].

Figure 20: ViT-YOLO Architecture. The backbone MHSA-Darknet combines multi-head self-attention blocks (MHSADark Block) with Cross-Stage Partial connection blocks (CSPDark block). The neck uses BiFPN to aggregate features from different backbone levels, and the head comprises five multi-scale detection heads.
Many works have combined transformers with YOLO-related architectures tailored to specific applications. For example, Zhang et al. [131], motivated by the robustness of Vision Transformers to occlusions, perturbations, and domain shifts, proposed ViT-YOLO, a hybrid architecture that combines CSP-Darknet [58] and multi-head self-attention (MHSA-Darknet) in the backbone along with bidirectional feature pyramid networks (BiFPN) [7] for the neck and multi-scale detection heads like YOLOv3. Their specific use case was for object detection in drone images. Figure 20 shows the detailed architecture of ViT-YOLO.
MSFT-YOLO [132] adds transformer-based modules to the backbone and detection heads intending to detect defects on the steel surface. NRT-YOLO [133] (Nested Residual Transformer) tries to address the problem of tiny objects in remote sensing images. Adding an extra prediction head, feature fusion layers, and a residual transformer module, NRT-YOLO improved YOLOv51 by 5.4 % 5.4 % 5.4%5.4 \% in the DOTA data set [134].
In remote sensing applications, YOLO-SD [135] tried to improve the detection accuracy for small ships in synthetic aperture radar (SAR) images. They started with YOLOX [90] coupled with multi-scale convolution (MSC) to improve
Table 4: Summary of YOLO architectures. The metric reported for YOLO and YOLOv2 were on VOC2007, while the rest are reported on COCO2017. The NAS-YOLO model reported has 16-bit precision.
Version Date Anchor Framework Backbone AP (%)
YOLO 2015 No Darknet Darknet24 63.4
YOLOv2 2016 Yes Darknet Darknet24 78.6
YOLOv3 2018 Yes Darknet Darknet53 33.0
YOLOv4 2020 Yes Darknet CSPDarknet53 43.5
YOLOv5 2020 Yes Pytorch YOLOv5CSPDarknet 55.8
PP-YOLO 2020 Yes PaddlePaddle ResNet50-vd 45.9
Scaled-YOLOv4 2021 Yes Pytorch CSPDarknet 56.0
PP-YOLOv2 2021 Yes PaddlePaddle ResNet101-vd 50.3
YOLOR 2021 Yes Pytorch CSPDarknet 55.4
YOLOX 2021 No Pytorch YOLOXCSPDarknet 51.2
PP-YOLOE 2022 No PaddlePaddle CSPRepResNet 54.7
YOLOv6 2022 No Pytorch EfficientRep 52.5
YOLOv7 2022 No Pytorch YOLOv7Backbone 56.8
DAMO-YOLO 2022 No Pytorch MAE-NAS 50.0
YOLOv8 2023 No Pytorch YOLOv8CSPDarknet 53.9
YOLO-NAS 2023 No Pytorch NAS 52.2
Version Date Anchor Framework Backbone AP (%) YOLO 2015 No Darknet Darknet24 63.4 YOLOv2 2016 Yes Darknet Darknet24 78.6 YOLOv3 2018 Yes Darknet Darknet53 33.0 YOLOv4 2020 Yes Darknet CSPDarknet53 43.5 YOLOv5 2020 Yes Pytorch YOLOv5CSPDarknet 55.8 PP-YOLO 2020 Yes PaddlePaddle ResNet50-vd 45.9 Scaled-YOLOv4 2021 Yes Pytorch CSPDarknet 56.0 PP-YOLOv2 2021 Yes PaddlePaddle ResNet101-vd 50.3 YOLOR 2021 Yes Pytorch CSPDarknet 55.4 YOLOX 2021 No Pytorch YOLOXCSPDarknet 51.2 PP-YOLOE 2022 No PaddlePaddle CSPRepResNet 54.7 YOLOv6 2022 No Pytorch EfficientRep 52.5 YOLOv7 2022 No Pytorch YOLOv7Backbone 56.8 DAMO-YOLO 2022 No Pytorch MAE-NAS 50.0 YOLOv8 2023 No Pytorch YOLOv8CSPDarknet 53.9 YOLO-NAS 2023 No Pytorch NAS 52.2| Version | Date | Anchor | Framework | Backbone | AP (%) | | :--- | :--- | :--- | :--- | :--- | :--- | | YOLO | 2015 | No | Darknet | Darknet24 | 63.4 | | YOLOv2 | 2016 | Yes | Darknet | Darknet24 | 78.6 | | YOLOv3 | 2018 | Yes | Darknet | Darknet53 | 33.0 | | YOLOv4 | 2020 | Yes | Darknet | CSPDarknet53 | 43.5 | | YOLOv5 | 2020 | Yes | Pytorch | YOLOv5CSPDarknet | 55.8 | | PP-YOLO | 2020 | Yes | PaddlePaddle | ResNet50-vd | 45.9 | | Scaled-YOLOv4 | 2021 | Yes | Pytorch | CSPDarknet | 56.0 | | PP-YOLOv2 | 2021 | Yes | PaddlePaddle | ResNet101-vd | 50.3 | | YOLOR | 2021 | Yes | Pytorch | CSPDarknet | 55.4 | | YOLOX | 2021 | No | Pytorch | YOLOXCSPDarknet | 51.2 | | PP-YOLOE | 2022 | No | PaddlePaddle | CSPRepResNet | 54.7 | | YOLOv6 | 2022 | No | Pytorch | EfficientRep | 52.5 | | YOLOv7 | 2022 | No | Pytorch | YOLOv7Backbone | 56.8 | | DAMO-YOLO | 2022 | No | Pytorch | MAE-NAS | 50.0 | | YOLOv8 | 2023 | No | Pytorch | YOLOv8CSPDarknet | 53.9 | | YOLO-NAS | 2023 | No | Pytorch | NAS | 52.2 |
the detection at different scales and feature transformer modules to capture global features. The authors showed that these changes improved the accuracy of YOLO-SD compared with YOLOX in the HRSID dataset [136].
Another interesting attempt to combine YOLO with detection transformer (DETR) [130] is the case of DEYO [137] comprising two stages: a YOLOv5-based model followed by a DETR-like model. The first stage generates high-quality query and anchors that input to the second stage. The results show a faster convergence time and better performance than DETR, achieving 52.1% AP in the COCO detection benchmark.

20 Discussion

This paper examined 16 YOLO versions, ranging from the original YOLO model to the most recent YOLO-NAS. Table 4 provides an overview of the YOLO versions discussed. From this table, we can identify several key patterns:
  • Anchors: The original YOLO model was relatively simple and did not employ anchors, while the state-of-theart relied on two-stage detectors with anchors. YOLOv2 incorporated anchors, leading to improvements in bounding box prediction accuracy. This trend persisted for five years until YOLOX introduced an anchor-less approach that achieved state-of-the-art results. Since then, subsequent YOLO versions have abandoned the use of anchors.
  • Framework: Initially, YOLO was developed using the Darknet framework, with subsequent versions following suit. However, when Ultralytics ported YOLOv3 to PyTorch, the remaining YOLO versions were developed using PyTorch, leading to a surge in enhancements. Another deep learning language utilized is PaddlePaddle, an open-source framework initially developed by Baidu.
  • Backbone: The backbone architectures of YOLO models have undergone significant changes over time. Starting with the Darknet architecture, which comprised simple convolutional and max pooling layers, later models incorporated cross-stage partial connections (CSP) in YOLOv4, reparameterization in YOLOv6 and YOLOv7, and neural architecture search in DAMO-YOLO and YOLO-NAS.
  • Performance: While the performance of YOLO models has improved over time, it is worth noting that they often prioritize balancing speed and accuracy rather than solely focusing on accuracy. This tradeoff is essential to the YOLO framework, allowing for real-time object detection across various applications.

20.1 Tradeoff between speed and accuracy

The YOLO family of object detection models has consistently focused on balancing speed and accuracy, aiming to deliver real-time performance without sacrificing the quality of detection results. As the YOLO framework has evolved through its various iterations, this tradeoff has been a recurring theme, with each version seeking to optimize these competing objectives differently. In the original YOLO model, the primary focus was on achieving high-speed object
detection. The model utilized a single convolutional neural network (CNN) to directly predict object locations and classes from the input image, enabling real-time processing. However, this emphasis on speed led to a compromise in accuracy, mainly when dealing with small objects or objects with overlapping bounding boxes.
Subsequent YOLO versions introduced refinements and enhancements to address these limitations while maintaining the framework’s real-time capabilities. For instance, YOLOv2 (YOLO9000) introduced anchor boxes and passthrough layers to improve the localization of objects, resulting in higher accuracy. In addition, YOLOv3 enhanced the model’s performance by employing a multi-scale feature extraction architecture, allowing for better object detection across various scales.
The tradeoff between speed and accuracy became more nuanced as the YOLO framework evolved. Models like YOLOv4 and YOLOv5 introduced innovations, such as new network backbones, improved data augmentation techniques, and optimized training strategies. These developments led to significant gains in accuracy without drastically affecting the models’ real-time performance.
From YOLOv5, all official YOLO models have fine-tuned the tradeoff between speed and accuracy, offering different model scales to suit specific applications and hardware requirements. For instance, these versions often provide lightweight models optimized for edge devices, trading accuracy for reduced computational complexity and faster processing times. Figure 21 [138] shows the comparison of the different model scales from YOLOv5 to YOLOv8. The figure presents a comparative analysis of different versions of YOLO models in terms of their complexity and performance. The left graph plots the number of parameters (in millions) against the mean average precision (mAP) on the COCO validation set, ranging from IOU thresholds of 50 to 95 . It illustrates a clear trend where an increase in the number of parameters enhances the model’s accuracy. Each model includes various scales indicated by
n n nn (nano), s s ss (small), m m mm (medium), l l ll (large), and x x xx (extra-large).
The right graph contrasts the inference latency on an NVIDIA A100 GPU, utilizing TensorRT FP16, with the same mAP performance metric. Here, the tradeoff between the inference speed and the detection accuracy is evident. Lower latency values, indicating faster model inference, typically result in reduced accuracy. Conversely, models with higher latency tend to achieve better performance on the COCO mAP metric. This relationship is pivotal for applications where real-time processing is crucial, and the choice of model is influenced by the requirement to balance speed and accuracy.
Figure 21: Performance comparison of YOLO object detection models. The left plot illustrates the relationship between model complexity (measured by the number of parameters) and detection accuracy (COCO mAP50-95). The right plot shows the tradeoff between inference speed (latency on A100 TensorRT FP16) and accuracy for the same models. Each model version is represented by a distinct color, with markers indicating size variants from nano to extra. Plots taken from [138].

21 The future of YOLO

As the YOLO framework continues to evolve, we anticipate that the following trends and possibilities will shape future developments:
Incorporation of Latest Techniques. Researchers and developers will continue to refine the YOLO architecture by leveraging state-of-the-art methods in deep learning, data augmentation, and training techniques. This ongoing innovation will likely improve the model’s performance, robustness, and efficiency.
Benchmark Evolution. The current benchmark for evaluating object detection models, COCO 2017, may eventually be replaced by a more advanced and challenging benchmark. This mirrors the transition from the VOC 2007 benchmark used in the first two YOLO versions, reflecting the need for more demanding benchmarks as models grow more sophisticated and accurate.
Proliferation of YOLO Models and Applications. As the YOLO framework progresses, we expect to witness an increase in the number of YOLO models released each year, along with a corresponding expansion of applications. As the framework becomes more versatile and powerful, it will likely be employed in more varied domains, from home appliances devices to autonomous cars.
Expansion into New Domains. YOLO models have the potential to expand beyond object detection and segmentation, exploring domains such as object tracking in videos and 3D keypoint estimation. We anticipate YOLO models to transition into multi-modal frameworks, incorporating both vision and language, video, and sound processing. As these models evolve, they may serve as the foundation for innovative solutions catering to a broader spectrum of computer vision and multimedia tasks.
Adaptability to Diverse Hardware. YOLO models will further span hardware platforms, from IoT devices to highperformance computing clusters. This adaptability will enable deploying YOLO models in various contexts, depending on the application’s requirements and constraints. In addition, by tailoring the models to suit different hardware specifications, YOLO can be made accessible and effective for more users and industries.

22 Acknowledgments

We thank the National Council for Science and Technology (CONACYT) for its support through the National Research System (SNI).

References

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587, 2014.
[2] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440-1448, 2015.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14, pp. 21-37, Springer, 2016.
[5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 2961-2969, 2017.
[6] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980-2988, 2017.
[7] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781-10790, 2020.
[8] B. Bhavya Sree, V. Yashwanth Bharadwaj, and N. Neelima, “An inter-comparative survey on state-of-the-art detectors-r-cnn, yolo, and ssd,” in Intelligent Manufacturing and Energy Sustainability: Proceedings of ICIMES 2020, pp. 475-483, Springer, 2021.
[9] T. Diwan, G. Anirudh, and J. V. Tembhurne, “Object detection using yolo: Challenges, architectural successors, datasets and applications,” multimedia Tools and Applications, vol. 82, no. 6, pp. 9243-9275, 2023.
[10] M. Hussain, “Yolo-v1 to yolo-v8, the rise of yolo and its complementary nature toward digital manufacturing and industrial defect detection,” Machines, vol. 11, no. 7, p. 677, 2023.
[11] W. Lan, J. Dang, Y. Wang, and S. Wang, “Pedestrian detection based on yolo network model,” in 2018 IEEE international conference on mechatronics and automation (ICMA), pp. 1547-1551, IEEE, 2018.
[12] W.-Y. Hsu and W.-Y. Lin, “Adaptive fusion of multi-scale yolo for pedestrian detection,” IEEE Access, vol. 9, pp. 110063-110073, 2021.
[13] A. Benjumea, I. Teeti, F. Cuzzolin, and A. Bradley, “Yolo-z: Improving small object detection in yolov5 for autonomous vehicles,” arXiv preprint arXiv:2112.11798, 2021.
[14] N. M. A. A. Dazlee, S. A. Khalil, S. Abdul-Rahman, and S. Mutalib, “Object detection for autonomous vehicles with sensor-based technology using yolo,” International Journal of Intelligent Systems and Applications in Engineering, vol. 10, no. 1, pp. 129-134, 2022.
[15] S. Liang, H. Wu, L. Zhen, Q. Hua, S. Garg, G. Kaddoum, M. M. Hassan, and K. Yu, “Edge yolo: Real-time intelligent object detection system based on edge-cloud cooperation in autonomous vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 25345-25360, 2022.
[16] Q. Li, X. Ding, X. Wang, L. Chen, J. Son, and J.-Y. Song, “Detection and identification of moving objects at busy traffic road based on yolo v4,” The Journal of the Institute of Internet, Broadcasting and Communication, vol. 21, no. 1, pp. 141-148, 2021.
[17] S. Shinde, A. Kothari, and V. Gupta, “Yolo based human action recognition and localization,” Procedia computer science, vol. 133, pp. 831-838, 2018.
[18] A. H. Ashraf, M. Imran, A. M. Qahtani, A. Alsufyani, O. Almutiry, A. Mahmood, M. Attique, and M. Habib, “Weapons detection for security and video surveillance using cnn and yolo-v5s,” CMC-Comput. Mater. Contin, vol. 70, pp. 2761-2775, 2022.
[19] Y. Zheng and H. Zhang, “Video analysis in sports by lightweight object detection network under the background of sports industry development,” Computational Intelligence and Neuroscience, vol. 2022, 2022.
[20] H. Ma, T. Celik, and H. Li, “Fer-yolo: Detection and classification based on facial expressions,” in Image and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6-8, 2021, Proceedings, Part I11, pp. 28-39, Springer, 2021.
[21] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, and Z. Liang, “Apple detection during different growth stages in orchards using the improved yolo-v3 model,” Computers and electronics in agriculture, vol. 157, pp. 417-426, 2019.
[22] D. Wu, S. Lv, M. Jiang, and H. Song, “Using channel pruning-based yolo v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments,” Computers and Electronics in Agriculture, vol. 178, p. 105742, 2020.
[23] M. Lippi, N. Bonucci, R. F. Carpio, M. Contarini, S. Speranza, and A. Gasparri, “A yolo-based pest detection system for precision agriculture,” in 2021 29th Mediterranean Conference on Control and Automation (MED), pp. 342-347, IEEE, 2021.
[24] W. Yang and Z. Jiachun, “Real-time face detection based on yolo,” in 2018 1st IEEE international conference on knowledge innovation and invention (ICKII), pp. 221-224, IEEE, 2018.
[25] W. Chen, H. Huang, S. Peng, C. Zhou, and C. Zhang, “Yolo-face: a real-time face detector,” The Visual Computer, vol. 37, pp. 805-813, 2021.
[26] M. A. Al-Masni, M. A. Al-Antari, J.-M. Park, G. Gi, T.-Y. Kim, P. Rivera, E. Valarezo, M.-T. Choi, S.-M. Han, and T.-S. Kim, “Simultaneous detection and classification of breast masses in digital mammograms via a deep learning yolo-based cad system,” Computer methods and programs in biomedicine, vol. 157, pp. 85-94, 2018.
[27] Y. Nie, P. Sommella, M. O’Nils, C. Liguori, and J. Lundgren, “Automatic detection of melanoma with yolo deep convolutional neural networks,” in 2019 E-Health and Bioengineering Conference (EHB), pp. 1-4, IEEE, 2019.
[28] H. M. Ünver and E. Ayan, “Skin lesion segmentation in dermoscopic images with combination of yolo and grabcut algorithm,” Diagnostics, vol. 9, no. 3, p. 72, 2019.
[29] L. Tan, T. Huangfu, L. Wu, and W. Chen, “Comparison of retinanet, ssd, and yolo v3 for real-time pill identification,” BMC medical informatics and decision making, vol. 21, pp. 1-11, 2021.
[30] L. Cheng, J. Li, P. Duan, and M. Wang, “A small attentional yolo model for landslide detection from satellite remote sensing images,” Landslides, vol. 18, no. 8, pp. 2751-2765, 2021.
[31] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard, “Yolo-fine: One-stage detector of small objects under various backgrounds in remote sensing images,” Remote Sensing, vol. 12, no. 15, p. 2501, 2020.
[32] Y. Qing, W. Liu, L. Feng, and W. Gao, “Improved yolo network for free-angle remote sensing target detection,” Remote Sensing, vol. 13, no. 11, p. 2171, 2021.
[33] Z. Zakria, J. Deng, R. Kumar, M. S. Khokhar, J. Cai, and J. Kumar, “Multiscale and direction target detecting in remote sensing images via modified yolo-v4,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 1039-1048, 2022.
[34] P. Kumar, S. Narasimha Swamy, P. Kumar, G. Purohit, and K. S. Raju, “Real-time, yolo-based intelligent surveillance and monitoring system using jetson tx 2 ,” in Data Analytics and Management: Proceedings of ICDAM, pp. 461-471, Springer, 2021.
[35] K. Bhambani, T. Jain, and K. A. Sultanpure, “Real-time face mask and social distancing violation detection system using yolo,” in 2020 IEEE Bangalore humanitarian technology conference (B-HTC), pp. 1-6, IEEE, 2020.
[36] J. Li, Z. Su, J. Geng, and Y. Yin, “Real-time detection of steel strip surface defects based on improved yolo detection network,” IFAC-PapersOnLine, vol. 51, no. 21, pp. 76-81, 2018.
[37] E. N. Ukhwah, E. M. Yuniarno, and Y. K. Suprapto, “Asphalt pavement pothole detection using deep learning method based on yolo neural network,” in 2019 International Seminar on Intelligent Technology and Its Applications (ISITIA), pp. 35-40, IEEE, 2019.
[38] Y. Du, N. Pan, Z. Xu, F. Deng, Y. Shen, and H. Kang, “Pavement distress detection and classification based on yolo network,” International Journal of Pavement Engineering, vol. 22, no. 13, pp. 1659-1672, 2021.
[39] R.-C. Chen et al., “Automatic license plate recognition via sliding-window darknet-yolo deep learning,” Image and Vision Computing, vol. 87, pp. 47-56, 2019.
[40] C. Dewi, R.-C. Chen, X. Jiang, and H. Yu, “Deep convolutional neural network for enhancing traffic sign recognition developed on yolo v4,” Multimedia Tools and Applications, vol. 81, no. 26, pp. 37821-37845, 2022.
[41] A. M. Roy, J. Bhaduri, T. Kumar, and K. Raj, “Wildect-yolo: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection,” Ecological Informatics, vol. 75, p. 101919, 2023.
[42] S. Kulik and A. Shtanko, “Experiments with neural net object detection system yolo on small training datasets for intelligent robotics,” in Advanced Technologies in Robotics and Intelligent Systems: Proceedings of ITR 2019, pp. 157-162, Springer, 2020.
[43] D. H. Dos Reis, D. Welfer, M. A. De Souza Leite Cuadros, and D. F. T. Gamarra, “Mobile robot navigation using an object recognition software with rgbd images and the yolo algorithm,” Applied Artificial Intelligence, vol. 33, no. 14, pp. 1290-1305, 2019.
[44] O. Sahin and S. Ozer, “Yolodrone: Improved yolo architecture for object detection in drone images,” in 2021 44th International Conference on Telecommunications and Signal Processing (TSP), pp. 361-365, IEEE, 2021.
[45] C. Chen, Z. Zheng, T. Xu, S. Guo, S. Feng, W. Yao, and Y. Lan, “Yolo-based uav technology: A review of the research and its applications,” Drones, vol. 7, no. 3, p. 190, 2023.
[46] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303-338, 2010.
[47] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740-755, Springer, 2014.
[48] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
[49] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, p. 3, Atlanta, Georgia, USA, 2013.
[50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
[51] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211-252, 2015.
[53] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263-7271, 2017.
[54] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[55] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, et al., “Openimages: A public dataset for large-scale multi-label and multi-class image classification,” Dataset available from
https://github. com/openimages, vol. 2, no. 3, p. 18, 2017.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
[57] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117-2125, 2017.
[58] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[59] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[60] S. Liu, D. Huang, et al., “Receptive field block net for accurate and fast object detection,” in Proceedings of the European conference on computer vision (ECCV), pp. 385-400, 2018.
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[62] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447-456, 2015.
[63] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, “M2det: A single-shot object detector based on multi-level feature pyramid network,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 9259-9266, 2019.
[64] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026-1034, 2015.
[65] D. Misra, “Mish: A self regularized non-monotonic neural activation function,” arXiv preprint arXiv:1908.08681, vol. 4, no. 2, pp. 10-48550, 2019.
[66] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms-improving object detection with one line of code,” in Proceedings of the IEEE international conference on computer vision, pp. 5561-5569, 2017.
[67] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492-1500, 2017.
[68] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, pp. 6105-6114, PMLR, 2019.
[69] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 390-391, 2020.
[70] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8759-8768, 2018.
[71] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), pp. 3-19, 2018.
[72] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,” Advances in neural information processing systems, vol. 31, 2018.
[73] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929-1958, 2014.
[74] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818-2826, 2016.
[75] M. A. Islam, S. Naha, M. Rochan, N. Bruce, and Y. Wang, “Label refinement network for coarse-to-fine semantic segmentation,” arXiv preprint arXiv:1703.00551, 2017.
[76] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12993-13000, 2020.
[77] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, pp. 448-456, PMLR, 2015.
[78] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
[79] S. Wang, J. Zhao, N. Ta, X. Zhao, M. Xiao, and H. Wei, “A real-time deep learning forest fire monitoring algorithm based on an improved pruned+ kd model,” Journal of Real-Time Image Processing, vol. 18, no. 6, pp. 2319-2329, 2021.
[80] G. Jocher, “YOLOv5 by Ultralytics.”
https://github.com/ultralytics/yolov5, 2020. Accessed: February 30, 2023.
[81] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[82] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data augmentation method for instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2918-2928, 2021.
[83] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[84] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin, “Albumentations: Fast and flexible image augmentations,” Information, vol. 11, no. 2, 2020.
[85] M. Contributors, “YOLOv5 by MMYOLO.” https://github.com/open-mmlab/mmyolo/tree/main/ configs/yolov5, 2023. Accessed: May 13, 2023.
[86] Ultralytics, “Model Structure.” https://docs.ultralytics.com/yolov5/tutorials/architecture_ description/#1-model-structure, 2023. Accessed: May 14, 2023.
[87] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: Scaling cross stage partial network,” in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 13029-13038, 2021.
[88] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren, S. Han, E. Ding, et al., “Pp-yolo: An effective and efficient implementation of object detector,” arXiv preprint arXiv:2007.12099, 2020.
[89] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “You only learn one representation: Unified network for multiple tasks,” arXiv preprint arXiv:2105.04206, 2021.
[90] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[91] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European conference on computer vision (ECCV), pp. 734-750, 2018.
[92] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6569-6578, 2019.
[93] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 9627-9636, 2019.
[94] G. Song, Y. Liu, and X. Wang, “Revisiting the sibling head in object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11563-11572, 2020.
[95] Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu, “Rethinking classification and localization for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10186-10195, 2020.
[96] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “Ota: Optimal transport assignment for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 303-312, 2021.
[97] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, et al., “Yolov6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
[98] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13733-13742, 2021.
[99] M. Contributors, “YOLOv6 by MMYOLO.” https://github.com/open-mmlab/mmyolo/tree/main/ configs/yolov6, 2023. Accessed: May 13, 2023.
[100] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3490-3499, IEEE Computer Society, 2021.
[101] H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou-aware dense object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8514-8523, 2021.
[102] Z. Gevorgyan, “Siou loss: More powerful learning for bounding box regression,” arXiv preprint arXiv:2205.12740, 2022.
[103] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658-666, 2019.
[104] X. Ding, H. Chen, X. Zhang, K. Huang, J. Han, and G. Ding, “Re-parameterizing your optimizers rather than architectures,” arXiv preprint arXiv:2205.15242, 2022.
[105] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise knowledge distillation for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5311-5320, 2021.
[106] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
[107] M. Contributors, “YOLOv7 by MMYOLO.” https://github.com/open-mmlab/mmyolo/tree/main/ configs/yolov7, 2023. Accessed: May 13, 2023.
[108] C.-Y. Wang, H.-Y. M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv preprint arXiv:2211.04800, 2022.
[109] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017.
[110] X. Xu, Y. Jiang, W. Chen, Y. Huang, Y. Zhang, and X. Sun, “Damo-yolo: A report on real-time object detection design,” arXiv preprint arXiv:2211.15444, 2022.
[111] Alibaba, “TinyNAS.” https://github.com/alibaba/lightweight-neural-architecture-search 2023. Accessed: March 18, 2023.
[112] Z. Tan, J. Wang, X. Sun, M. Lin, H. Li, et al., “Giraffedet: A heavy-neck paradigm for object detection,” in International Conference on Learning Representations, 2021.
[113] G. Jocher, A. Chaurasia, and J. Qiu, “YOLO by Ultralytics.”
https://github.com/ultralytics/ ultralytics, 2023. Accessed: February 30, 2023.
[114] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21002-21012, 2020.
[115] M. Contributors, “YOLOv8 by MMYOLO.” https://github.com/open-mmlab/mmyolo/tree/main/ configs/yolov8, 2023. Accessed: May 13, 2023.
[116] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An open-source deep learning platform from industrial practice,” Frontiers of Data and Domputing, vol. 1, no. 1, pp. 105-115, 2019.
[117] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 764-773, 2017.
[118] W. Xinlong, Z. Rufeng, K. Tao, L. Lei, and S. Chunhua, “Solov2: Dynamic, faster and stronger,” in Proc. NIPS, 2020.
[119] R. Liu, J. Lehman, P. Molino, F. Petroski Such, E. Frank, A. Sergeev, and J. Yosinski, “An intriguing failing of convolutional neural networks and the coordconv solution,” Advances in neural information processing systems, vol. 31, 2018.
[120] X. Huang, X. Wang, W. Lv, X. Bai, X. Long, K. Deng, Q. Dang, S. Han, Q. Liu, X. Hu, et al., “Pp-yolov2: A practical object detector,” arXiv preprint arXiv:2104.10419, 2021.
[121] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang, S. Wei, Y. Du, et al., “Pp-yoloe: An evolved version of yolo,” arXiv preprint arXiv:2203.16250, 2022.
[122] L. Rao, “Treenet: A lightweight one-shot aggregation convolutional network,” arXiv preprint arXiv:2109.12342, 2021.
[123] M. Contributors, “PP-YOLOE by MMYOLO.” https://github.com/open-mmlab/mmyolo/tree/main/ configs/ppyoloe, 2023. Accessed: May 13, 2023.
[124] R. team, “YOLO-NAS by Deci Achieves State-of-the-Art Performance on Object Detection Using Neural Architecture Search.” https://deci.ai/blog/yolo-nas-object-detection-foundation-model/, 2023. Accessed: May 12, 2023.
[125] X. Chu, L. Li, and B. Zhang, “Make repvgg greater again: A quantization-aware approach,” arXiv preprint arXiv:2212.01593, 2022.
[126] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 8430-8439, 2019.
[127] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[128] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, “You only look at one sequence: Rethinking transformer in vision through object detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 26183-26197, 2021.
[129] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth
16 × 16 16 × 16 16 xx1616 \times 16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[130] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213-229, Springer, 2020.
[131] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, “Vit-yolo: Transformer-based yolo for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 2799-2808, 2021.
[132] Z. Guo, C. Wang, G. Yang, Z. Huang, and G. Li, “Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface,” Sensors, vol. 22, no. 9, p. 3467, 2022.
[133] Y. Liu, G. He, Z. Wang, W. Li, and H. Huang, “Nrt-yolo: Improved yolov5 based on nested residual transformer for tiny remote sensing object detection,” Sensors, vol. 22, no. 13, p. 4953, 2022.
[134] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3974-3983, 2018.
[135] S. Wang, S. Gao, L. Zhou, R. Liu, H. Zhang, J. Liu, Y. Jia, and J. Qian, “Yolo-sd: Small ship detection in sar images by multi-scale convolution and feature transformer module,” Remote Sensing, vol. 14, no. 20, p. 5268, 2022.
[136] S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, “Hrsid: A high-resolution sar images dataset for ship detection and instance segmentation,” Ieee Access, vol. 8, pp. 120234-120254, 2020.
[137] H. Ouyang, “Deyo: Detr with yolo for step-by-step object detection,” arXiv preprint arXiv:2211.06588, 2022.
[138] Ultralytics,"YOLOv8-Ultralytics YOLOv8 Documentation."
https://docs.ultralytics.com/models/ yolov8/, 2023. Accessed: January 7, 2024.