Very Deep Convolutional Networks
for Large-Scale Image Recognition
用于大规模图像识别的非常深的卷积网络

Karen Simonyan & Andrew Zisserman⁺
卡伦·西蒙扬 & 安德鲁·齐瑟曼 ⁺
Visual Geometry Group, Department of Engineering Science, University of Oxford
{karen,az}@robots.ox.ac.uk
视觉几何组，牛津大学工程科学系 {karen,az}@robots.ox.ac.uk current affiliation: Google DeepMind ⁺current affiliation: University of Oxford and Google DeepMind
当前所属机构：谷歌深度思维 ⁺ 当前所属机构：牛津大学和谷歌深度思维

Abstract 摘要

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small ( $3\times 3$ ) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
在这项工作中，我们研究了卷积网络深度对在大规模图像识别任务中的准确率的影响。我们的主要贡献是使用非常小（ $3\times 3$ ）的卷积滤波器的架构，对深度不断增加的网络进行了全面评估，结果表明，通过将深度增加到 16-19 个权重层，可以显著改进现有配置的性能。这些发现构成了我们 2014 年 ImageNet 挑战赛提交的基础，其中我们的团队在定位和分类赛道中分别获得了第一和第二名。我们还展示了我们的表示在其他数据集上具有良好的泛化能力，并在这些数据集上取得了最先进的结果。我们已经公开了我们的两个表现最佳的卷积网络模型，以促进深度视觉表示在计算机视觉中应用的研究。

1 Introduction 1 引言

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).
卷积网络（ConvNets）在大型图像和视频识别领域最近取得了巨大成功（Krizhevsky 等人，2012 年；Zeiler 和 Fergus，2013 年；Sermanet 等人，2014 年；Simonyan 和 Zisserman，2014 年），这得益于大型公共图像库，如 ImageNet（Deng 等人，2009 年），以及高性能计算系统，如 GPU 或大规模分布式集群（Dean 等人，2012 年）。特别是，ImageNet 大规模视觉识别挑战赛（ILSVRC）（Russakovsky 等人，2014 年）在深度视觉识别架构的进步中发挥了重要作用，它为几代大规模图像分类系统提供了一个测试平台，从高维浅层特征编码（Perronnin 等人，2010 年）（ILSVRC-2011 的冠军）到深度卷积网络（Krizhevsky 等人，2012 年）（ILSVRC-2012 的冠军）。

With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small ( $3\times 3$ ) convolution filters in all layers.
随着卷积神经网络在计算机视觉领域变得越来越普及，人们已经做出许多尝试来改进 Krizhevsky 等人（2012 年）提出的原始架构，以实现更高的准确率。例如，在 ILSVRC-2013（Zeiler & Fergus, 2013；Sermanet 等人，2014 年）中表现最佳的提交作品采用了更小的感受野尺寸和更小的第一层卷积步长。另一条改进思路涉及在整个图像和多个尺度上密集地训练和测试网络（Sermanet 等人，2014 年；Howard，2014 年）。在本文中，我们探讨了卷积神经网络架构设计中的另一个重要方面——其深度。为此，我们固定了架构的其他参数，并通过添加更多卷积层逐步增加网络的深度，这是由于所有层都使用了非常小的（ $3\times 3$ ）卷积滤波器而可行的。

As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models¹¹1http://www.robots.ox.ac.uk/~vgg/research/very_deep/ to facilitate further research.
因此，我们提出了显著更精确的卷积神经网络架构，这些架构不仅实现了 ILSVRC 分类和定位任务上的最先进精度，还适用于其他图像识别数据集，即使作为相对简单的流程（例如，由线性 SVM 分类的深度特征，无需微调）的一部分时，也能取得优异性能。我们已发布我们的两个表现最佳模型 ¹¹1http://www.robots.ox.ac.uk/~vgg/research/very_deep/ ，以促进进一步研究。

The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
本文其余部分组织如下。第 2 节描述了我们的卷积神经网络配置。第 3 节介绍了图像分类的训练和评估细节，第 4 节将配置在 ILSVRC 分类任务上进行比较。第 5 节总结全文。为完整起见，我们在附录 A 中描述并评估了我们的 ILSVRC-2014 目标定位系统，并在附录 B 中讨论了非常深层特征在其他数据集上的泛化情况。最后，附录 C 包含了主要论文修订清单。

2 ConvNet Configurations 2 卷积网络配置

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.
为了在公平的条件下衡量增加卷积网络深度带来的改进，我们所有的卷积网络层配置都是基于相同的原理设计的，这些原理受 Ciresan 等人（2011 年）和 Krizhevsky 等人（2012 年）的启发。在本节中，我们首先描述我们卷积网络配置的通用布局（第 2.1 节），然后详细说明在评估中使用的具体配置（第 2.2 节）。随后，我们在第 2.3 节讨论我们的设计选择，并将其与现有技术进行比较。

2.1 Architecture 2.1 架构

During training, the input to our ConvNets is a fixed-size $224\times 224$ RGB image. The only pre-processing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: $3\times 3$ (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise $1\times 1$ convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to $1$ pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is $1$ pixel for $3\times 3$ conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a $2\times 2$ pixel window, with stride $2$ .
在训练过程中，我们卷积神经网络的输入是一张固定大小的 $224\times 224$ RGB 图像。我们进行的唯一预处理是从每个像素中减去在训练集上计算得到的 RGB 均值。图像通过一个卷积层堆栈，我们使用具有非常小感受野的滤波器： $3\times 3$ （这是捕获左右、上下、中心概念的最小尺寸）。在一种配置中，我们还利用了 $1\times 1$ 卷积滤波器，这可以看作是对输入通道的线性变换（随后接非线性）。卷积步长固定为 $1$ 像素；卷积层输入的空间填充方式是保持卷积后的空间分辨率，即对于 $3\times 3$ 卷积层，填充为 $1$ 像素。空间池化由五个最大池化层执行，这些池化层跟在一些卷积层后面（并非所有卷积层都跟有最大池化）。最大池化是在一个 $2\times 2$ 像素窗口上执行的，步长为 $2$ 。

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
一个卷积层堆栈（在不同架构中深度不同）后面跟着三个全连接（FC）层：前两层每个有 4096 个通道，第三层执行 1000 种 ILSVRC 分类，因此包含 1000 个通道（每个类别一个）。最后一层是 softmax 层。所有网络的全连接层配置相同。

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).
所有隐藏层都配备了整流（ReLU (Krizhevsky 等人，2012)）非线性。我们注意到我们的网络（除一个外）都不包含局部响应归一化（LRN）归一化（Krizhevsky 等人，2012）：正如第 4 节所示，这种归一化不会提高 ILSVRC 数据集的性能，但会导致内存消耗和计算时间增加。在适用的情况下，LRN 层的参数是(Krizhevsky 等人，2012)的参数。

2.2 Configurations 2.2 配置

The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from $64$ in the first layer and then increasing by a factor of $2$ after each max-pooling layer, until it reaches $512$ .
本文评估的卷积神经网络配置在表 1 中列出，每列一个。在以下内容中，我们将用它们的名称（A-E）来指代这些网络。所有配置都遵循第 2.1 节中介绍的一般设计，它们之间的区别仅在于深度：从网络 A 中的 11 个权重层（8 个卷积层和 3 个全连接层）到网络 E 中的 19 个权重层（16 个卷积层和 3 个全连接层）。卷积层的宽度（通道数）相对较小，从第一层的 $64$ 开始，然后在每次最大池化层后以 $2$ 的倍数增加，直到达到 $512$ 。

In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
表 2 报告了每种配置的参数数量。尽管深度很大，但我们网络中的权重数量并不大于一个更浅的网络中的权重数量，该浅网络的卷积层宽度和感受野更大（(Sermanet et al., 2014)中的 1.44 亿权重）。

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv

\langle

receptive field size

\rangle

\langle

number of channels

\rangle

”. The ReLU activation function is not shown for brevity.
表 1：卷积网络的配置（以列的形式展示）。配置的深度从左（A）到右（E）增加，随着更多层的添加（添加的层以粗体显示）。卷积层参数表示为“conv

\langle

感受野大小

\rangle

\langle

通道数

\rangle

”。为了简洁，未显示 ReLU 激活函数。

ConvNet Configuration 卷积网络配置
A	A-LRN	B	C	D	E
11 weight 11 个权重	11 weight 11 个权重	13 weight 13 个权重	16 weight 16 个权重	16 weight 16 个权重	19 weight 19 个权重
layers 层	layers 层	layers 层	layers 层	layers 层	layers 层
input ( $224\times 224$ RGB image) 输入（ $224\times 224$ RGB 图像）
conv3-64	conv3-64	conv3-64	conv3-64	conv3-64	conv3-64
	LRN	conv3-64	conv3-64	conv3-64	conv3-64
maxpool
conv3-128	conv3-128	conv3-128	conv3-128	conv3-128	conv3-128
		conv3-128	conv3-128	conv3-128	conv3-128
maxpool
conv3-256	conv3-256	conv3-256	conv3-256	conv3-256	conv3-256
conv3-256	conv3-256	conv3-256	conv3-256	conv3-256	conv3-256
			conv1-256	conv3-256	conv3-256
					conv3-256
maxpool
conv3-512	conv3-512	conv3-512	conv3-512	conv3-512	conv3-512
conv3-512	conv3-512	conv3-512	conv3-512	conv3-512	conv3-512
			conv1-512	conv3-512	conv3-512
					conv3-512
maxpool
conv3-512	conv3-512	conv3-512	conv3-512	conv3-512	conv3-512
conv3-512	conv3-512	conv3-512	conv3-512	conv3-512	conv3-512
			conv1-512	conv3-512	conv3-512
					conv3-512
maxpool
FC-4096
FC-4096
FC-1000
soft-max

Table 2: Number of parameters (in millions).
表 2：参数数量（单位：百万）。

Network 网络	A,A-LRN	B	C	D	E
Number of parameters 参数数量	133	133	134	138	144

2.3 Discussion 2.3 讨论

Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g. $11\times 11$ with stride $4$ in (Krizhevsky et al., 2012), or $7\times 7$ with stride $2$ in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small $3\times 3$ receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride $1$ ). It is easy to see that a stack of two $3\times 3$ conv. layers (without spatial pooling in between) has an effective receptive field of $5\times 5$ ; three such layers have a $7\times 7$ effective receptive field. So what have we gained by using, for instance, a stack of three $3\times 3$ conv. layers instead of a single $7\times 7$ layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer $3\times 3$ convolution stack has $C$ channels, the stack is parametrised by $3\left(3^{2}C^{2}\right)=27C^{2}$ weights; at the same time, a single $7\times 7$ conv. layer would require $7^{2}C^{2}=49C^{2}$ parameters, i.e. $81\%$ more. This can be seen as imposing a regularisation on the $7\times 7$ conv. filters, forcing them to have a decomposition through the $3\times 3$ filters (with non-linearity injected in between).
我们的卷积神经网络配置与 ILSVRC-2012（Krizhevsky 等人，2012 年）和 ILSVRC-2013 竞赛（Zeiler & Fergus，2013 年；Sermanet 等人，2014 年）中表现最佳的配置差异很大。我们没有在第一层卷积层中使用相对较大的感受野（例如在 Krizhevsky 等人，2012 年中使用步长为 1 的 $11\times 11$ ，或在 Zeiler & Fergus，2013 年；Sermanet 等人，2014 年中使用步长为 3 的 $7\times 7$ ），而是在整个网络中使用非常小的 $3\times 3$ 感受野，这些感受野在每一个像素点都与输入进行卷积（步长为 $1$ ）。很容易看出，两个 $3\times 3$ 卷积层堆叠（中间没有空间池化）的有效感受野为 $5\times 5$ ；三个这样的层具有 $7\times 7$ 的有效感受野。那么，我们通过使用三个 $3\times 3$ 卷积层堆叠而不是单个 $7\times 7$ 层获得了什么？首先，我们使用了三个非线性整流层而不是单个层，这使得决策函数更具判别性。其次，我们减少参数数量：假设三层 $3\times 3$ 卷积堆的输入和输出都有 $C$ 个通道，该堆由 $3\left(3^{2}C^{2}\right)=27C^{2}$ 个权重参数化；同时，一个单独的 $7\times 7$ 卷积层需要 $7^{2}C^{2}=49C^{2}$ 个参数，即 $81\%$ 个更多。这可以看作是对 $7\times 7$ 卷积滤波器施加正则化，迫使它们通过 $3\times 3$ 滤波器进行分解（在中间注入非线性）。

The incorporation of $1\times 1$ conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the $1\times 1$ convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that $1\times 1$ conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).
将 $1\times 1$ 卷积层（配置 C，表 1）的加入是一种在不影响卷积层感受野的情况下增加决策函数非线性的方法。尽管在我们的情况下 $1\times 1$ 卷积本质上是对同一维度的线性投影（输入和输出通道数相同），但通过整流函数引入了额外的非线性。需要注意的是 $1\times 1$ 卷积层最近已被用于 Lin 等人（2014）提出的“网络中的网络”架构。

Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets ( $11$ weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets (22 weight layers) and small convolution filters (apart from $3\times 3$ , they also use $1\times 1$ and $5\times 5$ convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.
小尺寸卷积滤波器之前已被 Ciresan 等人（2011）使用，但他们的网络深度远小于我们，并且他们没有在大型 ILSVRC 数据集上进行评估。Goodfellow 等人（2014）将深度卷积网络（ $11$ 权重层）应用于街道号码识别任务，并表明增加深度可以提高性能。GoogLeNet（Szegedy 等人，2014），作为 ILSVRC-2014 分类任务中的顶级表现者，独立于我们的工作开发，但在基于非常深的卷积网络（22 权重层）和小卷积滤波器（除了 $3\times 3$ ，他们还使用 $1\times 1$ 和 $5\times 5$ 卷积）方面相似。然而，他们的网络拓扑比我们更复杂，并且在第一层中更激进地降低了特征图的空间分辨率以减少计算量。第 4.5 节将表明，我们的模型在单网络分类精度方面优于 Szegedy 等人（2014）的模型。

3 Classification Framework
3 分类框架

In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.
在上一节中，我们介绍了网络配置的详细信息。在本节中，我们描述了分类卷积神经网络的训练和评估的详细信息。

3.1 Training 3.1 训练

The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to $256$ , momentum to $0.9$ . The training was regularised by weight decay (the $L_{2}$ penalty multiplier set to $5\cdot 10^{-4}$ ) and dropout regularisation for the first two fully-connected layers (dropout ratio set to $0.5$ ). The learning rate was initially set to $10^{-2}$ , and then decreased by a factor of $10$ when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after $370$ K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.
卷积神经网络（ConvNet）的训练过程通常遵循 Krizhevsky 等人（2012 年）的方法（除了从多尺度训练图像中采样输入裁剪，如后文所述）。具体来说，通过使用基于反向传播（LeCun 等人，1989 年）的动量小批量梯度下降来优化多项式逻辑回归目标。批大小设置为 $256$ ，动量设置为 $0.9$ 。训练通过权重衰减（ $L_{2}$ 惩罚乘数设置为 $5\cdot 10^{-4}$ ）和前两层全连接层的 dropout 正则化进行正则化（dropout 比率设置为 $0.5$ ）。初始学习率设置为 $10^{-2}$ ，当验证集准确率停止提升时，学习率以因子 $10$ 减少。总共学习率减少了 3 次，并在 $370$ K 次迭代（74 个 epoch）后停止训练。我们推测，尽管我们的网络参数数量更多、深度更大，但与（Krizhevsky 等人，2012 年）相比，由于（a）深度增加和卷积滤波器尺寸减小带来的隐式正则化；（b）某些层的预初始化，我们的网络收敛所需的 epoch 数更少。

The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{-2}$ variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
网络权重的初始化很重要，因为不良的初始化会导致深度网络中的梯度不稳定，从而停滞学习。为了解决这个问题，我们首先从训练配置 A（表 1）开始，这个配置足够浅，可以用随机初始化来训练。然后，在训练更深层次的网络时，我们用网络 A 的层来初始化前四个卷积层和最后三个全连接层（中间层随机初始化）。我们没有降低预初始化层的初始学习率，允许它们在学习过程中发生变化。对于随机初始化（适用时），我们从均值为零、方差为 $10^{-2}$ 的正态分布中采样权重。偏置项被初始化为零。值得注意的是，在论文提交后我们发现，可以通过使用 Glorot & Bengio（2010）的随机初始化方法来初始化权重，而无需预训练。

To obtain the fixed-size $224\times 224$ ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.
为了获得固定大小的 $224\times 224$ 卷积网络输入图像，它们被随机裁剪自重新缩放的训练图像（每张图像每个 SGD 迭代裁剪一次）。为了进一步扩充训练集，这些裁剪图像会进行随机水平翻转和随机 RGB 颜色偏移（Krizhevsky 等人，2012 年）。训练图像的重新缩放将在下文解释。

Training image size. 训练图像大小。

Let $S$ be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to $S$ as the training scale). While the crop size is fixed to $224\times 224$ , in principle $S$ can take on any value not less than $224$ : for $S=224$ the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for $S\gg 224$ the crop will correspond to a small part of the image, containing a small object or an object part.
设 $S$ 为等比例重新缩放后的训练图像的最小边长，从中裁剪卷积网络输入（我们也将 $S$ 称为训练尺度）。虽然裁剪大小固定为 $224\times 224$ ，但原则上 $S$ 可以取任何不小于 $224$ 的值：对于 $S=224$ ，裁剪将捕获整幅图像的统计信息，完全跨越训练图像的最小边；对于 $S\gg 224$ ，裁剪将对应图像的一小部分，包含一个小物体或物体的一部分。

We consider two approaches for setting the training scale $S$ . The first is to fix $S$ , which corresponds to single-scale training (note that image content within the sampled crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales: $S=256$ (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and $S=384$ . Given a ConvNet configuration, we first trained the network using $S=256$ . To speed-up training of the $S=384$ network, it was initialised with the weights pre-trained with $S=256$ , and we used a smaller initial learning rate of $10^{-3}$ .
我们考虑了两种设置训练规模的方法。第一种是固定 $S$ ，这对应于单尺度训练（注意：在采样裁剪区域内的图像内容仍然可以代表多尺度图像统计信息）。在我们的实验中，我们评估了在两个固定尺度下训练的模型： $S=256$ （该尺度已在先验艺术中被广泛使用（Krizhevsky 等人，2012；Zeiler & Fergus，2013；Sermanet 等人，2014））和 $S=384$ 。给定一个卷积神经网络配置，我们首先使用 $S=256$ 训练网络。为了加速 $S=384$ 网络的训练，它被初始化为使用 $S=256$ 预训练的权重，并且我们使用了较小的初始学习率 $10^{-3}$ 。

The second approach to setting $S$ is multi-scale training, where each training image is individually rescaled by randomly sampling $S$ from a certain range $\left[S_{min},S_{max}\right]$ (we used $S_{min}=256$ and $S_{max}=512$ ). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed $S=384$ .
设置 $S$ 的第二种方法是多尺度训练，其中每个训练图像通过从某个范围 $\left[S_{min},S_{max}\right]$ 中随机采样 $S$ 进行单独的重缩放（我们使用了 $S_{min}=256$ 和 $S_{max}=512$ ）。由于图像中的物体可能具有不同的大小，因此在训练过程中考虑这一点是有益的。这也可以看作是通过尺度抖动来增强训练集，其中单个模型被训练以识别跨越广泛尺度的物体。出于速度考虑，我们通过使用相同配置对单个尺度模型的全部层进行微调，并使用固定的 $S=384$ 进行预训练来训练多尺度模型。

3.2 Testing 3.2 测试

At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as $Q$ (we also refer to it as the test scale). We note that $Q$ is not necessarily equal to the training scale $S$ (as we will show in Sect. 4, using several values of $Q$ for each $S$ leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a $7\times 7$ conv. layer, the last two FC layers to $1\times 1$ conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.
在测试时，给定一个训练好的卷积神经网络和一张输入图像，它的分类方式如下。首先，图像被等比例缩放到一个预定义的最小图像边长，记为 $Q$ （我们也将它称为测试尺度）。我们注意到 $Q$ 不一定等于训练尺度 $S$ （如我们在第 4 节将展示的，对每个 $S$ 使用多个 $Q$ 的值可以提升性能）。然后，网络以类似于(Sermanet 等人，2014)的方式密集地应用于缩放后的测试图像。具体来说，首先将全连接层转换为卷积层（第一个全连接层转换为 $7\times 7$ 卷积层，最后两个全连接层转换为 $1\times 1$ 卷积层）。然后，将生成的全卷积网络应用于整张（未裁剪的）图像。结果是一个具有与类别数量相同通道数的类别分数图，以及一个取决于输入图像大小的可变空间分辨率。最后，为了获得图像的固定大小类别分数向量，对类别分数图进行空间平均（求和池化）。我们还通过图像的水平翻转来增强测试集；原始图像和翻转图像的 softmax 类后验概率被平均，以获得图像的最终得分。

Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using $50$ crops per scale ( $5\times 5$ regular grid with $2$ flips), for a total of $150$ crops over $3$ scales, which is comparable to $144$ crops over $4$ scales used by Szegedy et al. (2014).
由于全卷积网络应用于整张图像，因此在测试时无需采样多个裁剪区域（Krizhevsky 等人，2012 年），这种方式效率较低，因为它需要对每个裁剪区域进行网络重新计算。同时，使用大量裁剪区域（如 Szegedy 等人，2014 年所做的那样）可以提高准确率，因为它导致对输入图像的采样比全卷积网络更精细。此外，多裁剪评估与密集评估互补，因为它们具有不同的卷积边界条件：当将卷积神经网络应用于裁剪区域时，卷积特征图会用零填充，而在密集评估的情况下，相同裁剪区域的填充自然来自图像的相邻部分（由于卷积和空间池化），这显著增加了整个网络的感受野，从而捕获了更多上下文信息。虽然我们相信在实践中，多裁剪带来的计算时间增加并不能证明其可能带来的准确率提升，但为了参考，我们也使用每个尺度 $50$ 个裁剪（ $5\times 5$ 个规则的网格， $2$ 次翻转）来评估我们的网络，总共在 $3$ 个尺度上进行 $150$ 个裁剪，这相当于 Szegedy 等人（2014 年）使用的在 $4$ 个尺度上进行 $144$ 个裁剪。

3.3 Implementation Details
3.3 实现细节

Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.
我们的实现基于公开可用的 C++ Caffe 工具箱（Jia，2013）（于 2013 年 12 月分支出来），但包含许多重大修改，使我们能够在单个系统中的多个 GPU 上进行训练和评估，以及在多种尺度（如上所述）的全尺寸（未裁剪）图像上进行训练和评估。多 GPU 训练利用数据并行性，通过将每个训练图像批次分割成多个 GPU 批次，并在每个 GPU 上并行处理这些批次来进行。在计算 GPU 批次的梯度后，将它们平均以获得整个批次的梯度。梯度计算在 GPU 之间是同步的，因此结果与在单个 GPU 上训练时完全相同。

While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of $3.75$ times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.
虽然最近提出了更复杂的方法来加速 ConvNet 训练（Krizhevsky，2014），这些方法采用模型和数据并行性来处理网络的不同层，但我们发现我们概念上更简单的方案已经在 4-GPU 系统上提供了 $3.75$ 倍的加速，与使用单个 GPU 相比。在一台配备四块 NVIDIA Titan Black GPU 的系统上，根据网络架构的不同，训练单个网络需要 2-3 周的时间。

4 Classification Experiments
4 分类实验

Dataset. 数据集。

In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training ( $1.3$ M images), validation ( $50$ K images), and testing ( $100$ K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.
在本节中，我们展示了所描述的卷积神经网络（ConvNet）架构在 ILSVRC-2012 数据集（该数据集用于 ILSVRC 2012-2014 挑战赛）上的图像分类结果。该数据集包含 1000 个类别的图像，并分为三个集合：训练集（ $1.3$ M 张图像）、验证集（ $50$ K 张图像）和测试集（ $100$ K 张图像，其中包含保留的类别标签）。分类性能使用两种指标进行评估：top-1 错误率和 top-5 错误率。前者是多类别分类错误率，即错误分类图像的比例；后者是 ILSVRC 主要使用的评估标准，计算为真实类别不在 top-5 预测类别中的图像比例。

For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).
在大多数实验中，我们使用验证集作为测试集。某些实验也在测试集上进行了，并作为“VGG”团队提交给官方的 ILSVRC 服务器，参加 ILSVRC-2014 竞赛（Russakovsky 等人，2014）。

4.1 Single Scale Evaluation
4.1 单尺度评估

We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows: $Q=S$ for fixed $S$ , and $Q=0.5(S_{min}+S_{max})$ for jittered $S\in\left[S_{min},S_{max}\right]$ . The results of are shown in Table 3.
我们首先使用第 2.2 节中描述的层配置，评估单个 ConvNet 模型在单尺度下的性能。测试图像大小设置如下： $Q=S$ 用于固定 $S$ ， $Q=0.5(S_{min}+S_{max})$ 用于抖动 $S\in\left[S_{min},S_{max}\right]$ 。结果如表 3 所示。

First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).
首先，我们注意到使用局部响应归一化（A-LRN 网络）并没有改善没有任何归一化层的模型 A 的性能。因此，我们在更深的架构（B-E）中不使用归一化。

Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the configuration C (which contains three $1\times 1$ conv. layers), performs worse than the configuration D, which uses $3\times 3$ conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches $19$ layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five $5\times 5$ conv. layers, which was derived from B by replacing each pair of $3\times 3$ conv. layers with a single $5\times 5$ conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be $7\%$ higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
其次，我们观察到分类错误随着卷积网络深度的增加而减少：从 A 中的 11 层到 E 中的 19 层。值得注意的是，尽管深度相同，包含三个 $1\times 1$ 卷积层的配置 C 表现不如整个网络使用 $3\times 3$ 卷积层的配置 D。这表明虽然额外的非线性有所帮助（C 比 B 好），但使用具有非平凡感受野的卷积滤波器来捕获空间上下文也同样重要（D 比 C 好）。当深度达到 $19$ 层时，我们架构的错误率达到饱和，但对于更大的数据集，更深层的模型可能仍然有益。我们还比较了网络 B 和一个具有五个 $5\times 5$ 卷积层的浅层网络，该网络由 B 通过将每对 $3\times 3$ 卷积层替换为单个具有与第 2.3 节所述相同感受野的 $5\times 5$ 卷积层得到。浅层网络的 Top-1 错误率（在中心裁剪上）测量值比 B 高 $7\%$ ，这证实了具有小滤波器的深层网络优于具有大滤波器的浅层网络。

Finally, scale jittering at training time ( $S\in[256;512]$ ) leads to significantly better results than training on images with fixed smallest side ( $S=256$ or $S=384$ ), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.
最后，训练时的尺度抖动（ $S\in[256;512]$ ）比在具有固定最小边的图像上训练（ $S=256$ 或 $S=384$ ）的结果要好得多，即使测试时只使用单一尺度。这证实了通过尺度抖动对训练集进行增强确实有助于捕捉多尺度图像统计信息。

Table 3: ConvNet performance at a single test scale.
表 3：单测试尺度下卷积网络的性能。

ConvNet config. (Table 1) 卷积神经网络配置（表 1）	smallest image side 最小图像边长		top-1 val. error (%) top-1 验证错误率（%）	top-5 val. error (%) top-5 验证误差 (%)
	train ( $S$ ) 训练 ( $S$ )	test ( $Q$ ) 测试（ $Q$ ）
A	256	256	29.6	10.4
A-LRN	256	256	29.7	10.5
B	256	256	28.7	9.9
C	256	256	28.1	9.4
	384	384	28.1	9.3
	[256;512]	384	27.3	8.8
D	256	256	27.0	8.8
	384	384	26.8	8.7
	[256;512]	384	25.6	8.1
E	256	256	27.3	9.0
	384	384	26.9	8.7
	[256;512]	384	25.5	8.0

4.2 Multi-Scale Evaluation
4.2 多尺度评估

Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of $Q$ ), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed $S$ were evaluated over three test image sizes, close to the training one: $Q=\{S-32,S,S+32\}$ . At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable $S\in[S_{min};S_{max}]$ was evaluated over a larger range of sizes $Q=\{S_{min},0.5(S_{min}+S_{max}),S_{max}\}$ .
在单一尺度下评估了 ConvNet 模型后，我们现在评估测试时尺度抖动的影响。它包括对测试图像的多个重缩放版本（对应不同的 $Q$ 值）运行模型，然后对得到的类后验进行平均。考虑到训练和测试尺度之间的较大差异会导致性能下降，使用固定 $S$ 训练的模型在接近训练尺度的三个测试图像尺寸 $Q=\{S-32,S,S+32\}$ 上进行了评估。同时，训练时的尺度抖动允许网络在测试时应用于更广泛的尺度范围，因此使用可变 $S\in[S_{min};S_{max}]$ 训练的模型在更大的尺寸范围 $Q=\{S_{min},0.5(S_{min}+S_{max}),S_{max}\}$ 上进行了评估。

The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side $S$ . Our best single-network performance on the validation set is $24.8\%/7.5\%$ top-1/top-5 error (highlighted in bold in Table 4). On the test set, the configuration E achieves $7.3\%$ top-5 error.
表 4 中的结果表明，在测试时进行尺度抖动可以获得更好的性能（与在单一尺度上评估相同模型相比，如表 3 所示）。如前所述，最深的配置（D 和 E）表现最佳，而尺度抖动比使用固定最小边长进行训练的效果更好。我们在验证集上的单网络最佳性能是 $24.8\%/7.5\%$ top-1/top-5 错误率（在表 4 中加粗显示）。在测试集上，配置 E 实现了 $7.3\%$ top-5 错误率。

Table 4: ConvNet performance at multiple test scales.
表 4：不同测试尺度下卷积神经网络的性能。

ConvNet config. (Table 1) 卷积神经网络配置（表 1）	smallest image side 最小图像边长		top-1 val. error (%) top-1 验证错误率（%）	top-5 val. error (%) top-5 验证误差 (%)
	train ( $S$ ) 训练 ( $S$ )	test ( $Q$ ) 测试（ $Q$ ）
B	256	224,256,288	28.2	9.6
C	256	224,256,288	27.7	9.2
	384	352,384,416	27.8	9.2
	$\left[256;512\right]$	256,384,512	26.3	8.2
D	256	224,256,288	26.6	8.6
	384	352,384,416	26.5	8.6
	$\left[256;512\right]$	256,384,512	24.8	7.5
E	256	224,256,288	26.9	8.7
	384	352,384,416	26.7	8.6
	$\left[256;512\right]$	256,384,512	24.8	7.5

4.3 Multi-crop evaluation 4.3 多裁剪评估

In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their soft-max outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.
在表 5 中，我们比较了密集卷积网络评估与多裁剪评估（详见第 3.2 节）。我们还通过平均它们的 softmax 输出，评估了这两种评估技术的互补性。可以看出，使用多个裁剪的表现略好于密集评估，而且这两种方法确实互补，因为它们的组合表现优于各自单独使用。如前所述，我们假设这是由于对卷积边界条件的不同处理所致。

Table 5: ConvNet evaluation techniques comparison. In all experiments the training scale

S

was sampled from

\left[256;512\right]

, and three test scales

Q

were considered:

\left\{256,384,512\right\}

.
表 5：卷积网络评估技术比较。所有实验中，训练尺度

S

从

\left[256;512\right]

中采样，并考虑了三个测试尺度

Q

：

\left\{256,384,512\right\}

。

ConvNet config. (Table 1) 卷积神经网络配置（表 1）	Evaluation method 评估方法	top-1 val. error (%) top-1 验证错误率（%）	top-5 val. error (%) top-5 验证误差 (%)
D	dense 密集	24.8	7.5
	multi-crop 多裁剪	24.6	7.5
	multi-crop & dense 多裁剪 & 密集	24.4	7.2
E	dense 密集	24.8	7.5
	multi-crop 多裁剪	24.6	7.4
	multi-crop & dense 多裁剪 & 密集	24.4	7.1

4.4 ConvNet Fusion 4.4 卷积网络融合

Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).
到目前为止，我们评估了单个卷积网络模型的性能。在这部分实验中，我们通过平均多个模型的 softmax 类别后验输出来组合它们的输出。由于模型的互补性，这种方法提高了性能，并在 2012 年（Krizhevsky 等人，2012 年）和 2013 年（Zeiler & Fergus，2013 年；Sermanet 等人，2014 年）的顶级 ILSVRC 提交中被使用。

The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has $7.3\%$ ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to $7.0\%$ using dense evaluation and $6.8\%$ using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves $7.1\%$ error (model E, Table 5).
结果如表 6 所示。在 ILSVRC 提交时，我们只训练了单尺度网络以及多尺度模型 D（通过仅微调全连接层而不是所有层）。由此产生的 7 个网络的集成具有 $7.3\%$ ILSVRC 测试错误。提交后，我们考虑了仅由两个性能最佳的多尺度模型（配置 D 和 E）组成的集成，使用密集评估将测试错误降低到 $7.0\%$ ，使用密集和多裁剪组合评估将测试错误降低到 $6.8\%$ 。作为参考，我们性能最佳的单一模型实现了 $7.1\%$ 错误（模型 E，表 5）。

Table 6: Multiple ConvNet fusion results.
表 6：多个卷积网络融合结果。

Combined ConvNet models 组合卷积网络模型	Error 错误
Combined ConvNet models 组合卷积网络模型	top-1 val top-1 验证	top-5 val top-5 验证	top-5 test top-5 测试
ILSVRC submission ILSVRC 提交
\pbox11cm (D/256/224,256,288), (D/384/352,384,416), (D/[256;512]/256,384,512)
(C/256/224,256,288), (C/384/352,384,416)
(E/256/224,256,288), (E/384/352,384,416)	24.7	7.5	7.3
post-submission 提交后
\pbox11cm (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), dense eval.	24.0	7.1	7.0
\pbox11cm (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop	23.9	7.2	-
\pbox11cm (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop & dense eval. \pbox11cm (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), 多重裁剪 & 稠密评估。	23.7	6.8	6.8

4.5 Comparison with the State of the Art
4.5 与当前最优方法的比较

Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with $7.3\%$ test error using an ensemble of 7 models. After the submission, we decreased the error rate to $6.8\%$ using an ensemble of 2 models.
最后，我们在表 7 中将我们的结果与当前最优方法进行了比较。在 ILSVRC-2014 挑战赛（Russakovsky 等人，2014 年）的分类任务中，我们的“VGG”团队使用 7 个模型的集成获得了第 2 名，测试错误率为 $7.3\%$ 。提交后，我们使用 2 个模型的集成将错误率降低到 $6.8\%$ 。

As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Our result is also competitive with respect to the classification task winner (GoogLeNet with $6.7\%$ error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved $11.2\%$ with outside training data and $11.7\%$ without it. This is remarkable, considering that our best result is achieved by combining just two models – significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result ( $7.0\%$ test error), outperforming a single GoogLeNet by $0.9\%$ . Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.
从表 7 可以看出，我们的非常深的卷积网络显著优于上一代模型，这些模型在 ILSVRC-2012 和 ILSVRC-2013 竞赛中取得了最佳结果。我们的结果在与分类任务冠军（GoogLeNet，错误率为 $6.7\%$ ）相比也具有竞争力，并且显著优于 ILSVRC-2013 冠军提交的 Clarifai，后者在外部训练数据下取得了 $11.2\%$ 的成绩，在无外部训练数据下取得了 $11.7\%$ 的成绩。考虑到我们的最佳结果是通过结合仅两个模型实现的——这远少于大多数 ILSVRC 提交中使用的模型数量。在单网络性能方面，我们的架构取得了最佳结果（ $7.0\%$ 测试错误率），优于单个 GoogLeNet $0.9\%$ 。值得注意的是，我们没有偏离 LeCun 等人（1989 年）提出的经典卷积网络架构，而是通过显著增加深度对其进行了改进。

Table 7: Comparison with the state of the art in ILSVRC classification. Our method is denoted as “VGG”. Only the results obtained without outside training data are reported.
表 7：与 ILSVRC 分类领域的最新技术进行比较。我们的方法表示为“VGG”。仅报告了无外部训练数据获得的结果。

Method 方法	top-1 val. error (%) top-1 验证错误率（%）	top-5 val. error (%) top-5 验证误差 (%)	top-5 test error (%) top-5 测试误差 (%)
VGG (2 nets, multi-crop & dense eval.) VGG (2 个网络，多裁剪与密集评估)	23.7	6.8	6.8
VGG (1 net, multi-crop & dense eval.) VGG (1 个网络，多裁剪与密集评估)	24.4	7.1	7.0
VGG (ILSVRC submission, 7 nets, dense eval.) VGG (ILSVRC 提交，7 个网络，密集评估)	24.7	7.5	7.3
GoogLeNet (Szegedy et al., 2014) (1 net) GoogLeNet (Szegedy 等人，2014) (1 个网络)	-	7.9
GoogLeNet (Szegedy et al., 2014) (7 nets) GoogLeNet（Szegedy 等人，2014 年）（7 个网络）	-	6.7
MSRA (He et al., 2014) (11 nets) MSRA（何等，2014）（11 个网络）	-	-	8.1
MSRA (He et al., 2014) (1 net) MSRA（何等，2014）（1 个网络）	27.9	9.1	9.1
Clarifai (Russakovsky et al., 2014) (multiple nets) Clarifai (Russakovsky 等人，2014) (多个网络)	-	-	11.7
Clarifai (Russakovsky et al., 2014) (1 net) Clarifai (Russakovsky 等人，2014) (1 个网络)	-	-	12.5
Zeiler & Fergus (Zeiler & Fergus, 2013) (6 nets) Zeiler & Fergus (Zeiler & Fergus，2013) (6 个网络)	36.0	14.7	14.8
Zeiler & Fergus (Zeiler & Fergus, 2013) (1 net) Zeiler & Fergus (Zeiler & Fergus，2013) (1 个网络)	37.5	16.0	16.1
OverFeat (Sermanet et al., 2014) (7 nets) OverFeat (Sermanet 等人，2014) (7 个网络)	34.0	13.2	13.6
OverFeat (Sermanet et al., 2014) (1 net) OverFeat (Sermanet 等人，2014) (1 个网络)	35.7	14.2	-
Krizhevsky et al. (Krizhevsky et al., 2012) (5 nets) Krizhevsky 等人 (Krizhevsky 等人，2012) (5 个网络)	38.1	16.4	16.4
Krizhevsky et al. (Krizhevsky et al., 2012) (1 net) Krizhevsky 等人 (Krizhevsky 等人，2012) (1 个网络)	40.7	18.2	-

5 Conclusion 5 结论

In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.
在这项工作中，我们评估了用于大规模图像分类的非常深的卷积网络（多达 19 个权重层）。结果表明，表示深度有利于分类精度，并且可以使用具有显著增加深度的传统卷积网络架构（LeCun 等人，1989；Krizhevsky 等人，2012）在 ImageNet 挑战数据集上达到最先进的性能。在附录中，我们还展示了我们的模型在广泛的任务和数据集上具有良好的泛化能力，与围绕较浅图像表示构建的更复杂的识别流程相比，性能相当或更优。我们的结果再次证实了深度在视觉表示中的重要性。

Acknowledgements 致谢

This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.
这项工作得到了 ERC 资助项目 VisRec（编号 228180）的支持。我们衷心感谢英伟达公司捐赠了用于这项研究的 GPU。

References

Bell et al. (2014) Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials in context database. CoRR, abs/1412.0623, 2014.
Chatfield et al. (2014) Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.
Cimpoi et al. (2014) Cimpoi, M., Maji, S., and Vedaldi, A. Deep convolutional filter banks for texture recognition and segmentation. CoRR, abs/1411.6836, 2014.
Ciresan et al. (2011) Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. Flexible, high performance convolutional neural networks for image classification. In IJCAI, pp. 1237–1242, 2011.
Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS, pp. 1232–1240, 2012.
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
Donahue et al. (2013) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.
Everingham et al. (2015) Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. The Pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
Fei-Fei et al. (2004) Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE CVPR Workshop of Generative Model Based Vision, 2004.
Girshick et al. (2014) Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524v5, 2014. Published in Proc. CVPR, 2014.
Gkioxari et al. (2014) Gkioxari, G., Girshick, R., and Malik, J. Actions and attributes from wholes and parts. CoRR, abs/1412.2604, 2014.
Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, volume 9, pp. 249–256, 2010.
Goodfellow et al. (2014) Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Proc. ICLR, 2014.
Griffin et al. (2007) Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.
He et al. (2014) He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729v2, 2014.
Hoai (2014) Hoai, M. Regularized max pooling for image categorization. In Proc. BMVC., 2014.
Howard (2014) Howard, A. G. Some improvements on deep convolutional neural network based image classification. In Proc. ICLR, 2014.
Jia (2013) Jia, Y. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
Karpathy & Fei-Fei (2014) Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
Kiros et al. (2014) Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.
Krizhevsky (2014) Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
Lin et al. (2014) Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.
Long et al. (2014) Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
Oquab et al. (2014) Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In Proc. CVPR, 2014.
Perronnin et al. (2010) Perronnin, F., Sánchez, J., and Mensink, T. Improving the Fisher kernel for large-scale image classification. In Proc. ECCV, 2010.
Razavian et al. (2014) Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. CNN Features off-the-shelf: an Astounding Baseline for Recognition. CoRR, abs/1403.6382, 2014.
Russakovsky et al. (2014) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
Sermanet et al. (2014) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In Proc. ICLR, 2014.
Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199, 2014. Published in Proc. NIPS, 2014.
Szegedy et al. (2014) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
Wei et al. (2014) Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., and Yan, S. CNN: Single-label to multi-label. CoRR, abs/1406.5726, 2014.
Zeiler & Fergus (2013) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.

Appendix A Localisation 附录 A 定位

In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with $25.3\%$ error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a few modifications. Our method is described in Sect. A.1 and evaluated in Sect. A.2.
在论文正文中，我们考虑了 ILSVRC 挑战的分类任务，并对不同深度的卷积网络架构进行了全面评估。在本节中，我们转向挑战的定位任务，该任务我们在 2014 年以 $25.3\%$ 的误差赢得了冠军。这可以被视为目标检测的一个特例，其中对于前五名的每个类别，都应该预测一个单独的对象边界框，而不管该类别的实际对象数量是多少。为此，我们采用了 Sermanet 等人（2014 年）的方法，他们是 ILSVRC-2013 定位挑战的冠军，并进行了一些修改。我们的方法在 A.1 节中描述，并在 A.2 节中进行评估。

A.1 Localisation ConvNet A.1 局部定位卷积网络

To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4).
为了进行目标定位，我们使用一个非常深的卷积网络，其中最后一个全连接层预测边界框的位置而不是类别分数。边界框由一个 4 维向量表示，存储其中心坐标、宽度和高度。可以选择边界框预测是跨所有类别共享（单类别回归，SCR（Sermanet 等人，2014 年））还是特定于类别（每类别回归，PCR）。在前一种情况下，最后一层是 4 维的，而在后一种情况下是 4000 维的（因为数据集有 1000 个类别）。除了最后一个边界框预测层，我们使用卷积网络架构 D（表 1），它包含 16 个权重层，并且在分类任务中被发现表现最佳（第 4 节）。

Training. 训练。

Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale: $S=256$ and $S=384$ (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to $10^{-3}$ . We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch.
定位 ConvNets 的训练与分类 ConvNets 相似（第 3.1 节）。主要区别在于我们用欧几里得损失替代了逻辑回归目标，该损失惩罚预测边界框参数与真实值的偏差。我们训练了两个定位模型，每个模型在单一尺度上： $S=256$ 和 $S=384$ （由于时间限制，我们没有在 ILSVRC-2014 提交中使用训练尺度抖动）。训练初始时使用相应的分类模型（在相同尺度上训练），初始学习率设置为 $10^{-3}$ 。我们探索了两种微调策略：微调所有层和仅微调前两个全连接层（如 Sermanet 等人，2014 年所述）。最后一个全连接层随机初始化并从头开始训练。

Testing. 测试。

We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image.
我们考虑了两种测试协议。第一种用于在验证集上比较不同的网络修改，仅考虑真实类别的边界框预测（以排除分类错误）。边界框通过仅将网络应用于图像的中心裁剪部分获得。

The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.
第二种完整的测试流程是基于定位卷积网络对整个图像进行密集应用，类似于分类任务（第 3.2 节）。不同之处在于，不是输出类别分数图，而是最后一个全连接层的输出是一组边界框预测。为了得出最终预测，我们利用了 Sermanet 等人（2014）提出的贪婪合并过程，该过程首先通过平均其坐标来合并空间上接近的预测，然后根据分类卷积网络获得的类别分数对其进行评分。当使用多个定位卷积网络时，我们首先取它们边界框预测的并集，然后在并集上运行合并过程。我们没有使用 Sermanet 等人（2014）提出的多次池化偏移技术，该技术可以提高边界框预测的空间分辨率，并可以进一步提高结果。

A.2 Localisation Experiments
A.2 定位实验

In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above $0.5$ .
在本节中，我们首先确定最佳定位设置（使用第一个测试协议），然后在完整场景中评估它（第二个协议）。定位误差根据 ILSVRC 标准（Russakovsky 等人，2014 年）进行测量，即如果预测的边界框与其与真实边界框的交并比高于 $0.5$ ，则认为其正确。

Settings comparison. 设置对比。

As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to $S=384$ ; the results with $S=256$ exhibit the same behaviour and are not shown for brevity.
如表 8 所示，每类回归（PCR）优于类无关单类回归（SCR），这与 Sermanet 等人（2014 年）的发现不同，在他们的研究中 PCR 被 SCR 超越。我们还注意到，为定位任务微调所有层比仅微调全连接层（如 Sermanet 等人，2014 年所做）的结果明显更好。在这些实验中，最小图像边长设置为 $S=384$ ； $S=256$ 的结果表现出相同的行为，为简洁起见未显示。

Table 8: Localisation error for different modifications with the simplified testing protocol: the bounding box is predicted from a single central image crop, and the ground-truth class is used. All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).
表 8：使用简化测试协议的不同修改的定位误差：从单个中心图像裁剪预测边界框，并使用真实类别。所有 ConvNet 层（除最后一层外）采用配置 D（表 1），而最后一层执行单类回归（SCR）或每类回归（PCR）。

Fine-tuned layers 微调层	regression type 回归类型	GT class localisation error GT 类别定位误差
1st and 2nd FC 1st 和 2nd 全连接层	SCR	36.4
1st and 2nd FC 1st 和 2nd 全连接层	PCR	34.3
all 全部	PCR	33.1

Fully-fledged evaluation.
完整的评估。

Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance.
在确定了最佳定位设置（PCR，所有层的微调）后，我们现在将其应用于完整场景中，其中使用我们表现最佳的分类系统（第 4.5 节）预测前 5 个类别标签，并使用 Sermanet 等人（2014 年）的方法合并多个密集计算的边界框预测。如表 9 所示，与使用中心裁剪（表 8）相比，将定位卷积网络应用于整张图像显著提高了结果，尽管使用的是前 5 个预测类别标签而不是真实标签。与分类任务（第 4 节）类似，在多个尺度上测试并结合多个网络的预测进一步提高了性能。

Table 9: Localisation error
表 9：定位误差

smallest image side 最小图像边长		top-5 localisation error (%) 前 5 个定位误差（%）
train ( $S$ ) 训练 ( $S$ )	test ( $Q$ ) 测试（ $Q$ ）	val.	test. 测试.
256	256	29.5	-
384	384	28.2	26.7
384	352,384	27.5	-
fusion: 256/256 and 384/352,384 融合：256/256 和 384/352,384		26.9	25.3

Comparison with the state of the art.
与当前最佳技术的比较。

We compare our best localisation result with the state of the art in Table 10. With $25.3\%$ test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation performance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation.
我们将我们的最佳定位结果与当前最佳技术列于表 10 中。在 $25.3\%$ 测试错误下，我们的“VGG”团队赢得了 ILSVRC-2014（Russakovsky 等人，2014）的定位挑战。值得注意的是，我们的结果明显优于 ILSVRC-2013 冠军 Overfeat（Sermanet 等人，2014），尽管我们使用了较少的尺度，并且没有采用他们的分辨率增强技术。我们设想如果将这种技术融入我们的方法中，可以取得更好的定位性能。这表明了我们非常深的卷积网络带来的性能提升——我们使用更简单的定位方法得到了更好的结果，但获得了更强大的表示能力。

Table 10: Comparison with the state of the art in ILSVRC localisation. Our method is denoted as “VGG”.
表 10：与 ILSVRC 定位领域当前最佳技术的比较。我们的方法表示为“VGG”。

Method 方法	top-5 val. error (%) top-5 验证误差 (%)	top-5 test error (%) top-5 测试误差 (%)
VGG	26.9	25.3
GoogLeNet (Szegedy et al., 2014) GoogLeNet (Szegedy 等人, 2014)	-	26.7
OverFeat (Sermanet et al., 2014) OverFeat (Sermanet 等人，2014)	30.0	29.9
Krizhevsky et al. (Krizhevsky et al., 2012) Krizhevsky 等人 (Krizhevsky 等人，2012)	-	34.2

Appendix B Generalisation of Very Deep Features
附录 B 非常深特征的泛化

In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC (Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).
在前面的章节中，我们讨论了在 ILSVRC 数据集上训练和评估非常深的卷积神经网络。在本节中，我们评估了在 ILSVRC 上预训练的卷积神经网络，作为其他更小数据集上的特征提取器，由于过拟合问题，从头开始训练大型模型是不可行的。最近，人们对此类应用场景产生了很大兴趣（Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014），因为事实证明，在 ILSVRC 上学习到的深度图像表示能够很好地泛化到其他数据集，并且它们在性能上大幅超越了手工制作的表示。沿着这条研究路线，我们研究了我们的模型是否比最先进方法中使用的更浅的模型能带来更好的性能。在此评估中，我们考虑了在 ILSVRC 上分类性能最佳的两种模型（第 4 节）——配置“Net-D”和“Net-E”（我们已经将它们公开）。

To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor is $L_{2}$ -normalised and combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed).
要利用在 ILSVRC 上预训练的 ConvNets 进行其他数据集上的图像分类，我们移除最后一个全连接层（该层执行 1000 类 ILSVRC 分类），并使用倒数第二层的 4096 维激活作为图像特征，这些特征在多个位置和尺度上聚合。得到的图像描述符经过 $L_{2}$ 归一化，并与一个在线 SVM 分类器结合，该分类器在目标数据集上训练。为简化起见，预训练的 ConvNet 权重保持固定（不进行微调）。

Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals $Q$ , and then the network is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a horizontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales $Q$ . The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors.
特征聚合的方式与我们的 ILSVRC 评估程序（第 3.2 节）类似。具体来说，首先将图像重新缩放，使其最小边等于 $Q$ ，然后网络在图像平面上密集应用（当所有权重层都视为卷积层时这是可能的）。我们对生成的特征图进行全局平均池化，得到一个 4096 维的图像描述符。然后将该描述符与水平翻转图像的描述符进行平均。正如第 4.2 节所示，在多个尺度上进行评估是有益的，因此我们在多个尺度上提取特征 $Q$ 。所得的多尺度特征可以在不同尺度间堆叠或池化。堆叠允许后续分类器学习如何优化地组合不同尺度上的图像统计信息；然而，这以描述符维度的增加为代价。我们将在下面的实验中回到这一设计选择的讨论。我们还评估了使用两个网络计算的特征的后期融合，这是通过堆叠它们的各自图像描述符来完成的。

Table 11: Comparison with the state of the art in image classification on VOC-2007, VOC-2012, Caltech-101, and Caltech-256. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes).
表 11：在 VOC-2007、VOC-2012、Caltech-101 和 Caltech-256 上的图像分类任务中与当前最优方法的比较。我们的模型标记为“VGG”。标有*的结果是使用在扩展的 ILSVRC 数据集（2000 类）上预训练的卷积神经网络（ConvNets）实现的。

Method 方法	VOC-2007	VOC-2012	Caltech-101	Caltech-256
Method 方法	(mean AP)	(mean AP)	(mean class recall)	(mean class recall)
Zeiler & Fergus (Zeiler & Fergus, 2013)	-	79.0	86.5 $\pm$ 0.5	74.2 $\pm$ 0.3
Chatfield et al. (Chatfield et al., 2014)	82.4	83.2	88.4 $\pm$ 0.6	77.6 $\pm$ 0.1
He et al. (He et al., 2014)	82.4	-	93.4 $\pm$ 0.5	-
Wei et al. (Wei et al., 2014)	81.5 (85.2^∗)	81.7 (90.3^∗)	-	-
VGG Net-D (16 layers) VGG Net-D (16 层)	89.3	89.0	91.8 $\pm$ 1.0	85.0 $\pm$ 0.2
VGG Net-E (19 layers) VGG Net-E (19 层)	89.3	89.0	92.3 $\pm$ 0.5	85.1 $\pm$ 0.3
VGG Net-D & Net-E	89.7	89.3	92.7 $\pm$ 0.5	86.2 $\pm$ 0.3

Image Classification on VOC-2007 and VOC-2012.
在 VOC-2007 和 VOC-2012 上进行图像分类。

We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels, corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes.
我们从 PASCAL VOC-2007 和 VOC-2012 基准数据集（Everingham 等人，2015）上的图像分类任务开始评估。这些数据集分别包含 10K 和 22.5K 张图像，每张图像被标注一个或多个标签，对应于 20 个物体类别。VOC 组织者提供了预定义的训练、验证和测试数据划分（VOC-2012 的测试数据未公开提供；相反，提供了一个官方评估服务器）。识别性能使用跨类别的平均精度均值（mAP）进行衡量。

Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit. Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q\in\{256,384,512,640,768\}$ . It is worth noting though that the improvement over a smaller range of $\{256,384,512\}$ was rather marginal ( $0.3\%$ ).
值得注意的是，通过检查 VOC-2007 和 VOC-2012 验证集上的性能，我们发现将多个尺度上计算出的图像描述符通过平均进行聚合，其性能与通过堆叠进行聚合相似。我们假设这是因为 VOC 数据集中物体出现在各种尺度上，因此没有特定的尺度特定语义可供分类器利用。由于平均不会增加描述符的维度，我们能够对多个尺度范围内的图像描述符进行聚合： $Q\in\{256,384,512,640,768\}$ 。不过值得注意的是，在较小范围内的改进相当微弱（ $0.3\%$ ）。

The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results. Our methods set the new state of the art across image representations, pre-trained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than $6\%$ . It should be noted that the method of Wei et al. (2014), which achieves $1\%$ better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline.
测试集性能在表 11 中报告，并与其他方法进行了比较。我们的网络“Net-D”和“Net-E”在 VOC 数据集上表现出相同的性能，而它们的组合略微提高了结果。我们的方法在图像表示方面设定了新的技术前沿，在 ILSVRC 数据集上预训练，比 Chatfield 等人（2014）之前最好的结果高出超过 $6\%$ 。需要注意的是，Wei 等人（2014）的方法在 VOC-2012 上实现了 $1\%$ 更好的 mAP，它是在一个扩展的 2000 类 ILSVRC 数据集上预训练的，该数据集包括额外的 1000 个类别，这些类别在语义上与 VOC 数据集中的类别相近。它还受益于与一个结合目标检测辅助分类流程的融合。

Image Classification on Caltech-101 and Caltech-256.
在 Caltech-101 和 Caltech-256 上进行图像分类。

In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al., 2004) and Caltech-256 (Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard evaluation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall (which compensates for a different number of test images per class). Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20% of training images were used as a validation set for hyper-parameter selection.
在本节中，我们在 Caltech-101（Fei-Fei 等人，2004 年）和 Caltech-256（Griffin 等人，2007 年）图像分类基准数据集上评估了非常深的特征。Caltech-101 包含 9K 张图像，这些图像被标记为 102 个类别（101 个物体类别和一个背景类别），而 Caltech-256 更大，包含 31K 张图像和 257 个类别。在这些数据集上的标准评估协议是生成多个随机分割，将数据分为训练集和测试集，并报告跨分割的平均识别性能，该性能通过平均类别召回率来衡量（这可以弥补每个类别测试图像数量的不同）。遵循 Chatfield 等人（2014 年）；Zeiler & Fergus（2013 年）；He 等人（2014 年）的方法，在 Caltech-101 上我们生成了 3 个随机分割，每个分割包含每个类别 30 张训练图像，以及最多 50 张测试图像。在 Caltech-256 上，我们也生成了 3 个分割，每个分割包含每个类别 60 张训练图像（其余部分用于测试）。在每个分割中，20%的训练图像被用作超参数选择的验证集。

We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales $Q\in\{256,384,512\}$ .
我们发现，与 VOC 不同，在 Caltech 数据集上，计算多个尺度上的描述符堆叠性能优于平均或最大池化。这可以由 Caltech 图像中物体通常占据整个图像这一事实来解释，因此多尺度图像特征在语义上是不同的（捕捉整个物体与物体部分），而堆叠允许分类器利用这种尺度特定的表示。我们使用了三个尺度 $Q\in\{256,384,512\}$ 。

Our models are compared to each other and the state of the art in Table 11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256, our features outperform the state of the art (Chatfield et al., 2014) by a large margin ( $8.6\%$ ).
我们的模型在表 11 中相互比较，并与当前最佳水平进行了比较。可以看出，更深的 19 层 Net-E 表现优于 16 层 Net-D，而它们的组合进一步提升了性能。在 Caltech-101 上，我们的表示方法与 He 等人（2014 年）的方法具有竞争力，然而该方法在 VOC-2007 上表现明显不如我们的网络。在 Caltech-256 上，我们的特征大幅优于当前最佳水平（Chatfield 等人，2014 年）（ $8.6\%$ ）。

Action Classification on VOC-2012.
VOC-2012 上的动作分类。

We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training images, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation. The results are compared to other approaches in Table 12.
我们还评估了我们表现最佳的图像表示（Net-D 和 Net-E 特征的堆叠）在 PASCAL VOC-2012 动作分类任务（Everingham 等人，2015 年）上的效果，该任务要求从包含动作执行者边框的单张图像中预测动作类别。该数据集包含 4.6K 张训练图像，分为 11 个类别。与 VOC-2012 目标分类任务类似，性能使用 mAP 进行衡量。我们考虑了两种训练设置：(i) 在整张图像上计算 ConvNet 特征并忽略提供的边框；(ii) 在整张图像和提供的边框上计算特征，并将它们堆叠以获得最终表示。结果与其他方法在表 12 中进行比较。

Table 12: Comparison with the state of the art in single-image action classification on VOC-2012. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (1512 classes).
表 12：在 VOC-2012 单图像动作分类任务上与当前最优方法的比较。我们的模型标记为“VGG”。带*的结果是使用在扩展的 ILSVRC 数据集（1512 类）上预训练的卷积神经网络实现的。

Method 方法	VOC-2012 (mean AP) VOC-2012（平均 AP）
(Oquab et al., 2014) (Oquab 等人，2014)	70.2^∗
(Gkioxari et al., 2014) (Gkioxari 等人，2014)	73.6
(Hoai, 2014) (Hoai，2014)	76.3
VGG Net-D & Net-E, image-only VGG Net-D & Net-E，仅图像	79.2
VGG Net-D & Net-E, image and bounding box VGG Net-D & Net-E, 图像和边界框	84.0

Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.
我们的表示在 VOC 动作分类任务上即使不使用提供的边界框也达到了当前最佳水平，并且在使用图像和边界框时结果得到了进一步改善。与其他方法不同，我们没有结合任何特定任务的启发式方法，而是依赖于非常深卷积特征的表达能力。

Other Recognition Tasks.
其他识别任务。

Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014).
自从我们的模型公开发布以来，它们已被研究界广泛用于各种图像识别任务，并持续超越更浅层的表示。例如，Girshick 等人（2014）通过用我们的 16 层模型替换 Krizhevsky 等人（2012）的卷积神经网络，实现了目标检测结果的当前最佳水平。在语义分割（Long 等人，2014）、图像描述生成（Kiros 等人，2014；Karpathy & Fei-Fei，2014）、纹理和材料识别（Cimpoi 等人，2014；Bell 等人，2014）等方面，也观察到了对 Krizhevsky 等人（2012）更浅层架构的类似提升。

Appendix C Paper Revisions
附录 C 论文修订

Here we present the list of major paper revisions, outlining the substantial changes for the convenience of the reader.
这里我们列出了主要论文的修订版本，概述了重要的变更，以便读者方便查阅。

v1 Initial version. Presents the experiments carried out before the ILSVRC submission.
v1 初版。展示了在提交 ILSVRC 之前进行的实验。

v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.
v2 增加了提交后使用尺度抖动进行训练集增强的 ILSVRC 实验，这提升了性能。

v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets. The models used for these experiments are publicly available.
v3 在 PASCAL VOC 和 Caltech 图像分类数据集上增加了泛化实验（附录 B）。用于这些实验的模型是公开可用的。

v4 The paper is converted to ICLR-2015 submission format. Also adds experiments with multiple crops for classification.
v4 论文转换为 ICLR-2015 提交格式。还增加了多裁剪分类的实验。

v6 Camera-ready ICLR-2015 conference paper. Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.
v6 摄像头就绪的 ICLR-2015 会议论文。增加了网络 B 与浅网络的比较以及 PASCAL VOC 动作分类基准的结果。

Very Deep Convolutional Networks for Large-Scale Image Recognition用于大规模图像识别的非常深的卷积网络