DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs
DeepFuse：一种针对极端曝光图像对的深度无监督曝光融合方法

K. Ram Prabhakar, V Sai Srikar, and R. Venkatesh Babu
K·拉姆·普拉巴卡尔、V·赛·斯里卡尔与 R·文卡泰什·巴布
Video Analytics Lab, Department of Computational and Data Sciences,
Indian Institute of Science, Bangalore, India
印度科学研究所计算与数据科学系视频分析实验室，印度班加罗尔

Abstract 摘要

We present a novel deep learning architecture for fusing static multi-exposure images. Current multi-exposure fusion (MEF) approaches use hand-crafted features to fuse input sequence. However, the weak hand-crafted representations are not robust to varying input conditions. Moreover, they perform poorly for extreme exposure image pairs. Thus, it is highly desirable to have a method that is robust to varying input conditions and capable of handling extreme exposure without artifacts. Deep representations have known to be robust to input conditions and have shown phenomenal performance in a supervised setting. However, the stumbling block in using deep learning for MEF was the lack of sufficient training data and an oracle to provide the ground-truth for supervision. To address the above issues, we have gathered a large dataset of multi-exposure image stacks for training and to circumvent the need for ground truth images, we propose an unsupervised deep learning framework for MEF utilizing a no-reference quality metric as loss function. The proposed approach uses a novel CNN architecture trained to learn the fusion operation without reference ground truth image. The model fuses a set of common low level features extracted from each image to generate artifact-free perceptually pleasing results. We perform extensive quantitative and qualitative evaluation and show that the proposed technique outperforms existing state-of-the-art approaches for a variety of natural images.
我们提出了一种新颖的深度学习架构，用于融合静态多曝光图像。当前的多曝光融合（MEF）方法采用手工设计特征来融合输入序列，然而这些脆弱的手工表征对变化的输入条件缺乏鲁棒性，且在处理极端曝光图像对时表现欠佳。因此，亟需一种能适应不同输入条件、可处理极端曝光且无伪影的融合方法。众所周知，深度表征对输入条件具有强鲁棒性，并在监督学习环境中展现出卓越性能。但将深度学习应用于 MEF 的障碍在于缺乏足够的训练数据及提供监督真值的基准。为解决上述问题，我们收集了大规模多曝光图像堆栈数据集用于训练，并通过采用无参考质量指标作为损失函数，提出了无需地面真实图像的无监督深度学习框架。该方案采用新型 CNN 架构进行训练，可在无参考真值图像的情况下学习融合操作。该模型通过融合从每幅图像中提取的一组共同低级特征，生成无伪影且感知愉悦的结果。我们进行了广泛的定量与定性评估，结果表明所提出的技术在多种自然图像处理中优于现有最先进方法。

1 Introduction 1 引言

High Dynamic Range Imaging (HDRI) is a photography technique that helps to capture better-looking photos in difficult lighting conditions. It helps to store all range of light (or brightness) that is perceivable by human eyes, instead of using limited range achieved by cameras. Due to this property, all objects in the scene look better and clear in HDRI, without being saturated (too dark or too bright) otherwise.
高动态范围成像（HDRI）是一种摄影技术，能够在复杂光照条件下拍摄出更优质的照片。该技术旨在存储人眼可感知的全部光线（或亮度）范围，而非相机有限的动态范围。凭借这一特性，高动态范围图像中的所有物体都能呈现更佳、更清晰的视觉效果，避免了传统成像中常见的过暗或过亮现象。

The popular approach for HDR image generation is called as Multiple Exposure Fusion (MEF), in which, a set of static LDR images (further referred as exposure stack) with varying exposure is fused into a single HDR image. The proposed method falls under this category. Most of MEF algorithms work better when the exposure bias difference between each LDR images in exposure stack is minimum¹¹1Exposure bias value indicates the amount of exposure offset from the auto exposure setting of an camera. For example, EV 1 is equal to doubling auto exposure time (EV 0).. Thus they require more LDR images (typically more than 2 images) in the exposure stack to capture whole dynamic range of the scene. It leads to more storage requirement, processing time and power. In principle, the long exposure image (image captured with high exposure time) has better colour and structure information in dark regions and short exposure image (image captured with less exposure time) has better colour and structure information in bright regions. Though fusing extreme exposure images is practically more appealing, it is quite challenging (existing approaches fail to maintain uniform luminance across image). Additionally, it should be noted that taking more pictures increases power, capture time and computational time requirements. Thus, we propose to work with exposure bracketed image pairs as input to our algorithm.
高动态范围（HDR）图像生成的常用方法称为多曝光融合（MEF），该方法将一组曝光度不同的静态低动态范围（LDR）图像（后文简称曝光堆栈）融合为单张 HDR 图像。本文提出的方法即属于此类。大多数 MEF 算法在曝光堆栈中各 LDR 图像间曝光偏差差异最小时（趋近于 0）效果最佳，因此需要更多 LDR 图像（通常超过 2 张）来捕捉场景的完整动态范围。这会导致更高的存储需求、处理时间和功耗。从原理上说，长曝光图像（高曝光时间拍摄）在暗部区域具有更优的色彩与结构信息，而短曝光图像（低曝光时间拍摄）在亮部区域表现更佳。虽然融合极端曝光图像在实际应用中更具吸引力，但这一过程极具挑战性（现有方法难以维持图像整体亮度的一致性）。此外需注意，拍摄更多图像会增加功耗、采集时间与计算时间成本。因此，我们提出采用曝光包围的图像对作为算法的输入。

In this work, we present a data-driven learning method for fusing exposure bracketed static image pairs. To our knowledge this is the first work that uses deep CNN architecture for exposure fusion. The initial layers consists of a set of filters to extract common low-level features from each input image pair. These low-level features of input image pairs are fused for reconstructing the final result. The entire network is trained end-to-end using a no-reference image quality loss function.
本文提出了一种数据驱动的学习方法，用于融合曝光包围的静态图像对。据我们所知，这是首个采用深度 CNN 架构进行曝光融合的研究。网络初始层包含一组滤波器，用于从每对输入图像中提取共有的低层特征。这些输入图像对的低层特征经过融合后重建最终结果。整个网络采用无参考图像质量损失函数进行端到端训练。

We train and test our model with a huge set of exposure stacks captured with diverse settings (indoor/outdoor, day/night, side-lighting/back-lighting, and so on). Furthermore, our model does not require parameter fine-tuning for varying input conditions. Through extensive experimental evaluations we demonstrate that the proposed architecture performs better than state-of-the-art approaches for a wide range of input scenarios.
我们使用大量不同拍摄条件下（室内/室外、白天/夜晚、侧光/逆光等）采集的曝光堆栈图像对模型进行训练和测试。此外，我们的模型无需针对不同输入条件进行参数微调。通过大量实验评估，我们证明所提出的架构在多种输入场景下均优于现有最先进方法。

The contributions of this work are as follows:
本工作的贡献如下：

•

A CNN based unsupervised image fusion algorithm for fusing exposure stacked static image pairs.
一种基于 CNN 的无监督图像融合算法，用于融合曝光堆叠的静态图像对。
•

A new benchmark dataset that can be used for comparing various MEF methods.
构建了可用于比较多种多曝光融合方法的新基准数据集。
•

An extensive experimental evaluation and comparison study against 7 state-of-the-art algorithms for variety of natural images.
针对多种自然图像，与 7 种最先进算法进行了全面的实验评估与对比研究。

The paper is organized as follows. Section 2, we briefly review related works from literature. Section 3, we present our CNN based exposure fusion algorithm and discuss the details of experiments. Section 4, we provide the fusion examples and then conclude the paper with an insightful discussion in section 5.
本文组织结构如下：第 2 节简要回顾文献中的相关工作；第 3 节提出基于 CNN 的曝光融合算法并详述实验细节；第 4 节展示融合实例，第 5 节通过深入讨论总结全文。

2 Related Works 2 相关工作

Many algorithms have been proposed over the years for exposure fusion. However, the main idea remains the same in all the algorithms. The algorithms compute the weights for each image either locally or pixel wise. The fused image would then be the weighted sum of the images in the input sequence.
多年来，人们提出了许多曝光融合算法。然而，所有算法的核心思想始终保持一致。这些算法通过局部或逐像素方式计算每幅图像的权重，最终融合图像即为输入序列中各图像的加权求和结果。

Burt et al. [3] performed a Laplacian pyramid decomposition of the image and the weights are computed using local energy and correlation between the pyramids. Use of Laplacian pyramids reduces the chance of unnecessary artifacts. Goshtasby et al. [5] take non-overlapping blocks with highest information from each image to obtain the fused result. This is prone to suffer from block artifacts. Mertens et al. [16] perform exposure fusion using simple quality metrics such as contrast and saturation. However, this suffers from hallucinated edges and mismatched color artifacts.
Burt 等人[3]通过拉普拉斯金字塔分解图像，并利用金字塔间的局部能量和相关性计算权重。采用拉普拉斯金字塔可降低不必要伪影的产生概率。Goshtasby 等人[5]通过选取每幅图像中信息量最高的非重叠区块来获得融合结果，但这种方法容易产生区块伪影。Mertens 等人[16]使用对比度和饱和度等简单质量指标进行曝光融合，然而该方法存在边缘虚化和色彩失配伪影的问题。

Algorithms which make use of edge preserving filters like Bilateral filters are proposed in [19]. As this does not account for the luminance of the images, the fused image has dark region leading to poor results. A gradient based approach to assign the weight was put forward by Zhang et al. [28]. In a series of papers by Li et al. [9], [10] different approaches to exposure fusion have been reported. In their early works they solve a quadratic optimization to extract finer details and fuse them. In one of their later works [10], they propose a Guided Filter based approach.
文献[19]提出了利用双边滤波等边缘保持滤波器的算法。由于该方法未考虑图像亮度因素，融合结果会出现暗区导致效果不佳。Zhang 等人[28]提出了一种基于梯度的权重分配方法。Li 等人在系列研究[9][10]中报道了多种曝光融合方法，早期工作通过求解二次优化提取精细细节进行融合，后续研究[10]则提出了基于导向滤波的融合方案。

Shen et al. [22] proposed a fusion technique using quality metrics such as local contrast and color consistency. The random walk approach they perform gives a global optimum solution to the fusion problem set in a probabilistic fashion.
沈等人[22]提出了一种利用局部对比度和色彩一致性等质量指标的融合技术。他们所采用的随机游走方法以概率方式为融合问题提供了全局最优解。

All of the above works rely on hand-crafted features for image fusion. These methods are not robust in the sense that the parameters need to be varied for different input conditions say, linear and non-linear exposures, filter size depends on image sizes. To circumvent this parameter tuning we propose a feature learning based approach using CNN. In this work we learn suitable features for fusing exposure bracketed images. Recently, Convolutional Neural Network (CNN) have shown impressive performance across various computer vision tasks [8]. While CNNs have produced state-of-the-art results in many high-level computer vision tasks like recognition ([7], [21]), object detection [11], Segmentation [6], semantic labelling [17], visual question answering [2] and much more, their performance on low-level image processing problems such as filtering [4] and fusion [18] is not studied extensively. In this work we explore the effectiveness of CNN for the task of multi-exposure image fusion.
上述研究均依赖于手工设计的特征进行图像融合。这些方法在鲁棒性方面存在不足，因为需要根据不同输入条件（如线性与非线性曝光）调整参数，且滤波器尺寸取决于图像大小。为规避此类参数调优问题，我们提出了一种基于卷积神经网络（CNN）的特征学习方法。本研究通过学习适用于曝光包围图像融合的特征来实现这一目标。近年来，卷积神经网络在各类计算机视觉任务中展现出卓越性能[8]。尽管 CNN 在高级视觉任务（如识别[7][21]、目标检测[11]、分割[6]、语义标注[17]、视觉问答[2]等）中取得了最先进成果，但其在低层次图像处理问题（如滤波[4]与融合[18]）上的性能尚未得到充分研究。本工作重点探索 CNN 在多曝光图像融合任务中的有效性。

To our knowledge, use of CNNs for multi-exposure fusion is not reported in literature. The other machine learning approach is based on a regression method called Extreme Learning Machine (ELM) [25], that feed saturation level, exposedness, and contrast into the regressor to estimate the importance of each pixel. Instead of using hand crafted features, we use the data to learn a representation right from the raw pixels.
据我们所知，文献中尚未报道使用卷积神经网络进行多曝光融合的研究。另一种机器学习方法基于称为极限学习机（ELM）的回归方法[25]，该方法将饱和度、曝光度和对比度输入回归器以估计每个像素的重要性。与使用手工设计特征不同，我们直接从原始像素数据中学习表征。

3 Proposed Method 3 提出的方法

In this work, we propose an image fusion framework using CNNs. Within a span of couple years, Convolutional Neural Networks have shown significant success in high-end computer vision tasks. They are shown to learn complex mappings between input and output with the help of sufficient training data. CNN learns the model parameters by optimizing a loss function in order to predict the result as close as to the ground-truth. For example, let us assume that input x is mapped to output y by some complex transformation f. The CNN can be trained to estimate the function f that minimizes the difference between the expected output y and obtained output $\hat{\textbf{y}}$ . The distance between y and $\hat{\textbf{y}}$ is calculated using a loss function, such as mean squared error function. Minimizing this loss function leads to better estimate of required mapping function.
在本研究中，我们提出了一种基于卷积神经网络（CNN）的图像融合框架。近年来，卷积神经网络在高端计算机视觉任务中取得了显著成功。研究表明，借助充足的训练数据，CNN 能够学习输入与输出之间的复杂映射关系。通过优化损失函数来预测尽可能接近真实值的结果，CNN 得以学习模型参数。例如，假设输入 x 通过某种复杂变换 f 映射为输出 y，我们可以训练 CNN 来估计函数 f，使预期输出 y 与实际输出 $\hat{\textbf{y}}$ 之间的差异最小化。y 与 $\hat{\textbf{y}}$ 之间的距离通过损失函数（如均方误差函数）进行计算。最小化该损失函数可获得更精确的目标映射函数估计。

Let us denote the input exposure sequence and fusion operator as $I$ and $O(I)$ . The input images are assumed to be registered and aligned using existing registration algorithms, thus avoiding camera and object motion. We model $O(I)$ with a feed-forward process $F_{W}(I)$ . Here, $F$ denotes the network architecture and $W$ denotes the weights learned by minimizing the loss function. As the expected output $O(I)$ is absent for MEF problem, the squared error loss or any other full reference error metric cannot be used. Instead, we make use of no-reference image quality metric MEF SSIM proposed by Ma et al. [15] as loss function. MEF SSIM is based on structural similarity index metric (SSIM) framework [27]. It makes use of statistics of a patch around individual pixels from input image sequence to compare with result. It measures the loss of structural integrity as well as luminance consistency in multiple scales (see section 3.1.1 for more details).
设输入曝光序列与融合算子分别为 $I$ 和 $O(I)$ 。假设输入图像已通过现有配准算法完成对齐注册，从而规避相机与物体运动带来的影响。我们采用前馈过程 $F_{W}(I)$ 对 $O(I)$ 进行建模，其中 $F$ 表示网络架构， $W$ 表示通过最小化损失函数学习得到的权重参数。由于多曝光融合(MEF)问题缺乏预期输出 $O(I)$ ，故无法使用平方误差或任何全参考误差度量。为此，我们采用 Ma 等人[15]提出的无参考图像质量评价指标 MEF-SSIM 作为损失函数。该指标基于结构相似性指数(SSIM)框架[27]，通过利用输入图像序列中各像素邻域块的统计特征与融合结果进行对比，在多尺度下衡量结构完整性损失与亮度一致性（详见 3.1.1 节）。

An overall scheme of proposed method is shown in Fig. 1. The input exposure stack is converted into YCbCr color channel data. The CNN is used to fuse the luminance channel of the input images. This is due to the fact that the image structural details are present in luminance channel and the brightness variation is prominent in luminance channel than chrominance channels. The obtained luminance channel is combined with chroma (Cb and Cr) channels generated using method described in section 3.3. The following subsection details the network architecture, loss function and the training procedure.
图 1 展示了所提方法的整体框架。输入曝光序列首先被转换为 YCbCr 色彩通道数据，随后采用卷积神经网络对输入图像的亮度通道进行融合处理。这是由于图像结构细节主要存在于亮度通道，且亮度通道的明暗变化比色度通道更为显著。最终将获得的亮度通道与第 3.3 节所述方法生成的色度通道（Cb 和 Cr）进行组合。以下小节将详细阐述网络架构、损失函数及训练流程。

3.1 DeepFuse CNN 3.1 DeepFuse 卷积神经网络

The learning ability of CNN is heavily influenced by right choice of architecture and loss function. A simple and naive architecture is to have a series of convolutional layers connected in sequential manner. The input to this architecture would be exposure image pairs stacked in third dimension. Since the fusion happens in the pixel domain itself, this type of architecture does not make use of feature learning ability of CNNs to a great extent.
卷积神经网络的学习能力很大程度上取决于架构设计与损失函数的合理选择。一种简单直接的架构设计是采用多个卷积层顺序连接的结构。该架构的输入为沿第三维度堆叠的曝光图像对。由于融合过程直接作用于像素域，此类架构未能充分发挥卷积神经网络的特征学习能力。

The proposed network architecture for image fusion is illustrated in Fig. 2. The proposed architecture has three components: feature extraction layers, a fusion layer and reconstruction layers. As shown in Fig. 2, the under-exposed and the over-exposed images ( $Y_{1}$ and $Y_{2}$ ) are input to separate channels (channel 1 consists of C11 and C21 and channel 2 consists of C12 and C22). The first layer (C11 and C12) contains 5 $\times$ 5 filters to extract low-level features such as edges and corners. The weights of pre-fusion channels are tied, C11 and C12 (C21 and C22) share same weights. The advantage of this architecture is three fold: first, we force the network to learn the same features for the input pair. That is, the F11 and F21 are same feature type. Hence, we can simply combine the respective feature maps via fusion layer. Meaning, the first feature map of image 1 (F11) and the first feature map of image 2 (F21) are added and this process is applied for remaining feature maps as well. Also, adding the features resulted in better performance than other choices of combining features (see Table 1). In feature addition, similar feature types from both images are fused together. Optionally one can choose to concatenate features, by doing so, the network has to figure out the weights to merge them. In our experiments, we observed that feature concatenation can also achieve similar results by increasing the number of training iterations, increasing number of filters and layers after C3. This is understandable as the network needs more number of iterations to figure out appropriate fusion weights. In this tied-weights setting, we are enforcing the network to learn filters that are invariant to brightness changes. This is observed by visualizing the learned filters (see Fig. 8). In case of tied weights, few high activation filters have center surround receptive fields (typically observed in retina). These filters have learned to remove the mean from neighbourhood, thus effectively making the features brightness invariant. Second, the number of learnable filters is reduced by half. Third, as the network has low number of parameters, it converges quickly. The obtained features from C21 and C22 are fused by merge layer. The result of fuse layer is then passed through another set of convolutional layers (C3, C4 and C5) to reconstruct final result ( $Y_{fused}$ ) from fused features.
图 2 展示了本文提出的图像融合网络架构。该架构包含三个核心组件：特征提取层、融合层和重建层。如图 2 所示，欠曝光图像（ $Y_{1}$ ）与过曝光图像（ $Y_{2}$ ）分别输入两个独立通道（通道 1 包含 C11 与 C21，通道 2 包含 C12 与 C22）。第一层（C11 和 C12）采用 5×5 滤波器提取边缘与角点等底层特征。预融合通道采用权重共享机制，即 C11 与 C12（以及 C21 与 C22）具有相同权重。该架构具有三重优势：首先，强制网络对输入图像对学习相同特征类型，确保 F11 与 F21 属于同类特征，从而可直接通过融合层合并对应特征图——即将图像 1 的首个特征图（F11）与图像 2 的首个特征图（F21）相加，该操作同样适用于其余特征图。实验表明，特征相加的融合方式相较其他组合策略能获得更优性能（参见表 1）。在特征相加过程中，两幅图像的同类特征得以有效融合。可选地，可以选择通过特征拼接的方式，此时网络需要自行学习特征融合的权重。在实验中我们发现，通过增加训练迭代次数、在 C3 层后增加滤波器数量与网络层数，特征拼接同样能取得相近效果。这可以理解，因为网络需要更多迭代次数来学习合适的融合权重。在这种权重共享的设置下，我们强制网络学习对亮度变化具有不变性的滤波器。通过可视化已学习的滤波器（见图 8）可以观察到：在权重共享情况下，少数高激活滤波器具有中心-外周感受野（视网膜中常见特性），这些滤波器学会了消除邻域均值，从而有效实现特征亮度不变性。其次，可训练滤波器数量减少了一半。再者，由于网络参数量较少，收敛速度更快。最终，C21 与 C22 层提取的特征通过融合层进行合并。融合层的结果随后通过另一组卷积层（C3、C4 和 C5）进行处理，从融合特征中重建最终结果（ $Y_{fused}$ ）。

3.1.1 MEF SSIM loss function
3.1.1 多曝光融合结构相似性损失函数

In this section, we will discuss on computing loss without using reference image by MEF SSIM image quality measure [15]. Let $\{{\mathbf{y}}_{k}\}$ = $\{{\mathbf{y}}_{k}|k$ =1,2 $\}$ denote the image patches extracted at a pixel location $p$ from input image pairs and ${\mathbf{y}}_{f}$ denote the patch extracted from CNN output fused image at same location $p$ . The objective is to compute a score to define the fusion performance given ${\mathbf{y}}_{k}$ input patches and ${\mathbf{y}}_{f}$ fused image patch.
本节将讨论如何通过 MEF SSIM 图像质量评价指标[15]在不使用参考图像的情况下计算损失。设 $\{{\mathbf{y}}_{k}\}$ = $\{{\mathbf{y}}_{k}|k$ =1,2 $\}$ 表示从输入图像对中在像素位置 $p$ 处提取的图像块， ${\mathbf{y}}_{f}$ 表示在相同位置 $p$ 处从 CNN 输出融合图像中提取的图像块。其目标是计算一个评分，用以在给定 ${\mathbf{y}}_{k}$ 个输入图像块和 ${\mathbf{y}}_{f}$ 个融合图像块的情况下定义融合性能。

In SSIM [27] framework, any patch can be modelled using three components: structure ( ${\mathbf{s}}$ ), luminance ( $l$ ) and contrast ( $c$ ). The given patch is decomposed into these three components as:
在 SSIM[27]框架中，任何图像块均可通过三个分量建模：结构分量( ${\mathbf{s}}$ )、亮度分量( $l$ )和对比度分量( $c$ )。给定图像块可分解为如下三个分量：

$\displaystyle{\mathbf{y}}_{k}=$	$\displaystyle\\|{\mathbf{y}}_{k}-\mu_{{\mathbf{y}}_{k}}\\|\cdot\frac{{\mathbf{y}}_{k}-\mu_{{\mathbf{y}}_{k}}}{\\|{\mathbf{y}}_{k}-\mu_{{\mathbf{y}}_{k}}\\|}+\mu_{{\mathbf{y}}_{k}}$
$\displaystyle=$	$\displaystyle\\|\tilde{\mathbf{y}}_{k}\\|\cdot\frac{\tilde{\mathbf{y}}_{k}}{\\|\tilde{\mathbf{y}}_{k}\\|}+\mu_{{\mathbf{y}}_{k}}$
$\displaystyle=$	$\displaystyle c_{k}\cdot{\mathbf{s}}_{k}+l_{k},$	(1)

where, $\parallel\cdot\parallel$ is the $\ell_{2}$ norm of patch, $\mu_{{\mathbf{y}}_{k}}$ is the mean value of ${\mathbf{y}}_{k}$ and $\tilde{\mathbf{y}}_{k}$ is the mean subtracted patch. As the higher contrast value means better image, the desired contrast value ( $\hat{c}$ ) of the result is taken as the highest contrast value of $\{c_{k}\}$ , (i.e.)
其中， $\parallel\cdot\parallel$ 表示图像块的 $\ell_{2}$ 范数， $\mu_{{\mathbf{y}}_{k}}$ 为 ${\mathbf{y}}_{k}$ 的均值， $\tilde{\mathbf{y}}_{k}$ 表示减去均值后的图像块。由于对比度值越高意味着图像质量越好，因此结果图像的期望对比度值（ $\hat{c}$ ）取 $\{c_{k}\}$ 中的最高对比度值（即）。

\hat{c}=\underset{\{k=1,2\}}{\max}c_{k}

The structure of the desired result ( $\hat{{\mathbf{s}}}$ ) is obtained by weighted sum of structures of input patches as follows,
期望结果的结构（ $\hat{{\mathbf{s}}}$ ）通过以下方式对输入图像块的结构进行加权求和获得，

\bar{\mathbf{s}}=\frac{\sum_{k=1}^{2}w\left({\tilde{\mathbf{y}}_{k}}\right){\mathbf{s}}_{k}}{\sum_{k=1}^{2}w\left({\tilde{\mathbf{y}}_{k}}\right)}\quad{\rm and}\quad\hat{\mathbf{s}}=\frac{\bar{\mathbf{s}}}{\|\bar{\mathbf{s}}\|},

(2)

where the weighting function assigns weight based on structural consistency between input patches. The weighting function assigns equal weights to patches, when they have dissimilar structural components. In the other case, when all input patches have similar structures, the patch with high contrast is given more weight as it is more robust to distortions. The estimated $\hat{s}$ and $\hat{c}$ is combined to produce desired result patch as,
其中权重函数根据输入图像块之间的结构一致性来分配权重。当各输入块的结构成分差异较大时，权重函数会赋予它们相等的权重。反之，当所有输入块具有相似结构时，对比度较高的图像块会被赋予更大权重，因为其对失真具有更强的鲁棒性。最终将估计的 $\hat{s}$ 和 $\hat{c}$ 进行融合，生成目标结果块，具体表达式为：

\hat{{\mathbf{y}}}=\hat{c}\cdot\hat{s}

(3)

As the luminance comparison in the local patches is insignificant, the luminance component is discarded from above equation. Comparing luminance at lower spatial resolution does not reflect the global brightness consistency. Instead, performing this operation at multiple scales would effectively capture global luminance consistency in coarser scale and local structural changes in finer scales. The final image quality score for pixel $p$ is calculated using SSIM framework,
由于局部图像块的亮度对比不显著，因此上述方程中剔除了亮度分量。在较低空间分辨率下比较亮度无法反映全局亮度一致性。相反，在多尺度下执行该操作能有效捕捉粗尺度上的全局亮度一致性和细尺度上的局部结构变化。最终像素 $p$ 的图像质量评分采用结构相似性(SSIM)框架计算得出，

Score(p)=\frac{2\sigma_{\hat{\mathbf{y}}{\mathbf{y}}_{f}}+C}{{\sigma^{2}_{\hat{\mathbf{y}}}}+\sigma^{2}_{{\mathbf{y}}_{f}}+C},

(4)

where, $\sigma^{2}_{\hat{{\mathbf{y}}}}$ is variance and $\sigma_{\hat{\mathbf{y}}{\mathbf{y}}_{f}}$ is covariance between $\hat{{\mathbf{y}}}$ and ${\mathbf{y}}_{f}$ . The total loss is calculated as,
其中， $\sigma^{2}_{\hat{{\mathbf{y}}}}$ 表示方差， $\sigma_{\hat{\mathbf{y}}{\mathbf{y}}_{f}}$ 表示 $\hat{{\mathbf{y}}}$ 与 ${\mathbf{y}}_{f}$ 之间的协方差。总损失计算如下：

Loss=1-\frac{1}{N}\sum_{p\in P}Score(p)

(5)

where $N$ is the total number of pixels in image and $P$ is the set of all pixels in input image. The computed loss is backpropagated to train the network. The better performance of MEF SSIM is attributed to its objective function that maximizes structural consistency between fused image and each of input images.
式中 $N$ 为图像总像素数， $P$ 表示输入图像所有像素的集合。计算得到的损失通过反向传播训练网络。MEF SSIM 的优越性能源于其目标函数能最大化融合图像与各输入图像间的结构一致性。

Table 1: Choice of blending operators: Average MEF SSIM scores of 23 test images generated by CNNs trained with different feature blending operations. The maximum score is highlighted in bold. Results illustrate that adding the feature tensors yield better performance. Results by addition and mean methods are similar, as both operations are very similar, except for a scaling factor. Refer text for more details.
表 1：融合算子选择：采用不同特征融合操作训练的 CNN 对 23 幅测试图像生成的 MEF SSIM 平均得分。最高分以粗体标出。结果表明特征张量相加能获得更好性能。加法与均值法的结果相似，因二者操作除缩放因子外基本一致。详见正文说明。

Product 乘积	Concatenation 拼接	Max 最大值	Mean 平均值	Addition 加法
0.8210	0.9430	0.9638	0.9750	0.9782

3.2 Training 3.2 训练

We have collected 25 exposure stacks that are available publicly [1]. In addition to that, we have curated 50 exposure stacks with different scene characteristics. The images were taken with standard camera setup and tripod. Each scene consists of 2 low dynamic range images with $\pm 2$ EV difference. The input sequences are resized to 1200 $\times$ 800 dimensions. We give priority to cover both indoor and outdoor scenes. From these input sequences, 30000 patches of size 64 $\times$ 64 were cropped for training. We set the learning rate to $10^{-4}$ and train the network for 100 epochs with all the training patches being processed in each epoch.
我们收集了 25 组公开可用的曝光堆栈[1]，此外还整理了 50 组具有不同场景特征的曝光堆栈。这些图像均采用标准相机配置和三脚架拍摄，每个场景包含 2 张动态范围较低且曝光值相差 $\pm 2$ 档的图像。输入序列统一调整为 1200 $\times$ 800 像素尺寸，并优先覆盖室内外多种场景。从这些输入序列中，我们裁剪出 30000 个 64 $\times$ 64 尺寸的图块用于训练，设置学习率为 $10^{-4}$ ，并以所有训练图块每轮完整遍历的方式训练网络 100 个周期。

3.3 Testing 3.3 测试环节

We follow the standard cross-validation procedure to train our model and test the final model on a disjoint test set to avoid over-fitting. While testing, the trained CNN takes the test image sequence and generates the luminance channel ( $Y_{fused}$ ) of fused image. The chrominance components of fused image, $Cb_{fused}$ and $Cr_{fused}$ , are obtained by weighted sum of input chrominance channel values.
我们采用标准交叉验证流程训练模型，并在独立测试集上评估最终模型以防止过拟合。测试时，训练好的 CNN 网络接收测试图像序列并生成融合图像的亮度通道( $Y_{fused}$ )，而融合图像的色度分量 $Cb_{fused}$ 和 $Cr_{fused}$ 则通过输入色度通道值的加权求和获得。

The crucial structural details of the image tend to be present mainly in $Y$ channel. Thus, different fusion strategies are followed in literature for $Y$ and $Cb$ / $Cr$ fusion ([18], [24], [26]). Moreover, MEF SSIM loss is formulated to compute the score between 2 gray-scale ( $Y$ ) images. Thus, measuring MEF SSIM for $Cb$ and $Cr$ channels may not be meaningful. Alternately, one can choose to fuse RGB channels separately using different networks. However, there is typically a large correlation between RGB channels. Fusing RGB independently fails to capture this correlation and introduces noticeable color difference. Also, MEF-SSIM is not designed for RGB channels. Another alternative is to regress RGB values in a single network, then convert them to a $Y$ image and compute MEF SSIM loss. Here, the network can focus more on improving $Y$ channel, giving less importance to color. However, we observed spurious colors in output which were not originally present in input.
图像的关键结构细节往往主要存在于 $Y$ 通道中。因此，文献中对 $Y$ 与 $Cb$ / $Cr$ 融合采用了不同策略（[18]、[24]、[26]）。此外，MEF SSIM 损失函数专为计算两幅灰度（ $Y$ ）图像间的评分而设计，故对 $Cb$ 和 $Cr$ 通道进行 MEF SSIM 评估可能缺乏意义。替代方案之一是采用不同网络分别融合 RGB 通道，但 RGB 通道间通常存在强相关性，独立融合会丢失这种关联并导致明显色差。况且 MEF-SSIM 本非为 RGB 通道设计。另一替代方案是通过单一网络回归 RGB 值，再转换为 $Y$ 图像计算 MEF SSIM 损失——此时网络可侧重优化 $Y$ 通道而弱化色彩处理。但我们发现该方法会在输出中产生输入图像原本不存在的伪色。

We follow the procedure used by Prabhakar et al. [18] for chrominance channel fusion. If $x_{1}$ and $x_{2}$ denote the $Cb$ (or $Cr$ ) channel value at any pixel location for image pairs, then the fused chrominance value $x$ is obtained as follows,
我们采用 Prabhakar 等人[18]提出的色度通道融合方法。若 $x_{1}$ 和 $x_{2}$ 表示图像对中任意像素位置的 $Cb$ （或 $Cr$ ）通道值，则融合后的色度值 $x$ 按以下方式计算：

x=\dfrac{x_{1}(|x_{1}-\tau|)+x_{2}(|x_{2}-\tau|)}{|x_{1}-\tau|+|x_{2}-\tau|}

(6)

The fused chrominance value is obtained by weighing two chrominance values with $\tau$ subtracted value from itself. The value of $\tau$ is chosen as 128. The intuition behind this approach is to give more weight for good color components and less for saturated color values. The final result is obtained by converting { $Y_{fused}$ , $Cb_{fused}$ , $Cr_{fused}$ } channels into RGB image.
融合色度值通过对两个色度值进行加权获得，其中权重为 $\tau$ 与其自身相减的值。 $\tau$ 的取值设定为 128。该方法的核心思想是对优质色彩成分赋予更大权重，而对饱和色彩值降低权重。最终通过将{ $Y_{fused}$ , $Cb_{fused}$ , $Cr_{fused}$ }通道转换至 RGB 空间得到合成图像。

Table 2: MEF SSIM scores of different methods against DeepFuse (DF) for test images. Bolded values indicate the highest score by that corresponding column algorithm than others for that row image sequence.
表 2：不同方法相对于 DeepFuse(DF)在测试图像上的 MEF SSIM 评分。加粗数值表示该列算法在当前行图像序列中取得的最高分。

	Mertens09 梅尔滕斯 09	Raman11 拉曼 11	Li12 李 12	Li13 李 13	Shen11 沈 11	Ma15 马 15	Guo17 郭 17	DF-Baseline 深度融合基线方法	DF-UnSupervised 深度融合无监督方法
AgiaGalini 阿吉亚加利尼	0.9721	0.9343	0.9438	0.9409	0.8932	0.9465	0.9492	0.9477	0.9813
Balloons 气球	0.9601	0.897	0.9464	0.9366	0.9252	0.9608	0.9348	0.9717	0.9766
Belgium house 比利时房屋	0.9655	0.8924	0.9637	0.9673	0.9442	0.9643	0.9706	0.9677	0.9727
Building 建筑	0.9801	0.953	0.9702	0.9685	0.9513	0.9774	0.9666	0.965	0.9826
Cadik lamp 卡迪克灯	0.9658	0.8696	0.9472	0.9434	0.9152	0.9464	0.9484	0.9683	0.9638
Candle 蜡烛	0.9681	0.9391	0.9479	0.9017	0.9441	0.9519	0.9451	0.9704	0.9893
Chinese garden 中式园林	0.990	0.8887	0.9814	0.9887	0.9667	0.990	0.9860	0.9673	0.9838
Corridor 走廊	0.9616	0.898	0.9709	0.9708	0.9452	0.9592	0.9715	0.9740	0.9740
Garden 花园	0.9715	0.9538	0.9431	0.932	0.9136	0.9667	0.9481	0.9385	0.9872
Hostel 旅舍	0.9678	0.9321	0.9745	0.9742	0.9649	0.9712	0.9757	0.9715	0.985
House 房屋	0.9748	0.8319	0.9575	0.9556	0.9356	0.9365	0.9623	0.9601	0.9607
Kluki Bartlomiej 克鲁基·巴托米耶	0.9811	0.9042	0.9659	0.9645	0.9216	0.9622	0.9680	0.9723	0.9742
Landscape 风景	0.9778	0.9902	0.9577	0.943	0.9385	0.9817	0.9467	0.9522	0.9913
Lighthouse 灯塔	0.9783	0.9654	0.9658	0.9545	0.938	0.9702	0.9657	0.9728	0.9875
Madison capitol 麦迪逊州议会大厦	0.9731	0.8702	0.9516	0.9668	0.9414	0.9745	0.9711	0.9459	0.9749
Memorial 纪念碑	0.9676	0.7728	0.9644	0.9771	0.9547	0.9754	0.9739	0.9727	0.9715
Office 办公室	0.9749	0.922	0.9367	0.9495	0.922	0.9746	0.9624	0.9277	0.9749
Room 房间	0.9645	0.8819	0.9708	0.9775	0.9543	0.9641	0.9725	0.9767	0.9724
SwissSunset 瑞士日落	0.9623	0.9168	0.9407	0.9137	0.8155	0.9512	0.9274	0.9736	0.9753
Table 桌子	0.9803	0.9396	0.968	0.9501	0.9641	0.9735	0.9750	0.9468	0.9853
TestChart1 测试图表 1	0.9769	0.9281	0.9649	0.942	0.9462	0.9529	0.9617	0.9802	0.9831
Tower 塔楼	0.9786	0.9128	0.9733	0.9779	0.9458	0.9704	0.9772	0.9734	0.9738
Venice 威尼斯	0.9833	0.9581	0.961	0.9608	0.9307	0.9836	0.9632	0.9562	0.9787

4 Experiments and Results 4 实验与结果

We have conducted extensive evaluation and comparison study against state-of-the-art algorithms for variety of natural images. For evaluation, we have chosen standard image sequences to cover different image characteristics including indoor and outdoor, day and night, natural and artificial lighting, linear and non-linear exposure. The proposed algorithm is compared against seven best performing MEF algorithms, (1) Mertens09 [16], (2) Li13 [10] (3) Li12 [9] (4) Ma15 [14] (5) Raman11 [20] (6) Shen11 [23] and (7) Guo17 [12]. In order to evaluate the performance of algorithms objectively, we adopt MEF SSIM. Although number of other IQA models for general image fusion have also been reported, none of them makes adequate quality predictions of subjective opinions [15].
我们针对各类自然图像，对当前最先进的算法进行了广泛评估与比较研究。为全面评估，选取的标准图像序列涵盖了室内外场景、昼夜环境、自然与人工光源、线性与非线性曝光等不同图像特性。所提算法与七种性能最优的多曝光融合(MEF)算法进行对比：(1)Mertens09[16]、(2)Li13[10]、(3)Li12[9]、(4)Ma15[14]、(5)Raman11[20]、(6)Shen11[23]及(7)Guo17[12]。为客观评估算法性能，采用 MEF-SSIM 评价指标。尽管已有多种通用图像融合的质量评估模型被提出，但均未能充分预测主观视觉质量[15]。

4.1 DeepFuse - Baseline 4.1 DeepFuse 基准模型

So far, we have discussed on training CNN model in unsupervised manner. One interesting variant of that would be to train the CNN model with results of other state-of-art methods as ground truth. This experiment can test the capability of CNN to learn complex fusion rules from data itself without the help of MEF SSIM loss function. The ground truth is selected as best of Mertens [16] and GFF [10] methods based on MEF SSIM score²²2In a user survey conducted by Ma et al. [15], Mertens and GFF results are ranked better than other MEF algorithms. The choice of loss function to calculate error between ground truth and estimated output is very crucial for training a CNN in supervised fashion. The Mean Square Error or $\ell_{2}$ loss function is generally chosen as default cost function for training CNN. The $\ell_{2}$ cost function is desired for its smooth optimization properties. While $\ell_{2}$ loss function is better suited for classification tasks, they may not be a correct choice for image processing tasks [29]. It is also a well known phenomena that MSE does not correlate well with human perception of image quality [27]. In order to obtain visually pleasing result, the loss function should be well correlated with HVS, like Structural Similarity Index (SSIM) [27]. We have experimented with different loss functions such as $\ell_{1}$ , $\ell_{2}$ and SSIM.
迄今为止，我们已探讨了以无监督方式训练 CNN 模型的方法。其中一个有趣的变体是采用其他先进方法的结果作为基准真值来训练 CNN 模型。该实验可验证 CNN 在无需 MEF SSIM 损失函数辅助的情况下，从数据中学习复杂融合规则的能力。基于 MEF SSIM 评分 ²²2In a user survey conducted by Ma et al. [15], Mertens and GFF results are ranked better than other MEF algorithms ，我们选取 Mertens[16]与 GFF[10]方法中的最优结果作为基准真值。在监督式 CNN 训练中，用于计算基准真值与估计输出间误差的损失函数选择至关重要。均方误差（MSE）或 $\ell_{2}$ 损失函数通常作为默认代价函数，其平滑优化特性使 $\ell_{2}$ 代价函数备受青睐。虽然 $\ell_{2}$ 损失函数更适用于分类任务，但对于图像处理任务可能并非最佳选择[29]。众所周知，MSE 与人类视觉感知的图像质量相关性较弱[27]。为获得视觉感知良好的结果，损失函数应与人类视觉系统（HVS）高度相关，例如结构相似性指数（SSIM）[27]。我们尝试了不同的损失函数，如 $\ell_{1}$ 、 $\ell_{2}$ 和结构相似性指数（SSIM）。

The fused image appear blurred when the CNN was trained with $\ell_{2}$ loss function. This effect termed as regression to mean, is due to the fact that $\ell_{2}$ loss function compares the result and ground truth in a pixel by pixel manner. The result by $\ell_{1}$ loss gives sharper result than $\ell_{2}$ loss but it has halo effect along the edges. Unlike $\ell_{1}$ and $\ell_{2}$ , results by CNN trained with SSIM loss function are both sharp and artifact-free. Therefore, SSIM is used as loss function to calculate error between generated output and ground truth in this experiment.
当 CNN 使用 $\ell_{2}$ 损失函数训练时，融合图像会出现模糊现象。这种被称为"均值回归"的效应源于 $\ell_{2}$ 损失函数以逐像素方式比较结果与真实值。使用 $\ell_{1}$ 损失函数得到的结果比 $\ell_{2}$ 损失函数更清晰，但边缘存在光晕效应。与 $\ell_{1}$ 和 $\ell_{2}$ 不同，采用 SSIM 损失函数训练的 CNN 结果既清晰又无伪影。因此本实验采用 SSIM 作为损失函数来计算生成输出与真实值之间的误差。

The quantitative comparison between DeepFuse baseline and unsupervised method is shown in Table 2. The MEF SSIM scores in Table 2 shows the superior performance of DeepFuse unsupervised over baseline method in almost all test sequences. The reason is due to the fact that for baseline method, the amount of learning is upper bound by the other algorithms, as the ground truth for baseline method is from Merterns et al. [16] or Li et al. [10]. We see from Table 2 that the baseline method does not exceed both of them.
表 2 展示了 DeepFuse 基线方法与无监督方法的定量对比结果。表中 MEF SSIM 评分表明，在几乎所有测试序列中，DeepFuse 无监督方法均优于基线方法。这是由于基线方法的学习效果受限于其他算法，因其真值数据来源于 Merterns 等人[16]或 Li 等人[10]的研究。从表 2 可见，基线方法的性能始终未能超越这两个参照方法。

The idea behind this experiment is to combine advantages of all previous methods, at the same time avoid shortcomings of each. From Fig. 3, we can observe that though DF-baseline is trained with results of other methods, it can produce results that do not have any artifacts observed in other results.
本实验的核心思想在于综合先前所有方法的优势，同时规避各自的缺陷。从图 3 可以观察到，尽管 DF-baseline 模型是使用其他方法的结果进行训练的，但其生成的结果并未出现其他方法中常见的伪影现象。

4.2 Comparison with State-of-the-art
4.2 与现有最优方法的比较

Comparison with Mertens et al.: Mertens et al. [16] is a simple and effective weighting based image fusion technique with multi resolution blending to produce smooth results. However, it suffers from following shortcomings: (a) it picks “best” parts of each image for fusion using hand crafted features like saturation and well-exposedness. This approach would work better for image stacks with many exposure images. But for exposure image pairs, it fails to maintain uniform brightness across whole image. Compared to Mertens et al., DeepFuse produces images with consistent and uniform brightness across whole image. (b) Mertens et al. does not preserve complete image details from under exposed image. In Fig. 4(d), the details of the tile area is missing in Mertens et al.’s result. The same is the case in Fig. 4(j), the fine details of the lamp are not present in the Mertens et al. result. Whereas, DeepFuse has learned filters that extract features like edges and textures in C1 and C2, and preserves finer structural details of the scene.
与 Mertens 等人的方法比较：Mertens 等人[16]提出了一种基于权重分配的多分辨率融合技术，该方法简单有效且能生成平滑结果。但其存在以下不足：(a)该方法通过手工设计特征（如饱和度和曝光良好度）从每幅图像中选取"最佳"部分进行融合。这种策略在曝光图像数量较多时效果较好，但对于曝光图像对而言，难以维持整幅图像的亮度均匀性。相比之下，DeepFuse 生成的图像在整个画面中保持亮度一致且均匀。(b)Mertens 等人的方法无法完整保留欠曝光图像中的细节。如图 4(d)所示，瓦片区域的细节在其结果中缺失；同样在图 4(j)中，灯具的精细结构也未能在其结果中呈现。而 DeepFuse 通过学习得到的滤波器，能够在 C1 和 C2 层提取边缘和纹理等特征，从而保留场景更精细的结构细节。

Comparison with Li et al. [9] [10]: It can be noted that, similar to Mertens et al. [16], Li et al. [9] [10] also suffers from non-uniform brightness artifact (Fig. 5). In contrast, our algorithm provides a more pleasing image with clear texture details.
与 Li 等人[9][10]的对比：可以注意到，与 Mertens 等人[16]类似，Li 等人[9][10]的方法同样存在亮度不均匀的伪影问题（图 5）。相比之下，我们的算法生成的图像视觉效果更佳，纹理细节清晰。

Comparison with Shen et al. [23]: The results generated by Shen et al. show contrast loss and non-uniform brightness distortions (Fig. 5). In Fig. 5(e1), the brightness distortion is present in the cloud region. The cloud regions in between balloons appear darker compared to other regions. This distortion can be observed in other test images as well in Fig. 5(e2). However, the DeepFuse (Fig. 5(f1) and (f2) ) have learnt to produce results without any of these artifacts.
与 Shen 等人[23]的对比：Shen 等人方法生成的结果存在对比度损失和亮度不均匀失真（图 5）。在图 5(e1)中，云层区域出现了亮度失真，气球之间的云区比其他区域显得更暗。这种失真现象在图 5(e2)的其他测试图像中同样可见。而 DeepFuse 方法（图 5(f1)和(f2)）则能够生成完全没有这些伪影的结果。

Comparison with Ma et al. [14]: Fig. 6 and 7 shows comparison between results of Ma et al. and DeepFuse for Lighthouse and Table sequences. Ma et al. proposed a patch based fusion algorithm that fuses patches from input images based on their patch strength. The patch strength is calculated using a power weighting function on each patch. This method of weighting would introduce unpleasant halo effect along edges (see Fig. 6 and 7).
与 Ma 等人[14]的对比：图 6 和图 7 展示了 Lighthouse 和 Table 序列在 Ma 等人方法与 DeepFuse 方法下的结果比较。Ma 等人提出了一种基于图像块的融合算法，该算法根据图像块的强度对输入图像的块进行融合。块强度通过每个块的幂加权函数计算得出。这种加权方法会在边缘处产生不自然的光晕效应（见图 6 和图 7）。

Comparison with Raman et al. [20]: Fig. 3(f) shows the fused result by Raman et al. for House sequence. The result exhibit color distortion and contrast loss. In contrast, proposed method produces result with vivid color quality and better contrast.
与 Raman 等人[20]的对比：图 3(f)展示了 Raman 等人对 House 序列的融合结果。该结果存在色彩失真和对比度损失的问题。相比之下，本文提出的方法能生成色彩生动且对比度更优的结果。

After examining the results by both subjective and objective evaluations, we observed that our method is able to faithfully reproduce all the features in the input pair. We also notice that the results obtained by DeepFuse are free of artifacts such as darker regions and mismatched colors. Our approach preserves the finer image details along with higher contrast and vivid colors. The quantitative comparison between proposed method and existing approaches in Table 2 also shows that proposed method outperforms others in most of the test sequences. From the execution times shown in Table 3 we can observe that our method is roughly 3-4 $\times$ faster than Mertens et al. DeepFuse can be easily extended to more input images by adding additional streams before merge layer. We have trained DeepFuse for sequences with 3 and 4 images. For sequences with 3 images, average MEF SSIM score for DF is 0.987 and 0.979 for Mertens et al. For sequences with 4 images, average MEF SSIM score for DF is 0.972 and 0.978 for Mertens et al. For sequences with 4 images, we attribute dip in performance to insufficient training data. With more training data, DF can be trained to perform better in such cases as well.
通过主客观评估检验结果后，我们发现本方法能准确复现输入图像对的所有特征。同时观察到 DeepFuse 生成的结果不存在暗区或色彩失配等伪影问题。该方法在保持更高对比度与鲜艳色彩的同时，还能保留更精细的图像细节。表 2 所示定量对比表明，在多数测试序列中本方法优于现有方案。由表 3 执行时间可见，本方法速度较 Mertens 等人方案快约 3-4 倍。通过合并层前增加额外数据流，DeepFuse 可轻松扩展至多图像输入。我们已针对 3 幅和 4 幅图像序列训练模型：对于 3 幅图像序列，DF 的平均 MEF-SSIM 得分为 0.987，Mertens 方案为 0.979；4 幅图像序列中 DF 得分为 0.972，Mertens 方案为 0.978。我们认为 4 幅序列性能下降源于训练数据不足。随着训练数据的增加，DF 模型也能通过训练在这些情况下表现更优。

4.3 Application to Multi-Focus Fusion
4.3 多焦点融合应用

In this section, we discuss the possibility of applying our DeepFuse model for solving other image fusion problems. Due to the limited depth-of-field in the present day cameras, only object in limited range of depth are focused and the remaining regions appear blurry. In such scenario, Multi-Focus Fusion (MFF) techniques are used to fuse images taken with varying focus to generate a single all-in-focus image. MFF problem is very similar to MEF, except that the input images have varying focus than varying exposure for MEF. To test the generalizability of CNN, we have used the already trained DeepFuse CNN to fuse multi-focus images without any fine-tuning for MFF problem. Fig. 9 shows that the DeepFuse results on publicly available multi-focus dataset show that the filters of CNN have learnt to identify proper regions in each input image and successfully fuse them together. It can also be seen that the learnt CNN filters are generic and could be applied for general image fusion.
本节探讨将 DeepFuse 模型应用于解决其他图像融合问题的可能性。由于现代相机景深有限，只有特定深度范围内的物体能清晰对焦，其余区域则呈现模糊状态。在此情况下，多焦点融合技术通过整合不同对焦位置的图像，生成单张全清晰图像。多焦点融合问题与多曝光融合高度相似，区别仅在于输入图像的变化因素是对焦位置而非曝光程度。为测试卷积神经网络的泛化能力，我们直接使用已训练的 DeepFuse CNN 模型进行多焦点图像融合，未针对该问题做任何微调。图 9 显示，在公开多焦点数据集上的融合结果表明，CNN 滤波器已学会识别各输入图像的恰当区域并成功实现融合。这同时证明所学得的 CNN 滤波器具有通用性，可推广至一般图像融合任务。

Table 3: Computation time: Running time in seconds of different algorithms on a pair of images. The numbers in bold denote the least amount of time taken to fuse.

\ddagger

: tested with NVIDIA Tesla K20c GPU,

\dagger

: tested with Intel^®Xeon @ 3.50 GHz CPU
表 3：计算时间：不同算法处理一对图像的运行时间（单位：秒）。加粗数字表示完成融合所需的最短时间。

\ddagger

：使用 NVIDIA Tesla K20c GPU 测试，

\dagger

：使用 Intel ^® Xeon @ 3.50 GHz CPU 测试

Image size 图像尺寸	Ma $15^{\dagger}$ 马 $15^{\dagger}$	Li $13^{\dagger}$ 李 $13^{\dagger}$	Mertens $07^{\dagger}$ 默滕斯 $07^{\dagger}$	$DF^{\ddagger}$
512*384	2.62	0.58	0.28	0.07
1024*768	9.57	2.30	0.96	0.28
1280*1024	14.72	3.67	1.60	0.46
1920*1200	27.32	6.60	2.76	0.82

5 Conclusion and Future work
5 结论与未来工作

In this paper, we have proposed a method to efficiently fuse a pair of images with varied exposure levels to produce an output which is artifact-free and perceptually pleasing. DeepFuse is the first ever unsupervised deep learning method to perform static MEF. The proposed model extracts set of common low-level features from each input images. Feature pairs of all input images are fused into a single feature by merge layer. Finally, the fused features are input to reconstruction layers to get the final fused image. We train and test our model with a huge set of exposure stacks captured with diverse settings. Furthermore, our model is free of parameter fine-tuning for varying input conditions. Finally, from extensive quantitative and qualitative evaluation, we demonstrate that the proposed architecture performs better than state-of-the-art approaches for a wide range of input scenarios.
本文提出了一种高效融合不同曝光度图像对的方法，能够生成无伪影且视觉感知良好的输出结果。DeepFuse 是首个采用无监督深度学习实现静态多曝光融合的方法。该模型从每幅输入图像中提取共有的底层特征集，通过融合层将所有输入图像的特征对合并为单一特征，最终将融合特征输入重建层以获得合成图像。我们使用多种拍摄环境下获取的大规模曝光序列数据集进行模型训练与测试。此外，该模型无需针对不同输入条件进行参数微调。通过大量定量与定性评估实验证明，所提架构在多种输入场景下的性能均优于当前最先进方法。

In summary, the advantages offered by DF are as follows: 1) Better fusion quality: produces better fusion result even for extreme exposure image pairs, 2) SSIM over $\ell_{1}$ : In [29], the authors report that $\ell_{1}$ loss outperforms SSIM loss function. In their work, the authors have implemented approximate version of SSIM and found it to perform sub-par compared to $\ell_{1}$ . We have implemented the exact SSIM formulation and observed that SSIM loss function perform much better than MSE and $\ell_{1}$ . Further, we have shown that a complex perceptual loss such as MEF SSIM can be successfully incorporated with CNNs in absense of ground truth data. The results encourage the research community to examine other perceptual quality metrics and use them as loss functions to train a neural net. 3) Generalizability to other fusion tasks: The proposed fusion is generic in nature and could be easily adapted to other fusion problems as well. In our current work, DF is trained to fuse static images. For future research, we aim to generalize DeepFuse to fuse images with object motion as well.
综上所述，DF 方法具有以下优势：1）更优的融合质量：即使对于极端曝光图像对也能生成更好的融合结果；2）结构相似性（SSIM）超越基准：文献[29]指出 $\ell_{1}$ 损失函数优于 SSIM 损失函数。该研究团队实现了 SSIM 的近似版本，发现其性能逊于 $\ell_{1}$ 。我们采用精确的 SSIM 公式实现，观察到 SSIM 损失函数表现显著优于均方误差（MSE）和 $\ell_{1}$ 。此外，我们证明了在缺乏真实数据的情况下，诸如 MEF-SSIM 这类复杂感知损失函数能够成功与卷积神经网络结合。这一发现鼓励研究界探索其他感知质量指标并将其作为神经网络训练的损失函数；3）对其他融合任务的泛化能力：所提出的融合方法具有通用性，可轻松适配其他融合问题。在当前工作中，DF 被训练用于静态图像融合。未来研究将致力于将 DeepFuse 推广至含运动物体的图像融合领域。

References

[1] EMPA HDR image database. http://www.empamedia.ethz.ch/hdrdatabase/index.php. Accessed: 2016-07-13.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
[3] P. J. Burt and R. J. Kolczynski. Enhanced image capture through fusion. In Proceedings of the International Conference on Computer Vision, 1993.
[4] N. Divakar and R. V. Babu. Image denoising via CNNs: An adversarial approach. In New Trends in Image Restoration and Enhancement, CVPR workshop, 2017.
[5] A. A. Goshtasby. Fusion of multi-exposure images. Image and Vision Computing, 23(6):611–618, 2005.
[6] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. arXiv preprint arXiv:1703.06870, 2017.
[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[8] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[9] S. Li and X. Kang. Fast multi-exposure image fusion with median filter and recursive filter. IEEE Transaction on Consumer Electronics, 58(2):626–632, May 2012.
[10] S. Li, X. Kang, and J. Hu. Image fusion with guided filtering. IEEE Transactions on Image Processing, 22(7):2864–2875, July 2013.
[11] Y. Li, K. He, J. Sun, et al. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, 2016.
[12] Z. Li, Z. Wei, C. Wen, and J. Zheng. Detail-enhanced multi-scale exposure fusion. IEEE Transactions on Image Processing, 26(3):1243–1252, 2017.
[13] Y. Liu, S. Liu, and Z. Wang. Multi-focus image fusion with dense SIFT. Information Fusion, 23:139–155, 2015.
[14] K. Ma and Z. Wang. Multi-exposure image fusion: A patch-wise approach. In IEEE International Conference on Image Processing, 2015.
[15] K. Ma, K. Zeng, and Z. Wang. Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing, 24(11):3345–3356, 2015.
[16] T. Mertens, J. Kautz, and F. Van Reeth. Exposure fusion. In Pacific Conference on Computer Graphics and Applications, 2007.
[17] P. H. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. arXiv preprint arXiv:1306.2795, 2013.
[18] K. R. Prabhakar and R. V. Babu. Ghosting-free multi-exposure image fusion in gradient domain. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
[19] S. Raman and S. Chaudhuri. Bilateral filter based compositing for variable exposure photography. In Proceedings of EUROGRAPHICS, 2009.
[20] S. Raman and S. Chaudhuri. Reconstruction of high contrast images for dynamic scenes. The Visual Computer, 27:1099–1114, 2011. 10.1007/s00371-011-0653-0.
[21] R. K. Sarvadevabhatla, J. Kundu, et al. Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition. In Proceedings of the ACM on Multimedia Conference, 2016.
[22] J. Shen, Y. Zhao, S. Yan, X. Li, et al. Exposure fusion using boosting laplacian pyramid. IEEE Trans. Cybernetics, 44(9):1579–1590, 2014.
[23] R. Shen, I. Cheng, J. Shi, and A. Basu. Generalized random walks for fusion of multi-exposure images. IEEE Transactions on Image Processing, 20(12):3634–3646, 2011.
[24] M. Tico and K. Pulli. Image enhancement method via blur and noisy image fusion. In IEEE International Conference on Image Processing, 2009.
[25] J. Wang, B. Shi, and S. Feng. Extreme learning machine based exposure fusion for displaying HDR scenes. In International Conference on Signal Processing, 2012.
[26] J. Wang, D. Xu, and B. Li. Exposure fusion based on steerable pyramid for displaying high dynamic range scenes. Optical Engineering, 48(11):117003–117003, 2009.
[27] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[28] W. Zhang and W.-K. Cham. Reference-guided exposure fusion in dynamic scenes. Journal of Visual Communication and Image Representation, 23(3):467–475, 2012.
[29] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for neural networks for image processing. arXiv preprint arXiv:1511.08861, 2015.

DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image PairsDeepFuse：一种针对极端曝光图像对的深度无监督曝光融合方法