(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
(eccv) 包 eccv 警告: 包 'hyperref' 以选项 'pagebackref' 加载，这*不*推荐用于最终版

¹¹institutetext: University of Florence - Media Integration and Communication Center (MICC)
Florence, Italy
¹¹email: name.surname@unifi.it

Quality-Aware Image-Text Alignment
for Real-World Image Quality Assessment
用于真实世界图像质量评估的质量感知图像文本对齐

Lorenzo Agnolucci 洛伦佐·阿尼诺卢奇 0000-0002-9558-1287 Leonardo Galteri 0000-0002-7247-9407 Marco Bertini 0000-0002-1364-218X

Abstract 摘要

No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP
无参考图像质量评估（NR-IQA）侧重于在没有高质量参考图像的情况下，设计与人类感知一致的图像质量测量方法。大多数最先进的 NR-IQA 方法依赖于带注释的平均意见分数（MOS），这限制了它们在实际场景中的可扩展性和更广泛的适用性。为了解决这一限制，我们提出了 QualiCLIP（质量感知的 CLIP），这是一种基于 CLIP 的自监督无意见方法，不需要标记的 MOS。特别地，我们引入了一种质量感知的图像-文本对齐策略，使 CLIP 生成与图像固有质量相关的表示。从完美图像开始，我们对它们进行合成降级，增强强度。随后，我们训练 CLIP 根据它们与质量相关反义词提示的相似性对这些降级图像进行排名，同时确保质量相当的图像具有一致的表示。我们的方法在多个具有真实失真的数据集上实现了最先进的表现。此外，尽管不需要 MOS，但当训练数据集与测试数据集不同时，QualiCLIP 优于有监督方法，从而证明其更适用于现实世界场景。此外，我们的方法表现出比竞争方法更大的鲁棒性和更好的可解释性。代码和模型在 https://github.com/miccunifi/QualiCLIP 上公开提供。.

Keywords:

Image Quality Assessment CLIP Self-Supervised Learning
关键词：图像质量评估 CLIP 自监督学习

1 Introduction 1 引言

Refer to caption — Figure 1: Comparison between the image quality scores predicted by CLIP-IQA [36] and QualiCLIP for increasing distortion intensities of different types of synthetic degradation. We average the results of 1000 randomly sampled images from the KonIQ [10] dataset. Compared to CLIP-IQA, our method corresponds to a higher correlation between the predicted quality scores and the severity of the degradation. The distortion intensities are scaled between 0 and 1 for clearer visualization.
图 1：CLIP-IQA [36] 和 QualiCLIP 所预测的图像质量分数在不同类型的合成降解强度增加时的比较。我们对来自 KonIQ [10] 数据集的 1000 个随机抽样图像的结果进行了平均。与 CLIP-IQA 相比，我们的方法在预测质量分数与降解严重程度之间具有更高的相关性。为了更清晰地展示，失真强度在 0 和 1 之间进行了缩放。

Image Quality Assessment (IQA) aims to automatically evaluate the quality of images in accordance with human judgments, represented by Mean Opinion Scores (MOS). Specifically, No-Reference IQA (NR-IQA) focuses on developing methods that do not require a high-quality reference image and that are consequently more easily applicable in real-world scenarios. NR-IQA plays a critical role in diverse industries and research domains. For example, given the large number of photos that are captured and shared daily on social media platforms, it is imperative to design approaches that can measure image quality objectively to be able to store and process these images effectively. However, for such approaches to be reliable, they need to exhibit strong generalization capabilities.
图像质量评估 (IQA) 旨在自动评估图像质量，使其与人类判断一致，并以平均意见分数 (MOS) 为表现形式。特别是，无参考 IQA (NR-IQA) 集中于开发无需高质量参考图像的方法，从而更容易在现实场景中应用。NR-IQA 在各个行业和研究领域中发挥着关键作用。例如，考虑到社交媒体平台上每日捕获和分享的大量照片，设计能够客观衡量图像质量的方法对于有效存储和处理这些图像至关重要。然而，为了使这些方法可靠，它们需要表现出强大的泛化能力。

Most NR-IQA methods are opinion-aware, i.e. they require labeled mean opinion scores as supervision during the training process [33, 7, 41, 23, 1, 28]. Some approaches, such as HyperIQA [33] or TReS [7], directly train the model parameters on IQA datasets. Other methods, namely CONTRIQUE [23] or ARNIQA [1], train first an image encoder on unlabeled data via self-supervised learning and then a linear regressor using the MOS. However, annotating IQA datasets is very expensive and resource-intensive, as several human ratings are needed for each image for the MOS to be reliable. For example, the FLIVE dataset [39], which contains 40K real-world images, required about 4M ratings, up to 50 for a single image. The requirement for human annotations significantly hinders the scalability of opinion-aware approaches. In addition, these methods show limited generalization capabilities and thus applicability to real-world scenarios, as their performance significantly deteriorates in cross-dataset settings, i.e. when considering testing datasets different from the training one. To remove the requirement for labeled MOS, several opinion-unaware methods have been proposed [20, 4, 8]. For instance, CL-MI [4] introduces a two-stage self-supervised approach that employs two different training strategies for synthetically and authentically degraded images. However, existing opinion-unaware methods obtain considerably worse performance than supervised approaches in cross-dataset experiments, thus demonstrating limited applicability.
大多数 NR-IQA 方法是意见感知的，即在训练过程中需要标记的平均意见评分作为监督[33, 7, 41, 23, 1, 28]。一些方法，例如 HyperIQA [33]或 TReS [7]，直接在 IQA 数据集上训练模型参数。其他方法，即 CONTRIQUE [23]或 ARNIQA [1]，首先通过自监督学习在未标记的数据上训练图像编码器，然后使用 MOS 训练线性回归。然而，标注 IQA 数据集非常昂贵且资源密集，因为每个图像都需要多个人工评估以确保 MOS 的可靠性。例如，包含 40K 现实世界图像的 FLIVE 数据集[39]需要约 4M 的评分，单个图像最多需要达到 50 次评分。对人工标注的需求显著阻碍了意见感知方法的可扩展性。此外，这些方法显示出有限的泛化能力，因此在交叉数据集设置下，即考虑与训练数据集不同的测试数据集时，其性能显著下降，从而限制了其在现实世界场景中的应用。为了消除对标记 MOS 的需求，已经提出了几种意见未知的方法[20, 4, 8]。例如，CL-MI [4]引入了一种两阶段自监督方法，针对合成和真实劣化的图像采用两种不同的训练策略。然而，现有的意见不敏感方法在跨数据集实验中比监督方法表现差得多，因此显示出有限的适用性。

In this context, we propose to take advantage of recent advancements in vision-language models by presenting a self-supervised opinion-unaware approach based on CLIP [26]. Recently, CLIP-based methods achieved promising performance in the NR-IQA task [36, 42, 32]. For example, CLIP-IQA [36] proposes to compute the quality score by measuring the similarity between the image and the two quality-related antonym prompts without any task-specific training. However, CLIP struggles to generate quality-aware image representations [42, 13], as it focuses more on the high-level semantic information rather than on the low-level characteristics of the images. To support this claim, we randomly sample 1000 images from the KonIQ [10] dataset and synthetically degrade them with several distortions using increasing levels of intensity. Then, we compute the quality score of each image through CLIP-IQA and average the results. We expect the more degraded versions of the images to correspond to lower quality scores. However, Figure 1 shows that CLIP-IQA demonstrates a low correlation between the predicted quality and the degree of the distortion. Therefore, CLIP proves not to be intrinsically quality-aware.
在这个背景下，我们提出了一种基于 CLIP [26]的自监督无意见方法，以利用视觉语言模型的最新进展。最近，基于 CLIP 的方法在 NR-IQA 任务中取得了令人满意的表现 [36, 42, 32]。例如，CLIP-IQA [36]提出可以通过测量图像与两个质量相关反义词提示之间的相似性来计算质量分数，而无需特定任务训练。然而，CLIP 在生成质量感知的图像表示方面存在困难 [42, 13]，因为它更多地关注高层语义信息，而不是图像的低层特征。为了支持这一说法，我们随机从 KonIQ [10]数据集中抽取 1000 张图像，并用几种失真以不断增加的强度进行合成降质处理。然后，通过 CLIP-IQA 计算每张图像的质量分数并对结果进行平均。我们预期图像的降质版本应该对应较低的质量分数。然而，图 1 显示，CLIP-IQA 在预测质量与失真程度之间表现出较低的相关性。因此，CLIP 被证明本质上并没有质量意识。

To address this issue, we propose a quality-aware image-text alignment strategy to fine-tune the CLIP image encoder so that it generates representations that correlate with the inherent quality of the images. We start by synthetically degrading pairs of pristine images using increasing levels of intensity. Then, we measure the similarity between each image and antonym prompts referring to image quality, such as “Good photo” and “Bad photo”. Finally, we employ a training strategy based on a margin ranking loss [13, 16, 7] that allows us to achieve two objectives. First, we want CLIP to generate similar representations for images of comparable quality, i.e. exhibiting the same amount of distortion. Second, the similarity between each of the antonym prompts and the increasingly degraded versions of the images must correlate – in opposite directions – with the intensity of the distortion. Our approach, named QualiCLIP (Quality-aware CLIP), is both self-supervised and opinion-unaware, as we do not rely on any form of supervision – especially MOS – at any step of the training process. Thanks to our training strategy, the image-text alignment in the CLIP embedding space focuses on the low-level image characteristics rather than the semantics. Consequently, QualiCLIP generates quality-aware representations that correlate with the amount of degradation exhibited from the images, as shown in Fig. 1. The experiments show that the proposed approach obtains significant performance improvements – up to a 20% gain – over other state-of-the-art opinion-unaware methods on several datasets with authentic distortions. In addition, QualiCLIP outperforms supervised techniques in the cross-dataset setting, exhibiting more suitability for real-world applications. Moreover, the gMAD [21] competition and visualization with gradCAM [29] show that QualiCLIP demonstrates both greater robustness and enhanced explainability than competing methods. We summarize our contributions as follows:
为了解决这个问题，我们提出了一种质量感知的图像文本对齐策略，以微调 CLIP 图像编码器，使其生成与图像固有质量相关的表示。我们首先通过增加强度级别来合成退化一对无暇图像。然后，我们测量每张图像与指代图像质量的反义词提示之间的相似性，例如“好照片”和“坏照片”。最后，我们采用一种基于边际排名损失[13, 16, 7]的训练策略，使我们能够实现两个目标。首先，我们希望 CLIP 为质量相当的图像生成相似表示，即显示相同量的失真。其次，反义词提示与图像逐渐退化版本之间的相似性必须与失真强度反向相关。我们的方法，名为 QualiCLIP（质量感知 CLIP），既是自我监督的，也是观点无关的，因为我们在训练过程的任何步骤中都不依赖任何形式的监督——特别是 MOS。得益于我们的训练策略，CLIP 嵌入空间中的图文对齐关注的是图像的低级特征，而不是语义。因此，QualiCLIP 生成的质量感知表征与图像所展示的退化程度相关，如图 1 所示。实验表明，与其他最先进的观点无关的方法相比，在多个具有真实失真的数据集上，提出的方法取得了显著的性能提升——高达 20%的增益。此外，QualiCLIP 在跨数据集的设置中表现优于监督技术，显示出更适合实际应用。此外，gMAD 竞争和使用 gradCAM 的可视化显示，QualiCLIP 相比竞争方法展示了更强的稳健性和更好的可解释性。我们总结我们的贡献如下：

•

We propose QualiCLIP, a CLIP-based self-supervised opinion-unaware approach for NR-IQA that does not require any supervision, especially MOS;
我们提出了 QualiCLIP，这是一种基于 CLIP 的自监督意见无关的方法，用于无参考图像质量评估，不需要任何监督，尤其是主观意见分数（MOS）；
•

We introduce a quality-aware image-text alignment strategy based on ranking increasingly degraded pairs of images according to their similarity to quality-related antonym prompts. After training, CLIP generates image representations that correlate with their intrinsic quality;
我们提出了一种质量感知的图像-文本对齐策略，根据图像与质量相关的反义词提示的相似性，对越来越差的图像对进行排序。在训练之后，CLIP 生成的图像表示与其内在质量相关联；
•

Our method obtains significantly better results than other opinion-unaware approaches and even outperforms supervised techniques in cross-dataset experiments, proving to be more suitable for real-world scenarios. Moreover, QualiCLIP exhibits greater robustness and improved explainability than competing methods.
我们的方法在跨数据集实验中取得了显著优于其他无观点的方法的结果，甚至超过了监督技术，证明更适合实际应用场景。此外，QualiCLIP 表现出比竞争方法更强的鲁棒性和更好的可解释性。

2 Related Work 2 相关工作

No-Reference Image Quality Assessment Due to its wide range of applications in real-world scenarios, in recent years research on NR-IQA has gained significant momentum [33, 7, 23, 1, 4]. Several methods achieved promising performance by relying on supervised learning [33, 7, 23, 1]. Some approaches directly employ the labeled MOS during model training [33, 23, 41]. For example, HyperIQA [33] proposes a self-adaptive hypernetwork that separates content understanding from quality prediction. Another research direction involves a self-supervised pre-training of an encoder on unlabeled images, followed by the training of a linear regressor using the annotated MOS [23, 28, 1]. For instance, ARNIQA [1] pre-trains the encoder by maximizing the similarity between different images degraded in the same way. Supervised methods require expensive labeled MOS, either for training the encoder or the regressor. This requirement is removed by opinion-unaware approaches [24, 40, 20, 4, 31]. Some of them, such as NIQE [24], are based on natural scene statistics [24, 40], while others employ self-supervised learning [20, 4, 31, 8]. For example, CL-MI [4] pre-trains an encoder on synthetic data and then fine-tunes it on authentic images via a mutual information-based loss. In this work, we propose a self-supervised opinion-unaware approach that relies solely on CLIP and achieves state-of-the-art results on several datasets with authentic distortions.
无参考图像质量评估由于其在现实场景中广泛的应用，近年来对 NR-IQA 的研究取得了显著进展。有些方法通过依赖监督学习实现了令人满意的性能。有些方法在模型训练期间直接采用标记的 MOS。例如，HyperIQA 提出了一种自适应超网络，该网络将内容理解与质量预测分开。另一个研究方向涉及在未标记图像上对编码器进行自监督预训练，然后使用标注的 MOS 训练线性回归器。例如，ARNIQA 通过最大化以相同方式被降质的不同图像之间的相似性来预训练编码器。监督方法需要昂贵的标记 MOS，无论是用于训练编码器还是回归器。这一要求通过意见无关的方法得以消除。其中一些方法，如 NIQE，基于自然场景统计，而其他方法采用自监督学习。例如，CL-MI [4] 通过在合成数据上预训练编码器，然后在真实图像上通过基于互信息的损失进行微调。在这项工作中，我们提出了一种自监督的、不依赖意见的方法，该方法仅依赖于 CLIP，并在多个具有真实失真的数据集上实现了最先进的结果。

CLIP for NR-IQA CLIP has achieved impressive performance in several low-level vision tasks, such as image and video restoration [13, 18, 2] and quality assessment [36, 42, 32, 38, 37]. CLIP-IQA [36] is the first work that studied the capabilities of CLIP in assessing the quality and abstract perception of images without task-specific training. In addition, the authors train a model named CLIP-IQA+ based on learning two antonym prompts using labeled MOS. On the contrary, LIQE [42] proposes a multi-task learning approach that fine-tunes CLIP on multiple IQA datasets at once in a supervised way. The most similar to our work is the concurrent GRepQ [32], a self-supervised method based on a low-level encoder and a high-level CLIP-based one. CLIP is fine-tuned by separating higher and lower-quality groups of images within the same batch with a contrastive loss depending on their predicted quality, obtained by measuring their similarity to antonym text prompts. GRepQ predicts the final quality score by combining the features of the two encoders and feeding them as input to a linear regressor, which is trained on IQA datasets using the labeled MOS. In contrast, we present a CLIP-only self-supervised approach that removes the need for a low-level encoder. We propose to synthetically degrade pairs of images with increasing levels of intensity and make our model learn to rank them through a ranking loss according to their degree of distortion. The ranking is based directly on the similarity between the text features and each of the antonym prompts, instead of relying on the predicted quality as in GRepQ. Also, differently from GRepQ, we do not require any form of supervision at any step of our approach.
CLIP 在无参考图像质量评估中的表现 CLIP 在多个低级视觉任务中取得了令人印象深刻的表现，例如图像和视频修复 [13, 18, 2] 以及质量评估 [36, 42, 32, 38, 37]。CLIP-IQA [36] 是第一个研究 CLIP 在不进行任务特定训练的情况下评估图像质量和抽象感知能力的工作。此外，作者基于标注的 MOS 训练了一个名为 CLIP-IQA+的模型，学习了两个反义词提示。相反，LIQE [42] 提出了一个多任务学习方法，以监督方式一次微调 CLIP 在多个 IQA 数据集上。与我们的工作最相似的是并行的 GRepQ [32]，这是一种基于低级编码器和高级 CLIP 编码器的自监督方法。通过在同一批次内将高质量和低质量的图像分组进行分离，使用对比损失根据其预测质量进行微调，预测质量是通过测量其与反义文本提示的相似性获得的。GRepQ 通过组合两个编码器的特征，并将其作为线性回归器的输入来预测最终质量得分，该线性回归器在标注的 MOS 上进行训练。相反，我们提出了一种仅使用 CLIP 的自监督方法，去除了对低级编码器的需求。我们建议合成地退化图像对，使其具有不断增加的强度级别，并使我们的模型通过排序损失根据扭曲程度学习排序。排序是直接基于文本特征与每个反义词提示之间的相似性，而不是像 GRepQ 一样依赖于预测质量。此外，与 GRepQ 不同，我们的方法在任何步骤中都不需要任何形式的监督。

Learning to Rank Learning to rank images has proven to be an effective technique for image quality and aesthetics assessment [16, 7, 11, 35, 20, 27]. RankIQA [16] proposes to pre-train a Siamese network in an unsupervised way by directly ranking the quality of increasingly degraded images. Then, the model is fine-tuned on IQA datasets with an MSE loss using the ground-truth MOS. VILA [11] tackles image aesthetics assessment by fine-tuning CLIP using image-comment pairs. Then, the authors train a residual projection by learning to rank the quality – expressed as the similarity between the image and a single prompt – of a single pair of images, according to their labeled MOS. In contrast, we design an approach that does not require any form of supervision during training. Indeed, we fine-tune the CLIP image encoder by learning to rank the similarity between two antonym prompts and multiple increasingly degraded pairs of images, according to the severity of their distortion. At the same time, we force our model to generate consistent representations for images exhibiting the same level of degradation.
学习排序学习对图像进行排序已被证明是一种有效的图像质量和美学评估技术 [16, 7, 11, 35, 20, 27]。RankIQA [16] 提出通过直接排序质量不断退化的图像，并以无监督方式对 Siamese 网络进行预训练。然后，模型使用真实的 MOS 在 IQA 数据集上用 MSE 损失进行微调。VILA [11]通过使用图像评论对微调 CLIP 来处理图像美学评估问题。随后，作者通过学习排序质量——通过图像和单一提示之间的相似性来表达——根据其标注的 MOS，训练一个残差投影以评估单对图像的质量。相反，我们设计了一种在训练过程中不需要任何形式的监督的方法。实际上，我们通过学习根据其失真严重程度排序两个反义词提示和多个不断退化的图像对之间的相似性来微调 CLIP 图像编码器。同时，我们强制模型为展示相同退化水平的图像生成一致的表示。

3 Proposed Approach 3 提出的方法

We propose a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the intrinsic quality of the images. First, we synthetically degrade pairs of crops with increasing levels of intensity. Then, we fine-tune the CLIP image encoder by ranking the similarity between two antonym prompts and the increasingly distorted image pairs, based on their degree of degradation, while guaranteeing consistent representations for images with comparable quality. We keep the CLIP text encoder fixed. We do not employ any supervision – particularly MOS – at any step of the training process.
我们提出了一种质量感知的图像-文本对齐策略，使 CLIP 生成与图像内在质量相关的表示。首先，我们对成对裁剪的图像进行合成退化，逐步增加强度。然后，我们通过根据其退化程度对两个反义词提示语和日益失真图像对之间的相似性进行排序，微调 CLIP 的图像编码器，同时保证具有可比质量的图像的一致表示。我们保持 CLIP 文本编码器不变。在训练过程的任何阶段，我们都不采用任何监督信息，特别是不使用 MOS。

3.1 CLIP preliminaries 3.1 CLIP 初步知识

CLIP (Contrastive Language-Image Pre-training (CLIP) [26]) is a vision-language model trained on a large-scale dataset with a contrastive loss to semantically align images and corresponding text captions in a shared embedding space. CLIP comprises an image encoder $\psi_{I}$ and a text encoder $\psi_{T}$ . Given an image $I$ , the image encoder extracts its feature representation $x=\psi_{I}(I)\in\mathbb{R}^{d}$ , where $d$ is the CLIP embedding space dimension. For a given text caption $T$ , each tokenized word is mapped to the token embedding space $\mathcal{W}$ through a word embedding layer $E_{w}$ . Then, the text encoder $\psi_{T}$ is employed to generate the textual feature representation $y=\psi_{T}(E_{w}(T))\in\mathbb{R}^{d}$ from the token embeddings. Thanks to its training strategy, CLIP generates similar representations within the common embedding space for images and text expressing the same concepts.
CLIP（对比语言-图像预训练（CLIP）[26]）是一种视觉语言模型，通过对比损失在大规模数据集上进行训练，以在共享嵌入空间中语义对齐图像和相应的文本说明。CLIP 由一个图像编码器 $\psi_{I}$ 和一个文本编码器 $\psi_{T}$ 组成。给定图像 $I$ ，图像编码器提取其特征表示 $x=\psi_{I}(I)\in\mathbb{R}^{d}$ ，其中 $d$ 是 CLIP 嵌入空间的维度。对于给定的文本说明 $T$ ，每个标记化的单词通过词嵌入层 $E_{w}$ 被映射到标记嵌入空间 $\mathcal{W}$ 。然后，文本编码器 $\psi_{T}$ 用于从标记嵌入中生成文本特征表示 $y=\psi_{T}(E_{w}(T))\in\mathbb{R}^{d}$ 。由于其训练策略，CLIP 在共同嵌入空间中为表达相同概念的图像和文本生成相似的表示。

3.2 Synthetic Degradation with Increasing Levels of Intensity
3.2 随着强度等级提高的合成退化

To make our approach self-supervised, we propose to synthetically degrade unlabeled pristine images using progressively higher levels of intensity. In this way, we can train our model to rank the different versions of each image according to the severity of their degradation. Following [1], we consider 24 distinct degradation types spanning the 7 distortion groups $\mathcal{G}\!=\!\{\mathcal{G}_{1},\ldots,\mathcal{G}_{7}\}$ defined by the KADID [15] dataset. These groups are: 1) Brightness change; 2) Blur; 3) Spatial distortions; 4) Noise; 5) Color distortions; 6) Compression; 7) Sharpness & contrast. Each distortion has $L\!=\!5$ levels of intensity. Figure 2 shows some examples of degraded images for varying degrees of intensity. See the supplementary material for more details on the specific degradation types. Each distortion group is defined as $\mathcal{G}_{i}\!=\!\{\ldots,F^{ij},\ldots\}$ , where $i\!\in\!\{1,\ldots,7\}$ indicates the index of the distortion group within $\mathcal{G}$ and $j\!\in\!\{1,\ldots,\left|G_{i}\right|\}$ refers to the index of the degradation type within $\mathcal{G}_{i}$ , with $\left|\cdot\right|$ that represents the cardinality.
为了使我们的方法实现自监督，我们建议使用逐渐提高的强度来合成退化未标记的原始图像。通过这种方式，我们可以训练模型根据退化的严重程度对每张图像的不同版本进行排名。根据[1]，我们考虑了 24 种不同的退化类型，涵盖了 KADID[15]数据集中定义的 7 个失真组 $\mathcal{G}\!=\!\{\mathcal{G}_{1},\ldots,\mathcal{G}_{7}\}$ 。这些组包括：1）亮度变化；2）模糊；3）空间畸变；4）噪声；5）颜色失真；6）压缩；7）锐度和对比度。每种失真有 $L\!=\!5$ 强度级别。图 2 展示了某些强度不同的退化图像示例。有关具体退化类型的更多细节，请参阅补充材料。每个失真组定义为 $\mathcal{G}_{i}\!=\!\{\ldots,F^{ij},\ldots\}$ ，其中 $i\!\in\!\{1,\ldots,7\}$ 表示失真组在 $\mathcal{G}$ 中的索引， $j\!\in\!\{1,\ldots,\left|G_{i}\right|\}$ 指退化类型在 $\mathcal{G}_{i}$ 中的索引， $\left|\cdot\right|$ 则代表总数。

Given a training image, we start by extracting a pair of random overlapping crops. Then, we randomly sample $D\!=\!1$ distortion groups and a degradation within each group. We apply the $D$ distortions to both crops using $L\!=\!5$ distinct levels of intensity, resulting in $L$ pairs of equally degraded crops, one for each level. Contrary to [16], we obtain two images for each degree of distortion. When considering two such pairs of crops, we can infer which has a higher quality, based on the corresponding level of degradation. We leverage this information to train our model with a ranking loss.
给定一张训练图像，我们首先提取一对随机重叠的裁剪图。然后，我们随机采样 $D\!=\!1$ 个失真组和每组内的一个退化。我们使用 $D$ 种失真对两个裁剪图应用 $L\!=\!5$ 个不同强度级别，得到 $L$ 对同等退化的裁剪图，每个级别一对。与[16]相反，我们为每个失真程度获得两幅图像。当考虑这两对裁剪图时，我们可以根据相应的退化程度推断哪一个具有更高的质量。我们利用这些信息，通过排名损失来训练我们的模型。

3.3 Quality-Aware Image-Text Alignment
3.3 质量感知的图像-文本对齐

As Fig. 1 shows, CLIP struggles to generate accurate quality-aware image representations that correlate with the severity of the degradation. To address this issue, we propose a quality-aware image-text alignment strategy to fine-tune the CLIP image encoder. The idea of our approach is that given two degraded versions of the same image, a prompt referring to high image quality – such as “Good photo” – should be more similar to the less degraded version. The opposite consideration applies when considering a prompt referring to low image quality, such as “Bad photo”. At the same time, two images with overlapping content and equal degree of degradation should have comparable similarities to such a pair of quality-related prompts, which we refer to as antonym prompts [36]. Note that, given two random images with completely different content, we can not make any assumptions about their relative quality, or, in other words, their similarity to the prompts. Our training strategy leverages multiple pairs of increasingly degraded images to achieve two objectives: O1): we want CLIP to generate consistent representations for images of similar quality, i.e. showing the same amount of distortion; O2): the similarity between each of the antonym prompts and the distinct versions of the images must correlate – in opposite directions – with the corresponding level of degradation.
如图 1 所示，CLIP 在生成与退化程度相关的质量意识图像表示方面存在困难。为了解决这个问题，我们提出了一种质量意识的图像-文本对齐策略来微调 CLIP 图像编码器。我们方法的想法是，给定同一图像的两个退化版本，提到高图像质量的提示——例如“好照片”——应该与退化程度较轻的版本更相似。在考虑与低图像质量相关的提示时，如“坏照片”，则应该考虑相反的情况。同时，具有重叠内容且退化程度相同的两张图片应该与这种一对质量相关的提示具有可比的相似性，我们将这些提示称为反义提示【36】。请注意，考虑到两张内容完全不同的随机图片，我们无法对它们的相对质量做任何假设，或者换句话说，它们与提示的相似性。我们的训练策略利用多组不断退化的图像对来实现两个目标：O1）：我们希望 CLIP 为质量相似的图像生成一致的表示，即显示相同数量的失真；O2）：每个反义词提示与不同版本图像之间的相似性必须与相应的退化程度在相反方向上相关。

Let $I_{i}^{1}$ and $I_{i}^{2}$ be the $i$ -th pair of increasingly degraded crops obtained as detailed in Sec. 3.2, where $i\in\{1,\ldots,L\}$ and $L\!=\!5$ is the number of considered distortion levels. For $i,j\in\{1,\ldots,L\}$ with $j>i$ , the $j$ -th pair of crops is more degraded than the $i$ -th one. Given each pair of crops, we extract the corresponding features through the CLIP image encoder $\psi_{I}$ , resulting in $x_{i}^{1}=\psi_{I}(I_{i}^{1})$ and $x_{i}^{2}=\psi_{I}(I_{i}^{2})$ . Similarly to [36], we remove the positional embedding to relax the CLIP’s requirement of fixed-size inputs. Let $T_{p}$ and $T_{n}$ be a pair of antonym prompts related to image quality, such as “Good photo” and “Bad photo”. We refer to $T_{p}$ and $T_{n}$ as “positive” and “negative” prompts, respectively. We employ the CLIP text encoder $\psi_{T}$ to extract the text features associated with the prompts, obtaining $t_{p}=\psi_{T}(T_{p})$ and $t_{n}=\psi_{T}(T_{n})$ . We normalize both the image and text features to have a unit $L_{2}$ -norm.
设 $I_{i}^{1}$ 和 $I_{i}^{2}$ 为第 $i$ 对按逐渐退化的裁剪图像对，在 Sec. 3.2 中详细介绍，其中 $i\in\{1,\ldots,L\}$ 和 $L\!=\!5$ 是所考虑的畸变级别数。对于 $i,j\in\{1,\ldots,L\}$ 和 $j>i$ ，第 $j$ 对裁剪图像比第 $i$ 对更加退化。给定每对裁剪图像，我们通过 CLIP 图像编码器 $\psi_{I}$ 提取相应特征，得到 $x_{i}^{1}=\psi_{I}(I_{i}^{1})$ 和 $x_{i}^{2}=\psi_{I}(I_{i}^{2})$ 。类似于 [36]，我们移除位置嵌入以放松 CLIP 对固定尺寸输入的要求。设 $T_{p}$ 和 $T_{n}$ 为与图像质量相关的反义提示语对，例如“好照片”和“坏照片”。我们分别将 $T_{p}$ 和 $T_{n}$ 称为“正”提示和“负”提示。我们使用 CLIP 文本编码器 $\psi_{T}$ 提取与提示相关的文本特征，得到 $t_{p}=\psi_{T}(T_{p})$ 和 $t_{n}=\psi_{T}(T_{n})$ 。我们将图像和文本特征都归一化以具有单位 $L_{2}$ 范数。

To achieve objective O1, we propose to employ a consistency loss term to guarantee that the similarity between the features of the prompts and those of each of the two images composing each degraded pair is comparable. We assume that two overlapping crops extracted from the same image have a comparable quality [28, 43]. We rely on a margin ranking loss [13, 16, 7] defined as:
为了实现目标 O1，我们建议使用一个一致性损失项来保证提示的特征与构成每个降级对的两个图像的特征之间的相似性是可比的。我们假设从同一图像中提取的两个重叠裁剪具有可比的品质[28, 43]。我们依靠一个边缘排序损失[13, 16, 7]，定义为：

$\begin{aligned} \mathcal{L}_{cons}&=\sum_{i=1}^{L=5}\left[\max(0,\left|c(x_{i}^{1},t_{p})-c(x_{i}^{2},t_{p})\right|-m_{cons})+\max(0,\left|c(x_{i}^{1},t_{n})-c(x_{i}^{2},t_{n})\right|-m_{cons})\right]\end{aligned}$

(1)

where $c(\cdot)$ stands for the cosine similarity and the margin $m_{cons}$ is a hyperparameter. Intuitively, $m_{cons}$ must be close to 0 to force the similarities between the prompts and each of the two crops to be comparable.
其中 $c(\cdot)$ 表示余弦相似度，边距 $m_{cons}$ 是一个超参数。直观上， $m_{cons}$ 必须接近 0，以迫使提示与两个裁剪之间的相似度具有可比性。

Given the $i$ -th level of synthetic degradation, with $i\in\{1,\ldots,L\}$ , we assume that the quality of the two distorted crops of the $i$ -th pair is higher than that of the two images composing the $(i+1)$ -th one [16, 27]. Thus, we enforce that the similarity between the features of the positive prompt and those of two crops is higher than when considering more degraded versions of the two crops. Specifically, we define a margin ranking loss as:
考虑到第 $i$ 级的合成退化程度，给定 $i\in\{1,\ldots,L\}$ ，我们假设第 $i$ 对中的两个失真裁剪图像的质量高于组成第 $(i+1)$ 对的两个图像的质量 [ 16, 27 ]。因此，我们规定正向提示的特征与两个裁剪图像的特征之间的相似性要高于考虑两个裁剪图像更加退化版本的情况。具体地，我们定义一个边距排序损失如下：

$\begin{aligned} \mathcal{L}_{pos}&=\sum_{i=1}^{L=5}\sum_{j=i+1}^{L=5}\sum_{k=1}^{2}\left[\max(0,c(x_{j}^{k},t_{p})-c(x_{i}^{1},t_{p})+m_{rank})+\max(0,c(x_{j}^{k},t_{p})-c(x_{i}^{2},t_{p})+m_{rank})\right]\end{aligned}$

(2)

where $c(\cdot)$ represents the cosine similarity and the margin $m_{rank}$ is a hyperparameter. The opposite of the consideration made above applies when we take into account the negative prompt. Therefore, we add a loss term to impose that the similarity between the features of the negative prompt and those of two crops is lower than when considering more degraded versions of the two crops:
其中 $c(\cdot)$ 表示余弦相似度， $m_{rank}$ 是一个超参数。当我们考虑负提示时，上述考虑的相反情况适用。因此，我们添加一个损失项，以强制负提示的特征与两个裁剪的特征之间的相似度低于考虑两个裁剪的更多降级版本时的情况：

$\begin{aligned} \mathcal{L}_{neg}&=\sum_{i=1}^{L=5}\sum_{j=i+1}^{L=5}\sum_{k=1}^{2}\left[\max(0,c(x_{i}^{1},t_{n})-c(x_{j}^{k},t_{n})+m_{rank})+\max(0,c(x_{i}^{2},t_{n})-c(x_{j}^{k},t_{n})+m_{rank})\right]\end{aligned}$

(3)

with $c(\cdot)$ and $m_{rank}$ defined as above. Intuitively, we need to set $m_{rank}\!>>\!0$ as we aim for a noticeable difference between the similarities of the prompts and the increasingly degraded versions of the two crops. The combination of $\mathcal{L}_{pos}$ and $\mathcal{L}_{neg}$ allows us to achieve objective O2.
如上所述定义了 $c(\cdot)$ 和 $m_{rank}$ 。直观而言，我们需要设定 $m_{rank}\!>>\!0$ ，因为我们希望在提示的相似性与两个剪裁的逐渐退化版本之间产生显著差异。 $\mathcal{L}_{pos}$ 和 $\mathcal{L}_{neg}$ 的结合使我们能够实现目标 O2。

The final training loss used to fine-tune the CLIP image encoder is given by:
最后用于微调 CLIP 图像编码器的训练损失为：

\mathcal{L}=\lambda_{cons}\mathcal{L}_{cons}+\lambda_{pos}\mathcal{L}_{pos}+\lambda_{neg}\mathcal{L}_{neg}

(4)

where $\lambda_{cons}$ , $\lambda_{pos}$ , and $\lambda_{neg}$ represent the loss weights. Figure 3 shows an overview of our training strategy. Given that we do not employ any labeled MOS, our approach is both self-supervised and opinion-unaware. Thanks to the proposed training strategy, CLIP learns an image-text alignment not based on high-level semantics, but rather on low-level image characteristics, such as noise and blur. As a result, QualiCLIP generates representations that correlate with the intrinsic quality of the images, as shown in Fig. 1.
其中 $\lambda_{cons}$ 、 $\lambda_{pos}$ 和 $\lambda_{neg}$ 代表损失权重。图 3 展示了我们训练策略的概述。由于我们没有使用任何标记的 MOS，我们的方法既是自监督的也是意见无关的。得益于提出的训练策略，CLIP 学习到了不基于高级语义，而是基于低级图像特征（如噪声和模糊）的图像文本对齐。因此，QualiCLIP 生成的表示与图像的内在质量相关，如图 1 所示。

At inference time, given an image $I$ , we extract its features $x$ using the CLIP image encoder. Then, we compute the cosine similarity between $x$ and the features $t_{p}$ and $t_{n}$ of the antonym prompts, resulting in $s_{p}$ and $s_{n}$ . Finally, similar to [36], we obtain the final quality score $q\!\in\![0,1]$ by using the softmax:
在推理时，给定图像 $I$ ，我们使用 CLIP 图像编码器提取其特征 $x$ 。然后，我们计算 $x$ 与反义提示的特征 $t_{p}$ 和 $t_{n}$ 之间的余弦相似度，得到 $s_{p}$ 和 $s_{n}$ 。最后，类似于[36]，我们使用 softmax 获得最终的质量分数 $q\!\in\![0,1]$ 。

q=\frac{e^{s_{p}}}{e^{s_{p}}+e^{s_{n}}}

(5)

Figure 4 provides an overview of the final quality score computation. Note that, since we keep the CLIP text encoder weights frozen, we need to compute the text features of the antonym prompts just once, and we can use them both for training and inference. Therefore, at inference time the computational cost of our method is the same as an image-encoder-only model with the same backbone.
图 4 概述了最终质量评分的计算。请注意，由于我们保持 CLIP 文本编码器的权重不变，因此我们只需计算一次反义词提示的文本特征，并能在训练和推理中使用它们。因此，在推理阶段，我们的方法与使用相同骨干的仅图像编码器模型具有相同的计算成本。

We recall that the authors of GRepQ [32] use a strategy similar to the one shown in Fig. 4 to predict the quality of all the images comprising a training batch. Then, they divide the images into a high-quality and a low-quality group. Finally, they fine-tune the CLIP image encoder with a contrastive loss by maximizing the intra-group similarity and minimizing the inter-group one. In contrast, we consider multiple pairs of increasingly synthetically degraded crops. We rely on $\mathcal{L}_{cons}$ to fine-tune the CLIP image encoder so that it generates consistent representations for images exhibiting the same degree of distortion. At the same time, we use $\mathcal{L}_{pos}$ and $\mathcal{L}_{neg}$ to train our model to rank the similarity between two antonym prompts and the images, according to their level of degradation. Thanks to our training strategy, CLIP yields more accurate quality-aware image representations.
我们回顾一下，GRepQ [32] 的作者使用了一种类似于图 4 所示的策略来预测训练批次中所有图像的质量。然后，他们将图像分为高质量和低质量组。最后，他们通过最大化组内相似性和最小化组间相似性来用对比损失微调 CLIP 图像编码器。相比之下，我们考虑多个对比度逐渐合成退化的图块。我们依赖 $\mathcal{L}_{cons}$ 来微调 CLIP 图像编码器，以便为展现相同程度失真的图像生成一致的表现。同时，我们使用 $\mathcal{L}_{pos}$ 和 $\mathcal{L}_{neg}$ 训练我们的模型，根据图像的退化水平排列两个反义提示之间以及图像之间的相似性。得益于我们的训练策略，CLIP 能够产生更准确的质量感知图像表现。

4 Experimental Results 4 实验结果

4.1 Datasets 4.1 数据集

We train our model using the 140K pristine images of the KADIS dataset [15]. Given a training image, we synthetically degrade it by employing the strategy detailed in Sec. 3.2. We validate and test the proposed approach on IQA datasets with synthetic and authentic distortions, respectively. These datasets contain a set of degraded images annotated with human judgments of picture quality in the form of a Mean Opinion Score (MOS). For validation, we consider two synthetically degraded datasets: LIVE [30] and TID2013 [25]. LIVE stems from 29 reference images, each distorted with 5 types of degradation at 5 levels of intensity, resulting in 779 images. Conversely, TID2013 contains 3000 images degraded with 24 distinct distortions at 5 levels of intensity, with 25 reference images as the base. For testing, we consider four datasets with authentic distortions: KonIQ [10], CLIVE [6], FLIVE [39], and SPAQ [5]. KonIQ contains 10K images sampled from the YFCC100M [34] database. CLIVE consists of 1162 images captured with a wide range of mobile devices. FLIVE is the largest existing dataset for NR-IQA and is composed of about 40K real-world images. SPAQ comprises 11K high-resolution photos taken with several smartphones. Following [5], we resize the SPAQ images so that the shorter side is 512.
我们使用 KADIS 数据集[15]中的 140K 原始图像训练我们的模型。给定一个训练图像，我们通过使用第 3.2 节中详细介绍的策略对其进行合成降级。我们在分别具有合成和真实失真的 IQA 数据集上验证和测试所提议的方法。这些数据集包含一组降级图像，并附有人类对图像质量的判断，以平均意见得分（MOS）的形式进行标注。对于验证，我们考虑两个合成降级数据集：LIVE[30]和 TID2013[25]。LIVE 来自 29 个参考图像，每个图像使用 5 种类型的降级在 5 个强度级别上进行处理，共得到 779 张图像。相反，TID2013 包含 3000 张图像，经过 24 种不同失真在 5 个强度级别下降级，以 25 张参考图像为基础。对于测试，我们考虑四个真实失真数据集：KonIQ[10]、CLIVE[6]、FLIVE[39]和 SPAQ[5]。KonIQ 包含从 YFCC100M[34]数据库中采样的 10K 图像。CLIVE 由 1162 张使用各种移动设备拍摄的图像组成。FLIVE 是现有的最大 NR-IQA 数据集，由约 40K 张真实世界图像组成。 SPAQ 包含 11K 张用多款智能手机拍摄的高分辨率照片。根据[ 5]，我们将 SPAQ 图像的短边调整为 512。

Table 1: Comparison between QualiCLIP and competing opinion-unaware methods on datasets with authentic distortions. Best and second-best scores are highlighted in bold and underlined, respectively. Relative gains over the best-performing baseline are indicated in green. OU indicates Opinion-Unaware version as explained in Sec. 4.3.
表格 1：在具有真实失真数据集上，QualiCLIP 与竞争的意见无关方法的比较。最佳和次最佳分数分别用黑体和下划线标出。相对于表现最佳的基线的相对增益以绿色表示。OU 表示意见无关版本，如第 4.3 节所述。

	KonIQ		CLIVE		FLIVE		SPAQ		Average 平均
Method 方法	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
NIQE [24] NIQE [24]	0.526	0.534	0.450	0.493	0.158	0.221	0.703	0.712	0.459	0.490
IL-NIQE [40] IL-NIQE [40]	0.493	0.519	0.438	0.503	0.165	0.209	0.710	0.717	0.452	0.487
CONTRIQUE-OU [23] CONTRIQUE-OU [23]	0.637	0.630	0.394	0.422	0.199	0.228	0.676	0.680	0.477	0.490
Re-IQA-OU [28]	0.558	0.550	0.418	0.444	0.218	0.238	0.616	0.618	0.453	0.463
ARNIQA-OU [1] ARNIQA-OU [1]	0.741	0.760	0.484	0.558	0.299	0.362	0.789	0.797	0.578	0.619
CL-MI [4] CL-MI [4]	0.645	0.645	0.507	0.525	0.257	0.293	0.701	0.702	0.528	0.541
CLIP-IQA [36] CLIP-IQA [36]	0.699	0.733	0.611	0.593	0.287	0.349	0.733	0.728	0.583	0.601
GRepQ-OU [32] GRepQ-OU [32]	0.768	0.788	0.740	0.769	0.327	0.438	0.805	0.809	0.660	0.701
QualiCLIP	0.815	0.837	0.753	0.790	0.393	0.496	0.843	0.855	0.701	0.745
	+6.1%	+6.2%	+1.8%	+2.7%	+20.2%	+13.2%	+4.7%	+5.7%	+6.2%	+6.2%

4.2 Implementation Details
4.2 实现细节

We rely on a ResNet50 [9] as the backbone for CLIP. Similar to [36], we remove the positional embedding from the encoder to allow our model to take images of any resolution as input. The dimension $d$ of the CLIP embedding space is 1024. Differently from [32], we do not train a projector head on top of the CLIP image encoder. We keep the CLIP text encoder frozen. Similar to [38, 37], we employ multiple pairs of antonym prompts. We train our model for 3 epochs using an AdamW [17] optimizer with a weight decay and a learning rate of $1e\!-\!2$ and $1e\!-\!9$ , respectively. During training, we employ a patch size of 224 and a batch size of 16. We set the margins $m_{cons}$ in Eq. 1 and $m_{rank}$ in Eqs. 2 and 3 to $2.5e\!-\!3$ and $6.75e\!-\!2$ , respectively. The loss weights $\lambda_{cons}$ , $\lambda_{pos}$ and $\lambda_{neg}$ in Eq. 4 are all equal to 1. At inference time, our model takes the whole image as input and outputs a single quality score.
我们依赖于 ResNet50 [9]作为 CLIP 的主干网络。与[36]类似，我们从编码器中移除位置嵌入，以允许我们的模型接受任何分辨率的图像作为输入。CLIP 嵌入空间的维度 $d$ 为 1024。不同于[32]，我们没有在 CLIP 图像编码器之上训练投射头。我们保持 CLIP 文本编码器冻结。类似于[38, 37]，我们使用多对反义词提示。我们使用 AdamW [17]优化器进行 3 轮训练，权重衰减和学习率分别为 $1e\!-\!2$ 和 $1e\!-\!9$ 。在训练期间，我们使用 224 的补丁尺寸和 16 的批量大小。我们将方程 1 中的边距 $m_{cons}$ 和方程 2 及 3 中的边距设置为 $2.5e\!-\!3$ 和 $6.75e\!-\!2$ 。方程 4 中的损失权重 $\lambda_{cons}$ 、 $\lambda_{pos}$ 和 $\lambda_{neg}$ 均等于 1。在推理时，我们的模型接受整个图像作为输入，并输出单个质量分数。

4.3 Quantitative Results 4.3 定量结果

Evaluation protocol We evaluate the performance using Spearman’s rank order correlation coefficient (SRCC) and Pearson’s linear correlation coefficient (PLCC), which measure prediction monotonicity and accuracy, respectively. Higher values of SRCC and PLCC correspond to better results. Following [3], we pass the quality predictions through a four-parameter logistic non-linearity before computing PLCC.
评估协议我们使用 Spearman 秩序相关系数(SRCC)和 Pearson 线性相关系数(PLCC)来评估性能，它们分别衡量预测的单调性和准确性。SRCC 和 PLCC 值越高，结果越好。根据[3]，在计算 PLCC 之前，我们将质量预测通过一个四参数逻辑非线性处理。

We compare our approach to state-of-the-art methods in two different settings: zero-shot and cross-dataset. Note that our method remains consistent across both settings; the only variation lies in the baselines we compare against. For a fair comparison, we compute the results of each baseline using our evaluation protocol by employing the official pre-trained model if available or training it from scratch using the original hyperparameters. In the zero-shot setting, we only consider opinion-unaware methods [24, 40, 4, 36] and approaches that, with slight modifications, can function without requiring MOS [23, 28, 1, 32]. In particular, we follow the original paper for GRepQ [32], while for methods based on a linear regressor [23, 28, 1], such as CONTRIQUE [23], we use a NIQE-style framework on the extracted image features, similar to [4]. We evaluate the performance using the full datasets for testing. In the cross-dataset setting, we compare with supervised methods [33, 7, 23, 28, 1, 36, 32] using testing datasets different from the training one, simulating real-world scenarios. Here, we use the full datasets both for training and testing.
我们在两种不同的环境中将我们的方法与现有最先进的方法进行比较：零样本和跨数据集。请注意，我们的方法在这两种环境中保持一致；唯一的变化在于我们所比较的基线。为了公平比较，我们在评估过程中使用官方预训练模型（如果可用）或使用原始超参数从头开始训练来计算每个基线的结果。在零样本环境中，我们仅考虑意见无关的方法 [24, 40, 4, 36] 以及经过轻微修改后能够在不需要 MOS 的情况下运行的方法 [23, 28, 1, 32]。特别地，对于 GRepQ [32] 方法，我们遵循原始论文，而对于基于线性回归器的方法 [23, 28, 1]，如 CONTRIQUE [23]，我们在提取的图像特征上使用类似于 [4] 的 NIQE 风格框架。我们使用完整的数据集进行测试来评估性能。在跨数据集环境中，我们与监督方法 [33, 7, 23, 28, 1, 36, 32] 进行比较，使用与训练数据集不同的测试数据集，模拟现实世界情景。在这里，我们同时使用完整的数据集进行训练和测试。

Zero-shot setting We report the results for the zero-shot setting in Tab. 1. We observe that the proposed approach achieves state-of-the-art results on all the testing datasets. QualiCLIP obtains significant improvements compared to the other methods, with about a 20% and a 6% gain over the best-performing baseline on the FLIVE dataset and on average, respectively. In particular, the improvement over CLIP-IQA proves that the proposed training strategy makes CLIP generate image representations that better correlate with their intrinsic quality. In addition, we recall that GRepQ employs a low-level encoder and a high-level fine-tuned CLIP-based encoder. Despite relying solely on CLIP, QualiCLIP outperforms GRepQ, further confirming the effectiveness of our quality-aware image-text alignment strategy.
零样本设置我们在表 1 中报告了零样本设置的结果。我们观察到，所提出的方法在所有测试数据集上都达到了最先进的结果。QualiCLIP 取得了显著的提升，与其他方法相比，在 FLIVE 数据集上和平均水平上分别提升了约 20%和 6%。特别是，与 CLIP-IQA 的提升证明了所提出的训练策略使 CLIP 生成的图像表示与其内在质量更好地相关。此外，我们回忆起 GRepQ 采用了低级编码器和高级微调的基于 CLIP 的编码器。尽管仅依赖于 CLIP，QualiCLIP 仍然优于 GRepQ，进一步确认了我们质量感知图像-文本对齐策略的有效性。

Table 2: Comparison between QualiCLIP and supervised methods trained on FLIVE [39]. We report the performance on several datasets with authentic distortions. Best and second-best scores are highlighted in bold and underlined, respectively. Relative gains (losses) over the best-performing baseline are indicated in green (red).
表 2：QualiCLIP 与在 FLIVE [39]上训练的监督方法的比较。我们报告在多个包含真实失真数据集上的表现。最佳和次佳分数分别以粗体和下划线突出显示。与表现最佳的基线相比的相对增益（损失）分别以绿色（红色）标示。

		KonIQ		CLIVE		SPAQ		Average 平均
Method 方法	Opinion-Unaware 意见无关	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
HyperIQA [33] HyperIQA [33]	✗	0.738	0.742	0.736	0.743	0.653	0.658	0.709	0.714
TReS [7] TReS [7]	✗	0.748	0.751	0.735	0.751	0.743	0.741	0.742	0.748
CONTRIQUE [23] CONTRIQUE [23]	✗	0.779	0.781	0.734	0.751	0.817	0.825	0.777	0.786
Re-IQA [28] Re-IQA [28]	✗	0.789	0.825	0.719	0.770	0.826	0.830	0.778	0.808
ARNIQA [1] ARNIQA [1]	✗	0.787	0.804	0.734	0.777	0.841	0.849	0.787	0.810
CLIP-IQA+ [36] CLIP-IQA+ [36]	✗	0.784	0.801	0.707	0.732	0.751	0.750	0.747	0.761
GRepQ [32]	✗	0.807	0.812	0.768	0.785	0.828	0.836	0.801	0.811
QualiCLIP	✓	0.815	0.837	0.753	0.790	0.843	0.855	0.804	0.827
		+1.0%	+3.1%	-2.0%	+0.6%	+0.2%	+0.7%	+0.3%	+2.0%

Cross-dataset setting Table 2 shows the results for the cross-dataset setting. This experiment allows us to compare the generalization capabilities of our model with supervised methods. We employ FLIVE as the training dataset for the baselines. We do not compare with LIQE [42] as it requires multiple datasets for training. It is important to highlight that this experimental setting is unfavorable to the proposed approach. Indeed, in contrast with the competing methods, we do not use any labeled MOS during the training process. Nevertheless, we observe that QualiCLIP outperforms all the baselines. In particular, our method obtains a significant improvement over ARNIQA, which has shown impressive generalization capabilities [1]. This experiment shows that most of the supervised methods struggle to generalize to datasets different from the training ones. In contrast, despite not requiring MOS, our method consistently achieves impressive performance on all the testing datasets, proving to be more suitable for applications in real-world scenarios. Moreover, by comparing Tab. 1 and Tab. 2, we observe that QualiCLIP is the only opinion-unaware approach that manages to obtain better results than supervised methods.
跨数据集设置表 2 显示了跨数据集设置的结果。这个实验使我们能够将我们的模型与监督方法的泛化能力进行比较。我们采用 FLIVE 作为基线的训练数据集。我们不与 LIQE[42]进行比较，因为它需要多个数据集进行训练。需要强调的是，这个实验设置对提出的方法是不利的。实际上，与竞争方法相比，我们在训练过程中没有使用任何标记的 MOS。然而，我们观察到 QualiCLIP 在所有基线中表现最佳。特别是，我们的方法在 ARNIQA 上取得了显著的改善，而 ARNIQA 已经展示了令人印象深刻的泛化能力[1]。该实验表明，大多数监督方法难以泛化到与训练数据集不同的数据集。相比之下，尽管不需要 MOS，我们的方法在所有测试数据集上始终表现出色，证明更适合在现实场景中应用。此外，通过比较表 1 和表。我们观察到，QualiCLIP 是唯一一个在没有意见的情况下，能获得比监督方法更好结果的方法。

4.4 Ablation Studies 4.4 消融研究

We conduct ablations studies on the LIVE and TID2013 synthetic datasets to evaluate the individual contribution of each component of our approach.
我们在 LIVE 和 TID2013 合成数据集上进行了消融研究，以评估我们方法中每个组件的独立贡献。

Table 3: Ablation studies on the loss terms of Eq. 4 (left) and on the training strategy (right). We report the performance on the LIVE and TID2013 synthetic datasets. Best and second-best scores are highlighted in bold and underlined, respectively.
表 3：关于方程 4 中的损失项（左）和训练策略（右）的消融研究。我们报告了在 LIVE 和 TID2013 合成数据集上的表现。最佳和次佳分数分别以粗体和下划线标出。

$\mathcal{L}_{cons}$	$\mathcal{L}_{pos}$	$\mathcal{L}_{neg}$	SRCC	PLCC	SRCC	PLCC
			LIVE		TID2013
✓	✗	✗	0.601	0.617	0.504	0.610
✗	✓	✗	0.651	0.627	0.515	0.592
✗	✗	✓	0.871	0.852	0.609	0.657
✓	✓	✗	0.670	0.649	0.523	0.605
✓	✗	✓	0.881	0.859	0.623	0.675
✗	✓	✓	0.861	0.858	0.603	0.636
✓	✓	✓	0.887	0.880	0.626	0.679

	LIVE		TID2013
Ablation	SRCC	PLCC	SRCC	PLCC
D = 2	0.841	0.817	0.618	0.676
L = 3	0.814	0.781	0.587	0.661
quality-based	0.827	0.816	0.552	0.663
w/ pos. emb.	0.879	0.866	0.629	0.674
QualiCLIP	0.887	0.880	0.626	0.679

Loss terms We study the importance of each loss term in Eq. 4 and report the results in Tab. 3 (left). First, we notice that $\mathcal{L}_{cons}$ by itself is insufficient for making CLIP generate quality-aware representations, as it does not exploit the information provided by the intrinsic ranking of the increasingly degraded crops. Nevertheless, $\mathcal{L}_{cons}$ consistently yields a positive impact when combined with any of the other loss terms. Then, we observe that despite being symmetrical to $\mathcal{L}_{pos}$ , $\mathcal{L}_{neg}$ seems to be more crucial for the training process. Given that $\mathcal{L}_{neg}$ involves the alignment between the images and the negative prompt, this outcome suggests that such prompt holds more significance in the quality score computation as in Fig. 4. We provide a detailed discussion of this hypothesis in the supplementary material. Nevertheless, Tab. 3 (left) shows that combining the three loss terms achieves the best results, proving that they are all crucial for training CLIP to generate accurate quality-aware image representations.
损失项我们研究了公式 4 中每个损失项的重要性，并在表 3（左）中报告了结果。首先，我们注意到，仅靠 $\mathcal{L}_{cons}$ 本身不足以让 CLIP 生成质量感知的表示，因为它没有利用逐渐退化裁剪的内在排名所提供的信息。然而，当与任何其他损失项结合时， $\mathcal{L}_{cons}$ 始终产生积极的影响。然后，我们观察到，尽管与 $\mathcal{L}_{pos}$ 是对称的， $\mathcal{L}_{neg}$ 似乎对训练过程更为关键。鉴于 $\mathcal{L}_{neg}$ 涉及到图像与负提示词之间的对齐，这一结果表明，此类提示词在质量分数计算中具有更大的意义，如图 4 所示。我们在补充材料中提供了关于这一假设的详细讨论。尽管如此，表 3（左）显示，结合这三个损失项可达到最佳结果，这证明它们对于训练 CLIP 生成准确的质量感知图像表示都是至关重要的。

Training strategy We evaluate the performance achieved by modified versions of our approach: 1) $D\!=\!2$ : we apply two sequential degradations to each crop in Sec. 3.2 instead of just one; 2) $L\!=\!3$ : we consider only 3 levels of degradation in Secs. 3.2 and 3.3 instead of 5; 3) we directly use the predicted quality scores associated to each degraded crop instead of its similarity to the antonym prompts in the ranking loss computation; 4) w/ pos. emb.: we relax the CLIP’s requirement of fixed-size inputs by interpolating the positional embedding instead of removing it. Table 3 (right) shows the results. First, we note that employing more than one distortion leads to a decrease in performance. This is because the synthetic degradation becomes too severe independently of the level of intensity, making it overly challenging for the model to rank the crops effectively. Moreover, considering only 3 levels of degradation provides less information to the model during training compared to using 5 different levels, and thus corresponds to worse results. Then, we observe that directly employing the predicted quality scores in the ranking loss instead of the similarity to the prompts achieves poor performance. We attribute this outcome to an increased discrepancy between the CLIP training and fine-tuning process. Indeed, while the predicted quality scores originate from two prompts (see Fig. 4), the proposed strategy considers multiple pairs of single images and texts, which we argue is more similar to the technique used for training CLIP [26]. Finally, employing an interpolated positional embedding produces comparable results to its removal. This contrasts with observations in CLIP-IQA, where the authors noted a significant performance decline when the positional embedding was used [36]. We suppose that in our case, the model learns to adjust to the presence or absence of the positional embedding during the fine-tuning process, thus achieving similar outcomes.
训练策略我们评估了我们方法的修改版本所达到的性能：1） $D\!=\!2$ ：我们在第 3.2 节中对每个切块应用两个连续的降级，而不是仅仅一个；2） $L\!=\!3$ ：我们在第 3.2 节和第 3.3 节中仅考虑 3 个降级等级而不是 5 个；3）我们直接使用与每个降级的切块相关的预测质量得分，而不是其与反义提示相似度进行排序损失计算；4）带位置嵌入：我们通过插值位置嵌入而不是移除它来放宽 CLIP 的固定大小输入要求。表 3（右）显示了结果。首先，我们注意到应用多个失真会导致性能下降。这是因为无论强度水平如何，合成降级变得过于严重，使得模型难以有效地对切块进行排序。此外，仅考虑 3 级降级在训练期间为模型提供的信息比使用 5 个不同的等级要少，因此对应更差的结果。然后，我们发现直接在排名损失中使用预测质量分数而不是与提示的相似性会导致较差的表现。我们将这个结果归因于 CLIP 训练和微调过程之间的差异增大。事实上，虽然预测的质量分数来源于两个提示（参见图 4），但提出的策略考虑了多对单图片和文本，我们认为这更接近于 CLIP 训练中使用的技术 [26]。最后，使用插值位置嵌入与不使用它的结果相当。这与在 CLIP-IQA 中的观察相反，作者注意到使用位置嵌入时性能显著下降 [36]。我们认为在我们的情况下，模型在微调过程中学习调整有或没有位置嵌入的情况，从而实现类似的结果。

4.5 Additional Experiments

We evaluate the robustness and the explainability of our model through the gMAD [21] competition and a gradCAM [29] visualization, respectively. We report additional experiments in the supplementary material.
我们通过 gMAD [21]竞赛和 gradCAM [29]可视化分别评估了我们模型的鲁棒性和可解释性。我们在补充材料中报告了额外的实验。

gMAD To assess the robustness of our model we carry out the group maximum differentiation (gMAD) competition [21]. In particular, we compare QualiCLIP against GRepQ [32] using the Waterloo Exploration Database [19] dataset, which comprises 95K synthetically degraded images without MOS annotations. In this evaluation, one model is fixed to function as a defender, and its quality predictions are grouped into distinct levels. The other model assumes the role of the attacker, tasked with identifying image pairs within each level that exhibit the greatest quality difference. For a model to demonstrate robustness, the selected image pairs should show comparable quality when acting as the defender while exhibiting a notable quality disparity when assuming the role of the attacker. We observe that when we fix QualiCLIP at a low-quality level (Fig. 5(a) top), GRepQ fails to find picture pairs with an obvious quality difference. When considering a high-quality level (Fig. 5(a) bottom), the image pair identified by GRepQ shows a slight quality gap. However, when assuming the role of the attacker (Fig. 5(b)), QualiCLIP successfully exposes the failures of GRepQ, as it pinpoints image pairs displaying a significant quality disparity. Therefore, our approach demonstrates superior robustness compared to GRepQ.
为了评估我们模型的鲁棒性，我们进行了群组最大化区分（gMAD）竞赛[21]。特别是，我们使用 Waterloo Exploration Database[19]数据集，将 QualiCLIP 与 GRepQ[32]进行比较，该数据集包含 95K 个无主观意见分数注释的合成退化图像。在此评估中，一个模型被固定作为防御者，其质量预测被分组为不同级别。另一个模型则承担攻击者的角色，负责在每个级别内识别出在质量上差异最大的图片对。为了证明模型的鲁棒性，当担任防御者时，选定的图片对应应体现出相近的质量，而当担任攻击者时，应展现出显著的质量差异。我们观察到，当我们将 QualiCLIP 固定在低质量级别时（图 5(a)上），GRepQ 未能找到质量差异明显的图片对。在考虑高质量级别时（图 5(a)下），由 GRepQ 识别出的图片对表现出轻微的质量差距。然而，当担任攻击者的角色时（图 5）。 QualiCLIP 成功揭示了 GRepQ 的缺陷，因为它指出了显示显著质量差异的图像对。因此，我们的方法比 GRepQ 表现出更优越的鲁棒性。

gradCAM visualization We evaluate the explainability of our model and CLIP-IQA via a gradCAM [29] visualization. gradCAM is a visualization technique aimed at understanding which regions of an input image are most influential for a model’s decision by studying the gradients of a given layer. We employ gradCAM to produce a heatmap of the regions of the image that activate the most for each of the antonym prompts. We employ “Good photo” and “Bad photo” as the positive and negative prompts, respectively. Following [29], we consider the last convolutional layer of the ResNet50 backbone. Figure 6(a) shows the result for the positive prompt. We observe that, compared to CLIP-IQA, our model leads to a better alignment with high-quality areas of the image, such as the head of the horse. Similarly, Fig. 6(b) illustrates that QualiCLIP focuses on the most degraded parts of the images when considering the negative prompt, in contrast with CLIP-IQA. This experiment shows that our training strategy forces CLIP to focus on the low-level characteristics of the images. Moreover, the improved alignment between the antonym prompts and the corresponding regions of the images makes QualiCLIP more easily explainable than CLIP-IQA.
gradCAM 可视化我们通过 gradCAM [29]可视化评估我们模型和 CLIP-IQA 的可解释性。gradCAM 是一种可视化技术，旨在通过研究特定层的梯度来理解输入图像中哪些区域对模型决策影响最大。我们使用 gradCAM 生成图像中对所有反义词提示反应最强烈的区域的热图。我们使用“好照片”和“坏照片”分别作为正面和负面提示。根据[29]，我们考虑 ResNet50 主干的最后一个卷积层。图 6(a)显示了正面提示的结果。我们观察到，相比于 CLIP-IQA，我们的模型在图像中的优质区域，如马头部，具有更好的对齐。同样，图 6(b)中描述了 QualiCLIP 在考虑负面提示时，重点关注图像的最退化部分，与 CLIP-IQA 形成对比。此实验表明我们的训练策略迫使 CLIP 聚焦于图像的低级特征。此外，反义词提示与图像对应区域之间的改进对齐使得 QualiCLIP 比 CLIP-IQA 更容易解释。

5 Conclusion 5 结论

In this work, we observe that CLIP struggles to generate representations that correlate with the inherent quality of the images. To address this issue, we propose QualiCLIP, a self-supervised opinion-unaware approach aimed at enhancing CLIP’s ability to produce accurate quality-aware image representations. In particular, we design a quality-aware image-text alignment strategy that trains CLIP to rank increasingly synthetically degraded images based on their similarity with antonym prompts, while ensuring consistent representations for images with comparable quality. The experiments show that QualiCLIP surpasses other state-of-the-art opinion-unaware methods – with gains of up to a 20% gain – and outperforms supervised approaches in the cross-dataset setting. Moreover, our approach demonstrates stronger robustness and improved explainability than competing methods. In future work, we will investigate how the quality-aware image representations obtained by our model can help improve the performance of CLIP-based methods designed for semantic tasks, such as image retrieval.
在这项工作中，我们观察到 CLIP 在生成与图片内在质量相关的表示时存在困难。为了解决这一问题，我们提出了 QualiCLIP，一种自监督的意见不敏感方法，旨在增强 CLIP 生成准确质量感知图像表示的能力。特别地，我们设计了一种质量感知图像-文本对齐策略，训练 CLIP 基于与反义词提示的相似性对几乎没有质量下降的图像进行排序，同时确保具有相似质量的图像有一致的表示。实验表明，QualiCLIP 比其他最新的意见不敏感方法表现更好，提升幅度可达 20%，并在跨数据集设置中超越监督方法。此外，我们的方法展示出比竞争方法更强的鲁棒性和解释能力。在未来的工作中，我们将研究通过我们的模型获得的质量感知图像表示如何帮助改进设计用于语义任务的 CLIP 方法的性能，例如图像检索。

Acknowledgments 致谢
This work was partially supported by the European Commission under European Horizon 2020 Programme, grant number 951911 - AI4Media.
本项工作部分由欧洲委员会通过欧洲地平线 2020 计划资助，资助编号为 951911 - AI4Media。

References

[1] Agnolucci, L., Galteri, L., Bertini, M., Del Bimbo, A.: ARNIQA: Learning Distortion Manifold for Image Quality Assessment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 189–198 (2024)
[2] Agnolucci, L., Galteri, L., Bertini, M., Del Bimbo, A.: Reference-based restoration of digitized analog videotapes. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1659–1668 (2024)
[3] Antkowiak, J., Baina, T.J., Baroncini, F.V., Chateau, N., FranceTelecom, F., Pessoa, A.C.F., Colonnese, F.S., Contin, I.L., Caviedes, J., Philips, F.: Final report from the video quality experts group on the validation of objective models of video quality assessment march 2000. Final report from the video quality experts group on the validation of objective models of video quality assessment march 2000 (2000)
[4] Babu, N.C., Kannan, V., Soundararajan, R.: No reference opinion unaware quality assessment of authentically distorted images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2459–2468 (2023)
[5] Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3677–3686 (2020)
[6] Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015)
[7] Golestaneh, S.A., Dadsetan, S., Kitani, K.M.: No-reference image quality assessment via transformers, relative ranking, and self-consistency. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1220–1230 (2022)
[8] Gu, J., Meng, G., Da, C., Xiang, S., Pan, C.: No-reference image quality assessment with reinforcement recursive list-wise ranking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8336–8343 (2019)
[9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[10] Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, 4041–4056 (2020)
[11] Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: Learning image aesthetics from user comments with vision-language pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10041–10051 (2023)
[12] Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging 19(1), 011006–011006 (2010)
[13] Liang, Z., Li, C., Zhou, S., Feng, R., Loy, C.C.: Iterative prompt learning for unsupervised backlit image enhancement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8094–8103 (2023)
[14] Liao, P.S., Chen, T.S., Chung, P.C., et al.: A fast algorithm for multilevel thresholding. J. Inf. Sci. Eng. 17(5), 713–727 (2001)
[15] Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Tenth International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–3. IEEE (2019)
[16] Liu, X., Van De Weijer, J., Bagdanov, A.D.: Rankiqa: Learning from rankings for no-reference image quality assessment. In: Proceedings of the IEEE international conference on computer vision. pp. 1040–1049 (2017)
[17] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[18] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018 (2023)
[19] Ma, K., Duanmu, Z., Wu, Q., Wang, Z., Yong, H., Li, H., Zhang, L.: Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing 26(2), 1004–1016 (2016)
[20] Ma, K., Liu, W., Liu, T., Wang, Z., Tao, D.: dipiq: Blind image quality assessment by learning-to-rank discriminable image pairs. IEEE Transactions on Image Processing 26(8), 3951–3964 (2017)
[21] Ma, K., Wu, Q., Wang, Z., Duanmu, Z., Yong, H., Li, H., Zhang, L.: Group mad competition-a new methodology to compare objective image quality models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1664–1673 (2016)
[22] Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(11) (2008)
[23] Madhusudana, P.C., Birkbeck, N., Wang, Y., Adsumilli, B., Bovik, A.C.: Image quality assessment using contrastive learning. IEEE Transactions on Image Processing 31, 4149–4161 (2022)
[24] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20(3), 209–212 (2012)
[25] Ponomarenko, N., Ieremeiev, O., Lukin, V., Egiazarian, K., Jin, L., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., et al.: Color image database TID2013: Peculiarities and preliminary results. In: European workshop on visual information processing (EUVIP). pp. 106–111. IEEE (2013)
[26] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[27] Roy, S., Mitra, S., Biswas, S., Soundararajan, R.: Test time adaptation for blind image quality assessment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16742–16751 (2023)
[28] Saha, A., Mishra, S., Bovik, A.C.: Re-iqa: Unsupervised learning for image quality assessment in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5846–5855 (2023)
[29] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
[30] Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15(11), 3440–3451 (2006)
[31] Shukla, A., Upadhyay, A., Bhugra, S., Sharma, M.: Opinion unaware image quality assessment via adversarial convolutional variational autoencoder. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2153–2163 (2024)
[32] Srinath, S., Mitra, S., Rao, S., Soundararajan, R.: Learning generalizable perceptual representations for data-efficient no-reference image quality assessment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 22–31 (2024)
[33] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3667–3676 (2020)
[34] Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM 59(2), 64–73 (2016)
[35] Thong, W., Pereira, J.C., Parisot, S., Leonardis, A., McDonagh, S.: Content-diverse comparisons improve iqa. arXiv preprint arXiv:2211.05215 (2022)
[36] Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2555–2563 (2023)
[37] Wu, H., Liao, L., Hou, J., Chen, C., Zhang, E., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring opinion-unaware video quality assessment with semantic affinity criterion. arXiv preprint arXiv:2302.13269 (2023)
[38] Wu, H., Liao, L., Wang, A., Chen, C., Hou, J., Sun, W., Yan, Q., Lin, W.: Towards robust text-prompted semantic criterion for in-the-wild video quality assessment. arXiv preprint arXiv:2304.14672 (2023)
[39] Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3575–3585 (2020)
[40] Zhang, L., Zhang, L., Bovik, A.C.: A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing 24(8), 2579–2591 (2015)
[41] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology 30(1), 36–47 (2018)
[42] Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14071–14081 (2023)
[43] Zhao, K., Yuan, K., Sun, M., Li, M., Wen, X.: Quality-aware pre-trained models for blind image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22302–22313 (2023)

Quality-aware Image-Text Alignment
for Real-World Image Quality Assessment
Supplementary Material

Table S4: Analysis of the individual prompt contributions in the quality score computation.

T_{p}

and

T_{n}

indicate the positive and negative prompts, respectively. Best and second-best scores are highlighted in bold and underlined, respectively.

$T_{p}$	$T_{n}$	SRCC	PLCC	SRCC	PLCC
		LIVE		TID2013
✓	✗	0.382	0.381	0.059	0.206
✗	✓	0.735	0.768	0.604	0.669
✓	✓	0.887	0.880	0.626	0.679

S6 Analysis of Individual Prompt Contributions

The results of the ablation studies on the training loss terms reported in Sec. 4.4 and Tab. 3 (left) show that $\mathcal{L}_{neg}$ is more important than $\mathcal{L}_{pos}$ in the training process. We recall that $\mathcal{L}_{pos}$ (Eq. 2) and $\mathcal{L}_{neg}$ (Eq. 3) involve the alignment between the images and the positive and negative prompts, respectively. Therefore, this finding suggests that the negative prompt contributes more than the positive one in the quality score computation (illustrated in Fig. 4). To support this hypothesis, we study the individual contribution of the positive and negative prompts in obtaining the final quality scores.

Let $T_{p}$ and $T_{n}$ be the positive and negative prompts that compose the pair of antonym prompts. We conduct an experiment where we directly use the similarity between the image and each of the antonym prompts as the quality score. This is possible because both the similarities and the quality scores are comprised between 0 and 1. Table S4 shows the results on the LIVE [30] and TID2013 [25] synthetic datasets. We observe that the similarity between the negative prompt and the image provides significantly more information about its inherent quality than using the positive prompt. This result confirms our hypothesis and is consistent with the greater importance of $\mathcal{L}_{neg}$ in our training strategy. Nevertheless, Tab. S4 also indicates that both the positive and negative prompts are crucial for the quality score computation, as the strategy illustrated in Fig. 4 achieves the best performance.

We carry out an additional experiment to investigate whether the discrepancy in the contribution between the positive and negative prompts is a result of our training strategy or is inherent to CLIP itself. Specifically, we follow the experimental setting described above to evaluate the individual contributions of the prompts in the quality score computation of CLIP-IQA [36]. We recall that CLIP-IQA employs the out-of-the-box CLIP image encoder without task-specific training and computes the quality score as depicted in Fig. 4. Our experiment reveals that $T_{p}$ and $T_{n}$ achieve a SRCC of -0.036 and 0.441 on the TID2013 dataset, respectively. This outcome leads us to conclude that the similarity with the negative prompt inherently provides more meaningful information about image quality compared to using the positive prompt. We plan to investigate more thoroughly on this finding in future work.

Table S5: Comparison between QualiCLIP and competing opinion-unaware methods on datasets with synthetic distortions. Best and second-best scores are highlighted in bold and underlined, respectively. OU indicates Opinion-Unaware version as explained in Sec. 4.3.

	LIVE		CSIQ		TID2013		KADID		Average
Method	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
NIQE [24]	0.908	0.905	0.628	0.719	0.312	0.398	0.379	0.438	0.557	0.615
IL-NIQE [40]	0.896	0.896	0.824	0.860	0.487	0.582	0.539	0.582	0.687	0.730
CONTRIQUE-OU [23]	0.854	0.848	0.695	0.715	0.323	0.360	0.552	0.563	0.606	0.622
Re-IQA-OU [28]	0.803	0.796	0.719	0.727	0.288	0.326	0.518	0.531	0.582	0.595
ARNIQA-OU [1]	0.871	0.863	0.816	0.805	0.464	0.533	0.630	0.635	0.695	0.709
CL-MI [4]	0.748	0.732	0.588	0.589	0.253	0.316	0.506	0.513	0.524	0.538
CLIP-IQA [36]	0.663	0.663	0.723	0.781	0.504	0.600	0.480	0.485	0.593	0.632
GRepQ-OU [32]	0.727	0.717	0.692	0.706	0.402	0.550	0.423	0.471	0.561	0.611
QualiCLIP	0.887	0.880	0.772	0.812	0.626	0.679	0.655	0.660	0.735	0.758

S7 Additional Experimental Results

S7.1 Quantitative Results

We report additional quantitative results, following the evaluation protocol detailed in Sec. 4.3. In particular, we consider synthetic datasets in the zero-shot setting, while we employ the CLIVE dataset [6] to train the baselines in the cross-dataset setting.

Zero-shot setting We evaluate the performance of our approach on four synthetically degraded datasets: LIVE [30], CSIQ [12], TID2013 [25], and KADID [15]. LIVE comprises 779 images degraded with 5 different distortion types at 5 levels of intensity, with 29 reference images as the base. CSIQ originates from 30 reference images, each distorted with 6 distinct degradations at 5 intensity levels, resulting in 866 images. TID2013 and KADID comprise 3000 and 10125 images degraded using 24 and 25 types of distortion across 5 different degrees of intensity, originating from 25 and 81 reference images, respectively. We recall that we use LIVE and TID2013 for validation, so the results of our method could likely exhibit some bias. Still, we report them for completeness. We provide the results in Tab. S5. Although primarily designed for use with authentically distorted images in real-world scenarios, QualiCLIP also achieves state-of-the-art performance on synthetic datasets. Indeed, similar to what we observed for the authentic datasets in Sec. 4.3, our method obtains significant improvements over all the baselines.

Cross-dataset setting Table S6 shows the results for the cross-dataset setting when using the CLIVE [6] dataset for training the baselines. Despite being the only opinion-unaware approach, QualiCLIP outperforms all competing approaches. In particular, our method achieves superior performance compared with other CLIP-based approaches, namely CLIP-IQA [36] and GRepQ [32]. This outcome aligns with the results reported in Tab. 2 and further confirms the effectiveness of our quality-aware image-text alignment strategy.

Table S6: Comparison between QualiCLIP and supervised methods trained on CLIVE [6]. We report the performance on several datasets with authentic distortions. Best and second-best scores are highlighted in bold and underlined, respectively.

		KonIQ		FLIVE		SPAQ		Average
Method	Opinion-Unaware	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
HyperIQA [33]	✗	0.750	0.787	0.335	0.483	0.776	0.796	0.620	0.689
TReS [7]	✗	0.738	0.766	0.356	0.477	0.865	0.870	0.653	0.704
CONTRIQUE [23]	✗	0.734	0.747	0.355	0.465	0.840	0.850	0.643	0.687
Re-IQA [28]	✗	0.732	0.753	0.341	0.449	0.823	0.832	0.632	0.678
ARNIQA [1]	✗	0.751	0.781	0.384	0.480	0.862	0.872	0.666	0.711
CLIP-IQA+ [36]	✗	0.780	0.814	0.369	0.481	0.855	0.861	0.668	0.719
GRepQ [32]	✗	0.779	0.793	0.345	0.449	0.839	0.852	0.654	0.698
QualiCLIP	✓	0.815	0.837	0.393	0.496	0.843	0.855	0.684	0.729

S7.2 gMAD Competition

We compare the robustness of QualiCLIP with CLIP-IQA [36] by conducting the group maximum differentiation (gMAD) competition [21]. We provide more details on gMAD in Sec. 4.5. Figure S7 shows the results. When QualiCLIP is fixed (Fig. 7(a)), CLIP-IQA struggles to identify image pairs with an evident quality gap. In contrast, when QualiCLIP operates as the attacker (Fig. 7(b)), it successfully highlights the failures of CLIP-IQA by finding image pairs displaying significantly different quality. This result shows that our approach demonstrates greater robustness than CLIP-IQA.

S7.3 t-SNE Visualization

We compare the image representations generated by QualiCLIP and CLIP-IQA [36] via a t-SNE [22] visualization. Following [32], we consider images from the CLIVE [6] dataset with very high or very low quality. In particular, we take into account images with a labeled MOS greater than 75 and lower than 25, respectively. Figure S8 shows the results. We observe that the representations of high- and low-quality images obtained by the proposed approach (Fig. S8) correspond to more easily separable clusters compared to those of CLIP-IQA (Fig. S8), which are more intertwined. This result confirms that QualiCLIP generates image representations that better correlate with their intrinsic quality.

S8 Additional Implementation Details

S8.1 Prompts

Following [37, 38], we employ multiple pairs of antonym prompts during training and inference. In particular, we use: 1) “Good/Bad photo”; 2) “Good/Bad picture”; 3) “High-resolution/Low-resolution image”; 4) “High-quality/Low-quality image”; 5) “Sharp/Blurry image”; 6) “Sharp/Blurry edges”; 7) “Noise-free/Noisy image”. We average the similarities between the images and the pairs of prompts. As we keep the CLIP text encoder fixed, computing the text features of the prompts is a one-time requirement. Subsequently, we can employ them both for training and inference.

S8.2 Synthetic Distortions

As detailed in Sec. 3.2, during training we synthetically degrade pristine images with increasing intensity levels to make our approach self-supervised. Specifically, similar to [1] we consider 24 distinct distortion types divided into the 7 degradation groups defined by the KADID [15] dataset. Each degradation has 5 degrees of progressively higher intensity. We report an example for all the intensity levels of each distortion belonging to the degradation groups in Figs. S9, S10, S11, S12, S13, S14 and S15. Each distortion is described as follows:

1.
Brightness change:
- •
  
  Brighten: applies a sequence of color space transformations, curve adjustments, and blending operations to increase the brightness of the image;
- •
  
  Darken: similar to the brighten operation, but reduces the brightness instead of increasing it;
- •
  
  Mean shift: adjusts the average intensity of image pixels by adding a constant value to all pixel values. Then, it constrains the resulting values to stay within the original image range;
2.
Blur:
- •
  
  Gaussian blur: applies a Gaussian kernel filter to each image pixel;
- •
  
  Lens blur: applies a circular kernel filter to each image pixel;
- •
  
  Motion blur: applies a linear motion blur kernel to each image pixel, simulating the effect of either a moving camera or a moving object in the scene. This results in the image appearing blurred in the direction of the motion;
3.
Spatial distortions:
- •
  
  Jitter: randomly displaces image data by applying small offsets to warp each pixel;
- •
  
  Non-eccentricity patch: randomly selects patches from the image and places them in random neighboring positions;
- •
  
  Pixelate: employs a combination of downscaling and upscaling operations using nearest-neighbor interpolation;
- •
  
  Quantization: quantizes the image into $N$ uniform levels. The quantization thresholds are dynamically computed using Multi-Otsu’s method [14];
- •
  
  Color block: randomly superimposes uniformly colored square patches onto the image;
4.
Noise:
- •
  
  White noise: adds Gaussian white noise to the image;
- •
  
  White noise in color component: transforms the image to the YCbCr color space and then adds Gaussian white noise to each channel;
- •
  
  Impulse noise: adds salt and pepper noise to the image;
- •
  
  Multiplicative noise: adds speckle noise to the image;
5.
Color distortions:
- •
  
  Color diffusion: transforms the image to the LAB color space and then applies Gaussian blur to each channel;
- •
  
  Color shift: randomly shifts the green channel and then blends it into the original image, masking it with the normalized gradient magnitude of the original image;
- •
  
  Color saturation 1: transforms the image to the HSV color space and then scales the saturation channel by a factor;
- •
  
  Color saturation 2: transforms the image to the LAB color space and then scales each color channel by a factor;
6.
Compression:
- •
  
  JPEG2000: applies the standard JPEG2000 compression to the image;
- •
  
  JPEG: applies the standard JPEG compression to the image;
7.
Sharpness & contrast:
- •
  
  High sharpen: applies unsharp masking to sharpen the image in the LAB color space;
- •
  
  Nonlinear contrast change: applies a nonlinear tone mapping operation to adjust the contrast of the image;
- •
  
  Linear contrast change: applies a linear tone mapping operation to adjust the contrast of the image;