Learning Generalizable Perceptual Representations for Data-Efficient No-Reference Image Quality Assessment
学习可泛化的感知表示以进行数据高效的无参考图像质量评估

Suhas Srinath^∗ Shankhanil Mitra^∗ Shika Rao Rajiv Soundararajan
Indian Institute of Science
Bengaluru, India 560012
{suhass12, shankhanilm, shikarao, rajivs}@iisc.ac.in

Abstract

No-reference (NR) image quality assessment (IQA) is an important tool in enhancing the user experience in diverse visual applications. A major drawback of state-of-the-art NR-IQA techniques is their reliance on a large number of human annotations to train models for a target IQA application. To mitigate this requirement, there is a need for unsupervised learning of generalizable quality representations that capture diverse distortions. We enable the learning of low-level quality features agnostic to distortion types by introducing a novel quality-aware contrastive loss. Further, we leverage the generalizability of vision-language models by fine-tuning one such model to extract high-level image quality information through relevant text prompts. The two sets of features are combined to effectively predict quality by training a simple regressor with very few samples on a target dataset. Additionally, we design zero-shot quality predictions from both pathways in a completely blind setting. Our experiments on diverse datasets encompassing various distortions show the generalizability of the features and their superior performance in the data-efficient and zero-shot settings.
无参考（NR）图像质量评估（IQA）是在各类视觉应用中增强用户体验的重要工具。目前最先进的 NR-IQA 技术的主要缺陷是它们依赖大量人工标注来训练面向目标 IQA 应用的模型。为了减少这一需求，需要针对捕获多样化失真进行无监督学习的一般化质量表示。我们通过引入一种新颖的质量感知对比损失来实现无失真类型区分的低级质量特征学习。此外，我们利用视觉-语言模型的泛化能力，通过相关文本提示微调其中一个模型以提取高级图像质量信息。通过在目标数据集上使用非常少样本训练一个简单的回归器，两个特征集被结合以有效预测质量。此外，我们设计了完全盲设置下的两种路径的零样本质量预测。我们在涵盖多种失真的各类数据集上的实验显示出特征的泛化能力以及其在数据高效和零样本设置中的优越性能。

^*^*footnotetext: Equal contribution.^$\S$^$\S$footnotetext: https://github.com/suhas-srinath/GRepQ

1 Introduction

The increasing number of imaging devices, including cameras and smartphones, has significantly increased the volume of images captured, edited, and shared on a global scale. As a result, there is a necessity to assess the quality of visual content to enhance user experience. Image Quality Assessment (IQA) is generally divided into two categories: full-reference (FR) and no-reference (NR) IQA. While FR-IQA relies on pristine reference images for quality assessment, NR-IQA is more relevant and challenging due to the absence of clean references for user-captured images.
包括摄像机和智能手机在内的成像设备数量不断增加，显著提高了全球范围内捕获、编辑和分享的图像数量。因此，需要对视觉内容的质量进行评估，以增强用户体验。图像质量评估（IQA）通常分为两类：全参考（FR）和无参考（NR）IQA。尽管 FR-IQA 依赖于原始参考图像进行质量评估，但 NR-IQA 由于用户捕获图像缺乏干净的参考，更具相关性和挑战性。

Most successful NR IQA methods are deep-learning based, and require a large number of images with human opinion scores for training. As imaging systems evolve, the distortions also evolve, making it difficult to keep creating large annotated datasets for training NR IQA models. This motivates the study of limited data or data-efficient NR IQA models which can be trained on a target IQA application (or database) with limited labels. Such an approach works best if the learned quality representations can generalize well across different distortion types for various IQA tasks. These representations can then be mapped to quality using a simple linear model [16, 25] using the limited labels on a target application. Further, it is desirable that we learn these representations without requiring any human annotations of image quality. The goal of our work is to learn generalizable image quality representations through self-supervised learning to design data-efficient NR models for a target IQA application.
大多数成功的无参考图像质量评估（NR IQA）方法都是基于深度学习的，需要大量带有人类意见评分的图像进行训练。随着成像系统的发展，失真也随之演变，这使得不断创建大型标注数据集用于训练 NR IQA 模型变得困难。这激发了对有限数据或数据高效的 NR IQA 模型的研究，这些模型可以在具有有限标签的目标 IQA 应用（或数据库）上进行训练。如果学习到的质量表示可以很好地泛化到不同失真类型以适用于各种 IQA 任务，这种方法效果最佳。然后可以通过一个简单的线性模型[16, 25]利用目标应用上的有限标签将这些表示映射到质量。此外，我们希望在学习这些表示时不需要任何关于图像质量的人类标注。我们工作的目标是通过自监督学习来学习可泛化的图像质量表示，为目标 IQA 应用设计数据高效的 NR 模型。

In this regard, while DEIQT[22] studies the data-efficient IQA problem, they train the entire network with millions of parameters, which still requires a reasonable number of labeled training images. On the other hand, recent methods such as CONTRIQUE [16], Re-IQA [25], and QPT [4] focus on self-supervised contrastive learning to learn quality features, which can potentially yield superior performance with limited labels. However, these methods do not consider that images with different distortions could have the same quality, thereby limiting the generalizability of their image features across varied distortions.
在这方面，虽然 DEIQT[22]研究了数据高效 IQA 问题，但他们使用数百万参数训练整个网络，这仍然需要合理数量的标记训练图像。另一方面，最近的方法如 CONTRIQUE [16]、Re-IQA [25] 和 QPT [4]专注于自监督对比学习以学习质量特征，这可能在有限的标签条件下实现优异的性能。然而，这些方法没有考虑到具有不同失真效果的图像可能具有相同的质量，从而限制了其图像特征在不同失真间的通用性。

Our main contribution is the design of Generalizable Representations for Quality (GRepQ) that can predict quality by training a simple linear model with few annotations. We present two sets of features, one to capture the local low-level quality variations and another to predict quality using the global context. To capture low-level quality features, we propose a quality-aware contrastive learning strategy guided by a perceptual similarity measure between distorted versions of an image. In particular, we bring similar-quality images closer in the latent space irrespective of their distortion types. This is achieved by assigning a weight based on a similarity measure between every pair of distorted versions of an image. Our strategy enables the learning of generalizable quality representations invariant to distortion types.
我们的主要贡献是设计了一种名为质量通用表示（GRepQ）的方法，可以通过训练一个简单的线性模型并使用少量注释来预测质量。我们提出了两组特征，一组用于捕捉局部低级质量变化，另一组用于利用全局上下文预测质量。为了捕捉低级质量特征，我们提出了一种质量感知对比学习策略，该策略由图像的失真版本之间的感知相似性度量指导。特别地，我们在潜在空间中使相似质量的图像更加接近，而不考虑它们的失真类型。这通过根据每对失真图像版本之间的相似性度量分配权重来实现。我们的策略使得学习到的质量表示对失真类型具有通用性。

We also leverage the generalization capabilities of large vision-language models for extracting high-level quality information. Notably, the versatile CLIP[23] model can be applied to zero-shot quality prediction [28], although a lack of task-specific fine-tuning limits it. While LIQE [42] fine-tunes CLIP by integrating scene and distortion information, it requires large-scale training with human labels. We overcome these limitations through a novel unsupervised fine-tuning of CLIP. We achieve this by segregating images of higher and lower quality into groups using antonym text prompts and employing a group-contrastive loss with respect to the prompts. Our group-contrastive learning facilitates the learning of high-level quality representations that can generalize well to diverse content and distortions.
我们还利用大型视觉-语言模型的泛化能力来提取高层次的质量信息。值得注意的是，通用的 CLIP[23]模型可以应用于零样本质量预测[28]，但由于缺乏任务特定的微调，限制了其应用。虽然 LIQE[42]通过整合场景和失真信息来微调 CLIP，但需要大规模的人为标注进行训练。我们通过一种新的无监督微调 CLIP 的方法克服了这些限制。我们通过使用反义词文本提示将较高质量和较低质量的图像分组，并根据提示应用组对比损失来实现这一目标。我们的组对比学习促进了高层次质量表征的学习，可以很好地泛化到各种内容和失真。

The features from both pathways can be combined to learn a simple regressor trained with few samples from any IQA dataset. Additionally, predictions can be made in a zero-shot setting using the learned features, which can then be combined to provide a single objective score. We show through extensive experiments that our framework shows superior performance in both the data-efficient as well as zero-shot settings. We summarize the main contributions of our framework as follows:
来自两个路径的特征可以结合起来，以学习一个简单的回归器，该回归器只需从任何 IQA 数据集中采集少量样本进行训练。此外，利用学习到的特征可以在零样本环境中进行预测，然后结合为一个单一的客观评分。通过大量实验，我们证明了我们的框架在数据效率和零样本环境中都表现出卓越的性能。我们总结了框架的主要贡献如下：

•

A quality-aware contrastive loss that weighs positive and negative training pairs using a “soft” perceptual similarity measure between a pair of samples to enable representation learning invariant to distortion types.
一种质量感知对比损失，通过样本对之间的“软”感知相似性度量对正负训练对进行加权，以实现对失真类型不变的表示学习。
•

An unsupervised task-specific adaptation of a vision-language model to capture semantic quality information. We achieve this by separating higher and lower-quality groups of images based on quality-relevant antonym text prompts.
视觉-语言模型的无监督任务特定适应，用于捕捉语义质量信息。我们通过基于与质量相关的反义词文本提示，将图片分为高质量和低质量组来实现这一目标。
•

Superior performance of our method over other NR-IQA methods trained using few samples (data-efficient) on several IQA datasets to highlight the generalizability of our features. Additionally, we show superior cross-database prediction performance.
通过对多个 IQA 数据集进行训练，我们的方法在少量样本（数据高效）上比其他 NR-IQA 方法表现更优，以强调我们特征的广泛适用性。此外，我们还展示了卓越的跨数据库预测性能。
•

A zero-shot quality prediction method using the learned features and its superior performance compared to other zero-shot (or completely blind) methods.
一种使用学习特征进行零样本质量预测的方法，以及相比其他零样本（或完全盲测）方法的优越性能。

2 Related Work 2 相关工作

2.1 Supervised NR-IQA 2.1 监督的无参考图像质量评价

Many popular supervised NR-IQA methods such as BRISQUE[17], DIIVINE[19], BLIINDS[24], CORNIA[35] predict quality using hand-crafted natural scene statistics based features. Such methods have succeeded when images contain synthetic distortions but often suffer when the distortions are more complex or authentic. To mitigate this, several deep learning-based methods have emerged that are either trained in an end-to-end fashion [41, 3, 11] or use a pre-trained feature encoder that can be fine-tuned for IQA [41]. Further, transformer-based models have shown promise on authentic and synthetically distorted images [6, 37, 29, 27]. Methods such as MetaIQA[44] employ meta-learning to learn from synthetic data and adapt to real-world images efficiently. A recent method, LIQE [42] adapts the CLIP model for IQA via scene and distortion classification along with supervised fine-tuning on several IQA datasets. However, the model requires multiple annotations per image during training making the model infeasible when adapting to newer and more complex datasets in the data-efficient regime.
许多流行的监督式无参考图像质量评估（NR-IQA）方法，如 BRISQUE[17]、DIIVINE[19]、BLIINDS[24]、CORNIA[35]，使用手工设计的自然场景统计特征来预测质量。这些方法在图像包含合成失真时取得了成功，但在失真更复杂或真实时往往效果不佳。为了解决这个问题，一些基于深度学习的方法已经出现，这些方法要么是通过端到端方式训练[41, 3, 11]，要么是使用可以为 IQA 微调的预训练特征编码器[41]。此外，基于变换器的模型在真实和合成失真的图像上表现出色[6, 37, 29, 27]。像 MetaIQA[44]这样的方法采用元学习，从合成数据中学习并高效地适应真实世界的图像。最近的方法 LIQE [42]通过场景和失真分类适配 CLIP 模型用于 IQA，并在多个 IQA 数据集上进行监督微调。然而，该模型在训练时每张图像需要多个标注，这使得在数据高效环境中适应更新和更复杂的数据集时难以实现。

2.2 Self-Supervised Quality Feature Learning
2.2 自监督质量特征学习

Although supervised NR-IQA methods have shown reasonable performance in quality prediction, they still possess the limitation of requiring large amounts of human annotations for training. One of the earliest approaches in this domain was through the design of quality-aware codebooks [35]. Later, different ranking-based methods were used for quality-aware pre-training [14]. Contrastive learning-based training such as CONTRIQUE [16], Re-IQA [25] and QPT [4] learn quality representations by contrasting multiple levels of synthetic distortions. While Re-IQA [25] also uses high and low-level features, our method significantly differs from Re-IQA in how the low-level and high-level features are designed. Further, all the above methods neither consider the generalizability to unseen distortions nor do they consider the data-efficient evaluation setting.
虽然监督式无参考图像质量评估（NR-IQA）方法在质量预测中表现合理，但它们仍然存在需要大量人工标注进行训练的局限。该领域最早的方法之一是通过设计质量感知码本[35]。随后，采用不同的基于排序的方法进行质量感知预训练[14]。基于对比学习的训练方法，如 CONTRIQUE[16]、Re-IQA[25]和 QPT[4]，通过对比多个级别的合成失真来学习质量表示。虽然 Re-IQA[25]也使用了高低级特征，但我们的方法在低级和高级特征的设计上与 Re-IQA 有显著区别。此外，上述所有方法均未考虑对未见失真的泛化能力，也未考虑数据高效的评估设置。

2.3 Zero-Shot or Completely Blind (CB) IQA
2.3 零样本或完全盲（CB）图像质量评估

Another class of IQA methods are zero-shot or completely blind and do not require any human opinions for their design. For example, NIQE [18] neither requires training on a dataset of annotated images nor knowledge about possible degradations. IL-NIQE [38] improves over NIQE by integrating other quality-aware features based on Gabor filter responses, gradients, and color statistics. However, both methods tend to fail on authentic and other complex distortions. A recent method [2] learns deep features using contrastive learning to predict quality without any supervision. However, the performance shown on in-the-wild IQA datasets still provides scope for further improvement. Leveraging the contextual information from CLIP[23], CLIP-IQA [28] shows that a zero-shot application of the CLIP model can yield promising quality predictions. However, zero-shot methods tend to have poorer performance and motivate the use of limited labels on target IQA applications to improve performance.
另一类 IQA 方法是零样本或完全盲测，不需要任何人的意见用于其设计。例如，NIQE[18]既不需要在标注的图像数据集上训练，也不需要了解可能的降解。IL-NIQE[38]通过融合基于 Gabor 滤波响应、梯度和颜色统计的其他质量感知特征对 NIQE 进行了改进。然而，这两种方法在真实和其他复杂失真上往往表现不佳。最近的一种方法[2]使用对比学习来学习深度特征，从而在没有任何监督的情况下预测质量。然而，在自然场景的 IQA 数据集上的性能仍然有进一步改进的空间。利用 CLIP[23]的上下文信息，CLIP-IQA[28]表明 CLIP 模型的零样本应用能够产生有希望的质量预测。然而，零样本方法往往表现较差，促使在目标 IQA 应用上使用有限的标签来提高性能。

2.4 Data-Efficient IQA 2.4 数据高效的图像质量评估

IQA in the low-data setting remains relatively unexplored. Data-efficient image quality assessment (DEIQT) [22] shows that IQA models can be efficiently fine-tuned with very few annotated samples from a target dataset, enabling generalization through data efficiency. Further, with a sufficient number of training samples, data-efficient training can achieve performances of full dataset supervision on multiple IQA datasets. However, DEIQT still requires end-to-end fine-tuning of a transformer model, leading to increased training times.
在低数据环境下的图像质量评估（IQA）仍然相对未被充分探索。数据有效图像质量评估（DEIQT）[22]显示，IQA 模型可以使用目标数据集中极少量的标注样本进行高效微调，使通过数据效率实现泛化。此外，随着训练样本数量充足，数据有效训练可以在多个 IQA 数据集上达到整个数据集监督的性能。然而，DEIQT 仍然需要对变压器模型进行端到端微调，导致训练时间增加。

Refer to caption — Figure 1: Illustration of the GRepQ framework. The low-level model, $f_{\theta}$ , is trained using multiple distorted versions of an image $\mathbf{x}_{i}$ subjected to the fragment sampling operation $T(\cdot)$ . $\mathbf{x}_{i}^{j}$ denotes an anchor image. The perceptual similarity measure $s(\cdot,\cdot)$ is used to weigh the feature similarities of pairs of distorted images $\mathbf{x}_{i}^{k}$ and $\mathbf{x}_{i}^{l}$ in Eq. 1. The high-level model, $g_{\phi}$ , is trained using Eq. 3 after selecting groups of features based on their cosine similarities with embeddings $\mathbf{z}_{g}$ and $\mathbf{z}_{b}$ of antonym text prompts that relate to higher and lower quality respectively. Embeddings are obtained from the text-encoder $h(\cdot).$
图 1：GRepQ 框架示意图。低层模型 $f_{\theta}$ 使用图像 $\mathbf{x}_{i}$ 的多个失真版本进行训练，这些版本经过片段采样操作 $T(\cdot)$ 。 $\mathbf{x}_{i}^{j}$ 表示锚图像。感知相似性度量 $s(\cdot,\cdot)$ 用于在公式 1 中加权失真图像对 $\mathbf{x}_{i}^{k}$ 和 $\mathbf{x}_{i}^{l}$ 的特征相似性。高层模型 $g_{\phi}$ 使用公式 3 进行训练，训练时选择的特征组基于其与分别对应高质量和低质量的反义文本提示嵌入 $\mathbf{z}_{g}$ 和 $\mathbf{z}_{b}$ 的余弦相似性。嵌入从文本编码器 $h(\cdot).$ 中获得。

3 Method 3 方法

We first describe our approach to learning generalizable low-level and high-level quality representations. The overall framework is illustrated in Fig. 1. We discuss how quality is predicted in the data-efficient and zero-shot settings.
我们首先描述我们的方法，以学习可泛化的低层次和高层次质量表示。整体框架如图 1 所示。我们讨论如何在数据高效和零样本设置下预测质量。

3.1 Low-Level Representation Model
3.1 低层次表示模型

Contrastive learning for image quality [16, 25, 4] discriminates images based on varied types and levels of distortions to capture low-level information. While this provides very good pre-trained feature encoders without learning from human labels, images with different distortion types are often treated as a negative pair with respect to an anchor image, and the features of such images are pulled apart. This leads to two main issues. An image with a different distortion type may have a similar perceptual quality as the anchor. Secondly, since image representations with different distortion types are separated, this hurts the model’s generalizability to represent unseen distortions. The goal of our work is to address both these limitations through a quality-aware contrastive learning loss.
对图像质量的对比学习 [16, 25, 4] 基于不同类型和程度的失真区分图像，以捕获低级别信息。虽然这种方法无需学习来自人工标签的数据就能提供非常好的预训练特征编码器，但不同失真类型的图像通常被视为与锚定图像的负对，并将这些图像的特征分开。这导致两个主要问题。具有不同失真类型的图像可能会与锚定图像具有相似的感知质量。其次，由于不同失真类型的图像表示被分开，这损害了模型对表示未见失真的泛化能力。我们的工作目标是通过质量感知对比学习损失解决这两个限制。

We introduce a novel quality-aware contrastive loss, where positive and negative pairs (pairs of images considered similar and dissimilar in quality, respectively) are selected based on their perceptual similarity. This allows a soft weighting such that a similarity weight close to one treats the pair of images as positive and pulls their corresponding representations closer. Similarly, image features are pulled apart when the perceptual similarity is near zero. This differs from the way prior methods use contrastive learning. In particular, our framework allows the selection of pairs regardless of their distortions, allowing for generalization.
我们引入了一种新颖的质量感知对比损失，其中正负样本对（在质量上被认为相似和不相似的图像对）是基于它们的感知相似性来选择的。这允许一种软权重，使相似性权重接近一时，将图像对视为正样本，并拉近它们的对应表示。同样，当感知相似性接近零时，图像特征会被拉开。这与先前方法使用对比学习的方式有所不同。特别是，我们的框架允许选择样本对而不考虑它们的失真，从而实现通用化。

Image Augmentation: In order to train the feature encoder using contrastive learning, we generate multiple synthetically distorted versions of a camera-captured image and sample fragments from each image. Four synthetic distortions are generated: blur, compression, noise, and color saturation, at two levels each. Fragment sampling has proven effective in retaining the global quality information of an image [32]. To obtain fragments, we divide an image into grids, and random mini-patches are extracted from each of the grid locations. The mini-patches are then stitched together to yield a single fragmented image that is used to train the model. An augmentation is generated by randomly sampling another set of mini-patches from the same image to obtain another fragmented image. Note that this augmentation is quality preserving and can be used as a hard-positive pair in contrastive loss.
图像增强：为了使用对比学习训练特征编码器，我们生成了多种合成失真的照相机拍摄图像，并从每个图像中抽取片段。生成了四种合成失真：模糊、压缩、噪声和色彩饱和度，每种失真有两个级别。片段采样已被证明在保留图像的全局质量信息方面是有效的 [32]。为了获取片段，我们将图像划分为网格，并从每个网格位置随机提取小块贴片。这些小块贴片随后被拼接在一起，形成用于训练模型的一个片段化图像。通过从同一图像中随机抽取另一组小块贴片生成一个增强，以获得另一个片段化图像。注意，这种增强是保留质量的，并可以作为对比损失中的一个强正对。

Quality-Aware Contrastive Loss: We contrast multiple distorted versions of the same scene to learn quality representations and mitigate content bias. Consider a batch of $N_{b}$ images $\{\mathbf{x}_{i}\}_{i=1}^{N_{b}}$ , where each image has $D$ distorted versions. Let $\mathbf{x}_{i}^{j}$ and $\mathbf{x}_{i}^{k}$ denote two distorted versions of an image $\mathbf{x}_{i}$ where $j,k\in\{1,2,\ldots,D\}$ . Let $\mathbf{z}_{i}^{j}$ and $\mathbf{z}_{i}^{k}$ be the respective unit-norm feature representations obtained as $\mathbf{z}_{i}^{j}=f_{\theta}(T(\mathbf{x}_{i}^{j}))$ and $\mathbf{z}_{i}^{k}=f_{\theta}(T(\mathbf{x}_{i}^{k}))$ , where $T(\cdot)$ is the fragment sampling operation and $f_{\theta}(\cdot)$ is the feature encoder. Let $s(\cdot,\cdot):\mathbb{R}^{M\times N}\times\mathbb{R}^{M\times N}\rightarrow[0,1]$ denote a perceptual similarity measure between two images with the same content. Further, let $p_{\tau}(\mathbf{z}_{1},\mathbf{z}_{2})=\text{exp}(\mathbf{z}_{1}\cdot\mathbf{z}_{2}/\tau)$ . We overcome the limitation of existing contrastive learning methods which require hard positives and negatives through the above soft similarity measure to label positives and negatives. The similarity measures the closeness of distorted versions in terms of intrinsic quality attributes and provides a confidence weight in contrastive loss.
质量感知对比损失：我们通过对比同一场景的多个失真版本来学习质量表示，并减少内容偏差。考虑一个包含 $N_{b}$ 张图片的批次 $\{\mathbf{x}_{i}\}_{i=1}^{N_{b}}$ ，其中每张图片有 $D$ 个失真版本。设 $\mathbf{x}_{i}^{j}$ 和 $\mathbf{x}_{i}^{k}$ 表示图片 $\mathbf{x}_{i}$ 的两个失真版本，其中 $j,k\in\{1,2,\ldots,D\}$ 。设 $\mathbf{z}_{i}^{j}$ 和 $\mathbf{z}_{i}^{k}$ 为通过 $\mathbf{z}_{i}^{j}=f_{\theta}(T(\mathbf{x}_{i}^{j}))$ 和 $\mathbf{z}_{i}^{k}=f_{\theta}(T(\mathbf{x}_{i}^{k}))$ 获得的单位范数特征表示，其中 $T(\cdot)$ 是片段采样操作， $f_{\theta}(\cdot)$ 是特征编码器。设 $s(\cdot,\cdot):\mathbb{R}^{M\times N}\times\mathbb{R}^{M\times N}\rightarrow[0,1]$ 表示具有相同内容的两张图片之间的感知相似性度量。此外，设 $p_{\tau}(\mathbf{z}_{1},\mathbf{z}_{2})=\text{exp}(\mathbf{z}_{1}\cdot\mathbf{z}_{2}/\tau)$ 。我们通过上述软相似度度量标签正样本和负样本克服了现有对比学习方法需要硬正样本和负样本的局限性。相似性度量失真版本在内在质量属性方面的接近程度，并在对比损失中提供置信权重。

Our quality-aware contrastive loss is given by $\mathcal{L}_{QACL}=\sum_{i=1}^{N_{b}}\sum_{j=1}^{D}\mathcal{L}_{i}^{j}$ , where $\mathcal{L}_{i}^{j}$ is given by
我们的质量感知对比损失由 $\mathcal{L}_{QACL}=\sum_{i=1}^{N_{b}}\sum_{j=1}^{D}\mathcal{L}_{i}^{j}$ 给出，其中 $\mathcal{L}_{i}^{j}$ 由给出

\mathcal{L}_{i}^{j}=-\text{log}\frac{p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{j+})+\sum_{k\neq j}s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k})p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})}{p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{j+})+\sum_{k\neq j}p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})},

(1)

where $\mathbf{z}_{i}^{j+}$ is the representation of an augmentation of the image $\mathbf{x}_{i}^{j}$ , and $\tau_{1}$ is a temperature hyperparameter. Note that since $p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})=s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k})p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})+(1-s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k}))p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})$ , $\mathbf{x}_{i}^{k}$ is treated as similar to $\mathbf{x}_{j}^{k}$ with weight $s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k})$ and dissimilar with weight $(1-s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k}))$ . The similarity function makes the learning distortion type agnostic since it measures relative degradation without knowledge of the distortion type, making the learned features generalizable to different (and unseen) distortions. The InfoNCE [20] loss can be seen as a special case of $\mathcal{L}_{QACL}$ when $s(\cdot,\cdot)=0$ .
其中 $\mathbf{z}_{i}^{j+}$ 是图像 $\mathbf{x}_{i}^{j}$ 的增强表示，而 $\tau_{1}$ 是一个温度超参数。请注意，由于 $p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})=s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k})p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})+(1-s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k}))p_{\tau_{1}}(\mathbf{z}_{i}^{j},\mathbf{z}_{i}^{k})$ ， $\mathbf{x}_{i}^{k}$ 被视为与权重 $s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k})$ 的 $\mathbf{x}_{j}^{k}$ 相似，与权重 $(1-s(\mathbf{x}_{i}^{j},\mathbf{x}_{i}^{k}))$ 的 $\mathbf{x}_{j}^{k}$ 不相似。相似性函数使得学习对失真类型不敏感，因为它在没有失真类型知识的情况下测量相对退化，使得学习的特征可以推广到不同的（甚至是未见过的）失真。InfoNCE[20]损失可以看作是 $\mathcal{L}_{QACL}$ 的一个特例，当 $s(\cdot,\cdot)=0$ 时。

It is desirable that the perceptual similarity measure used satisfies a few properties: ( $1$ ) It captures intrinsic quality-specific attributes, such as structure, sharpness, or contrast, ( $2$ ) It is capable of handling various distortion types used during training and correlates well with human judgments on these distortions, ( $3$ ) It captures local and global quality information relevant to the human visual system, and $(4)$ It predicts similarity with fairly low complexity to enable faster training times. We explore different similarity measures such as FSIM [39], SSIM [30], GMSD [33], MS-SSIM[31] and LPIPS[40] in our work. Degraded reference IQA [1] has also shown that such similarity measures can be used to distinguish between distorted versions of a degraded reference image. Finally, we note that such similarity measures can be used to compare different distorted versions of an anchor image since all of them have the same content. Thus, we do not include variations of different images in this loss.
理想情况下，所使用的感知相似度测量需要满足一些特性：（ $1$ ）它能捕捉结构、清晰度或对比度等内在的质量特定属性，（ $2$ ）能够处理训练中使用的各种失真类型，并且与人类对这些失真的判断具有很好的相关性，（ $3$ ）它捕捉与人类视觉系统相关的局部和全局质量信息，以及 $(4)$ 以较低的复杂度预测相似性，从而加快训练时间。我们在工作中探索了不同的相似性度量，如 FSIM [39]、SSIM [30]、GMSD [33]、MS-SSIM[31]和 LPIPS[40]。退化参考 IQA [1]同样显示，这些相似性测量可以用来区分退化参考图像的畸变版本。最后，我们注意到，这些相似性测量可用于比较同一锚点图像的不同失真版本，因为它们都具有相同的内容。因此，我们并未在此损失中包含不同图像的变化。

3.2 High-Level Representation Model
3.2 高层表示模型

To understand the scene context for IQA, we adapt the CLIP model to the IQA task. Image quality can be obtained using CLIP by measuring the cosine similarity between the image feature and the text embeddings of a pair of antonym prompts such as [‘‘a good photo.’’, ‘‘a bad photo.’’] [28]. Although CLIP has a reasonable zero-shot quality prediction performance in terms of correlation with human opinion, the representations are not specifically crafted for the task of IQA, leading to a performance gap. We bridge this gap by fine-tuning the image encoder of CLIP through an unsupervised loss, as described below.
为了理解场景上下文以进行图像质量评估（IQA），我们将 CLIP 模型应用于 IQA 任务。可以通过测量图像特征与一对反义词提示的文本嵌入之间的余弦相似度来使用 CLIP 获取图像质量，例如[“一张好照片。”，“一张坏照片。”] [28]。虽然 CLIP 在与人类意见相关的零样本质量预测性能上表现合理，但其表示并未专门针对 IQA 任务设计，导致性能差距。我们通过一种无监督损失微调 CLIP 的图像编码器来弥合这一差距，如下所述。

Contrastive Learning Over Groups: Analyses of vision-language models show that text representations are richer than image representations [13, 21]. Thus, we fix the text encoder in the CLIP model and only update the image encoder for the IQA task. The text representations corresponding to the antonym prompts remain the same during training and testing. We propose a loss for updating the image encoder that aims at separating images in a batch into groups based on how close the image representations are to two text-prompt embeddings. We then seek to align the representations of images within each group and separate the representations across groups. Such a loss simultaneously ensures that the intra-group feature entropy (entropy of representations within each group) is minimized and the inter-group entropy (entropy of features between groups) is maximized [43, 10].
跨组对比学习：视觉-语言模型的分析表明，文本表示比图像表示更丰富 [13, 21]。因此，我们在 CLIP 模型中固定文本编码器，并仅更新图像编码器以执行 IQA 任务。与反义词提示对应的文本表示在训练和测试中保持不变。我们提出了一种用于更新图像编码器的损失，旨在根据图像表示与两个文本提示嵌入的接近程度将批次中的图像分成组。然后，我们寻求对齐每组内图像的表示，并分离跨组的表示。这样的损失同时确保组内特征熵（每组内表示的熵）最小化，而组间熵（组间特征的熵）最大化 [43, 10]。

Consider a batch consisting of $N$ images $\{\mathbf{x}_{i}\}_{i=1}^{N}$ with visual representations $\{\mathbf{z}_{i}\}_{i=1}^{N}$ . The representations are obtained as $\mathbf{z}_{i}=g_{\phi}(\mathbf{x}_{i})$ , where $g_{\phi}(\cdot)$ is the CLIP image encoder. Let $\mathbf{z}_{g}$ and $\mathbf{z}_{b}$ correspond to the prompt representations of ‘‘a good photo.’’ and ‘‘ bad photo.’’ respectively. We construct two groups of images, $\mathcal{S}_{g}$ and $\mathcal{S}_{b}$ , that correspond to higher and lower quality respectively based on the quality estimated as
考虑一个由 $N$ 图像 $\{\mathbf{x}_{i}\}_{i=1}^{N}$ 组成的批次，具有视觉表示 $\{\mathbf{z}_{i}\}_{i=1}^{N}$ 。这些表示通过 $\mathbf{z}_{i}=g_{\phi}(\mathbf{x}_{i})$ 获得，其中 $g_{\phi}(\cdot)$ 是 CLIP 图像编码器。设 $\mathbf{z}_{g}$ 和 $\mathbf{z}_{b}$ 分别对应于“好照片”和“坏照片”的提示表示。我们构建了两个图片组， $\mathcal{S}_{g}$ 和 $\mathcal{S}_{b}$ ，它们分别对应于基于质量估计的较高和较低质量。

Q_{H}(\mathbf{x}_{i})=\frac{1}{1+\text{exp}(k_{2}(\mathbf{z}_{i}\cdot\mathbf{z}_{b}-\mathbf{z}_{i}\cdot\mathbf{z}_{g}))},

(2)

where $k_{2}$ is a scaling parameter. Let the features $\{\mathbf{z}_{i}\}_{i=1}^{N}$ sorted in increasing order of $Q_{H}$ be $\{\mathbf{z}_{(1)},\mathbf{z}_{(2)},\cdots,\mathbf{z}_{(N)}\}$ . We obtain the groups as $\mathcal{S}_{b}=\{z_{(i)}\}_{i=1}^{M}$ and $\mathcal{S}_{g}=\{z_{(i)}\}_{i=N-M+1}^{N}$ , where $M=\texttt{round}(N/k)$ , $k$ is a hyperparameter that decides the separability of lower and higher quality groups within a batch of images, and $M$ denotes the group size. Let $\mathcal{S}_{b}(i)=\mathcal{S}_{b}\setminus\{\mathbf{z}_{i}\}$ and $\mathcal{S}_{g}(i)=\mathcal{S}_{g}\setminus\{\mathbf{z}_{i}\}$ . Our group contrastive loss used for fine-tuning is expressed as
其中 $k_{2}$ 是一个缩放参数。让特征 $\{\mathbf{z}_{i}\}_{i=1}^{N}$ 按 $Q_{H}$ 递增顺序排列为 $\{\mathbf{z}_{(1)},\mathbf{z}_{(2)},\cdots,\mathbf{z}_{(N)}\}$ 。我们得到的组为 $\mathcal{S}_{b}=\{z_{(i)}\}_{i=1}^{M}$ 和 $\mathcal{S}_{g}=\{z_{(i)}\}_{i=N-M+1}^{N}$ ，其中 $M=\texttt{round}(N/k)$ , $k$ 是一个超参数，决定图像批次中低质量组和高质量组的可分性， $M$ 表示组的大小。令 $\mathcal{S}_{b}(i)=\mathcal{S}_{b}\setminus\{\mathbf{z}_{i}\}$ 和 $\mathcal{S}_{g}(i)=\mathcal{S}_{g}\setminus\{\mathbf{z}_{i}\}$ 。我们用于微调的组对比损失表达为

	$\displaystyle\mathcal{L}_{GCL}=$	$\displaystyle-\sum_{\mathbf{z}_{i}\in\mathcal{S}_{g}}\text{log}\frac{\sum_{\mathbf{z}_{j}\in\mathcal{S}_{g}(i)}p_{\tau_{2}}(\mathbf{z}_{i},\mathbf{z}_{j})}{\sum_{\mathbf{z}_{j}\in\mathcal{S}_{g}(i)\cup\mathcal{S}_{b}}p_{\tau_{2}}(\mathbf{z}_{i},\mathbf{z}_{j})}$
		$\displaystyle-\sum_{\mathbf{z}_{i}\in\mathcal{S}_{b}}\text{log}\frac{\sum_{\mathbf{z}_{j}\in\mathcal{S}_{b}(i)}p_{\tau_{2}}(\mathbf{z}_{i},\mathbf{z}_{j})}{\sum_{\mathbf{z}_{j}\in\mathcal{S}_{b}(i)\cup\mathcal{S}_{g}}p_{\tau_{2}}(\mathbf{z}_{i},\mathbf{z}_{j})}.$		(3)

While creating groups, a quality separation gap and closeness of quality scores of images in each group are necessary for effective contrastive learning. The parameter $k$ controls this separability and is a hyperparameter that needs to be appropriately chosen.
在创建组时，对于有效的对比学习来说，图像质量分数之间的质量分离差距以及每组中图像的质量分数接近性是必要的。参数 $k$ 控制这种可分离性，是一个需要适当选择的超参数。

3.3 Mapping Representations to Objective Quality
3.3 将表示映射到客观质量

Data-Efficient Quality Prediction: Once the high and low-level features are learned, they are concatenated and regressed with mean opinion scores on the evaluation datasets using a few samples from each dataset. We use a linear SVR $f_{d}(\cdot):\mathbb{R}^{P}\rightarrow\mathbb{R}$ on features of target datasets, where $P$ is the feature dimension. The data-efficient quality of any new image $\mathbf{x}$ can simply be computed using its corresponding feature representation $\mathbf{z}_{\mathbf{x}}\in\mathbb{R}^{P}$ as
数据高效质量预测：一旦学习了高低层特征，它们就会被连接，并使用来自每个数据集的几个样本通过平均意见评分对评估数据集进行回归。我们在目标数据集的特征上使用线性 SVR $f_{d}(\cdot):\mathbb{R}^{P}\rightarrow\mathbb{R}$ ，其中 $P$ 是特征维度。任何新图像 $\mathbf{x}$ 的数据高效质量可以简单地使用其相应的特征表示 $\mathbf{z}_{\mathbf{x}}\in\mathbb{R}^{P}$ 计算为

\mathbf{GRepQ}_{D}(\mathbf{x})=f_{d}(\mathbf{z}_{\mathbf{x}}).

(4)

Our approach offers the advantage of requiring no end-to-end training using the limited labels on a new target database.
我们的方法具有无需在新的目标数据库上使用有限标签进行端到端训练的优势。

Zero-Shot Quality Prediction: We use different approaches for the low-level and high-level representations to predict quality without using any supervision. For the low-level features, we compute a distance between the features of the input image and that of a corpus of pristine images similar to NIQE as
零样本质量预测：我们使用不同的方法，对低层次和高层次表征进行质量预测，而不使用任何监督。对于低层次特征，我们计算输入图像特征与一组完美图像特征之间的距离，类似于 NIQE。

d(\mathbf{x})=\sqrt{(\mu_{p}-\mu_{d})^{T}\bigg{(}\frac{\Sigma_{p}+\Sigma_{d}}{2}\bigg{)}^{-1}(\mu_{p}-\mu_{d})},

(5)

where $\mu_{p}$ and $\Sigma_{p}$ are the mean and covariance of the representations from the low-level encoder corresponding to patches of pristine images. $\mu_{d}$ and $\Sigma_{d}$ are the mean and covariance of the representations of the patches from an image $\mathbf{x}$ . Here, non-overlapping patches of size $R\times R$ are extracted from the image to estimate the relevant statistics of the features. The low-level quality is then predicted as
其中 $\mu_{p}$ 和 $\Sigma_{p}$ 是来自低级编码器的表示的均值和协方差，分别对应于原始图像的图块。 $\mu_{d}$ 和 $\Sigma_{d}$ 是来自图像 $\mathbf{x}$ 的图块表示的均值和协方差。在此，从图像中提取大小为 $R\times R$ 的不重叠图块，以估计特征的相关统计数据。然后，低级质量预测为

Q_{L}(\mathbf{x})=\frac{1}{1+\text{exp}(k_{1}d(\mathbf{x}))},

(6)

where $k_{1}$ is a scaling parameter. The quality from the high-level representations can be predicted using Eq. 2. The overall image quality is then measured as
其中 $k_{1}$ 是一个缩放参数。可以使用方程 2 预测来自高层表示的质量。然后整体图像质量被衡量为

\text{$\mathbf{GRepQ}_{Z}$}(\mathbf{x})=Q_{H}(\mathbf{x})+Q_{L}(\mathbf{x}),

(7)

and is illustrated in Fig. 2.
如图 2 所示。

4 Experiments

4.1 Training and Implementation Details
4.1 训练和实施细节

Training Dataset for Representation Learning: We train the low and high-level feature encoders on the FLIVE dataset [36] using a subset of $5000$ real-world images encompassing a variety of authentic distortions with different resolutions and aspect ratios. The diverse content and distortions make it conducive to learning representations that can be generalized to diverse images. No human annotations were used during this training.
表示学习的训练数据集：我们在 FLIVE 数据集[36]上训练低级和高级特征编码器，使用子集包含 $5000$ 张真实世界图像，其中涵盖了各种真实变形，具有不同的分辨率和纵横比。多样的内容和失真有助于学习能够推广到不同图像的表示。在此训练过程中没有使用人工标注。

Low-Level Encoder: We use a ResNet18 (performance with a Resnet50 was found to be similar) without pre-trained weights for the low-level feature encoder. The contrastive loss in Eq. 1 is trained by projecting the features from the penultimate layer of ResNet18 onto $\mathbb{R}^{128}$ . Images are fragmented into $7\times 7$ grids, and random mini-patches from each grid location are stitched together to form $224\times 224$ sized patches. The temperature $\tau_{1}$ is fixed at $0.5$ . A batch consists of $8$ images with $8$ distorted versions each. The model is trained for $15$ epochs using the AdamW [15] optimizer with a weight decay of $0.05$ and an initial learning rate of $10^{-4}$ . A cosine learning rate scheduler is used. To guide the quality-aware contrastive training, we employ FSIM as the perceptual similarity measure.
低级编码器：我们使用没有预训练权重的 ResNet18（使用 ResNet50 的性能类似）作为低级特征编码器。方程 1 中的对比损失通过将 ResNet18 的倒数第二层特征投影到 $\mathbb{R}^{128}$ 上进行训练。图像被分割成 $7\times 7$ 个网格，并将每个网格位置的随机小补丁拼接在一起，形成 $224\times 224$ 大小的补丁。温度 $\tau_{1}$ 固定为 $0.5$ 。一个批次包含 $8$ 张图像，每张图像有 $8$ 个失真版本。模型使用 AdamW [15]优化器进行训练，历时 $15$ 个周期，权重衰减为 $0.05$ ，初始学习率为 $10^{-4}$ 。使用余弦学习率调度器。为了指导质量感知的对比训练，我们采用 FSIM 作为感知相似度度量。

Method 方法 Method 方法 Type 类型 CLIVE KonIQ CSIQ LIVE PIPAL Labels 标签 50 100 200 50 100 200 50 100 200 50 100 200 50 100 200 TReS[6] End-to- 端到端 end 结束 Fine 精细 Tuning 调优 0.670 0.751 0.799 0.713 0.719 0.791 0.791 0.811 0.878 0.901 0.927 0.957 0.186 0.349 0.501 HyperIQA[27] HyperIQA[27] 0.648 0.725 0.790 0.615 0.710 0.776 0.790 0.824 0.909 0.892 0.912 0.929 0.102 0.302 0.379 DEIQT[22] 0.667 0.718 0.812 0.638 0.682 0.754 0.821 0.891 0.941 0.920 0.942 0.955 0.396 0.410 0.436 MANIQA[34] 0.642 0.769 0.797 0.652 0.755 0.810 0.794 0.847 0.874 0.909 0.928 0.957 0.136 0.361 0.470 LIQE[42] 0.691 0.769 0.810 0.759 0.801 0.832 0.838 0.891 0.924 0.904 0.934 0.948 - - - Resnet50[8] Simple 简单 Feature 特征 Regre- 回归 ssion 会话 0.576 0.611 0.636 0.635 0.670 0.707 0.793 0.890 0.935 0.871 0.906 0.922 0.150 0.220 0.302 CLIP[23] CLIP[23] 0.664 0.721 0.733 0.736 0.770 0.782 0.841 0.892 0.941 0.896 0.923 0.941 0.254 0.303 0.368 CONTRIQUE[16] 0.695 0.729 0.761 0.733 0.794 0.821 0.840 0.926 0.940 0.891 0.922 0.943 0.379 0.437 0.488 Re-IQA[25] Re-IQA[25] 0.591 0.621 0.701 0.685 0.723 0.754 0.893 0.907 0.923 0.884 0.894 0.929 0.280 0.350 0.431 $\text{GRepQ}_{\text{D}}$ (LL) 0.531 0.565 0.613 0.620 0.647 0.679 0.794 0.805 0.832 0.866 0.880 0.886 0.395 0.410 0.431 $\text{GRepQ}_{\text{D}}$ (HL) $\text{GRepQ}_{\text{D}}$ （HL） 0.740 0.770 0.796 0.794 0.813 0.843 0.869 0.905 0.932 0.904 0.927 0.944 0.410 0.415 0.427 $\text{GRepQ}_{\text{D}}$ (HL + LL) $\text{GRepQ}_{\text{D}}$ (高频低频) 0.760 0.791 0.822 0.812 0.836 0.855 0.878 0.914 0.941 0.926 0.937 0.953 0.489 0.518 0.548

Table 1: SRCC performance comparison of

\text{GRepQ}_{\text{D}}

with other NR-IQA methods trained using few labels on various IQA databases. The methods are segregated into end-to-end trained (top five) and feature-learning-based (next four) methods. LL and HL correspond to low and high-level models respectively. The best-performing methods are bolded.
表 1：

\text{GRepQ}_{\text{D}}

与其他使用少量标签训练的无参考图像质量评估（NR-IQA）方法在各个 IQA 数据库上的 SRCC 性能比较。这些方法分为端到端训练（前五个）和基于特征学习（接下来的四个）方法。LL 和 HL 分别对应低级和高级模型。表现最佳的方法加粗显示。

High-Level Encoder: We fine-tune CLIP’s image encoder while keeping the text encoder fixed. The image encoder consists of a Resnet50 backbone with an additional attention-pooling layer. To enable contrastive learning over groups specified in Eq. 3, a projection head is used to contrast features in $\mathbb{R}^{128}$ . The images are center-cropped to a size of $224\times 224$ , and a batch size of $N=128$ is used. Based on the coarse predictions obtained using Eq. 2, we use a separability hyperparameter $k=8$ to divide the batch of images into groups of size $M=16$ . Once the groups are formed, the image encoder is trained using Eq. 3 with a temperature $\tau_{2}=0.1$ . The model is trained for $15$ epochs using an Adam optimizer with an initial learning rate of $5\times 10^{-6}$ . The scaling parameter $k_{2}$ is set to $10$ .
高级编码器：我们微调 CLIP 的图像编码器，同时保持文本编码器固定。图像编码器由一个 Resnet50 骨干网和一个额外的注意力池化层组成。为了实现 Eq.3 中指定的组的对比学习，使用投影头对 $\mathbb{R}^{128}$ 中的特征进行对比。图像经过中心裁剪，大小为 $224\times 224$ ，批量大小为 $N=128$ 。基于使用 Eq.2 获得的粗略预测，我们使用可分性超参数 $k=8$ 将图像批次分成大小为 $M=16$ 的组。一旦形成组，使用 Eq.3 和温度 $\tau_{2}=0.1$ 训练图像编码器。模型使用初始学习率为 $5\times 10^{-6}$ 的 Adam 优化器训练 $15$ 个时代。缩放参数 $k_{2}$ 设置为 $10$ 。

Zero-Shot Quality Prediction using Low-Level Encoder: For the zero-shot quality prediction using Eq. 5, we select $125$ pristine image patches as used in literature [2] (chosen based on sharpness and colorfulness). Patches of size $96\times 96$ are extracted from the pristine images and the test image. The scaling parameter $k_{1}$ is set to $0.01$ .
使用低级编码器进行零样本质量预测：对于使用方程 5 的零样本质量预测，我们选择文献[2]中使用的 $125$ 个原始图像块（基于清晰度和色彩鲜艳度选择）。从原始图像和测试图像中提取尺寸为 $96\times 96$ 的图像块。缩放参数 $k_{1}$ 设置为 $0.01$ 。

All the implementations were done in PyTorch using two 11GB Nvidia GeForce RTX 2080 Ti GPUs.
所有实现均在 PyTorch 中完成，使用两块 11GB Nvidia GeForce RTX 2080 Ti GPU。

4.2 Experimental Setup 4.2 实验设置

We present the details of the two main evaluation settings: data-efficient setting and the zero-shot setting. In the data-efficient setting, we train our data-efficient framework $\text{GRepQ}_{\text{D}}$ , using a few samples from each evaluation dataset. We randomly split each evaluation dataset into 80% and 20% and use the 20% subset for testing. We select a random subset of 50, 100, or 200 samples from the 80% for training a linear support vector regressor (SVR) on the features. We use Spearman’s rank order correlation coefficient (SRCC) between the objective and subjective scores to evaluate the models’ performance. We report the median performance obtained across 10 splits of each evaluation dataset. The results with respect to Pearson’s linear correlation coefficient (PLCC) are given in the supplementary. In the zero-shot setting, no training on any evaluation dataset is required, and we test on the entire evaluation dataset.
我们介绍两个主要评价设置的详细信息：数据高效设置和零样本设置。在数据高效设置中，我们使用每个评估数据集的一些样本来训练我们的数据高效框架 $\text{GRepQ}_{\text{D}}$ 。我们随机将每个评估数据集拆分为 80%和 20%，并使用 20%子集进行测试。我们从 80%中随机选择 50、100 或 200 个样本用于在特征上训练线性支持向量回归器（SVR）。我们使用目标分数和主观分数之间的斯皮尔曼等级相关系数（SRCC）来评估模型的表现。我们报告每个评估数据集的 10 次拆分得到的中值表现。关于皮尔逊线性相关系数（PLCC）的结果在补充材料中给出。在零样本设置中，不需要对任何评估数据集进行训练，我们在整个评估数据集上进行测试。

Evaluation Datasets: We choose a variety of datasets spanning different types of distortions to demonstrate the effectiveness of our framework for the three experimental settings. Since the training images are sampled from the FLIVE dataset, we do not evaluate them on FLIVE. We evaluate two popular in-the-wild datasets: CLIVE [5], KONiQ [9], and three synthetic or processed image datasets: LIVE-IQA[26], CSIQ[12] and PIPAL[7]. CLIVE contains $1,162$ images captured from multiple mobile devices. KonIQ-10K contains $10073$ in-the-wild images. LIVE-IQA[26] contains $29$ scenes along with $779$ distorted images containing JPEG compression, blur, noise, and fast-fading distortions. CSIQ[12] consists of $30$ original images with $866$ distorted images with blur, contrast, and JPEG compression distortions. PIPAL is a large IQA database consisting of $23,200$ images with $40$ different distortions per image, including GAN-generated artifacts, making this dataset very challenging to evaluate.
评估数据集：我们选择了多种数据集，涵盖不同类型的失真，来展示我们框架在三种实验设置中的有效性。由于训练图像是从 FLIVE 数据集中抽样的，因此我们不对 FLIVE 进行评估。我们评估了两个受欢迎的自然场景数据集：CLIVE [5]，KONiQ [9]，以及三个合成或处理后的图像数据集：LIVE-IQA[26]，CSIQ[12]和 PIPAL[7]。CLIVE 包含 $1,162$ 张由多种移动设备拍摄的图像。KonIQ-10K 包含 $10073$ 张自然场景图像。LIVE-IQA[26]包含 $29$ 个场景和 $779$ 张含有 JPEG 压缩、模糊、噪声和快速衰褪失真的图像。CSIQ[12]由 $30$ 幅原始图像和 $866$ 张含有模糊、对比度失真和 JPEG 压缩失真的图像组成。PIPAL 是一个大型 IQA 数据库，由 $23,200$ 张图像组成，每张图像具有 $40$ 种不同的失真，包括 GAN 生成的伪影，使得该数据集的评估非常具有挑战性。

4.3 Data-Efficient Setting
4.3 数据高效设置

We compare $\text{GRepQ}_{\text{D}}$ with other state-of-the-art (SoTA) end-to-end NR-IQA methods: TReS[6], HyperIQA[27], and MANIQA[34], the data-efficient method DEIQT[22], and feature based methods: Resnet50[8], CLIP[23], CONTRIQUE[16] and Re-IQA[25]. We note that LIQE[42] is not trainable on PIPAL and thus its entry is left blank. For the methods requiring feature regression, the SVR parameters are optimized to yield the best performances. To ensure fair comparisons, the median performance of all methods over ten train-test splits are reported.
我们将 $\text{GRepQ}_{\text{D}}$ 与其他最先进的端到端无参考图像质量评价方法进行比较：TReS[6]，HyperIQA[27]，MANIQA[34]，数据高效方法 DEIQT[22]，以及基于特征的方法：Resnet50[8]，CLIP[23]，CONTRIQUE[16] 和 Re-IQA[25]。我们注意到 LIQE[42]无法在 PIPAL 上进行训练，因此其条目留空。对于需要特征回归的方法，SVR 参数经过优化以获得最佳性能。为了确保公平比较，报告了所有方法在十次训练-测试拆分中的中位性能。

Tab. 1 presents comparisons on the data-efficient training of $\text{GRepQ}_{\text{D}}$ against other NR-IQA methods. The results indicate that $\text{GRepQ}_{\text{D}}$ outperforms other methods on all datasets in almost all three data regimes (50, 100, and 200 samples). We notice that $\text{GRepQ}_{\text{D}}$ outperforms even end-to-end trained models despite using a simple SVR. The superior performance over Re-IQA, which may also be considered as an ensemble of two sets of features, demonstrates the superiority of both our low and high level features. While it may appear that the high-level model performs better than the low-level model in most of the scenarios, we provide examples in Sec. 4.6, where the low-level model could also be more accurate. Thus, there is a need for both the high and low-level representations. As an extreme case, we also present results in the fully-supervised setting in the supplement.
表 1 比较了 $\text{GRepQ}_{\text{D}}$ 与其他无参考图像质量评估方法在数据高效训练方面的表现。结果表明，在几乎所有三个数据方案（50、100 和 200 样本）中， $\text{GRepQ}_{\text{D}}$ 在所有数据集上均优于其他方法。我们注意到，尽管使用的是简单的 SVR， $\text{GRepQ}_{\text{D}}$ 甚至超越了端到端训练的模型。表现优于 Re-IQA（也可视作两组特征的集成）显示了我们低层和高层特征的优越性。虽然在大多数情况下，高层模型似乎优于低层模型，但我们在第 4.6 节提供了例子，低层模型在某些情况下也可能更准确。因此，既需要高层也需要低层的表示。作为极端案例，我们也在补充材料中展示了完全监督环境下的结果。

Method 方法 CLIVE KonIQ CSIQ LIVE PIPAL NIQE[18] NIQE[18] 0.463 0.530 0.613 0.836 0.153 IL-NIQE[38] 0.440 0.507 0.814 0.847 0.282 CL-MI[2] 0.507 0.645 0.588 0.663 0.303 CLIP-IQA[28] 0.612 0.700 0.690 0.652 0.261 $\text{GRepQ}_{\text{Z}}$ 0.740 0.768 0.693 0.741 0.436

Table 2: Performance comparison of

\text{GRepQ}_{\text{Z}}

(zero-shot) with other zero-shot methods on various IQA databases.
表 2：

\text{GRepQ}_{\text{Z}}

（零样本）与其他零样本方法在各种 IQA 数据库上的性能比较。

Training 训练 FLIVE KonIQ CLIVE LIVE CSIQ Testing 测试 CLIVE KonIQ CLIVE KonIQ CSIQ LIVE 实时 HyperIQA 0.758 0.735 0.785 0.772 0.744 0.926 TReS 0.713 0.740 0.786 0.733 0.761 - CONTRIQUE 0.710 0.781 0.731 0.676 0.823 0.925 DEIQT 0.733 0.781 0.794 0.744 0.781 0.932 $\text{GRepQ}_{\text{C}}$ 0.774 0.815 0.774 0.792 0.770 0.893

Table 3: Cross-dataset performance of

\text{GRepQ}_{\text{C}}

along with other NR-IQA methods. Results for methods apart from CONTRIQUE are from [22].
表 3：

\text{GRepQ}_{\text{C}}

与其他 NR-IQA 方法的跨数据集性能。除了 CONTRIQUE 以外的方法结果来自[22]。

4.4 Zero-Shot Setting 4.4 零样本设置

Since zero-shot methods are trained without human supervision, we compare $\text{GRepQ}_{\text{Z}}$ with unsupervised or completely blind NR-IQA methods such as NIQE [18], IL-NIQE [38], contrastive learning with mutual information (CL-MI) [2], and CLIP-IQA [28]. We utilize entire evaluation databases for testing all the methods.
由于零样本方法是在没有人工监督的情况下进行训练的，我们将 $\text{GRepQ}_{\text{Z}}$ 与无监督或完全盲目的无参考图像质量评估方法进行比较，如 NIQE [18]、IL-NIQE [38]、互信息对比学习（CL-MI）[2]和 CLIP-IQA [28]。我们利用整个评估数据库来测试所有方法。

Tab. 2 shows that $\text{GRepQ}_{\text{Z}}$ consistently outperforms other methods on three out of five datasets by considerable margins. $\text{GRepQ}_{\text{Z}}$ achieves SoTA performance even on the challenging PIPAL dataset, containing diverse distortions, particularly images restored by various restoration (including GAN-based) methods for super-resolution and denoising. A $44\%$ improvement in SRCC is shown over the second-best-performing algorithm (CL-MI). The lower performance of $\text{GRepQ}_{\text{Z}}$ on LIVE and CSIQ is attributed to content bias of both the low-level and high-level models. Although the low-level model is trained in a content conditional manner, the features perhaps do suffer from some residual content bias. Since LIVE and CSIQ contain very few unique scene content, the residual content bias leads to reduced performance of our zero-shot model. Despite these challenges, $\text{GRepQ}_{\text{Z}}$ still achieves competitive performance, showing its generalization capability in the zero-shot setting.
表 2 显示， $\text{GRepQ}_{\text{Z}}$ 在五个数据集中的三个上稳定地以较大幅度超越其他方法。即使在包含多种畸变的具有挑战性的 PIPAL 数据集上， $\text{GRepQ}_{\text{Z}}$ 也实现了超出技术水平的性能，特别是各种恢复方法（包括基于 GAN 的方法）的超分辨率和去噪。与表现第二好的算法（CL-MI）相比，SRCC 提高了 $44\%$ 。 $\text{GRepQ}_{\text{Z}}$ 在 LIVE 和 CSIQ 上的较低性能归因于低级和高级模型的内容偏差。尽管低级模型是以内容条件的方式训练的，但这些特征可能依然受到一些残余内容偏差的影响。由于 LIVE 和 CSIQ 包含非常少量独特的场景内容，这种残余内容偏差导致我们的零样本模型性能下降。尽管存在这些挑战， $\text{GRepQ}_{\text{Z}}$ 仍然表现出竞争力，显示了其在零样本设置中的泛化能力。

Similarity Measure 相似性度量	$50$	$100$	$200$
None 无	0.381	0.413	0.452
SSIM	0.533	0.558	0.590
MS-SSIM	0.527	0.561	0.575
GMSD	0.544	0.570	0.583
LPIPS	0.578	0.605	0.629
FSIM	0.620	0.647	0.679

Table 4: SRCC performance analysis on the KonIQ dataset of the impact of different perceptual similarity measures on the low-level model under the data-efficient setting.
表 4：在数据高效设置下，不同感知相似性度量对低级模型在 KonIQ 数据集上的 SRCC 性能分析。

4.5 Cross-Database Experiments
4.5 跨数据库实验

We also show the effectiveness of our features through cross-database experiments. Here, a single linear SVR (ridge regressor) is trained on an entire dataset and tested on other intra-domain datasets in the authentic and synthetic image settings. The results in Tab. 3 indicate that $\text{GRepQ}_{\text{C}}$ (defined as the cross-dataset prediction evaluated using Eq. 4) achieves competitive (also best) performances in most of the evaluation settings.
我们还通过跨数据库实验展示了我们特征的有效性。在这里，一个简单的线性 SVR（ridge regressor）在整个数据集上进行训练，并在真实和合成图像设置中的其他域内数据集上进行测试。表 3 中的结果表明， $\text{GRepQ}_{\text{C}}$ （根据公式 4 进行评估的跨数据集预测）在大多数评估设定中实现了具竞争力（也是最佳）的性能。

4.6 A Deeper Understanding of GRepQ Features
4.6 深入理解 GRepQ 特征

Choice of Perceptual Similarity Measures in the Low-Level Feature Encoder: We compare different popular perceptual similarity measures such as SSIM [30], MS-SSIM[31], FSIM[39], LPIPS[40] and GMSD[33] used in the low-level feature encoder in Tab. 4. The low-level encoders are trained using these measures under similar training settings. We also train an encoder without any similarity measure (denoted by None) to show a need for quality-aware contrastive learning. In this case, all the other distorted versions of an image are treated as negatives, while the augmented version is the only positive. In Tab. 4, we show the low-level encoder’s data-efficient performances on KonIQ. We see that FSIM outperforms all other measures. We note that the superior performance of FSIM in this context is consistent with its superior performance as an FR-IQA metric across multiple datasets.
低级特征编码器中的感知相似性度量选择：我们在表 4 中比较了不同的流行感知相似性度量，如 SSIM[30]、MS-SSIM[31]、FSIM[39]、LPIPS[40]和 GMSD[33]，这些度量用于低级特征编码器。这些低级编码器是在相似的训练设置下使用这些度量进行训练的。我们还训练了一个不使用任何相似性度量的编码器（标记为 None），以展示需要质量感知对比学习的必要性。在这种情况下，所有其他的图像失真版本都被视为负样本，而增强版本是唯一的正样本。在表 4 中，我们展示了低级编码器在 KonIQ 上的数据高效性能。我们发现 FSIM 的表现优于其他度量。我们注意到，FSIM 在此情境中的卓越性能与作为多个数据集中的 FR-IQA 度量的卓越性能是一致的。

[Uncaptioned image] — Figure 4: Demonstrating the complementarity of high and low-level model predictions. Images with different MOS from the KonIQ-10K database are listed. Low and high-level model predictions are mentioned below their respective MOS. Predictions that agree with human opinions are marked with green while erroneous predictions are marked with red.

Analyzing High-Level Feature Representations: We analyze the impact of our group-contrastive learning in improving the high-level quality representations. For this analysis, we identify extremely good and extremely bad quality images based on mean opinion score (MOS) greater than 75 or less than 25 respectively on the combined CLIVE and KonIQ datasets. We show the feature representations of the CLIP model in Fig. 3a and those of our model in Fig. 3b. We see that our learned representations are better separable between the higher and lower-quality images. This leads to the superior performance of our high-level model when compared to CLIP-IQA.
分析高层特征表示：我们分析了群组对比学习在提高高层质量表示方面的影响。为此分析，我们基于 CLIVE 和 KonIQ 数据集的综合后的平均意见得分（MOS）来识别质量极佳和极差的图像，分别为大于 75 或小于 25。我们在图 3a 中展示了 CLIP 模型的特征表示，并在图 3b 中展示了我们模型的特征表示。我们发现我们学习到的表示在高质量和低质量图像之间具有更好的可分性。这使得我们的高层模型在与 CLIP-IQA 比较时表现更优。

Complementarity of High and Low-Level Features: We present a qualitative and quantitative analysis of the complementarity of representations from both encoders. We show examples of when the two models outperform each other in Fig. 4. For instance, Fig. 4a shows that the low-level model makes an erroneous prediction since only the object in focus is blurred, but the background is relatively clean. Fig. 4d shows that the image does not contain enough contextual information for the high-level model to make an accurate prediction. We also perform an error-based feature complementarity analysis in Fig. 5. In particular, we compute the absolute error between the MOS predicted by the high and low-level models and the true MOS and show them in four quadrants. We see several examples where one of the models performs much better than the other. This shows that the models have complementary behavior in many examples.
高低层特征的互补性：我们对来自两个编码器的表示互补性进行了定性和定量分析。在图 4 中，我们展示了两个模型各自表现优异的实例。例如，图 4a 显示低层模型做出了错误预测，因为只有聚焦的物体是模糊的，而背景相对清晰。图 4d 显示由于图片中不含足够的上下文信息，高层模型无法做出准确预测。我们还在图 5 中进行了基于错误的特征互补性分析。具体来说，我们计算了高低层模型预测的 MOS 与真实 MOS 之间的绝对误差，并将其展示在四个象限中。我们可以看到，有些例子中一个模型的表现明显优于另一个。这表明，在许多实例中，这些模型表现出互补性。

Limitations: In the low-data setting, the low-level model does not perform as well as the high-level model on in-the-wild datasets. Since the low-level model is more suited to capture varied distortion levels rather than content, synthetic datasets benefit more from this model. Secondly, the high-level model uses fixed prompts and can be further improved through prompt engineering or tuning.
局限性：在低数据环境中，低级模型在野生数据集上的表现不如高级模型。由于低级模型更适合捕捉不同的失真程度而非内容，合成数据集因此从该模型中获益更多。其次，高级模型使用固定提示，通过提示工程或调整可以进一步改善。

5 Concluding Remarks 结论 remarks

We design generalizable low-level and high-level quality representations that enable IQA in a data-efficient setting. Specifically, we learn low-level features using a novel quality-aware contrastive learning strategy that is distortion-agnostic. Secondly, we present a group-contrastive learning framework that learns to elicit semantic-based high-level quality information from images. We show that both sets of representations lead to accurate prediction of quality scores in both the data-efficient and zero-shot settings on diverse datasets. This demonstrates the generalizability of our learned features. Future advances in self-supervised learning and quality-specific prompt engineering could be used to further enhance the generalizability of models for data-efficient NR IQA.
我们设计了具有可推广性的低级和高级质量表示，以实现数据高效的图像质量评估。具体而言，我们使用一种新的质量感知对比学习策略来学习不受失真影响的低级特征。其次，我们提出了一种组对比学习框架，能够从图像中提取基于语义的高级质量信息。我们证明了这两种表示能够在数据高效和零样本条件下对各种数据集进行准确的质量评分预测。这表明我们所学特征的可推广性。未来，自监督学习和特定于质量的提示工程的进步可以进一步增强模型在数据高效无参考图像质量评估中的可推广性。

Acknowledgement: This work was supported in part by Department of Science and Technology, Government of India under grant CRG/2020/003516.
致谢：本研究部分得到了印度政府科技部在资助项目 CRG/2020/003516 下的支持。

References

[1] Shahrukh Athar and Zhou Wang. Degraded reference image quality assessment. IEEE Transactions on Image Processing, 2023.
[2] Nithin C Babu, Vignesh Kannan, and Rajiv Soundararajan. No reference opinion unaware quality assessment of authentically distorted images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2459–2468, 2023.
[3] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing, 27(1):206–219, 2017.
[4] Lei Chen, Le Wu, Zhenzhen Hu, and Meng Wang. Quality-aware unpaired image-to-image translation. IEEE Transactions on Multimedia, 21(10):2664–2674, 2019.
[5] Deepti Ghadiyaram and Alan C Bovik. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing, 25(1):372–387, 2015.
[6] S Alireza Golestaneh, Saba Dadsetan, and Kris M Kitani. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1220–1230, 2022.
[7] Jinjin Gu, Haoming Cai, Haoyu Chen, Xiaoxing Ye, Jimmy Ren, and Chao Dong. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European Conference on Computer Vision (ECCV) 2020, pages 633–651. Springer International Publishing, 2020.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[9] Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing, 29:4041–4056, 2020.
[10] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
[11] Jongyoo Kim and Sanghoon Lee. Fully deep blind image quality predictor. IEEE Journal of selected topics in signal processing, 11(1):206–220, 2016.
[12] Eric C Larson and Damon M Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging, 19(1):011006–011006, 2010.
[13] Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16420–16429, 2022.
[14] Xialei Liu, Joost Van De Weijer, and Andrew D Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. In Proceedings of the IEEE International Conference on Computer Vision, pages 1040–1049, 2017.
[15] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
[16] Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. Image quality assessment using contrastive learning. IEEE Transactions on Image Processing, 31:4149–4161, 2022.
[17] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012.
[18] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
[19] Anush Krishna Moorthy and Alan Conrad Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Transactions on Image Processing, 20(12):3350–3364, 2011.
[20] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[21] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
[22] Guanyi Qin, Runze Hu, Yutao Liu, Xiawu Zheng, Haotian Liu, Xiu Li, and Yan Zhang. Data-efficient image quality assessment with attention-panel decoder. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
[23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[24] Michele A Saad, Alan C Bovik, and Christophe Charrier. Blind image quality assessment: A natural scene statistics approach in the dct domain. IEEE Transactions on Image Processing, 21(8):3339–3352, 2012.
[25] Avinab Saha, Sandeep Mishra, and Alan C. Bovik. Re-iqa: Unsupervised learning for image quality assessment in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5846–5855, June 2023.
[26] Hamid R Sheikh, Muhammad F Sabir, and Alan C Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing, 15(11):3440–3451, 2006.
[27] Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3667–3676, 2020.
[28] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In AAAI, 2023.
[29] Jing Wang, Haotian Fan, Xiaoxia Hou, Yitian Xu, Tao Li, Xuechao Lu, and Lean Fu. Mstriq: No reference image quality assessment based on swin transformer with multi-stage fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1269–1278, 2022.
[30] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[31] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
[32] Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In Computer Vision–ECCV 2022: 17th European Conference, Proceedings, pages 538–554. Springer, 2022.
[33] Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing, 23(2):684–695, 2013.
[34] Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022.
[35] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE conference on Computer Vision and Pattern Recognition, pages 1098–1105. IEEE, 2012.
[36] Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, and Alan Bovik. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3575–3585, 2020.
[37] Junyong You and Jari Korhonen. Transformer for image quality assessment. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1389–1393. IEEE, 2021.
[38] Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015.
[39] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378–2386, 2011.
[40] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
[41] Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology, 30(1):36–47, 2018.
[42] Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023.
[43] Yifan Zhang, Bryan Hooi, Dapeng Hu, Jian Liang, and Jiashi Feng. Unleashing the power of contrastive self-supervised visual models via contrast-regularized fine-tuning. Advances in Neural Information Processing Systems, 34:29848–29860, 2021.
[44] Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. Metaiqa: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14143–14152, 2020.


(a)	(b)		(c)	(d)
28.97	24.49	Human Opinion Score	75.63	65.15
\hdashline 40.56	25.43	$\text{GRepQ}_{\text{D}}$ (LL) Prediction	62.04	61.29
\hdashline 27.66	39.40	$\text{GRepQ}_{\text{D}}$ (HL) Prediction	72.33	49.85

Learning Generalizable Perceptual Representations for Data-Efficient No-Reference Image Quality Assessment学习可泛化的感知表示以进行数据高效的无参考图像质量评估