\useunder

\ul

RECAP: Retrieval-Augmented Audio Captioning
《RECAP：检索增强型音频字幕生成》

Abstract 摘要

We present RECAP (REtrieval-Augmented Audio CAP-tioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP [1] to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho¹¹1https://github.com/Sreyan88/RECAP
我们提出 RECAP（检索增强音频字幕生成系统），这是一种新颖高效的音频字幕生成系统，能够基于输入音频及从数据存储中检索到的相似字幕生成描述。该方法的独特优势在于无需额外微调即可迁移至任意领域。具体实现时，我们利用音频-文本模型 CLAP[1]从可替换数据存储中检索与目标音频相似的描述文本，并据此构建提示信息。随后将提示输入 GPT-2 解码器，同时在 CLAP 编码器与 GPT-2 之间引入交叉注意力层，使音频信息能有效参与字幕生成过程。在 Clotho 和 AudioCaps 两个基准数据集上的实验表明，RECAP 在领域内设置中表现出竞争力，在跨领域设置中则实现显著性能提升。此外，由于该方法能以零训练方式利用纯文本字幕的大型数据存储，RECAP 展现出独特优势：既能描述训练阶段从未出现的新颖音频事件，也能处理包含多重事件的复合型音频。为推动该领域研究，我们还发布了针对 AudioSet、AudioCaps 和 Clotho 数据集的 15 万+条新增弱标注描述文本 ¹ .

Index Terms— Automated audio captioning, multi-modal learning, retrieval-augmented generation
关键词——自动音频描述，多模态学习，检索增强生成

1 Introduction 1 引言

Audio captioning is the fundamental task of describing the contents of an audio sample using natural language. Compared to Automatic Speech Recognition (ASR), which transcribes human speech, audio captioning focuses on describing distinct environmental sounds in the input audio [2, 3]. By bridging the gap between text and audio modalities, audio captioning has found various applications in real-world use cases like environment monitoring, gaming, etc. [4].
音频描述是一项基础任务，旨在用自然语言描述音频样本的内容。与转录人类语音的自动语音识别（ASR）不同，音频描述侧重于刻画输入音频中独特的环境声[2,3]。通过弥合文本与音频模态之间的鸿沟，这项技术已在环境监测、游戏等现实场景中展现出多样化应用[4]。

Refer to caption — Fig. 1: We propose RECAP, a retrieval-augmented audio captioning model. RECAP can caption novel concepts never before seen in training and improves the captioning of audio with multiple events.
图 1：我们提出 RECAP——一种检索增强的音频描述模型。该模型能够为训练数据中从未出现过的新概念生成描述，并提升对包含多重事件音频的 captioning 效果。

In the past, most audio captioning models employed an encoder-decoder architecture using an off-the-shelf pre-trained audio encoder and a language decoder [5, 6]. The audio encoder generates an audio embedding sequence that is used to condition the language decoder for caption generation. However, most of these systems do not perform well on cross-domain settings (trained on one domain and tested on the other), and every use case might need separate training. We hypothesize that the primary reason behind this phenomenon is the shift of occurrence of unique audio events with a domain shift. For example, the AudioCaps benchmark dataset [2] has several audio concepts (e.g., the sound of jazz or an interview) that Clotho, another benchmark dataset, does not. This is also representative of real-world scenarios where not only do audio concepts change from one domain to another (e.g., environmental sounds in a city versus a forest), but new audio concepts also keep emerging within a domain (e.g., new versions of an online game).
过去，大多数音频字幕模型采用现成的预训练音频编码器和语言解码器组成的编码器-解码器架构[5,6]。音频编码器生成音频嵌入序列，用于调节语言解码器以生成字幕。然而，这些系统大多在跨领域设置（在一个领域训练而在另一个领域测试）中表现不佳，且每个用例可能需要单独训练。我们推测这一现象背后的主要原因是独特音频事件的出现会随领域变化而转移。例如，AudioCaps 基准数据集[2]包含多个音频概念（如爵士乐声或采访声），而另一个基准数据集 Clotho 则不具备这些概念。这也反映了现实场景：不仅音频概念会随领域变化（如城市环境声与森林环境声），领域内还会不断涌现新的音频概念（如网络游戏的新版本）。

Main Contributions. We propose RECAP, REtrieval-Augmented Audio CAPtioning, a simple and scalable solution to the aforementioned problems of domain shifts. Similar to other audio captioning systems in literature [5, 6, 7], RECAP is built on an audio encoder and a language decoder (GPT-2 in our setting). However, we introduce three novel changes: (1) Instead of employing an audio encoder pre-trained only on audio, we use CLAP [1] as our audio encoder. CLAP is pre-trained on audio-text pairs to learn the correspondence between audio and text by projecting them into a shared latent space. Thus, CLAP hidden state representations are better suited for captioning due to their enhanced linguistic comprehension. (2) We condition the audio for caption generation by introducing new cross-attention layers between CLAP and the GPT-2. (3) Finally, beyond just conditioning audio, we also condition a custom-constructed prompt for training and inference. We construct the prompt using the top-k captions most similar to the audio from a datastore retrieved using CLAP. We provide more details in Section 3.1. RECAP builds on retrieval-augmented generation (RAG) [8], which offers multiple advantages discussed further in Section 3. RECAP is lightweight, fast to train (as we only optimize the cross-attention layers), and can exploit any large text-caption-only datastore in a training-free fashion. We evaluate RECAP on two benchmark datasets, Clotho [3] and AudioCaps [2], and show that while being competitive to the state-of-the-art in in-domain settings, RECAP outperforms all baselines in out-of-domain settings by a large margin. Additionally, RECAP can effectively caption novel audio events never seen during training and can better generate captions for compositional audios with multiple audio events.
主要贡献。我们提出 RECAP（检索增强音频描述），针对上述领域偏移问题提供了一种简单且可扩展的解决方案。与文献中其他音频描述系统[5,6,7]类似，RECAP 基于音频编码器和语言解码器（本研究中采用 GPT-2）构建。但我们引入了三项创新改进：（1）采用 CLAP[1]作为音频编码器替代仅预训练于音频的编码器。CLAP 通过音频-文本对预训练，将两者映射至共享潜在空间以学习对应关系，其隐藏状态表征因增强的语言理解能力更适用于描述任务；（2）通过在 CLAP 与 GPT-2 间新增交叉注意力层，实现音频条件化描述生成；（3）除音频条件化外，我们还为训练和推理设计了定制化提示条件。该提示由 CLAP 从数据库中检索出的、与输入音频最相似的前 k 条描述构建而成。我们将在第 3.1 节提供更多细节。RECAP 基于检索增强生成（RAG）[8]构建，该框架具有第 3 节将进一步讨论的多项优势。RECAP 具有轻量级、训练快速（因为我们仅优化交叉注意力层）的特点，并能以无需训练的方式利用任何仅含文本-字幕的大型数据存储。我们在 Clotho[3]和 AudioCaps[2]两个基准数据集上评估 RECAP，结果表明：在领域内设置中与最先进技术保持竞争力的同时，RECAP 在领域外设置中以显著优势超越所有基线方法。此外，RECAP 能有效为训练中从未见过的新音频事件生成描述，并能更好地为包含多个音频事件的组合音频生成字幕。

2 Related Work 2 相关工作

Automated Audio Captioning. Current work in audio captioning primarily employs encoder-decoder models where a caption is generated by an autoregressive language decoder conditioned on representations obtained from an audio encoder [5, 6, 7]. The language decoder employed is either pre-trained on web-scale data [5, 6, 7] or learned from scratch [9, 10] during fine-tuning. The work closest to ours is [7], where the authors condition a GPT-2 on prompts constructed using retrieved captions. However, the key difference between our work and theirs is that we require only a text-caption-only datastore for RECAP, whereas their system requires both audio and text pairs. We also introduce additional cross-attention layers for audio conditioning. Kim et. al [6], the current state-of-the-art system, proposed prefix tuning for audio captioning where the authors feed a prefix or a fixed-size embedding sequence to GPT-2 for audio captioning. Other works include synthetic data augmentation techniques [11, 12], and training tricks to improve learning on the source training data [13, 14].
自动音频描述。当前音频描述领域的研究主要采用编码器-解码器模型架构，其中描述文本由基于音频编码器所获表征的自回归语言解码器生成[5,6,7]。这些语言解码器要么基于网络规模数据预训练[5,6,7]，要么在微调阶段从头训练[9,10]。与本研究最接近的是文献[7]，作者通过检索获得的描述文本来构建 GPT-2 的提示条件。但核心区别在于我们的 RECAP 系统仅需纯文本描述数据库，而他们的系统需要音频-文本配对数据。我们还引入了额外的交叉注意力层来实现音频条件化。当前最先进的 Kim 等人[6]提出前缀调优方法，通过向 GPT-2 输入前缀或固定尺寸的嵌入序列来生成音频描述。其他研究包括合成数据增强技术[11,12]以及改进源训练数据学习的训练技巧[13,14]。

Retrieval-augmented Generation. The core idea of retrieval-augmented generation (RAG) is to condition generation on additional data retrieved from an external datastore [8]. RAG has been shown to benefit knowledge-intensive NLP tasks like open-domain question-answering on datasets that require world knowledge and advanced reasoning capabilities [15, 16]. RAG has also proven to be extremely effective in various computer vision tasks, including image captioning [17, 18]. We argue that audio captioning, especially in out-of-domain scenarios, is a knowledge-intensive task as it requires the model to caption novel audio concepts never seen during training, and can benefit from RAG.
检索增强生成。检索增强生成（RAG）的核心思想是基于从外部数据存储中检索到的附加数据来生成内容[8]。研究表明，RAG 能显著提升需要世界知识和高级推理能力的知识密集型 NLP 任务（如开放域问答）的性能[15,16]。该方法在包括图像描述生成在内的多种计算机视觉任务中也展现出卓越效果[17,18]。我们认为音频描述任务（尤其是域外场景）作为知识密集型任务，需要模型描述训练阶段从未接触过的新颖音频概念，因此同样能从 RAG 中获益。

Table 1: Evaluation on Clotho. Each method is trained on three different settings and tested on the AudioCaps dataset. For evaluation, we use a datastore that has captions from the training set (

\mathcal{DS}

), AudioCaps (

\mathcal{DS}_{caps}

), or a large external dataset (

\mathcal{DS}_{large}

).
表 1：Clotho 数据集评估结果。每种方法在三种不同设置下训练，并在 AudioCaps 数据集上测试。评估时我们分别使用：训练集标注(

\mathcal{DS}

)、AudioCaps 标注(

\mathcal{DS}_{caps}

)和大型外部数据集标注(

\mathcal{DS}_{large}

)构建的数据存储。

Training set 训练集	Method 方法	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr
(1) Clotho (1) 克洛托	Mei et al. [19] 梅等人[19]	0.527	0.327	0.211	0.131	0.158	0.356	0.320	0.105	0.213
	Gontier et al. [5] 贡捷等人[5]	0.506	0.318	0.210	0.134	0.148	0.338	0.278	0.092	0.185
	Chen et al. [20] 陈等人[20]	0.534	0.343	0.230	0.151	0.160	0.356	0.346	0.108	0.227
	Xu et al. [10] 徐等人[10]	0.556	0.363	0.242	0.159	0.169	0.368	0.377	0.115	\ul0.246
	Koh et al. [21] 柯等人[21]	0.551	0.369	0.252	0.168	0.165	0.373	0.380	0.111	\ul0.246
	Kim et al. [6] 金等人[6]	0.560	0.376	\ul0.253	0.160	0.170	0.378	0.392	0.118	0.255
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	\ul0.563	\ul0.381	0.257	0.165	0.179	\ul0.383	\ul0.398	\ul0.122	0.214
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP（带 $\mathcal{DS}_{large}$ ）	0.582	0.384	0.257	\ul0.166	\ul0.177	0.395	0.411	0.125	0.224
(2) AudioCaps	Mei et al. [19] Mei 等人[19]	0.294	0.146	0.080	0.043	0.096	0.239	0.117	0.050	0.084
	Gontier et al. [5] 贡蒂埃等人[5]	0.309	0.146	0.071	0.034	0.098	0.233	0.112	0.046	0.079
	Chen et al. [20] 陈等人[20]	0.226	0.114	0.065	0.039	0.086	0.228	0.109	0.042	0.076
	Kim et al. [6] 金等人[6]	0.342	0.195	0.115	0.065	0.112	0.276	0.192	0.074	0.133
	RECAP (w/ $\mathcal{DS}_{caps}$ ) RECAP（带 $\mathcal{DS}_{caps}$ ）	0.339	0.193	0.109	0.068	0.110	0.276	0.195	0.084	0.137
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	\ul0.515	\ul0.349	\ul0.210	\ul0.143	\ul0.155	0.328	0.332	\ul0.988	\ul0.201
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP（带 $\mathcal{DS}_{large}$ ）	0.519	0.355	0.216	0.149	0.157	\ul0.324	\ul0.331	1.004	0.209
(3) Clotho & (3) 克洛托& AudioCaps	Mei et al. [19] 梅等人[19]	0.516	0.318	0.204	0.127	0.157	0.351	0.313	0.105	0.209
	Gontier et al. [5] 贡捷等人[5]	0.461	0.282	0.182	0.117	0.136	0.318	0.251	0.083	0.167
	Chen et al. [20] 陈等人[20]	0.516	0.325	0.215	0.141	0.153	0.350	0.314	0.102	0.208
	Kim et al. [6] 金等人[6]	0.539	0.346	\ul0.227	0.142	0.159	0.366	0.319	\ul0.111	0.215
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	\ul0.547	0.361	0.238	\ul0.149	0.167	\ul0.379	\ul0.322	0.116	0.222
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP（带 $\mathcal{DS}_{large}$ ）	0.549	\ul0.360	0.238	0.150	\ul0.166	0.381	0.323	0.116	\ul0.221

3 Methodology 3 方法论

Problem Formulation. Given a dataset $\mathcal{D}$ with audio-text pairs ( $\mathcal{A}$ , $\mathcal{T}$ ), where each text caption $t_{i}\in\mathcal{T}$ corresponding to an audio $a_{i}\in\mathcal{A}$ describes the content or events of the audio, we aim to train a model $\theta$ to generate $t_{i}$ from $a_{i}$ . Different from other audio captioning systems, we also assume that the model has access to a datastore $\mathcal{DS}$ with text captions during inference. These captions come from the training set of $\mathcal{D}$ or external sources but have no overlap with the validation or test sets of $\mathcal{D}$ .
问题定义。给定一个包含音频-文本对（ $\mathcal{A}$ ， $\mathcal{T}$ ）的数据集 $\mathcal{D}$ ，其中每个文本描述 $t_{i}\in\mathcal{T}$ 对应一段音频 $a_{i}\in\mathcal{A}$ ，用于描述该音频的内容或事件。我们的目标是训练一个模型 $\theta$ ，使其能够根据 $a_{i}$ 生成 $t_{i}$ 。与其他音频描述系统不同，我们假设该模型在推理过程中可以访问一个包含文本描述的数据存储 $\mathcal{DS}$ 。这些描述来自 $\mathcal{D}$ 的训练集或外部来源，但与 $\mathcal{D}$ 的验证集或测试集无重叠。

3.1 RECAP 3.1 RECAP

Overall Architecture. The overall architecture of RECAP is quite simple and lightweight. RECAP employs CLAP as the audio encoder and GPT-2 as the auto-regressive language decoder. To generate the caption, the language decoder conditions on the output of the audio encoder and an individually crafted prompt for each audio. We discuss how we construct the prompt in the next subsection.
整体架构。RECAP 的整体架构相当简洁轻量，采用 CLAP 作为音频编码器，GPT-2 作为自回归语言解码器。生成描述时，语言解码器会基于音频编码器的输出结果以及为每条音频单独设计的提示词进行条件化处理。我们将在下一小节详细讨论提示词的构建方法。

Table 2: Evaluation on AudioCaps. Each method is trained on three different settings and tested on the AudioCaps dataset. For evaluation, we use a datastore that has captions from the training set (

\mathcal{DS}

), Clotho (

\mathcal{DS}_{clotho}

), or a large external dataset (

\mathcal{DS}_{large}

).
表 2：AudioCaps 评估结果。每种方法均在三种不同设置下训练，并在 AudioCaps 数据集上测试。评估时，我们分别使用训练集字幕（

\mathcal{DS}

）、Clotho 字幕（

\mathcal{DS}_{clotho}

）和大型外部数据集（

\mathcal{DS}_{large}

）构建的检索库。

Training set 训练集	Method 方法	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr CIDEr 评分	SPICE	SPIDEr SPIDEr 评分
(1) AudioCaps (1) AudioCaps 数据集	Mei et al. [19] 梅等人[19]	0.647	0.488	0.356	0.252	0.222	0.468	0.679	0.160	0.420
	Gontier et al. [5] 贡蒂埃等人[5]	0.699	0.523	0.380	0.266	0.241	0.493	0.753	0.176	0.465
	Chen et al. [20] 陈等人[20]	0.550	0.385	0.264	0.178	0.173	0.390	0.443	0.117	0.280
	Eren et al. [9] 埃伦等人[9]	0.710	0.490	0.380	0.230	0.290	0.590	0.750	-	-
	Kim et al. [6] 金等人[6]	0.713	0.552	\ul0.421	0.309	0.240	0.503	0.733	0.177	0.455
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	\ul0.721	0.559	0.428	0.316	0.252	0.521	0.750	\ul0.183	0.472
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP (带 $\mathcal{DS}_{large}$ )	0.722	\ul0.557	0.428	\ul0.313	\ul0.256	\ul0.525	\ul0.751	0.186	\ul0.471
(2) Clotho (2) 克洛托	Mei et al. [19] 梅等人[19]	0.415	0.219	0.121	0.063	0.134	0.303	0.149	0.066	0.107
	Gontier et al. [5] 贡蒂埃等人[5]	0.425	0.223	0.124	0.061	0.128	0.298	0.147	0.060	0.104
	Chen et al. [20] 陈等人[20]	0.365	0.170	0.091	0.048	0.110	0.273	0.083	0.049	0.066
	Kim et al. [6] 金等人[6]	0.449	0.266	0.157	0.084	0.144	\ul0.330	0.211	0.083	0.147
	RECAP (w/ $\mathcal{DS}_{clotho}$ ) RECAP（带 $\mathcal{DS}_{clotho}$ ）	0.427	0.224	0.148	0.065	0.112	0.281	0.191	0.078	0.136
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	\ul0.501	0.326	0.211	\ul0.104	\ul0.164	0.357	\ul0.359	0.116	\ul0.198
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP (带 $\mathcal{DS}_{large}$ )	0.507	\ul0.321	\ul0.206	0.108	0.169	0.357	0.362	\ul0.111 0.111	0.204
(3) Clotho & (3) 克洛索 & AudioCaps 音频字幕数据集	Mei et al. [19] 梅等人[19]	0.682	0.507	0.369	0.266	0.238	0.488	0.701	0.166	0.434
	Gontier et al. [5] 贡捷等人[5]	0.635	0.461	0.322	0.219	0.208	0.450	0.612	0.153	0.383
	Chen et al. [20] 陈等人[20]	0.489	0.292	0.178	0.106	0.152	0.346	0.265	0.093	0.179
	Kim et al. [6] 金等人[6]	0.708	0.547	0.402	0.283	0.238	\ul0.499	0.710	0.167	\ul0.438
	RECAP (w/ $\mathcal{DS}$ ) RECAP（带 $\mathcal{DS}$ ）	0.728	0.563	0.425	\ul0.317	\ul0.252	0.529	0.764	\ul0.187	0.469
	RECAP (w/ $\mathcal{DS}_{large}$ ) RECAP（带 $\mathcal{DS}_{large}$ ）	\ul0.725 0.725	\ul0.561 0.561	\ul0.424 0.424	0.319	0.256	0.529	\ul0.761 0.761	0.190	0.469

For audio conditioning, we first pass the audio samples through the CLAP audio encoder and extract the last hidden state $A\in n\times d$ , where $n$ is the sequence length and $d$ is the embedding dimension. This embedding is extracted from the penultimate layer of the CLAP audio encoder right before the final projection. As the audio embeddings and decoder operate on different vector spaces, we connect them through randomly initialized cross-attention modules as each decoder layer. To train the RECAP, we freeze both GPT-2 and the CLAP and only train the cross-attention layers, which reduces overall compute requirements and time for training and retains the expressivity and generalization capabilities of GPT-2. RECAP performs well even after training only 5.4 $\%$ of total parameters because, like other retrieval-augmented models [8, 22, 23], RECAP does not need all information to be stored in its weights as it has access to external knowledge from a datastore of text. Additionally, CLAP generates an audio embedding that correlates well with its corresponding textual description, thus further lowering training time due to its superior understanding of the audio content.
在音频条件处理方面，我们首先将音频样本输入 CLAP 音频编码器，提取最后一个隐藏状态 $A\in n\times d$ ，其中 $n$ 表示序列长度， $d$ 表示嵌入维度。该嵌入是从 CLAP 音频编码器倒数第二层提取的，位于最终投影层之前。由于音频嵌入和解码器运行在不同的向量空间，我们通过随机初始化的交叉注意力模块将它们连接起来，每个解码器层都配备该模块。训练 RECAP 时，我们冻结 GPT-2 和 CLAP 模型，仅训练交叉注意力层，这降低了整体计算需求和训练时间，同时保留了 GPT-2 的表达能力和泛化性能。RECAP 仅需训练总参数量的 5.4% $\%$ 就能表现优异，因为与其他检索增强模型[8,22,23]类似，RECAP 无需将所有信息存储在权重中，它可以从文本数据存储库获取外部知识。此外，CLAP 生成的音频嵌入与其对应文本描述高度相关，凭借其对音频内容的卓越理解能力，进一步缩短了训练时间。

Constructing prompts with Retrieved Captions. Instead of just conditioning audio features for captioning, RECAP is also conditioned on a prompt, individually crafted for each audio during training and inference. To construct this prompt, RECAP exploits CLAP text and audio encoders [1], to retrieve top-k captions similar to an audio from a datastore. CLAP encodes audio and text to a shared vector space and has outperformed all prior models on audio-to-text and text-to-audio retrieval, thus making it most suitable for our task. Specifically, for retrieval, we calculate the cosine similarity between the embeddings of the current audio $a_{i}$ and all the text captions in the datastore $\mathcal{DS}$ and just choose the captions with the highest similarity. Once we have retrieved the top-k similar captions, we construct a prompt in the following manner: “Audios similar to this audio sounds like: caption 1, caption 2, $\cdots$ caption k. This audio sounds like:”. For retrieval, we naturally ignore the original caption $t_{i}$ corresponding to $a_{i}$ . RECAP is then trained using the generic cross-entropy loss between the tokens for the predicted caption $\hat{t_{i}}$ and the ground-truth caption $t_{i}$ .
基于检索字幕构建提示。RECAP 不仅利用音频特征生成字幕，还在训练和推理阶段为每段音频定制专属提示。该提示的构建方式如下：RECAP 通过 CLAP 文本与音频编码器[1]，从数据库中检索出与目标音频最相似的前 k 条字幕。CLAP 能将音频和文本编码至共享向量空间，其音频-文本双向检索性能超越所有现有模型，因此成为本任务的理想选择。具体实现时，我们计算当前音频 $a_{i}$ 与数据库中所有文本字幕 $\mathcal{DS}$ 的嵌入向量余弦相似度，仅选取相似度最高的字幕。获取前 k 条相似字幕后，按以下格式构建提示："与此音频相似的音频听起来像：字幕 1、字幕 2、 $\cdots$ 字幕 k。本音频听起来像："。检索过程中会自然忽略与 $a_{i}$ 对应的原始字幕 $t_{i}$ 。最终，RECAP 通过预测字幕 $\hat{t_{i}}$ 与真实字幕 $t_{i}$ 之间的标准交叉熵损失进行训练。

4 Experiments and Results 4 实验与结果

Datasets. For training and evaluating RECAP, we use either Clotho [3], AudioCaps [2], or a combination of both. Clotho has 3839/1045/1045 unique audios in train/dev/test splits, respectively, with five captions for each audio. AudioCaps has 49,838/495/975 with five captions except for the train set.
数据集。我们使用 Clotho[3]、AudioCaps[2]或两者组合来训练和评估 RECAP。Clotho 数据集在训练/开发/测试集中分别包含 3839/1045/1045 段独特音频，每段音频配有五条描述文本。AudioCaps 数据集包含 49,838/495/975 段音频，除训练集外每段音频也配有五条描述。

Baselines. We compare RECAP with six competitive baselines that are taken from literature. Eren et al. [9] and Xu et al. [10] train a Gated Recurrent Unit (GRU) for generating captions, conditioned on audio embeddings extracted from an audio encoder. Chen et al. [20] replaces the GRU with a transformer decoder, and Mei et al. [19] trains an entire encoder-decoder transformer architecture from scratch. Kim et al. [6] and Gontier et al. [5] use a pre-trained language model, where the former employs GPT-2, and the latter employs BART [24].
基线模型。我们将 RECAP 与文献中的六个竞争性基线模型进行对比。Eren 等人[9]和 Xu 等人[10]训练了门控循环单元(GRU)来生成描述文本，其条件输入为音频编码器提取的嵌入特征。Chen 等人[20]用 Transformer 解码器替代了 GRU，Mei 等人[19]则从头训练完整的编码器-解码器 Transformer 架构。Kim 等人[6]和 Gontier 等人[5]使用了预训练语言模型，前者采用 GPT-2，后者采用 BART[24]。

Experimental Setup. To compare the performance of RECAP, we conduct experiments in three distinct setups: (1) We train and evaluate the model on the same dataset $\mathcal{D}$ , (2) We train the model on a dataset $\mathcal{D}$ and evaluate the model on a different dataset $\hat{\mathcal{D}}$ (3) We train the model on a combination of both datasets and evaluate separately on individual datasets. For (1), the datastore $\mathcal{DS}$ consists of captions from either the training set of the source dataset $\mathcal{D}$ or a large curated datastore $\mathcal{DS}_{large}$ . For (2), we use $\mathcal{DS}$ that has captions from either $\mathcal{D}$ ( $\mathcal{DS}$ ), $\mathcal{DS}_{large}$ or from the other dataset. For (3), we either use $\mathcal{DS}$ that has captions from both datasets or use $\mathcal{DS}_{large}$ . We list all the sources of $\mathcal{DS}_{large}$ with over 600,000+ text-only captions on our GitHub. This includes 100,000+ new weakly labeled captions for the AudioSet strong subset and three new captions for each sample in AudioCaps and Clotho. All these captions were generated using GPT-4 and manually corrected by one expert human annotator. For retrieval-based prompt creation, we use $k$ =4 and retrieve only the top 4 captions from the datastore. It is worth noting that RECAP does not use any additional training or data augmentation tricks. For both AudioCaps and Clotho, we train using Adam optimizer with a learning rate of 5e^-5 for 100 epochs and a batch size of 32. We evaluate all our models on the metrics of BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and SPIDEr.
实验设置。为比较 RECAP 的性能，我们在三种不同配置下进行实验：(1)在同一数据集 $\mathcal{D}$ 上训练并评估模型，(2)在数据集 $\mathcal{D}$ 上训练模型并在不同数据集 $\hat{\mathcal{D}}$ 上评估，(3)在组合数据集上训练模型并分别在单个数据集上评估。对于(1)，数据存储 $\mathcal{DS}$ 包含来自源数据集 $\mathcal{D}$ 训练集的文本描述或大型精选数据存储 $\mathcal{DS}_{large}$ 。对于(2)，我们使用 $\mathcal{DS}$ ——其描述文本来自 $\mathcal{D}$ ( $\mathcal{DS}$ )、 $\mathcal{DS}_{large}$ 或另一数据集。对于(3)，我们使用包含双数据集描述的 $\mathcal{DS}$ 或采用 $\mathcal{DS}_{large}$ 。我们在 GitHub 上列出了所有 $\mathcal{DS}_{large}$ 来源，包含超过 60 万条纯文本描述，其中包括为 AudioSet 强子集新增的 10 万+弱标注描述，以及为 AudioCaps 和 Clotho 每个样本新增的三条描述。所有描述均使用 GPT-4 生成并经专业标注员人工校正。基于检索的提示创建中，我们设定 $k$ =4 且仅从数据存储中检索前 4 条最佳描述。值得注意的是，RECAP 并未使用任何额外的训练技巧或数据增强手段。对于 AudioCaps 和 Clotho 数据集，我们均采用 Adam 优化器进行训练，学习率设置为 5e-4，训练 100 个 epoch，批次大小为 32。所有模型均通过 BLEU、METEOR、ROUGE-L、CIDEr、SPICE 和 SPIDEr 指标进行评估。

Results. Table 1 and Table 2 compare the performance of RECAP against all our baselines evaluated on Clotho and AudioCaps, respectively. We train our models in different settings and evaluate them with different datastores. While RECAP shows decent margins of improvement in in-domain settings, RECAP outperforms all baselines by a significant margin in out-of-domain settings when an in-domain datastore is available. Without one, RECAP shows competitive performance with SOTA [6]. The presence of a larger data store ( $\mathcal{DS}_{large}$ ) almost always improves performance. This opens possibilities to improve captioning performance by augmenting the datastore with diverse synthetically generated captions.
实验结果。表 1 和表 2 分别展示了 RECAP 在 Clotho 和 AudioCaps 数据集上相较于所有基线模型的性能对比。我们在不同设置下训练模型，并使用不同数据存储进行评估。当存在领域内数据存储时，RECAP 在领域外设置中以显著优势超越所有基线模型，而在领域内设置中也展现出稳定的性能提升。若无领域内数据存储，RECAP 仍能与当前最佳模型 SOTA[6]保持相当性能。更大规模的数据存储（ $\mathcal{DS}_{large}$ ）几乎总能带来性能提升，这为通过合成多样化描述文本来扩充数据存储、从而提升字幕生成性能提供了可能性。

Results Analysis. Table 3 compares RECAP with Kimet al. [6] (SOTA) on compositional instances from Clotho (1.) and AudioCaps (4.) test set. While SOTA was able to caption only one audio event, due to conditioning on a prompt constructed from diverse retrieved captions, RECAP captures multiple. We also compared with a model trained on AudioCaps and inferred on a Clotho test instance with an audio event never seen during training (2.), and vice-versa (3.). By being conditioned on in-domain prompts, RECAP can caption these instances effectively.
结果分析。表 3 将 RECAP 与 Kimet al.[6]（当前最优模型）在 Clotho 测试集(1.)和 AudioCaps 测试集(4.)的组合实例上进行了对比。由于当前最优模型仅能描述单个音频事件，而 RECAP 通过基于多样化检索字幕构建的提示条件，实现了对多个事件的捕捉。我们还比较了在 AudioCaps 上训练后推断 Clotho 测试实例(2.)的模型（该音频事件在训练中从未出现），以及反向情况(3.)。通过基于领域内提示的条件设置，RECAP 能够有效描述这些实例。

Table 3: Comparing RECAP in 4 challenging settings.
表 3：RECAP 在四种挑战性场景下的对比

Ground Truth 真实标注	1: a engine roars in the background while pieces of metal are being dropped in. 1: 背景中引擎轰鸣，同时有金属碎片掉落声
	2: a moving vehicle has some metal container in it clinging against each other. 2：一辆行驶的车辆中，金属容器相互碰撞发出叮当声。
	3: nature sounds with a frog croaking. 3：自然声响中夹杂着青蛙的鸣叫。
	4: a vehicle driving as a man and woman are talking and laughing. 4：车辆行驶时，一男一女正在谈笑风生。
SOTA [6] 当前最佳[6]	1: a bell is ringing and a bell rings. 1: 钟声响起，铃铛鸣响。
	2: rain falling on a surface. 2：雨水落在表面上。
	3: people are talking and laughing with a man speaking in the background. 3: 人们谈笑风生，背景中有男子说话声。
	4: a person is talking in the background. 4: 背景中有人正在说话。
RECAP	1: A person is using a chisel to cut wood and a car passes by. 1: 一个人正在用凿子切割木头，同时有辆汽车驶过。
	2: Water splashes while a car drives by in the rain. 2: 雨中行驶的汽车溅起水花。
	3: several vehicles move and a beep goes off. 3: 多辆车辆移动时响起喇叭声。
	4: an adult male is speaking, and a motor vehicle engine is running. 4: 一名成年男性在说话，机动车引擎正在运转。

5 Conclusion and Future Work
5 结论与未来工作

We present RECAP, a novel audio captioning system based on retrieval-augmented generation. While being competitive with state-of-the-art methods on benchmark datasets, RECAP outperforms SOTA by a huge margin on out-of-domain settings and shows unique capabilities of captioning novel audio events and compositional audios with two or more events. Additionally, RECAP is cheap to train and can exploit a replaceable text-caption-only datastore in a training-free fashion to further push performance. As part of future work, we would like to explore advanced techniques for efficient retrieval and build better audio-text models.
我们提出了 RECAP 这一基于检索增强生成的新型音频描述系统。在基准数据集上，RECAP 与最先进方法表现相当，而在域外设置中则大幅超越现有技术，展现出描述新颖音频事件及包含两个以上事件的复合音频的独特能力。此外，RECAP 训练成本低廉，并能以无需训练的方式利用可替换的纯文本描述数据库来进一步提升性能。作为未来工作方向，我们将探索高效检索的先进技术，并构建更优质的音频-文本模型。

References

[1] Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE ICASSP 2023.
[2] Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim, “Audiocaps: Generating captions for audios in the wild,” in ACL 2019, pp. 119–132.
[3] Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen, “Clotho: An audio captioning dataset,” in IEEE ICASSP 2020, pp. 736–740.
[4] A Sophia Koepke, Andreea-Maria Oncescu, Joao Henriques, Zeynep Akata, and Samuel Albanie, “Audio retrieval with natural language queries: A benchmark study,” IEEE Transactions on Multimedia, 2022.
[5] Félix Gontier, Romain Serizel, and Christophe Cerisara, “Automated audio captioning by fine-tuning bart with audioset tags,” in DCASE2021 Challenge, 2021.
[6] Minkyu Kim, Kim Sung-Bin, and Tae-Hyun Oh, “Prefix tuning for automated audio captioning,” in IEEE ICASSP 2023, pp. 1–5.
[7] Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, and Masahiro Yasuda, “Audio captioning using pre-trained large-scale language model guided by audio-based similar caption retrieval,” arXiv preprint arXiv:2012.07331, 2020.
[8] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” NeurIPS 2020, pp. 9459–9474.
[9] Ayşegül Özkaya Eren and Mustafa Sert, “Audio captioning based on combined audio and semantic embeddings,” in IEEE International Symposium on Multimedia, 2020.
[10] Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Zeyu Xie, and Kai Yu, “Investigating local and global information for automated audio captioning with transfer learning,” in ICASSP 2021.
[11] Wu et al., “Beats-based audio captioning model with instructor embedding supervision and chatgpt mix-up,” Tech. Rep., DCASE2023 Challenge.
[12] Marek Kadlčík, Adam Hájek, Jürgen Kieslich, and Radosław Winiecki, “A whisper transformer for audio captioning trained with synthetic captions and transfer learning,” arXiv preprint arXiv:2305.09690, 2023.
[13] Haoran Sun, Zhiyong Yan, Yongqing Wang, Heinrich Dinkel, Junbo Zhang, and Yujun Wang, “Leveraging multi-task training and image retrieval with clap for audio captioning,” Tech. Rep., DCASE2023 Challenge.
[14] Jaeheon Sim, Eungbeom Kim, and Kyogu Lee, “Label-refined sequential training with noisy data for automated audio captioning,” Tech. Rep., DCASE2023 Challenge.
[15] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang, “Semantic parsing on freebase from question-answer pairs,” in EMNLP 2013, pp. 1533–1544.
[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al., “Natural questions: a benchmark for question answering research,” TACL 2019, pp. 453–466.
[17] Rita Ramos, Desmond Elliott, and Bruno Martins, “Retrieval-augmented image captioning,” arXiv preprint arXiv:2302.08268, 2023.
[18] Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva, “Smallcap: Lightweight image captioning prompted with retrieval augmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 2840–2849.
[19] Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, and Wenwu Wang, “Audio captioning transformer,” in DCASE2021 Challenge.
[20] Kun Chen, Yusong Wu, Ziyue Wang, Xuan Zhang, Fudong Nian, Shengchen Li, and Xi Shao, “Audio captioning based on transformer and pre-trained cnn.,” in DCASE, 2020, pp. 21–25.
[21] Andrew Koh, Xue Fuzhao, and Chng Eng Siong, “Automated audio captioning using transfer learning and reconstruction latent space similarity regularization,” in ICASSP 2022.
[22] Izacard et al., “Few-shot learning with retrieval augmented language models,” arXiv preprint arXiv:2208.03299, 2022.
[23] Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu, “A survey on retrieval-augmented text generation,” arXiv preprint arXiv:2202.01110, 2022.
[24] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL 2020, pp. 7871–7880.

RECAP: Retrieval-Augmented Audio Captioning《RECAP：检索增强型音频字幕生成》