Introduction 介绍
Spatial acoustic representation learning aims to extract a low-dimension representation of the spatial propagation characteristics of sound from microphone recordings. It can be used in a variety of audio tasks, such as estimating the acoustic parameters [1] or the geometrical information [2] related to source, microphone and environment. Related audio tasks have been widely applied in augmented reality [3] and hearing aids [4] where generating perceptually acceptable sound for the target environment is required to guarantee a good immersive experience, and also in intelligent robots [5] where perceiving surrounding acoustic properties serves as a priori for robot interaction with humans and environments.
空间声学表征学习旨在从麦克风录音中提取声音空间传播特性的低维表征。它可用于各种音频任务,例如估计与声源、麦克风和环境相关的声学参数 [1] 或几何信息 [2] 。相关音频任务已广泛应用于增强现实 [3] 和助听器 [4] 领域,在这些领域中,需要为目标环境生成感知上可接受的声音,以保证良好的沉浸式体验;此外,在智能机器人 [5] 领域中,感知周围的声学特性是机器人与人类和环境交互的先验条件。
Room impulse responses (RIRs) characterize the sound propagation from sound source to microphone constrained in a room environment. The physically relevant parameters that determine RIR include the positions of sound source and microphone array, the room geometry, and the absorption coefficients of walls. RIR is composed of direct-path propagation, early reflections, and late reverberation. The relative position between the source and the microphone array determines the direct-path propagation. All the three physical parameters affect the arrival times and the strengths of reflective pulses including early reflections and late reverberation. Some acoustic parameters of the spatial environment can be directly estimated from RIR without supervision, like position-dependent parameters including time difference of arrival (TDOA), direct-to-reverberant ratio (DRR) and clarity index
房间脉冲响应 (RIR) 表征了在室内环境中从声源到麦克风的声音传播。决定 RIR 的物理相关参数包括声源和麦克风阵列的位置、房间几何形状以及墙壁的吸收系数。RIR 由直达路径传播、早期反射和后期混响组成。声源和麦克风阵列之间的相对位置决定了直达路径传播。这三个物理参数都会影响反射脉冲(包括早期反射和后期混响)的到达时间和强度。一些空间环境的声学参数可以直接从 RIR 估计,而无需监督,例如与位置相关的参数,包括到达时间差 (TDOA)、直达混响比 (DRR) 和清晰度指数
With the development of deep learning techniques, lots of works directly estimate spatial acoustic parameters from microphone signals in a supervised manner. These supervised works have achieved superior performance than conventional methods, owing to the strong modeling ability of deep neural networks. Since these works are data-driven, the diversity and quantity of training data are crucial to their performance. To this end, DNN models are usually trained with abundant diverse simulated data, and then transferred to real-world data. Some researchers show that the trained model does not perform well when directly transferred to real-world data [11] due to the mismatch between simulated and real-world RIRs. In [11], the mismatch is analyzed mainly in terms of the directivity of source and microphone, and the wall absorption coefficient. Moreover, there are many other aspects of mismatch, for example: 1) The acoustic response of real-world moving source [12] and the spatial correlation of real-world multi-channel noise are difficult to simulate. 2) RIR simulators [13] usually generate empty box-shaped rooms, while obstacles and non-regular-shaped rooms exist in real-world applications. As an alternative solution, real-world data can be also used for training. However, annotating the acoustic environment and the sound propagation paths would be very difficult and expensive. Existing annotated real-world datasets lack diversity and quantity, which limits the development of supervised learning methods. Therefore, it is necessary to study how to dig spatial acoustic information from unlabeled real-world data.
随着深度学习技术的发展,大量研究以监督学习的方式直接从麦克风信号中估计空间声学参数。得益于深度神经网络强大的建模能力,这些监督学习取得了优于传统方法的性能。由于这些研究是数据驱动的,训练数据的多样性和数量对其性能至关重要。为此,深度神经网络 (DNN) 模型通常使用丰富多样的模拟数据进行训练,然后迁移到真实数据中。一些研究者表明,由于模拟 RIR 与真实 RIR 不匹配,训练好的模型在直接迁移到真实数据时表现不佳 [11] 。在 [11] 中,不匹配主要体现在声源和麦克风的方向性以及墙体吸收系数方面。此外,不匹配还体现在许多其他方面,例如:1) 真实世界运动声源的声学响应 [12] 和真实世界多通道噪声的空间相关性难以模拟。 2)RIR 模拟器 [13] 通常生成空盒子形状的房间,而在实际应用中,障碍物和非规则形状的房间是存在的。作为一种替代方案,也可以使用真实世界数据进行训练。然而,标注声学环境和声音传播路径非常困难且成本高昂。现有的带标注的真实世界数据集缺乏多样性和数量,这限制了监督学习方法的发展。因此,有必要研究如何从未标注的真实世界数据中挖掘空间声学信息。
In this work, we investigate how to learn a universal spatial acoustic representation from unlabeled dual-channel microphone signals based on self-supervised learning. Microphone signals can be formulated as a convolution between dry source signals with multi-channel RIRs, with the addition of noise signals. Spatial acoustic representation learning focuses on extracting RIR-related but source-independent embeddings. As far as we know, this is the first work studying on self-supervised learning of spatial acoustic representation. The proposed method has the following contributions.
在本研究中,我们研究如何基于自监督学习从未标记的双通道麦克风信号中学习通用的空间声学表征。麦克风信号可以表示为干源信号与多通道 RIR 之间的卷积,并添加噪声信号。空间声学表征学习的重点是提取与 RIR 相关但与源无关的嵌入。据我们所知,这是首篇研究空间声学表征自监督学习的研究成果。所提出的方法具有以下贡献。
1) Self-Supervised Learning of Spatial Acoustic Representation (SSL-SAR)
1)空间声学表征的自监督学习(SSL-SAR)
The proposed method follows the basic pipeline of self-supervised learning, namely first pre-training using abundant unlabeled data and then fine-tuning using a small labeled dataset for downstream tasks. A new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed for self-supervised learning of a universal spatial acoustic representation. This work is implemented in the short-time Fourier transform (STFT) domain. Given the dual-channel microphone signals as input, we randomly mask a portion of STFT frames of one microphone channel and ask the neural network to reconstruct them. The reconstruction of the masked STFT frames requires both the spatial acoustic information related to RIRs and the spectral pattern information indicating source signal content. Accordingly, the network is forced to learn the (inter-channel) spatial acoustic information from the frames that are not masked for both microphone channels, and meanwhile extract the spectral information from corresponding frames of the unmasked channel. In order to disentangle the two kinds of information, the input STFT coefficients are separately masked and fed to two different encoders. The spatial and spectral representations are concatenated along the embedding dimension and then passed to a decoder to reconstruct the masked STFT frames. The pre-trained spatial encoder can provide useful information to various spatial acoustics-related downstream tasks. Note that this work only considers static acoustic scenarios where RIRs are time-invariant.
所提出的方法遵循自监督学习的基本流程,即首先使用大量未标记数据进行预训练,然后使用少量已标记数据集进行微调以完成下游任务。设计了一个新的借口任务,即跨通道信号重构 (CCSR),用于自监督学习通用的空间声学表征。这项工作在短时傅里叶变换 (STFT) 域中实现。给定双通道麦克风信号作为输入,我们随机屏蔽其中一个麦克风通道的部分 STFT 帧,并要求神经网络重建它们。重建屏蔽的 STFT 帧需要与 RIR 相关的空间声学信息以及指示源信号内容的频谱模式信息。因此,网络被迫从两个麦克风通道未被屏蔽的帧中学习(通道间)空间声学信息,同时从未屏蔽通道的相应帧中提取频谱信息。为了分离这两类信息,输入的 STFT 系数分别被屏蔽并输入到两个不同的编码器中。空间和频谱表示沿嵌入维度连接,然后传递至解码器以重建掩蔽的 STFT 帧。预训练的空间编码器可以为各种与空间声学相关的下游任务提供有用信息。需要注意的是,本研究仅考虑 RIR 具有时不变性的静态声学场景。
2) Multi-Channel Audio Conformer (MC-Conformer)
2)多通道音频适配器(MC-Conformer)
Since the network would be pre-trained according to the pretext task, and then adopted by various downstream tasks, we need a powerful network that is suitable for both pretext and downstream tasks. To this end, a novel MC-Conformer is adopted as the encoder model. To fully learn the local and global properties of spatial acoustics exhibited in the time-frequency (TF) domain, it is designed following a local-to-global processing pipeline. The local processing model applies 2D convolutional layers to the raw dual-channel STFT coefficients to learn the relationship between microphone signals and RIRs. It captures the short-term and sub-band spatial acoustic information. The global processing model uses Conformer [14] blocks to mainly learn the full-band and long-term relationship of spatial acoustics. The feed-forward modules of Conformer can model the full-band correlations of RIRs, namely the wrapped-linear correlation between frequency-wise IPDs and time-domain TDOA for the direct path and early reflections. Considering RIRs are time-invariant for the entire signal, the multi-head self-attention module is used to model such long-term temporal dependence. Though the combination of 2D CNN and Conformer has been used in other tasks such as sound event localization and detection [15], in this work, we investigate its use for spatial acoustic representation learning and spatial acoustic parameter estimation.
由于网络将根据前置任务进行预训练,并被各种下游任务采用,因此我们需要一个强大的网络,既适用于前置任务,也适用于下游任务。为此,我们采用了一种新颖的 MC-Conformer 作为编码器模型。为了充分学习时频 (TF) 域中空间声学的局部和全局特性,该模型遵循从局部到全局的处理流程进行设计。局部处理模型将二维卷积层应用于原始双通道 STFT 系数,以学习麦克风信号与 RIR 之间的关系。它捕捉短期和子带空间声学信息。全局处理模型使用 Conformer [14] 模块,主要学习空间声学的全频带和长期关系。Conformer 的前馈模块可以模拟 RIR 的全频带相关性,即频率方向的 IPD 与直达路径和早期反射的时域 TDOA 之间的包裹线性相关性。考虑到 RIR 对于整个信号而言具有时不变性,我们采用多头自注意力模块来建模这种长期时间依赖性。虽然二维 CNN 与 Conformer 的组合已用于声音事件定位和检测 [15] 等其他任务,但在本研究中,我们研究了其在空间声学表征学习和空间声学参数估计中的应用。
The rest of this paper is organized as follows. Section II overviews the related works in the literature. Section III formulates the spatial acoustic representation learning problem. Section IV details the proposed self-supervised spatial acoustic representation learning method. Experiments and discussions with simulated and real-world data are presented in Section V, and conclusions are drawn in Section VI.
本文的其余部分安排如下。第 II 节概述了文献中的相关工作。第 III 节阐述了空间声学表征学习问题。第 IV 节详细介绍了所提出的自监督空间声学表征学习方法。第 V 节介绍了使用模拟数据和真实数据进行的实验和讨论,第 VI 节得出结论。
Related Works 相关作品
A. Deep-Learning-Based Spatial Acoustic Parameter Estimation
A.基于深度学习的空间声学参数估计
Spatial acoustic parameter estimation can provide important acoustic information of the environment. Lots of deep learning based methods are developed for related tasks [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], which are summarised in Table I. The estimation of spatial location-related parameters like TDOA and direction of arrival (DOA) requires multi-channel microphone signals as input [16], [17], [18]. The other spatial acoustic parameters can be predicted with both single-channel [6], [10], [19], [20], [21], [22], [23], [24], [25], [27], [28] and multi-channel microphone signals [11], [26], such as DRR,
空间声学参数估计可以提供重要的环境声学信息。许多基于深度学习的方法已针对相关任务 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] 进行了开发,这些方法总结于表 I 中。估计与空间位置相关的参数,例如时差 (TDOA) 和到达方向 (DOA),需要多通道麦克风信号作为输入 [16] 、 [17] 、 [18] 。其他空间声学参数可以通过单通道 [6] 、 [10] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [27] 、 [28] 和多通道麦克风信号 [11] 、 [26] 来预测,例如 DRR、
表一: 基于深度学习的空间声学参数估计方法总结
Most existing works train neworks for an acoustic parameter solely or multiple acoustic parameters jointly with labeled data, which implicitly learn task-oriented spatial acoustic information in a fully supervised manner. The work [10] extracts a universal representation of the acoustic environment with a contrastive learning method. However, it requires the RIR annotation to obtain the positive and negative sample pairs, which is still a supervised learning method. In contrast to these works, we aim to design a self-supervised method to learn a universal spatial acoustic representation that can be applied to various spatial acoustic-related downstream tasks. Self-supervised learning does not require any data annotation, which allows intensively digging spatial acoustic information from large-scale unlabeled audio data, especially from the real-world recordings.
现有的大多数研究都是针对单个声学参数或者多个声学参数与标注数据联合训练网络,以全监督的方式隐式学习面向任务的空间声学信息。文献 [10] 采用对比学习的方法提取了声学环境的通用表示,然而它需要 RIR 标注来获取正负样本对,这仍然是一种监督学习方法。与这些研究不同,我们的目标是设计一种自监督方法来学习通用的空间声学表示,并将其应用于各种与空间声学相关的下游任务。自监督学习不需要任何数据标注,这使得我们能够从大规模未标注音频数据(尤其是真实世界录音)中深入挖掘空间声学信息。
B. Audio Self-Supervised Representation Learning
B. 音频自监督表征学习
Self-supervised representation learning [29] has been successfully applied to audio/speech processing in recent years [30], which has shown effectiveness in a wide range of downstream applications like automatic speech recognition and sound event classification. According to how to build pretext task, it can be grouped into two categories, namely contrastive approaches and generative approaches. Contrastive approaches aim to learn a latent representation space that pulls close positive samples, and sometimes meanwhile pulls away negative samples from positive samples. Typical methods include contrastive predictive coding [31], wav2vec [32], COLA [33], BYOL-Audio [34], etc. Generative approaches learn representation by generating or reconstructing the input audio data with some limited views. Autoregressive predictive coding [35], [36] predicts future inputs from past inputs with an unsupervised autoregressive neural model to learn generic speech representation. Inspired by the masked language model task from BERT [37], some researchers propose to learn general-purpose audio representation by reconstructing masked patches from unmasked TF regions using Transformer [38], [39].
近年来,自监督表征学习 [29] 已成功应用于音频/语音处理 [30] ,并在自动语音识别和声音事件分类等一系列下游应用中展现出有效性。根据如何构建借口任务,它可以分为两类:对比方法和生成方法。 对比方法旨在学习一个潜在的表征空间,该空间可以拉近正样本,有时同时将负样本从正样本中拉开。典型方法包括对比预测编码 [31] 、wav2vec [32] 、COLA [33] 、BYOL-Audio [34] 等。 生成方法通过在某些有限视角下生成或重建输入音频数据来学习表征。自回归预测编码 [35] 、 [36] 使用无监督自回归神经模型根据过去的输入预测未来的输入,以学习通用的语音表征。受到 BERT [37] 的掩蔽语言模型任务的启发,一些研究人员提出通过使用 Transformer [38] 、 [39] 从未掩蔽的 TF 区域重建掩蔽块来学习通用音频表示。
These methods learn the representation of sound source from single-channel signal, and remove the influence of channel effect such as the response introduced by propagation paths. This kind of representation can be applied to a number of signal-content-related downstream tasks. In contrast, this work aims to learn the representation of spatial acoustic information and remove the information of the sound source, which will be used in spatial-acoustic-related downstream applications. Though there are some unsupervised/self-supervised methods proposed for multi-channel signal processing [40], [41], their self-supervised pretext tasks are different from our method. We aim to learn a general spatial acoustic representation, while they are designed for specific tasks and the potential to learn spatial acoustic information is unknown. Different from the existing masking-reconstruction pretext task [38], [39] which encourages learning inner-channel information, the proposed pretext task, i.e. cross-channel signal reconstruction, intends to learn both inner-channel and inter-channel information.
这些方法从单通道信号中学习声源的表示,并消除通道效应(例如传播路径引入的响应)的影响。这种表示可以应用于许多与信号内容相关的下游任务。相比之下,本研究旨在学习空间声学信息的表示并去除声源信息,这将用于与空间声学相关的下游应用。虽然已经提出了一些用于多通道信号处理的无监督/自监督方法 [40] , [41] ,但它们的自监督借口任务与我们的方法不同。我们的目标是学习通用的空间声学表示,而它们是为特定任务设计的,并且学习空间声学信息的潜力尚不清楚。与现有的鼓励学习通道内信息的掩蔽重建借口任务 [38] , [39] 不同,所提出的借口任务,即跨通道信号重建,旨在学习通道内和通道间的信息。
Problem Formulation 问题表述
We consider the case that one static sound source is observed by two static microphones in an enclosed room environment with additive ambient noise. The signal captured by the
我们考虑在封闭的房间环境中,用两个静态麦克风观测一个静态声源,并伴有附加环境噪声的情况。第
Considering the sound propagation and the reflection when encountering obstacles are frequency-dependent, we convert the time-domain signal model in (1) into the STFT domain as [42]
考虑到声音传播和遇到障碍物时的反射都与频率相关,我们将 (1) 中的时域信号模型转换为 STFT 域,如下所示: [42]
The spatial acoustic information is encoded in the dual-channel RIRs/CTFs, being independent of the source signal and ambient noise. As illustrated in Fig. 1, this work aims to design a self-supervised method to learn a universal spatial acoustic representation related to the RIRs/CTFs, from unlabeled dual-channel microphone signals. The representation can be used to estimate the spatial acoustic parameters including TDOA, DRR,
空间声学信息被编码在双通道 RIR/CTF 中,与源信号和环境噪声无关。如图 1 所示,本研究旨在设计一种自监督方法,从未标记的双通道麦克风信号中学习与 RIR/CTF 相关的通用空间声学表征。该表征可用于估计空间声学参数,包括 TDOA、DRR、
Illustration of self-supervised learning of spatial acoustic representation using multi-channel microphone recordings. The direct path, early reflections and late reverberation are illustrated in red, green and blue colors, respectively.
使用多通道麦克风录音进行空间声学表征的自监督学习示意图。直达路径、早期反射和后期混响分别以红色、绿色和蓝色表示。
Self-Supervised Learning of Spatial Acoustic Representation
空间声学表征的自监督学习
The proposed spatial acoustic representation learning method follows the basic pipeline of most self-supervised learning methods, namely first pre-training the representation model according to the pretext task using a large amount of unlabeled data, and then fine-tuning the pre-trained model for a specific downstream task using a small amount of labeled data. The key points of this work lie in how to build the pretext task to learn spatial acoustic information (see details in Section IV-A), and how to design a unified network architecture suited for both pretext task and downstream tasks (see details in Section IV-C). The block diagram of the proposed method is shown in Fig. 2.
所提出的空间声学表征学习方法遵循大多数自监督学习方法的基本流程,即首先使用大量未标注数据根据借口任务预训练表征模型,然后使用少量标注数据针对特定的下游任务对预训练模型进行微调。本研究的关键在于如何构建用于学习空间声学信息的借口任务(详见 IV-A 部分),以及如何设计一个同时适用于借口任务和下游任务的统一网络架构(详见 IV-C 部分)。所提方法的框图如图 2 所示。
Block diagram of the proposed self-supervised spatial acoustic representation learning model. The complex-valued STFT coefficients are illustrated by their real-part spectrograms.
所提出的自监督空间声学表征学习模型的框图。复值 STFT 系数由其实部频谱图表示。
A. Pretext Task: Cross-Channel Signal Reconstruction
A. 借口任务:跨通道信号重建
A cross-channel signal reconstruction (CCSR) pretext task is built to learn the spatial acoustic information. As illustrated in Fig. 2, the basic idea is to mask a portion of STFT frames of one microphone channel to destroy corresponding spectral and spatial information, and then ask the network to reconstruct them. The expected function of the reconstruction network lies in three aspects:
我们构建了一个跨通道信号重构 (CCSR) 借口任务来学习空间声学信息。如图 2 所示,其基本思想是屏蔽某个麦克风通道的部分 STFT 帧,以破坏相应的频谱和空间信息,然后要求网络对其进行重构。重构网络的预期功能体现在三个方面:
Learning the spectral patterns of masked frames from the unmasked channel. Source signals have unique spectral patterns indicating the signal content. Since signals received at the two microphones have the same spectral information, we only mask one channel to preserve the corresponding signal content, and the network can learn the spectral information of sound source from the unmasked channel. The learned information from the unmasked channel may also include some RIR/CTF information of the unmasked channel.
从未掩蔽通道学习掩蔽帧的频谱模式 。源信号具有独特的频谱模式,指示信号内容。由于两个麦克风接收到的信号具有相同的频谱信息,我们仅屏蔽一个通道以保留相应的信号内容,网络可以从未掩蔽通道学习声源的频谱信息。从未掩蔽通道学习到的信息可能还包含一些未掩蔽通道的 RIR/CTF 信息。Learning the spatial acoustics from the dual-channel unmasked frames. To reconstruct the masked STFT frames, the network needs to learn the (inter-channel) acoustic information from the dual-channel unmasked frames, and apply it to the information learned from the unmasked channel. The (inter-channel) spatial acoustic information relates to the RIR/CTF of the masked channel and more possibly to the relative RIR/CTF of the masked channel to the unmasked channel. In the representation of relative CTF/RIR, the relative information can be directly used for inter-channel downstream tasks, such as TDOA estimation. Moreover, it is expected that the temporal structure of CTF/RIR (or a variant of the temporal structure of CTF/RIR) is also preserved, from which the information used for temporal-structure-related downstream tasks, such as
estimation, can be extracted by DNN mapping. These assumptions will be validated through experiments, in which the learned spatial acoustic representation is shown to be effective for a variety of downstream tasks.T60
从双通道非掩蔽帧学习空间声学 。为了重建掩蔽的 STFT 帧,网络需要从双通道非掩蔽帧学习(通道间)声学信息,并将其应用于从非掩蔽通道学习到的信息。(通道间)空间声学信息与掩蔽通道的 RIR/CTF 相关,更可能与掩蔽通道相对于非掩蔽通道的 RIR/CTF 相关。在相对 CTF/RIR 的表示中,相对信息可直接用于通道间下游任务,例如 TDOA 估计。此外,预计 CTF/RIR 的时间结构(或 CTF/RIR 时间结构的变体)也会保留,以便通过 DNN 映射从中提取用于时间结构相关下游任务(例如 估计)的信息。这些假设将通过实验进行验证,实验表明学习到的空间声学表示对各种下游任务均有效。T60 Reconstructing the masked frames using the learned spectral and spatial information.
使用学习到的光谱和空间信息重建掩蔽帧。
The proposed cross-channel signal reconstruction pretext task can disentangle source spectral information and spatial information, which facilitates the application of spatial acoustic representation in downstream tasks.
所提出的跨通道信号重建借口任务可以分离源频谱信息和空间信息,从而有利于空间声学表示在下游任务中的应用。
1) Reconstruction Framework
1)重建框架
A portion of the STFT frames of one microphone channel is randomly masked. The signal masked by the single-channel masking operation is formulated as
一个麦克风通道的部分 STFT 帧被随机屏蔽。单通道屏蔽操作屏蔽的信号公式如下:
2) Encoder-Decoder Structure
2)编码器-解码器结构
The model of the pretext task adopts an encoder-decoder structure, as shown in Fig. 2. Considering the characteristics and heterogeneity of spatial acoustics and signal content, the input STFT coefficients are first masked in two different ways (see details in Section IV-A-3)), then fed into spatial and spectral encoders to separately learn the two kinds of information. Both encoders adopt the MC-Conformer (see details in Section IV-C) but with different configurations (see details in Section V-A-4). The complex-valued STFT coefficients have a dimension of
借口任务模型采用编码器-解码器结构,如图 2 所示。考虑到空间声学和信号内容的特点及异质性,输入的 STFT 系数首先以两种不同的方式进行掩蔽(详见 IV-A-3 节),然后输入到空间和频谱编码器中,分别学习这两类信息。两个编码器都采用 MC-Conformer(详见 IV-C 节),但配置不同(详见 V-A-4 节)。复值 STFT 系数的维度为
Model architecture of (a) spatial/spectral encoder (namely multi-channel audio Conformer), (b) decoder and (c) convolution block in the encoder.
(a) 空间/频谱编码器(即多通道音频 Conformer)、(b) 解码器和 (c) 编码器中的卷积块的模型架构。
3) Masking Scheme 3)掩蔽方案
The masked signal
针对两个编码器重新调整掩蔽信号
The spectral encoder is used to learn signal content information indicated by the inner-channel information. To make the spectral encoder learn the signal content information, we only input the single-channel signal to the spectral encoder. Specifically, the inverse single-channel mask is additionally applied to the signals of the other microphone, which is formulated as
谱编码器用于学习由内部通道信息指示的信号内容信息。为了使谱编码器学习信号内容信息,我们仅将单通道信号输入谱编码器。具体而言,将逆单通道掩模额外应用于另一个麦克风的信号,其公式为
B. Downstream Tasks: Spatial Acoustic Parameter Estimation
B.下游任务:空间声学参数估计
Fig. 2 also shows the diagram of how to use the pre-trained model in downstream tasks. The dual-channel STFT coefficients
图 2 还展示了如何在下游任务中使用预训练模型。双通道 STFT 系数
We consider estimating the following spatial acoustic parameters as downstream tasks.
我们将估计以下空间声学参数视为下游任务。
TDOA is an important feature for sound source localization. It is defined as the relative time delay, namely
in seconds, when the sound emitted by the source arrives at the two microphones. This work estimates TDOA in samples, namelyΔt , whereΔtfs is the sampling rate of signals.fs
TDOA 是声源定位的一个重要特征。它定义为声源发出的声音到达两个麦克风的相对时间延迟,即 秒。本文以样本为单位估计 TDOA,即Δt ,其中Δtfs 是信号的采样率。fs DRR is defined as the energy ratio of direct-path part to the rest of RIRs [1], i.e.,
View SourceDRRm=10log10∑td+Δtdt=td−Δtdh2m(t)∑td−Δtdt=0h2m(t)+∑∞t=td+Δtdh2m(t),(9)
\begin{align*}
\mathrm{{DRR}_{m}} {=} 10\log _{10} \frac{\sum _{t=t_\mathrm{{d}}-\Delta t_\mathrm{{d}}}^{t_\mathrm{{d}}+\Delta t_\mathrm{{d}}}h^{2}_{m}(t)}{{\sum _{t=0}^{t_\mathrm{{d}}-\Delta t_\mathrm{{d}}}h^{2}_{m}(t)+{\sum _{t=t_\mathrm{{d}}{+}\Delta t_\mathrm{{d}}}^{\infty }h^{2}_{m}(t)}}}, \tag{9}
\end{align*}
where the direct-path signal arrives at the -th sample, andtd is the additional sample spread for the direct-path signal, which typically corresponds to 2.5 ms [1].Δtd
DRR 定义为直接路径部分与其余 RIR [1] 的能量比,即 View SourceDRRm=10log10∑td+Δtdt=td−Δtdh2m(t)∑td−Δtdt=0h2m(t)+∑∞t=td+Δtdh2m(t),(9)
\begin{align*}
\mathrm{{DRR}_{m}} {=} 10\log _{10} \frac{\sum _{t=t_\mathrm{{d}}-\Delta t_\mathrm{{d}}}^{t_\mathrm{{d}}+\Delta t_\mathrm{{d}}}h^{2}_{m}(t)}{{\sum _{t=0}^{t_\mathrm{{d}}-\Delta t_\mathrm{{d}}}h^{2}_{m}(t)+{\sum _{t=t_\mathrm{{d}}{+}\Delta t_\mathrm{{d}}}^{\infty }h^{2}_{m}(t)}}}, \tag{9}
\end{align*}
其中,直接路径信号到达第 个样本,td 是直接路径信号的额外样本扩展,通常对应于 2.5 毫秒 [1] 。Δtd is defined as the time when the sound energy decays by 60 dB after the source is switched off. It can be computed from the energy decay curve of the RIR.T60
定义为声源关闭后声能衰减 60 dB 的时间。它可以根据 RIR 的能量衰减曲线计算出来。T60 measures the energy ratio between early reflections and late reverberation. It can be obtained from RIRs as [43]C50 View SourceC50m=10log10∑td+t50t=0h2m(t)∑∞t=td+t50h2m(t),(10)
\begin{equation*}
C_{\text{50}\,m} = 10\log _{10} \frac{\sum _{t=0}^{t_\mathrm{{d}}+t_{50}}h^{2}_{m}(t)}{{{\sum _{t=t_\mathrm{{d}}+t_{50}}^{\infty }h^{2}_{m}(t)}}}, \tag{10}
\end{equation*}
where is the number of samples for 50 ms.t50
测量早期反射与后期混响之间的能量比。它可以通过 RIR 计算得出,公式为 [43]C50 View SourceC50m=10log10∑td+t50t=0h2m(t)∑∞t=td+t50h2m(t),(10)
\begin{equation*}
C_{\text{50}\,m} = 10\log _{10} \frac{\sum _{t=0}^{t_\mathrm{{d}}+t_{50}}h^{2}_{m}(t)}{{{\sum _{t=t_\mathrm{{d}}+t_{50}}^{\infty }h^{2}_{m}(t)}}}, \tag{10}
\end{equation*}
,其中 是 50 毫秒内的样本数。t50 The surface area-weighted mean absorption coefficient is computed as [6], [26]
View Sourceα¯=∑Ii=1Siαi∑Ii=1Si,(11)
\begin{equation*}
\bar{\alpha } = \frac{\sum _{i=1}^{I} {S_{i}\alpha _{i}}}{\sum _{i=1}^{I}{S_{i}}}, \tag{11}
\end{equation*}
where is the number of room surfaces, andI andSi represent the surface area and the absorption coefficients of theαi -th surface, respectively.i
表面面积加权平均吸收系数计算为 [6] , [26] View Sourceα¯=∑Ii=1Siαi∑Ii=1Si,(11)
\begin{equation*}
\bar{\alpha } = \frac{\sum _{i=1}^{I} {S_{i}\alpha _{i}}}{\sum _{i=1}^{I}{S_{i}}}, \tag{11}
\end{equation*}
,其中 是房间表面的数量,I 和Si 分别表示第αi 个表面的表面面积和吸收系数。i
These spatial acoustic parameters are all continuous values, so we treat spatial acoustic parameter estimation as regression problems. The MSE loss between the predictions and the ground truths is used to train the downstream model.
这些空间声学参数均为连续值,因此我们将空间声学参数估计视为回归问题。预测值与真实值之间的 MSE 损失用于训练下游模型。
C. Encoder Model: Multi-Channel Audio Conformer
C.编码器模型:多通道音频转换器
The model architecture of the spatial encoder is required to be suitable for both the pretext task and downstream tasks. Existing spatial acoustic parameter estimation works commonly adopt CNN and recurrent neural network (RNN) architectures [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], in order to leverage the local information modeling ability of convolutional layers and the long-term temporal context learning ability of recurrent layers. Transformer and Conformer are two widely used architectures for self-supervised learning of audio spectrogram [38], [39], [44], [45]. The Transformer architecture [46], known for its ability to capture longer-term temporal context, has outperformed RNN in various audio signal processing tasks [47]. The Conformer [14] architecture, which incorporates convolutional layers into the Transformer block, has also shown effectiveness in many speech processing tasks [48], [49]. Therefore, we utilize a combination of CNN and Conformer architectures to construct the encoders in our work.
空间编码器的模型架构需要适用于前置任务和下游任务。现有的空间声学参数估计研究通常采用 CNN 和循环神经网络 (RNN) 架构 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] ,以利用卷积层的局部信息建模能力和循环层的长期时间上下文学习能力。Transformer 和 Conformer 是两种广泛用于音频频谱图自监督学习的架构 [38] 、 [39] 、 [44] 、 [45] 。Transformer 架构 [46] 以其捕捉长期时间上下文的能力而闻名,在各种音频信号处理任务中都表现优于 RNN [47] 。 Conformer [14] 架构将卷积层融入 Transformer 模块,也已在许多语音处理任务中展现出有效性 [48] , [49] 。因此,我们在工作中结合使用 CNN 和 Conformer 架构来构建编码器。
The spatial acoustic information exhibits discriminative properties in local and global TF regions. Local region means short time and sub band, while global region means long time and full band. Local characteristics: The CTF model naturally involves local convolution operations along the time axis [42]. In addition, smoothing over time frames and frequencies is necessary for estimating statistics of random signals, such as the spatial correlation matrix, which will be helpful for the estimation of acoustic parameters [50], [51], [52]. These characteristics motivate us to use time-frequency 2D convolutional layers to capture the local information. Global characteristics: The spatial acoustic information exhibits certain long-time and full-band properties. This work considers the case that the sound source and microphone array are static, and hence the RIRs remain time-invariant throughout the entire signal. The directional pulses involving both direct path and early reflections are highly correlated across frequencies. For each propagation path, the frequency-wise IPD is wrapped-linearly increased with respect to frequency [17], [53], and the slope corresponds to the TDOA of this path. These observations motivate the use of fully connected layers (across all frequencies) to capture the full-band linear dependence, and self-attention scheme to learn the (time-invariant) temporal dependence of spatial acoustic information.
空间声学信息在局部和全局 TF 区域表现出判别特性。局部区域意味着短时间和子带,而全局区域意味着长时间和全频带。 局部特征 :CTF 模型自然涉及沿时间轴 [42] 的局部卷积运算。此外,对时间帧和频率进行平滑对于估计随机信号的统计数据(例如空间相关矩阵)是必要的,这将有助于估计声学参数 [50] , [51] , [52] 。这些特性促使我们使用时频 2D 卷积层来捕获局部信息。 全局特征 :空间声学信息表现出一定的长时间和全频带特性。这项工作考虑了声源和麦克风阵列是静态的情况,因此 RIR 在整个信号中保持时不变。涉及直接路径和早期反射的方向脉冲在不同频率上高度相关。对于每条传播路径,频率方向的 IPD 相对于频率 [17] 和 [53] 呈包裹线性递增,斜率对应于该路径的 TDOA。这些观察结果促使我们使用全连接层(覆盖所有频率)来捕捉全频带线性相关性,并使用自注意力机制来学习空间声学信息的(时不变的)时间相关性。
Based on the considerations mentioned above, we design a MC-Conformer as our spatial encoder. Since the CNN and Conformer architectures are widely used for spectral pattern learning, we also use the MC-conformer as our spectral encoder. As shown in Fig. 3(a), the MC-Conformer follows a local-to-global processing pipeline. Local processing model utilizes time-frequency 2D convolutional layers to extract short-term and sub-band information. Global processing model uses Conformer blocks to mainly capture long-term and full-band information. We adopt the sandwich-structured Conformer block presented in [14], which stacks a half-step feed-forward module, a multi-head self-attention module, a 1D convolution module, a second half-step feed-forward module and a layernorm layer. The feed-forward module is applied to the time-wise full-band features, which can capture the full-band frequency dependence. The 1D convolution module and multi-head self-attention module mainly learn the short-term and long-term temporal dependencies, respectively.
基于上述考虑,我们设计了一个 MC-Conformer 作为空间编码器。由于 CNN 和 Conformer 架构广泛用于光谱模式学习,我们也使用 MC-Conformer 作为光谱编码器。如图 3(a) 所示,MC-Conformer 遵循从局部到全局的处理流程。局部处理模型利用时频二维卷积层提取短期和子带信息。全局处理模型使用 Conformer 模块主要捕获长期和全频带信息。我们采用 [14] 中提出的三明治结构 Conformer 模块,它堆叠了一个半步前馈模块、一个多头自注意力模块、一个一维卷积模块、一个第二个半步前馈模块和一个 layernorm 层。前馈模块应用于时间方向的全频带特征,可以捕获全频带频率依赖性。 1D 卷积模块和多头自注意力模块分别主要学习短期和长期时间依赖性。
Experiments and Discussions
实验与讨论
In this section, we conduct experiments on both simulated data and real-world data to evaluate the effectiveness of the proposed method. We first describe the experimental datasets and configurations, and then present extensive experimental results and discussions.
在本节中,我们将在模拟数据和真实数据上进行实验,以评估所提方法的有效性。我们首先描述实验数据集和配置,然后展示详细的实验结果和讨论。
A. Experimental Setup A. 实验设置
1) Simulated Dataset 1)模拟数据集
A large number of rectangular rooms are simulated using an implementation of the image method [54] provided by gpuRIR toolbox [13].1 The size of simulated rooms ranges from
使用 gpuRIR 工具箱 [13] 提供的图像方法 [54] 实现模拟了大量矩形房间。 1 模拟房间的大小范围从
Note that, the proposed pretext task, i.e. cross-channel signal reconstruction, is an ill-posed problem for random noise, as the random samples of noise are unpredictable. Our preliminary experiments demonstrated that more noise lead to more reconstruction error and decreased capability of spatial acoustic learning. In this experiment, we set the SNR range as 15-30 dB, which does not introduce much adverse effect. Meanwhile, this SNR condition (namely lager than 15 dB) can be satisfied in many real-world scenes for data recording.
需要注意的是,我们提出的借口任务,即跨通道信号重构,对于随机噪声而言是一个病态问题,因为噪声的随机样本是不可预测的。我们的初步实验表明,噪声越大,重构误差越大,空间声学学习能力也越差。在本实验中,我们将信噪比 (SNR) 范围设置为 15-30 dB,这不会带来太大的负面影响。同时,这一信噪比条件(即大于 15 dB)在许多实际数据记录场景中都能得到满足。
We randomly generate 512,000 training signals for the pretext task, in order to imitate that the real-world unlabeled data used for training can be with enough quantity and diversity. The validation set and the test set of the pretext task contain 5,120 signals each.
我们为借口任务随机生成了 512,000 个训练信号,以模拟用于训练的真实世界未标记数据能够具有足够的数量和多样性。借口任务的验证集和测试集各包含 5,120 个信号。
As for downstream tasks, we generate a series of new room conditions in terms of room size, reverberation time, absorption coefficient and source-microphone position, and 50 RIRs are randomly generated for each room condition. These rooms are divided without overlap to obtain the training, validation and test sets. To imitate a small number of labeled data with limited diversity used for downstream tasks, and to evaluate the influence of data diversity, the number of training (fine-tuning) rooms are set to 2, 4, 8, 16, 32, 64, 128 and 256, respectively. When the number of training rooms is too small, the results will vary a lot from trial to trial (different trials use different training rooms). Therefore, we conduct 16, 8, 4, 2, 1, 1, 1 and 1 trials for these settings of training room number, respectively, and the averaged results over trials are reported. The numbers of validation rooms and test rooms are both 20. Each RIR is convoluted with two, one and four different source signals for training, validation and test, respectively. Accordingly, the numbers of signals for each training, validation and test room are 100, 50 and 200, respectively.
对于下游任务,我们根据房间大小、混响时间、吸收系数和源麦克风位置生成一系列新的房间条件,并为每个房间条件随机生成 50 个 RIR。这些房间被不重叠地划分,以获得训练、验证和测试集。为了模拟用于下游任务的少量且多样性有限的标记数据,并评估数据多样性的影响,训练(微调)房间的数量分别设置为 2、4、8、16、32、64、128 和 256。当训练房间数量太少时,每次试验的结果会有很大差异(不同的试验使用不同的训练房间)。因此,我们分别针对这些训练房间数量的设置进行了 16、8、4、2、1、1、1 和 1 次试验,并报告了试验的平均结果。验证室和测试室的数量均为 20 个。每个 RIR 分别与 2 个、1 个和 4 个不同的源信号进行卷积,用于训练、验证和测试。因此,每个训练、验证和测试室的信号数量分别为 100、50 和 200 个。
2) Real-World Datasets 2)真实世界数据集
We collect 11 public real-world multi-channel datasets. Among them, MIR [56],4 MeshRIR [57],5 DCASE [58],6 dEchorate [59],7 BUTReverb [60]8 and ACE [1]9 provide real-measured multi-channel RIRs. The microphone signals are created by convolving the real-measured RIRs with source signals from WSJ0, and then adding noise with a SNR ranging from 15 dB to 30 dB if noise signals are provided by the corresponding dataset. LOCATA [61],10 MC-WSJ-AV [62],11 LibriCSS [63],12 AMIMeeting [64],13 AISHELL-4 [65],14 AliMeeting [66]15 and RealMAN [67]16 provide real-recorded multi-channel speech signals. From the original multi-channel audio recordings, all the two-channel sub-arrays with an aperture in [3 cm, 20 cm] are selected. We only use the data of a single static speaker in our experiments. Table II summarizes the settings of selected data from the collected datasets. There are a total of 111 rooms and more than 40k RIR settings (in terms of room condition, source position, and array position).
我们收集了 11 个公开的真实世界多通道数据集。其中,MIR [56] 、 4 MeshRIR [57] 、 5 DCASE [58] 、 6 dEchorate [59] 、 7 BUTReverb [60] 8 和 ACE [1] 9 提供了实测的多通道 RIR。麦克风信号是通过将实测 RIR 与来自 WSJ0 的源信号进行卷积生成的,然后如果相应数据集提供了噪声信号,则添加信噪比 (SNR) 范围为 15 dB 至 30 dB 的噪声。 LOCATA [61] 、 10 MC-WSJ-AV [62] 、 11 LibriCSS [63] 、 12 AMIMeeting [64] 、 13 AISHELL-4 [65] 、 14 AliMeeting [66] 15 和 RealMAN [67] 16 提供真实录制的多通道语音信号。从原始多通道音频录音中,选择了所有孔径为 [3 cm, 20 cm] 的双通道子阵列。我们在实验中仅使用单个静态说话人的数据。表 II 总结了从收集到的数据集中选定数据的设置。共有 111 个房间和超过 40k 个 RIR 设置(就房间条件、源位置和阵列位置而言)。
表二我们从七个公共真实世界多通道数据集中选择的数据设置
For pre-training, we use all the collected real-world datasets. We distinctively generate 512,000, 4,000 and 4,000 signals for training, validation and test, respectively. The importance weight of each dataset for generating data is set according to the number of rooms and the duration of speech recordings in this dataset. As for downstream tasks, we use the LOCATA dataset for TDOA estimation, and the ACE dataset for the tasks including DRR,
在预训练中,我们使用所有收集到的真实世界数据集。我们分别生成 512,000、4,000 和 4,000 个信号用于训练、验证和测试。每个数据集生成数据的重要性权重根据该数据集中的房间数量和语音记录的时长来设置。对于下游任务,我们使用 LOCATA 数据集进行 TDOA 估计,并使用 ACE 数据集进行 DRR、
3) Parameter Settings 3)参数设置
The sampling rate of signals is 16 kHz. STFT is performed with a window length of 32 ms and a frame shift of 16 ms. The number of frequencies
信号采样率为 16 kHz。STFT 采用 32 ms 的窗口长度和 16 ms 的帧移位进行。频率数
4) Model Configurations 4)模型配置
In the encoder, the setting of the convolution block is shown in Fig. 3. In Conformer blocks, the number of attention heads is 4, the kernel size of convolutional layers is 31, and the expansion factor of feed-forward layers is 4. For the spectral encoder, we use one Conformer block and the embedding dimension
在编码器中,卷积块的设置如图 3 所示。在 Conformer 块中,注意力头的数量为 4,卷积层的核大小为 31,前馈层的扩展因子为 4。对于谱编码器,我们使用一个 Conformer 块,嵌入维度
5) Training Details 5)培训详情
For self-supervised pre-training, the model is trained from scratch using simulated data in the simulated-data experiments, while in the real-data experiments, the model is initialized with the pre-trained model on simulated data and then trained using the real-world data. We found that the real-world training data (collected from 41 rooms) are not quite sufficient for pre-training, and initializing the model with the pre-trained model of simulated data is helpful for mitigating this problem. We use the Adam optimizer with an initial learning rate 0.001 and a cosine-decay learning rate scheduler. The batch size is set to 128. The maximum number of training epochs is 30. The best model is the one with the minimum validation loss.
对于自监督预训练,在模拟数据实验中,模型使用模拟数据从头开始训练;而在真实数据实验中,模型先使用模拟数据的预训练模型进行初始化,然后再使用真实数据进行训练。我们发现,真实世界的训练数据(收集自 41 个房间)不足以进行预训练,而使用模拟数据的预训练模型初始化模型有助于缓解这个问题。我们使用 Adam 优化器,初始学习率为 0.001,并采用余弦衰减学习率调度器。批量大小设置为 128。最大训练周期数为 30。最佳模型是验证损失最小的模型。
For downstream tasks, the pre-trained spatial encoder is fine-tuned using labeled data. The Adam optimizer is used for fine-tuning. The batch size is set to 8 for experiments on simulated data and 16 for experiments on real-world data. Fine-tuning the model with a small amount of labeled data is difficult and unstable in general [37], so we have carefully designed the fine-tuning scheme. The validation loss is recursively smoothed along the training epochs to reduce its fluctuations. The initial learning rate is divided by 10 when the smoothed validation loss does not descend with a patience of 10 epochs, and then the training is stopped when the smoothed validation loss does not decrease for another 10 epochs. For each task, we search the initial learning rate that achieves the smallest smoothed validation loss. The search range of the learning rate is [5e-5, 1e-4, 5e-4, 1e-3] for experiments on simulated data and [1e-4, 1e-3] for experiments on real-world data. We ensemble the models of the best epoch and its previous four epochs as the final model.
对于下游任务,使用标记数据对预训练的空间编码器进行微调。使用 Adam 优化器进行微调。对于模拟数据实验,批量大小设置为 8,对于真实数据实验,批量大小设置为 16。使用少量标记数据对模型进行微调通常很困难且不稳定 [37] ,因此我们精心设计了微调方案。验证损失会随着训练周期递归平滑,以减少其波动。如果在 10 个周期内平滑后的验证损失没有下降,则将初始学习率除以 10,然后,如果平滑后的验证损失在接下来的 10 个周期内没有下降,则停止训练。对于每个任务,我们搜索实现最小平滑验证损失的初始学习率。学习率的搜索范围:模拟数据实验 [5e-5, 1e-4, 5e-4, 1e-3];真实数据实验 [1e-4, 1e-3]。我们将最优 epoch 及其前 4 个 epoch 的模型集成,作为最终模型。
6) Evaluation Metrics 6)评估指标
All the downstream tasks, i.e. TDOA, DRR,
所有下游任务,即 TDOA、DRR、
B. Comparison With Fully Supervised Learning
B. 与完全监督学习的比较
As far as we know, this work is the first one to study the self-supervised learning of spatial acoustic information, and there are no self-supervised baseline methods to compare. Therefore, we compare the proposed self-supervised pre-training plus fine-tuning scheme with a fully supervised learning scheme. In the fully supervised learning scheme, we train the same network architecture as our downstream model (namely the spatial encoder followed by a mean pooling and a linear head) from scratch using labeled data specific to the downstream task. Training from scratch with a small amount of data is also challenging and unstable, and we employ the same training scheme as described earlier for fine-tuning. This comparison aims to demonstrate the effectiveness of the proposed self-supervised pre-training method.
据我们所知,这项工作是第一个研究空间声学信息的自监督学习的工作,并且没有自监督的基线方法可供比较。因此,我们将提出的自监督预训练加微调方案与完全监督学习方案进行比较。在完全监督学习方案中,我们使用针对下游任务的标记数据从头开始训练与下游模型相同的网络架构(即空间编码器,后接均值池化和线性头)。使用少量数据从头开始训练同样具有挑战性且不稳定,我们采用与前面描述的相同的训练方案进行微调。本次比较旨在证明所提出的自监督预训练方法的有效性。
1) Evaluation on Simulated Data
1)模拟数据评估
Fig. 4 shows the performance of five spatial acoustic parameter estimation tasks with the two learning schemes when using labeled data from various amounts of training rooms. It can be observed that the self-supervised setting outperforms the supervised setting under most conditions. This confirms that the spatial encoder learns spatial acoustic information in self-supervised pre-training. More specifically, the learned representation of relative RIR/CTF involves both the inter-channel information (used for TDOA estimation) and the temporal structure of RIR/CTF (used for DRR,
图 4 展示了使用来自不同数量训练房间的标记数据,两种学习方案在五项空间声学参数估计任务中的表现。可以观察到,在大多数情况下,自监督设置的表现优于监督设置。这证实了空间编码器在自监督预训练中学习了空间声学信息。更具体地说,学习到的相对 RIR/CTF 表示既涉及通道间信息(用于 TDOA 估计),也涉及 RIR/CTF 的时间结构(用于 DRR、
Results of TDOA, DRR,
对于所提出的自监督预训练加微调方法和完全监督训练方法,在使用来自不同数量训练室的标记数据时,对模拟数据集的 TDOA、DRR、
The training and test curves of three downstream tasks in self-supervised and supervised settings are illustrated in Fig. 5. Fine-tuning pre-trained models converges faster than training from scratch in general. Although the training losses of the two settings reach a similar level at the end, the test loss of the self-supervised setting is notably lower than the one of the supervised setting. This indicates that pre-training helps to reduce the generalization loss from training to test data.
图 5 展示了自监督和监督设置下三个下游任务的训练和测试曲线。通常,微调预训练模型比从头开始训练收敛得更快。虽然两种设置的训练损失最终达到了相似的水平,但自监督设置的测试损失明显低于监督设置。这表明预训练有助于降低从训练集到测试集的泛化损失。
Learning curves (MAE versus training iteration) for TDOA, DRR and
针对模拟数据集、建议的微调方案和从头开始训练方案,绘制了 TDOA、DRR 和
To evaluate how much information the proposed self-supervised pre-training method has learned, the performance of downstream tasks with four different settings are compared in Table III. Non-informative means the acoustic parameter prediction on the test data are simply set as a reasonable non-informative value, namely the mean value of the acoustic parameters of training data, which does not exploit any information from microphone signals of the test dataset. Pre-train plus linear evaluation means the pre-trained model is frozen and only a linear head is trained with downstream data. It can be seen that the linear evaluation setting achieves much better performance measures than the non-informative case, which demonstrates that the pre-trained model/feature indeed involves useful information for downstream tasks. By training/fine-tuning the whole network towards specific downstream tasks, the scratch and fine-tuning settings can better perform on downstream tasks. Although linear evaluation was once a standard way for evaluating the performance of self-supervised learning methods, it misses the opportunity to pursue strong but non-linear features, which is indeed a strength of deep learning [68]. Therefore, more self-supervised learning works put emphasis on the fine-tuning setting than linear evaluation, and we will also only evaluate the fine-tuning setting in the following.
为了评估所提出的自监督预训练方法学习了多少信息,表 III 比较了四种不同设置的下游任务性能。非信息性设置意味着测试数据的声学参数预测被简单地设置为一个合理的非信息性值,即训练数据声学参数的平均值,这不利用测试数据集麦克风信号中的任何信息。预训练加线性评估设置意味着预训练模型被冻结,只使用下游数据训练一个线性头。可以看出,线性评估设置的性能指标远优于非信息性设置,这表明预训练模型/特征确实包含对下游任务有用的信息。通过针对特定的下游任务训练/微调整个网络,初始和微调设置可以在下游任务中表现得更好。虽然线性评估曾经是评估自监督学习方法性能的标准方法,但它错失了追求强大但非线性特征的机会,而这恰恰是深度学习的优势之一 [68] 。因此,更多的自监督学习工作强调微调设置而不是线性评估,并且我们在下面也将只评估微调设置。
表三: 模拟数据集上不同训练设置下的性能(MAE)
To assess the impact of pre-training epochs/iterations on the performance of downstream tasks, we present the pre-training MSE and the performance of downstream tasks with different pre-training epochs/iterations in Table IV. It can be seen that the performance of downstream tasks is consistent with the pretext task to a large extent, namely the performance of downstream tasks can be improved when the pre-training loss is reduced. This property is very important for validating that the proposed pretext task is indeed learning information that can be transferred to downstream tasks.
为了评估预训练周期/迭代次数对下游任务性能的影响,我们在表 IV 中展示了预训练 MSE 以及不同预训练周期/迭代次数的下游任务性能。可以看出,下游任务的性能在很大程度上与借口任务一致,即当预训练损失降低时,下游任务的性能可以得到提升。这一特性对于验证所提出的借口任务确实能够学习到可以迁移到下游任务的信息至关重要。
2) Evaluation on Real-World Data
2)真实世界数据评估
We evaluate the proposed self-supervised method on real-world data to validate its effectiveness for practical applications. It is complicated to conduct real-data experiments mainly for two reasons. One is that we don't have a sufficient amount of real-world data for pre-training, despite the fact that self-supervised pre-training does not require any data annotation. As mentioned in Section V-A-5, we only use collected real-world data of 41 rooms for pre-training, which is not sufficient for fully pre-train the model. As indicated in Table IV, the performance of pre-training is closely related to the performance of downstream tasks, so we think the capability of pre-training may not be fully reflected in this experiment. The other reason is that for fine-tuning or training from scratch in the downstream tasks, it is not necessary to only use a small amount of real data, as a large amount of labeled simulated data can be easily obtained and used. Most DNN-based acoustic parameter estimation methods [6], [16], [17], [18], [19], [20], [21], [22] train the model (from scratch) using a large amount of labeled simulation data. Therefore, we conduct experiments of fine-tuning or training from scratch using three groups of data i) a limited number of real-world data; ii) a sufficiently large amount of simulated data generated from 1000 rooms; iii) both real-world and simulated data, and their importance weights are set to 0.5: 0.5. These three settings are evaluated in Fig. 6 and Table V.
我们在真实数据上评估了所提出的自监督方法,以验证其在实际应用中的有效性。进行真实数据实验很复杂,主要有两个原因。首先,尽管自监督预训练不需要任何数据注释,但我们没有足够数量的真实数据进行预训练。如第 V-A-5 节所述,我们仅使用收集的 41 个房间的真实数据进行预训练,这不足以完全预训练模型。如表 IV 所示,预训练的性能与下游任务的性能密切相关,因此我们认为预训练的能力可能无法在本次实验中得到充分体现。另一个原因是,对于下游任务中的微调或从头开始训练,不必仅使用少量的真实数据,因为可以轻松获取和使用大量带标记的模拟数据。大多数基于 DNN 的声学参数估计方法 [6] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 都使用大量带标签的模拟数据(从头开始)训练模型。因此,我们使用三组数据进行微调或从头开始训练的实验:i)有限数量的真实世界数据;ii)由 1000 个房间生成的足够大量的模拟数据;iii)真实世界数据和模拟数据,并将它们的重要性权重设置为 0.5: 0.5。图 6 和表 V 评估了这三种设置。
表五空间声学参数估计在真实世界数据集上的表现
Learning curves (MAE versus training iteration) for TDOA, DRR and
在实际数据集上,针对所提出的微调方案和从头开始训练的方案,绘制了 TDOA、DRR 和
The fine-tuning/training processes of three downstream tasks are illustrated in Fig. 6. Note that, different from the training scheme presented in Section V-A-5, to fully plot and analyze the training process in this figure, the learning rate is not reduced and the training is not stopped when it converges. When using only a small amount of real-world data, fine-tuning the pre-trained model converges rapidly for the DRR and
图 6 展示了三个下游任务的微调/训练过程。需要注意的是,与 V-A-5 节中介绍的训练方案不同,为了完整地绘制和分析此图中的训练过程,学习率不会降低,训练收敛后也不会停止。当仅使用少量真实数据时,对于 DRR 和
Table V shows the final performance of the five tasks using the training scheme described in Section V-A-5. As for the supervised case, one extra setting is added, namely supervised training with simulated data and then supervised fine-tuning with real-world data (denoted as Simulated (+ real-FT)). In addition, some conventional methods are also compared, including GCC-PHAT [69] for TDOA estimation, one blind DRR estimation method [70] and one blind
表 V 展示了使用第 V-A-5 节所述训练方案的五项任务的最终性能。对于监督学习的情况,增加了一个额外设置,即使用模拟数据进行监督训练,然后使用真实数据进行监督微调(记为“模拟(+真实-FT)”)。此外,还比较了一些传统方法,包括用于 TDOA 估计的 GCC-PHAT [69] 、一种盲 DRR 估计方法 [70] 和一种盲
The best performance is highlighted in bold for each task. Compared with the supervised setting, the proposed self-supervised setting wins on estimating TDOA,
每项任务的最佳表现已以粗体突出显示。与监督设置相比,所提出的自监督设置在估计 TDOA、
Compared to conventional methods, the best-performed learning-based models achieve much better performance, which demonstrates the superiority of deep learning for acoustic parameter estimation if the network can be properly trained.
与传统方法相比,表现最佳的基于学习的模型取得了更好的性能,证明了如果网络能够得到适当的训练,深度学习对于声学参数估计的优势。
C. Ablation Study C. 消融研究
We conduct some ablation experiments to evaluate the effectiveness of each component of the proposed method. Since existing real-world datasets lack diversity in room conditions, we perform ablation studies on the simulated dataset for better analysis. The number of simulated training rooms is set to 8 and four trials are performed for each experiment setting unless otherwise stated. Three representative downstream tasks are mainly considered, namely the estimation of TDOA, DRR and
我们进行了一些消融实验,以评估所提方法各组成部分的有效性。由于现有的真实数据集在房间条件下缺乏多样性,因此我们对模拟数据集进行了消融研究,以便更好地进行分析。除非另有说明,模拟训练房间的数量设置为 8,并且每个实验设置进行 4 次试验。主要考虑三个具有代表性的下游任务,即 TDOA、DRR 和
1) Influence of Masking Rate and Comparison With Patch-Wise Scheme
1)掩蔽率的影响及与 Patch-Wise 方案的比较
Table VI shows the results of three different masking rates, i.e. 25%, 50% and 75%. The pre-training MSE becomes larger with the increase of the masking rate, which is reasonable as reconstructing more frames is more difficult. However, the performance of downstream tasks is comparable for the three masking rates. The masking rate 75% achieves slightly better TDOA performance, while the masking rate 50% provides slightly better DRR and
表 VI 展示了三种不同掩蔽率(即 25%、50% 和 75%)的结果。随着掩蔽率的增加,预训练 MSE 会变大,这在情理之中,因为重建更多帧会更加困难。然而,对于这三种掩蔽率,下游任务的性能表现相当。75% 的掩蔽率实现了略优的 TDOA 性能,而 50% 的掩蔽率实现了略优的 DRR 和
表 VI 所提方法在不同掩蔽率、逐帧和逐块方案下的性能
In many audio spectral pattern learning works [38], [39], the so-called patch-wise scheme outperforms the frame-wise scheme on some downstream tasks, so we also test the patch-wise scheme. Patch-wise means the STFT coefficients are split into patches along the time and frequency axes, and the patches are ranked as a sequence and fed into the Conformer network. In this experiment, the 256 frames × 256 frequencies are split into 16 × 16 patches. Note that the frame-wise scheme can be considered as 256 × 1 patches. The results of the patch-wise scheme with 50% masking rate are also shown in Table VI. It can be seen that the pretext task with the patch-wise setting is much more challenging, possibly due to that it is difficult to reconstruct 16 continuous frames. For downstream tasks, the patch-wise scheme shows better performance on DRR estimation while worse performance on
在许多音频频谱模式学习工作 [38] , [39] 中,所谓的 patch-wise 方案在某些下游任务上优于 frame-wise 方案,因此我们也测试了 patch-wise 方案。patch-wise 意味着将 STFT 系数沿时间和频率轴分成 patch,并将 patch 按序列排列并输入到 Conformer 网络中。在本实验中,256 帧×256 个频率被分成 16×16 个 patch。注意,frame-wise 方案可以被认为是 256×1 个 patch。表 VI 中还显示了具有 50%掩蔽率的 patch-wise 方案的结果。可以看出,采用 patch-wise 设置的借口任务更具挑战性,可能是因为难以重建 16 个连续帧。对于下游任务,patch-wise 方案在 DRR 估计上表现出色,但在
2) Contribution of Spectral Encoder
2)光谱编码器的贡献
To evaluate the contribution of the spectral encoder, we conduct experiments using the spectral encoder or not in both pretext and downstream tasks. When using only the spatial encoder for the pretext task, the masking scheme given in (5) is used, and the encoder is actually required to learn both spatial and spectral information for signal reconstruction. The experimental results are shown in Table VII. As a baseline, the performance of training from scratch (namely w/o the pretext encoder) is also given. Compared with using two encoders for the pretext task and the spatial encoder for downstream tasks, using only one encoder for the pretext task achieves much worse performance on TDOA and
为了评估频谱编码器的贡献,我们在借口任务和下游任务中分别进行了使用和不使用频谱编码器的实验。当仅使用空间编码器进行借口任务时,采用 (5) 中给出的掩蔽方案,编码器实际上需要学习空间和频谱信息才能重建信号。实验结果如表 VII 所示。作为基准,我们还给出了从头开始训练(即不使用借口编码器)的性能。与使用两个编码器进行借口任务并使用空间编码器进行下游任务相比,仅使用一个编码器进行借口任务在 TDOA 和
3) Comparison of Encoder Model Architectures
3)编码器模型架构比较
To demonstrate the effectiveness of the proposed MC-Conformer architecture for both pretext and downstream tasks, we compare the performance of five encoder architectures including CRNN, Transformer, Conformer, CNN+Transformer and CNN+Conformer (namely MC-Conformer).
为了证明所提出的 MC-Conformer 架构对于借口任务和下游任务的有效性,我们比较了五种编码器架构的性能,包括 CRNN、Transformer、Conformer、CNN+Transformer 和 CNN+Conformer(即 MC-Conformer)。
CRNN is chosen as CNN and RNN are commonly adopted by spatial acoustic parameter estimation works [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The architecture of the CRNN-based encoder is shown in Fig. 7. It consists of a convolution block to process local TF information and a recurrent block to obtain global TF information. For the spatial encoder,
,L=4 and[c0,…,c4]=[16,16,32,64,128] . For the spectral encoder,[b0,…,b4]=[1,1,4,4,4] ,L=2 and[c0,c1,c2]=[32,32,64] . These hyper-parameters have been well tuned to improve the performance of downstream tasks.[b0,b1,b2]=[1,4,4]
选择 CRNN 是因为 CNN 和 RNN 是空间声学参数估计工作中常用的 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] 。基于 CRNN 的编码器的架构如图 7 所示。它由一个用于处理局部 TF 信息的卷积块和一个用于获取全局 TF 信息的循环块组成。对于空间编码器, 、L=4 和[c0,…,c4]=[16,16,32,64,128] 。对于谱编码器,[b0,…,b4]=[1,1,4,4,4] 、L=2 和[c0,c1,c2]=[32,32,64] 。这些超参数已经过精心调整,可以提高下游任务的性能。[b0,b1,b2]=[1,4,4] Transformer [46] and Conformer [14] are widely used in audio/speech processing and self-supervised audio spectrogram learning [38], [39], [44]. We evaluate four types of architectures, i.e., Transformer, Conformer, CNN+Transformer and CNN+Conformer (namely MC-Conformer). The network configurations, including the number of attention heads, the embedding dimension, the number of Transformer/Conformer blocks and the CNN configurations, are set to be the same as the proposed MC-Conformer.
Transformer [46] 和 Conformer [14] 广泛应用于音频/语音处理以及自监督音频频谱图学习 [38] 、 [39] 、 [44] 。我们评估了四种类型的架构,即 Transformer、Conformer、CNN+Transformer 和 CNN+Conformer(简称 MC-Conformer)。网络配置(包括注意力头数量、嵌入维度、Transformer/Conformer 块数量以及 CNN 配置)均与所提出的 MC-Conformer 相同。
The experimental results are shown in Table VIII. It can be observed that the proposed MC-Conformer outperforms other model architectures in both the pretext and downstream tasks. Transformer alone performs poorly, and once combined with CNN layers (CNN+Transfomer or Conformer) the performance measures improve significantly. This confirms that CNN is crucial and necessary for capturing local spatial acoustic information. Compared with Conformer, the performance improvement of the proposed CNN+Conformer indicates that using 2D CNNs to pre-process the raw STFT coefficients is very important. Finally, based on these comparisons, we can conclude that CRNN is a strong network architecture for learning spatial acoustic information, while RNN can be replaced with Conformer to better learn long-term dependencies.
实验结果如表 VIII 所示。可以观察到,所提出的 MC-Conformer 在前置任务和下游任务中均优于其他模型架构。Transformer 单独表现不佳,与 CNN 层(CNN+Transfomer 或 Conformer)结合使用时,性能指标显著提升。这证实了 CNN 对于捕捉局部空间声学信息至关重要且必不可少。与 Conformer 相比,所提出的 CNN+Conformer 的性能提升表明使用 2D CNN 对原始 STFT 系数进行预处理非常重要。最后,基于这些比较,我们可以得出结论:CRNN 是一种强大的学习空间声学信息的网络架构,而可以用 Conformer 替代 RNN 来更好地学习长期依赖关系。
D. Qualitative Experiments
D.定性实验
Fig. 8 provides an example of the reconstructed signal. It can be seen that the main structure of masked frames is well reconstructed. However, compared with the target signal, the reconstructed signal seems less blurred by reverberation, which is possibly due to that late reverberation has not been well reconstructed. Late reverberation is spatially diffuse with a low spatial correlation, making it more challenging to reconstruct the late reverberation of one channel from that of the other channel. This may be related to the phenomenon that pre-training does not help the estimation of
图 8 提供了重构信号的示例。可以看出,蒙版帧的主要结构已得到很好的重构。然而,与目标信号相比,重构信号似乎没有受到混响的影响,这可能是因为后期混响未能得到很好的重构。后期混响在空间上具有弥散性,空间相关性较低,因此从某个通道的后期混响中重构出另一个通道的后期混响更加困难。这可能与预训练对
An example of the masked input, the reconstructed signal and the target signal. The reverberation time is 1 s, and the SNR is 20 dB.
掩蔽输入、重构信号和目标信号的示例。混响时间为 1 秒,信噪比为 20 dB。
Fig. 9 visualizes the learned representations (the hidden vectors after mean pooling) of downstream tasks. Compared with training from scratch, fine-tuning the pre-trained model obtains fewer outliers and presents a much smoother and discriminative manifold. For example, when training from scratch, it is hard to discriminate between the red and yellow points for
图 9 可视化了下游任务学习到的表征(均值池化后的隐藏向量)。与从头开始训练相比,对预训练模型进行微调可以减少异常值,并呈现出更平滑、更具判别性的流形。例如,从头开始训练时,很难区分
Visualization of the learned representations for three downstream tasks. The number of training rooms is 8. The number of test rooms is 20. The representation extracted after the mean pooling layer from all test data is visualized with the t-SNE technique [72]. The gray histograms show the statistics of the values of acoustic parameters in test data.
三个下游任务的学习表征可视化。训练房间数量为 8,测试房间数量为 20。使用 t-SNE 技术 [72] 对从所有测试数据中经过均值池化层提取的表征进行可视化。灰色直方图显示了测试数据中声学参数值的统计信息。
Conclusion 结论
This paper proposes a self-supervised method to learn a universal spatial acoustic representation from dual-channel unlabeled microphone signals. With the designed cross-channel signal reconstruction (CCSR) pretext task, the pretext model is forced to separately learn the spatial acoustic and the spectral pattern information. The dual-encoder plus decoder structure adopted by the pretext task facilitates the disentanglement of the two types of information. In addition, a novel multi-channel Conformer (MC-Conformer) is utilized to learn the local and global properties of spatial acoustics present in the time-frequency domain, which can boost the performance of both pretext and downstream tasks. Experiments conducted on both simulated and real-world data verify that the proposed self-supervised pre-training model learns useful knowledge that can be transferred to the spatial acoustics-related tasks including TDOA, DRR,
本文提出了一种自监督方法,用于从双通道无标记麦克风信号中学习通用的空间声学表征。通过设计的跨通道信号重构(CCSR)借口任务,借口模型被迫分别学习空间声学和频谱模式信息。借口任务采用的双编码器+解码器结构有助于分离这两类信息。此外,本文还利用一种新颖的多通道 Conformer(MC-Conformer)来学习时频域中空间声学的局部和全局特性,从而提升借口任务和后续任务的性能。在模拟数据和真实数据上进行的实验验证了所提出的自监督预训练模型学习到的有用知识,这些知识可以迁移到与空间声学相关的任务中,包括 TDOA、DRR、
This work mainly focuses on learning spatial acoustic information from dual-channel microphone signals recorded in high-SNR environments with a single static speaker. This acoustic setting can be satisfied in many real-world indoor scenes. There are several potential directions for future extensions and improvements. For instance, the considered acoustic condition can be more dynamic and complex, and the joint learning of spatial and spectral cues can be further explored.How to extend the proposed method for more than two channels also needs further investigation.
本研究主要致力于从高信噪比环境下单个静态扬声器录制的双通道麦克风信号中学习空间声学信息。这种声学设置在许多现实世界的室内场景中都能得到满足。未来还有几个潜在的扩展和改进方向。例如,考虑的声学条件可以更加动态和复杂,并且可以进一步探索空间和频谱线索的联合学习。如何将所提出的方法扩展到双通道以上也需要进一步研究。
















![Fig. 9. - Visualization of the learned representations for three downstream tasks. The number of training rooms is 8. The number of test rooms is 20. The representation extracted after the mean pooling layer from all test data is visualized with the t-SNE technique [72]. The gray histograms show the statistics of the values of acoustic parameters in test data.](/mediastore/IEEE/content/media/6570655/10304349/10675425/yang9-3458811-small.gif)