Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 32
期刊与杂志 > IEEE/ACM Transactions on Audi... > 卷：32

Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer
具有跨通道信号重建和多通道一致性的空间声学表征自监督学习

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverbe...Show More

Metadata

Abstract: 抽象的：

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.
监督学习方法在估计到达时间差、直达声与混响声之比以及混响时间等空间声学参数方面表现出色。然而，由于模拟声学特性与真实世界声学特性不匹配，以及标注真实世界数据的不足，这些方法仍然存在“模拟到现实”的泛化问题。为此，本研究提出了一种自监督方法，充分利用未标注数据进行空间声学参数估计。首先，设计了一个新的借口任务——跨通道信号重构 (CCSR)，用于从未标注的多通道麦克风信号中学习通用的空间声学表示。我们屏蔽一个通道的部分信号，并要求模型对其进行重构，这使得从未屏蔽的信号中学习空间声学信息，并从另一个麦克风通道中提取源信息成为可能。我们采用编码器-解码器结构来分离这两种信息。通过使用小型标注数据集对预训练的空间编码器进行微调，该编码器可用于估计空间声学参数。其次，我们采用一种新颖的多通道音频 Conformer（MC-Conformer）作为编码器模型架构，该架构同时适用于前置任务和下游任务。MC-Conformer 经过精心设计，能够捕捉时频域中空间声学的局部和全局特性。在模拟数据和真实数据上进行的五个声学参数估计任务的实验结果证明了该方法的有效性。据我们所知，这是空间声学表征学习和多通道音频信号处理领域的第一个自监督学习方法。

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 32)
发表于： IEEE/ACM 音频、语音和语言处理学报（第 32 卷）

Page(s): 4211 - 4225
页数： 4211 - 4225

Date of Publication: 10 September 2024
出版日期： 2024 年 9 月 10 日

ISSN Information: ISSN 信息：

DOI: 10.1109/TASLP.2024.3458811
DOI： 10.1109/TASLP.2024.3458811

References is not available for this document.

Contents

SECTION I. 第一部分

Introduction 介绍

Spatial acoustic representation learning aims to extract a low-dimension representation of the spatial propagation characteristics of sound from microphone recordings. It can be used in a variety of audio tasks, such as estimating the acoustic parameters [1] or the geometrical information [2] related to source, microphone and environment. Related audio tasks have been widely applied in augmented reality [3] and hearing aids [4] where generating perceptually acceptable sound for the target environment is required to guarantee a good immersive experience, and also in intelligent robots [5] where perceiving surrounding acoustic properties serves as a priori for robot interaction with humans and environments.
空间声学表征学习旨在从麦克风录音中提取声音空间传播特性的低维表征。它可用于各种音频任务，例如估计与声源、麦克风和环境相关的声学参数 [1] 或几何信息 [2] 。相关音频任务已广泛应用于增强现实 [3] 和助听器 [4] 领域，在这些领域中，需要为目标环境生成感知上可接受的声音，以保证良好的沉浸式体验；此外，在智能机器人 [5] 领域中，感知周围的声学特性是机器人与人类和环境交互的先验条件。

Room impulse responses (RIRs) characterize the sound propagation from sound source to microphone constrained in a room environment. The physically relevant parameters that determine RIR include the positions of sound source and microphone array, the room geometry, and the absorption coefficients of walls. RIR is composed of direct-path propagation, early reflections, and late reverberation. The relative position between the source and the microphone array determines the direct-path propagation. All the three physical parameters affect the arrival times and the strengths of reflective pulses including early reflections and late reverberation. Some acoustic parameters of the spatial environment can be directly estimated from RIR without supervision, like position-dependent parameters including time difference of arrival (TDOA), direct-to-reverberant ratio (DRR) and clarity index C50, or position-independent parameters including reverberation time T60. The physical parameters including absorption coefficient, surface area, and room volume are difficult to estimate by conventional signal processing techniques since the RIR can not be formulated as an analytical function of them. Some researchers try to estimate these parameters from RIRs with deep neural networks (DNNs) [2], [6]. However, the RIR is normally unavailable without intrusive measurement in practical applications, in which case some works predict RIR [7], [8] or spatial acoustic embedding [9], [10] blindly from microphone recordings. As environmental acoustic parameters and geometrical information are heavily relevant to RIRs, a good blind pre-predictor of RIR or spatial acoustic representation will definitely benefit further spatial acoustic parameter estimation and geometrical structure analysis.
房间脉冲响应 (RIR) 表征了在室内环境中从声源到麦克风的声音传播。决定 RIR 的物理相关参数包括声源和麦克风阵列的位置、房间几何形状以及墙壁的吸收系数。RIR 由直达路径传播、早期反射和后期混响组成。声源和麦克风阵列之间的相对位置决定了直达路径传播。这三个物理参数都会影响反射脉冲（包括早期反射和后期混响）的到达时间和强度。一些空间环境的声学参数可以直接从 RIR 估计，而无需监督，例如与位置相关的参数，包括到达时间差 (TDOA)、直达混响比 (DRR) 和清晰度指数 C50 ，或与位置无关的参数，包括混响时间 T60 。由于 RIR 无法用它们的解析函数表示，因此传统的信号处理技术难以估计包括吸收系数、表面积和房间体积在内的物理参数。一些研究人员尝试利用深度神经网络 (DNN) [2] , [6] 从 RIR 中估计这些参数。然而，在实际应用中，RIR 通常无法通过侵入式测量获得，因此一些研究会根据麦克风录音盲目预测 RIR [7] , [8] 或空间声学嵌入 [9] , [10] 。由于环境声学参数和几何信息与 RIR 密切相关，因此，一个优秀的 RIR 或空间声学表示盲预预测器无疑将有益于进一步的空间声学参数估计和几何结构分析。

With the development of deep learning techniques, lots of works directly estimate spatial acoustic parameters from microphone signals in a supervised manner. These supervised works have achieved superior performance than conventional methods, owing to the strong modeling ability of deep neural networks. Since these works are data-driven, the diversity and quantity of training data are crucial to their performance. To this end, DNN models are usually trained with abundant diverse simulated data, and then transferred to real-world data. Some researchers show that the trained model does not perform well when directly transferred to real-world data [11] due to the mismatch between simulated and real-world RIRs. In [11], the mismatch is analyzed mainly in terms of the directivity of source and microphone, and the wall absorption coefficient. Moreover, there are many other aspects of mismatch, for example: 1) The acoustic response of real-world moving source [12] and the spatial correlation of real-world multi-channel noise are difficult to simulate. 2) RIR simulators [13] usually generate empty box-shaped rooms, while obstacles and non-regular-shaped rooms exist in real-world applications. As an alternative solution, real-world data can be also used for training. However, annotating the acoustic environment and the sound propagation paths would be very difficult and expensive. Existing annotated real-world datasets lack diversity and quantity, which limits the development of supervised learning methods. Therefore, it is necessary to study how to dig spatial acoustic information from unlabeled real-world data.
随着深度学习技术的发展，大量研究以监督学习的方式直接从麦克风信号中估计空间声学参数。得益于深度神经网络强大的建模能力，这些监督学习取得了优于传统方法的性能。由于这些研究是数据驱动的，训练数据的多样性和数量对其性能至关重要。为此，深度神经网络 (DNN) 模型通常使用丰富多样的模拟数据进行训练，然后迁移到真实数据中。一些研究者表明，由于模拟 RIR 与真实 RIR 不匹配，训练好的模型在直接迁移到真实数据时表现不佳 [11] 。在 [11] 中，不匹配主要体现在声源和麦克风的方向性以及墙体吸收系数方面。此外，不匹配还体现在许多其他方面，例如：1) 真实世界运动声源的声学响应 [12] 和真实世界多通道噪声的空间相关性难以模拟。 2）RIR 模拟器 [13] 通常生成空盒子形状的房间，而在实际应用中，障碍物和非规则形状的房间是存在的。作为一种替代方案，也可以使用真实世界数据进行训练。然而，标注声学环境和声音传播路径非常困难且成本高昂。现有的带标注的真实世界数据集缺乏多样性和数量，这限制了监督学习方法的发展。因此，有必要研究如何从未标注的真实世界数据中挖掘空间声学信息。

In this work, we investigate how to learn a universal spatial acoustic representation from unlabeled dual-channel microphone signals based on self-supervised learning. Microphone signals can be formulated as a convolution between dry source signals with multi-channel RIRs, with the addition of noise signals. Spatial acoustic representation learning focuses on extracting RIR-related but source-independent embeddings. As far as we know, this is the first work studying on self-supervised learning of spatial acoustic representation. The proposed method has the following contributions.
在本研究中，我们研究如何基于自监督学习从未标记的双通道麦克风信号中学习通用的空间声学表征。麦克风信号可以表示为干源信号与多通道 RIR 之间的卷积，并添加噪声信号。空间声学表征学习的重点是提取与 RIR 相关但与源无关的嵌入。据我们所知，这是首篇研究空间声学表征自监督学习的研究成果。所提出的方法具有以下贡献。

1) Self-Supervised Learning of Spatial Acoustic Representation (SSL-SAR)
1）空间声学表征的自监督学习（SSL-SAR）

The proposed method follows the basic pipeline of self-supervised learning, namely first pre-training using abundant unlabeled data and then fine-tuning using a small labeled dataset for downstream tasks. A new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed for self-supervised learning of a universal spatial acoustic representation. This work is implemented in the short-time Fourier transform (STFT) domain. Given the dual-channel microphone signals as input, we randomly mask a portion of STFT frames of one microphone channel and ask the neural network to reconstruct them. The reconstruction of the masked STFT frames requires both the spatial acoustic information related to RIRs and the spectral pattern information indicating source signal content. Accordingly, the network is forced to learn the (inter-channel) spatial acoustic information from the frames that are not masked for both microphone channels, and meanwhile extract the spectral information from corresponding frames of the unmasked channel. In order to disentangle the two kinds of information, the input STFT coefficients are separately masked and fed to two different encoders. The spatial and spectral representations are concatenated along the embedding dimension and then passed to a decoder to reconstruct the masked STFT frames. The pre-trained spatial encoder can provide useful information to various spatial acoustics-related downstream tasks. Note that this work only considers static acoustic scenarios where RIRs are time-invariant.
所提出的方法遵循自监督学习的基本流程，即首先使用大量未标记数据进行预训练，然后使用少量已标记数据集进行微调以完成下游任务。设计了一个新的借口任务，即跨通道信号重构 (CCSR)，用于自监督学习通用的空间声学表征。这项工作在短时傅里叶变换 (STFT) 域中实现。给定双通道麦克风信号作为输入，我们随机屏蔽其中一个麦克风通道的部分 STFT 帧，并要求神经网络重建它们。重建屏蔽的 STFT 帧需要与 RIR 相关的空间声学信息以及指示源信号内容的频谱模式信息。因此，网络被迫从两个麦克风通道未被屏蔽的帧中学习（通道间）空间声学信息，同时从未屏蔽通道的相应帧中提取频谱信息。为了分离这两类信息，输入的 STFT 系数分别被屏蔽并输入到两个不同的编码器中。空间和频谱表示沿嵌入维度连接，然后传递至解码器以重建掩蔽的 STFT 帧。预训练的空间编码器可以为各种与空间声学相关的下游任务提供有用信息。需要注意的是，本研究仅考虑 RIR 具有时不变性的静态声学场景。

2) Multi-Channel Audio Conformer (MC-Conformer)
2）多通道音频适配器（MC-Conformer）

Since the network would be pre-trained according to the pretext task, and then adopted by various downstream tasks, we need a powerful network that is suitable for both pretext and downstream tasks. To this end, a novel MC-Conformer is adopted as the encoder model. To fully learn the local and global properties of spatial acoustics exhibited in the time-frequency (TF) domain, it is designed following a local-to-global processing pipeline. The local processing model applies 2D convolutional layers to the raw dual-channel STFT coefficients to learn the relationship between microphone signals and RIRs. It captures the short-term and sub-band spatial acoustic information. The global processing model uses Conformer [14] blocks to mainly learn the full-band and long-term relationship of spatial acoustics. The feed-forward modules of Conformer can model the full-band correlations of RIRs, namely the wrapped-linear correlation between frequency-wise IPDs and time-domain TDOA for the direct path and early reflections. Considering RIRs are time-invariant for the entire signal, the multi-head self-attention module is used to model such long-term temporal dependence. Though the combination of 2D CNN and Conformer has been used in other tasks such as sound event localization and detection [15], in this work, we investigate its use for spatial acoustic representation learning and spatial acoustic parameter estimation.
由于网络将根据前置任务进行预训练，并被各种下游任务采用，因此我们需要一个强大的网络，既适用于前置任务，也适用于下游任务。为此，我们采用了一种新颖的 MC-Conformer 作为编码器模型。为了充分学习时频 (TF) 域中空间声学的局部和全局特性，该模型遵循从局部到全局的处理流程进行设计。局部处理模型将二维卷积层应用于原始双通道 STFT 系数，以学习麦克风信号与 RIR 之间的关系。它捕捉短期和子带空间声学信息。全局处理模型使用 Conformer [14] 模块，主要学习空间声学的全频带和长期关系。Conformer 的前馈模块可以模拟 RIR 的全频带相关性，即频率方向的 IPD 与直达路径和早期反射的时域 TDOA 之间的包裹线性相关性。考虑到 RIR 对于整个信号而言具有时不变性，我们采用多头自注意力模块来建模这种长期时间依赖性。虽然二维 CNN 与 Conformer 的组合已用于声音事件定位和检测 [15] 等其他任务，但在本研究中，我们研究了其在空间声学表征学习和空间声学参数估计中的应用。

The rest of this paper is organized as follows. Section II overviews the related works in the literature. Section III formulates the spatial acoustic representation learning problem. Section IV details the proposed self-supervised spatial acoustic representation learning method. Experiments and discussions with simulated and real-world data are presented in Section V, and conclusions are drawn in Section VI.
本文的其余部分安排如下。第 II 节概述了文献中的相关工作。第 III 节阐述了空间声学表征学习问题。第 IV 节详细介绍了所提出的自监督空间声学表征学习方法。第 V 节介绍了使用模拟数据和真实数据进行的实验和讨论，第 VI 节得出结论。

SECTION II. 第二部分

Related Works 相关作品

A. Deep-Learning-Based Spatial Acoustic Parameter Estimation
A.基于深度学习的空间声学参数估计

Spatial acoustic parameter estimation can provide important acoustic information of the environment. Lots of deep learning based methods are developed for related tasks [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], which are summarised in Table I. The estimation of spatial location-related parameters like TDOA and direction of arrival (DOA) requires multi-channel microphone signals as input [16], [17], [18]. The other spatial acoustic parameters can be predicted with both single-channel [6], [10], [19], [20], [21], [22], [23], [24], [25], [27], [28] and multi-channel microphone signals [11], [26], such as DRR, C50, T60, absorption coefficient, total surface area, room volume, signal-to-noise ratio (SNR) and speech transmission Index (STI). The network input can be spectrograms involving magnitude/energy [10], [11], [16], [17], [19], [20], [21], [22], [24], [26], [27], [28], phase [28], real and imaginary parts [18] of the complex-valued STFT coefficients, or relatively high-level features like inter-channel level difference (ILD) [11], [26], inter-channel phase difference (IPD) [11], [26], generalized cross-correlation function with phase transform (GCC-PHAT) [16], mel-scale frequency cepstral coefficient (MFCC) [25] and energy-based features [22], [28]. The commonly used model architectures include long short-term memory (LSTM) model [16], [18], [19], convolutional neural network (CNN) [6], [10], [11], [20], [22], [23], [24], [26], [27], convolutional recurrent neural network (CRNN) [17], [21], [25], [28], multilayer perceptron (MLP) [6], [23], etc.
空间声学参数估计可以提供重要的环境声学信息。许多基于深度学习的方法已针对相关任务 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] 进行了开发，这些方法总结于表 I 中。估计与空间位置相关的参数，例如时差 (TDOA) 和到达方向 (DOA)，需要多通道麦克风信号作为输入 [16] 、 [17] 、 [18] 。其他空间声学参数可以通过单通道 [6] 、 [10] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [27] 、 [28] 和多通道麦克风信号 [11] 、 [26] 来预测，例如 DRR、 C50 、 T60 、吸收系数、总表面积、房间体积、信噪比 (SNR) 和语音传输指数 (STI)。网络输入可以是涉及幅度/能量 [10] ， [11] ， [16] ， [17] ， [19] ， [20] ， [21] ， [22] ， [24] ， [26] ， [27] ， [28] ，相位 [28] ，复值 STFT 系数的实部和虚部 [18] 的频谱图，或者相对高级的特征，例如通道间电平差异（ILD） [11] ， [26] ，通道间相位差（IPD） [11] ， [26] ，带相位变换的广义互相关函数（GCC-PHAT） [16] ，梅尔尺度频率倒谱系数（MFCC） [25] 和基于能量的特征 [22] ， [28] 。常用的模型架构包括长短期记忆 (LSTM) 模型 [16] 、 [18] 、 [19] 、卷积神经网络 (CNN) [6] 、 [10] 、 [11] 、 [20] 、 [22] 、 [23] 、 [24] 、 [26] 、 [27] 、卷积循环神经网络 (CRNN) [17] 、 [21] 、 [25] 、 [28] 、多层感知器 (MLP) [6] 、 [23] 等。

TABLE I Summary of Deep-Learning-Based Spatial Acoustic Parameter Estimation Methods
表一： 基于深度学习的空间声学参数估计方法总结

Most existing works train neworks for an acoustic parameter solely or multiple acoustic parameters jointly with labeled data, which implicitly learn task-oriented spatial acoustic information in a fully supervised manner. The work [10] extracts a universal representation of the acoustic environment with a contrastive learning method. However, it requires the RIR annotation to obtain the positive and negative sample pairs, which is still a supervised learning method. In contrast to these works, we aim to design a self-supervised method to learn a universal spatial acoustic representation that can be applied to various spatial acoustic-related downstream tasks. Self-supervised learning does not require any data annotation, which allows intensively digging spatial acoustic information from large-scale unlabeled audio data, especially from the real-world recordings.
现有的大多数研究都是针对单个声学参数或者多个声学参数与标注数据联合训练网络，以全监督的方式隐式学习面向任务的空间声学信息。文献 [10] 采用对比学习的方法提取了声学环境的通用表示，然而它需要 RIR 标注来获取正负样本对，这仍然是一种监督学习方法。与这些研究不同，我们的目标是设计一种自监督方法来学习通用的空间声学表示，并将其应用于各种与空间声学相关的下游任务。自监督学习不需要任何数据标注，这使得我们能够从大规模未标注音频数据（尤其是真实世界录音）中深入挖掘空间声学信息。

B. Audio Self-Supervised Representation Learning
B. 音频自监督表征学习

Self-supervised representation learning [29] has been successfully applied to audio/speech processing in recent years [30], which has shown effectiveness in a wide range of downstream applications like automatic speech recognition and sound event classification. According to how to build pretext task, it can be grouped into two categories, namely contrastive approaches and generative approaches. Contrastive approaches aim to learn a latent representation space that pulls close positive samples, and sometimes meanwhile pulls away negative samples from positive samples. Typical methods include contrastive predictive coding [31], wav2vec [32], COLA [33], BYOL-Audio [34], etc. Generative approaches learn representation by generating or reconstructing the input audio data with some limited views. Autoregressive predictive coding [35], [36] predicts future inputs from past inputs with an unsupervised autoregressive neural model to learn generic speech representation. Inspired by the masked language model task from BERT [37], some researchers propose to learn general-purpose audio representation by reconstructing masked patches from unmasked TF regions using Transformer [38], [39].
近年来，自监督表征学习 [29] 已成功应用于音频/语音处理 [30] ，并在自动语音识别和声音事件分类等一系列下游应用中展现出有效性。根据如何构建借口任务，它可以分为两类：对比方法和生成方法。 对比方法旨在学习一个潜在的表征空间，该空间可以拉近正样本，有时同时将负样本从正样本中拉开。典型方法包括对比预测编码 [31] 、wav2vec [32] 、COLA [33] 、BYOL-Audio [34] 等。 生成方法通过在某些有限视角下生成或重建输入音频数据来学习表征。自回归预测编码 [35] 、 [36] 使用无监督自回归神经模型根据过去的输入预测未来的输入，以学习通用的语音表征。受到 BERT [37] 的掩蔽语言模型任务的启发，一些研究人员提出通过使用 Transformer [38] 、 [39] 从未掩蔽的 TF 区域重建掩蔽块来学习通用音频表示。

These methods learn the representation of sound source from single-channel signal, and remove the influence of channel effect such as the response introduced by propagation paths. This kind of representation can be applied to a number of signal-content-related downstream tasks. In contrast, this work aims to learn the representation of spatial acoustic information and remove the information of the sound source, which will be used in spatial-acoustic-related downstream applications. Though there are some unsupervised/self-supervised methods proposed for multi-channel signal processing [40], [41], their self-supervised pretext tasks are different from our method. We aim to learn a general spatial acoustic representation, while they are designed for specific tasks and the potential to learn spatial acoustic information is unknown. Different from the existing masking-reconstruction pretext task [38], [39] which encourages learning inner-channel information, the proposed pretext task, i.e. cross-channel signal reconstruction, intends to learn both inner-channel and inter-channel information.
这些方法从单通道信号中学习声源的表示，并消除通道效应（例如传播路径引入的响应）的影响。这种表示可以应用于许多与信号内容相关的下游任务。相比之下，本研究旨在学习空间声学信息的表示并去除声源信息，这将用于与空间声学相关的下游应用。虽然已经提出了一些用于多通道信号处理的无监督/自监督方法 [40] ， [41] ，但它们的自监督借口任务与我们的方法不同。我们的目标是学习通用的空间声学表示，而它们是为特定任务设计的，并且学习空间声学信息的潜力尚不清楚。与现有的鼓励学习通道内信息的掩蔽重建借口任务 [38] ， [39] 不同，所提出的借口任务，即跨通道信号重建，旨在学习通道内和通道间的信息。

SECTION III. 第三部分

Problem Formulation 问题表述

We consider the case that one static sound source is observed by two static microphones in an enclosed room environment with additive ambient noise. The signal captured by the m-th microphone is denoted as

x m (t) = h m (t) * s (t) + v m (t), (1)

View Source

where

m∈{1,M}

denotes the microphone index,

t∈[1,T]

is the time sample index,

s(t)

is the source signal, and

vm(t)

is the received noise signal at the

-th microphone. Here,

is always set to 2, and

∗

denotes convolution. The RIR

hm(t)

from the sound source to the

-th microphone consists of two successive parts, which is formulated as

h m (t) = h d m (t) + h r m (t), (2)

View Source

where

hdm(t)

and

hrm(t)

are the impulse responses of the direct path and reflected paths, respectively. Theoretically, the sound arrives at the microphones first along the direct path. The impulse response of the direct path is formulated as

h d m (t) = α m δ (t - τ m), (3)

View Source

where

αm

and

τm

are the propagation attenuation factor and the time of arrival from the source to the

-th microphone, respectively. The dual-channel direct-path pulses can reflect the DOA of the source relative to the microphone array. The following pulses relate to the reflections on room boundaries or built-in objects, which can be divided into early reflections and late reverberation on the basis of their arrival times. The reflected paths indicate the acoustic settings of room, e.g., the decaying rate of RIR reflects the

T60

of room.
我们考虑在封闭的房间环境中，用两个静态麦克风观测一个静态声源，并伴有附加环境噪声的情况。第

m 个麦克风捕获的信号表示为

x m (t) = h m (t) * s (t) + v m (t), (1)

View Source

其中，

m∈{1,M} 表示麦克风索引，

t∈[1,T] 表示时间样本索引，

s(t) 表示源信号，

vm(t) 表示在第

m 个麦克风处接收到的噪声信号。其中，

M 始终设置为 2，

∗ 表示卷积。从声源到第

m 个麦克风的 RIR

hm(t) 由两个连续部分组成，公式如下

h m (t) = h d m (t) + h r m (t), (2)

View Source

其中

hdm(t) 和

hrm(t) 分别是直达路径和反射路径的脉冲响应。理论上，声音首先沿直达路径到达麦克风。直达路径的脉冲响应公式如下

h d m (t) = α m δ (t - τ m), (3)

View Source

其中，

αm 和

τm 分别为传播衰减因子和从声源到第

m 个麦克风的到达时间。双通道直达路径脉冲可以反映声源相对于麦克风阵列的到达方向（DOA）。后续脉冲与房间边界或内置物体上的反射有关，根据到达时间可分为早期反射和晚期混响。反射路径指示房间的声学特性，例如，RIR 的衰减速率反映了房间的

T60 。

Considering the sound propagation and the reflection when encountering obstacles are frequency-dependent, we convert the time-domain signal model in (1) into the STFT domain as [42]

X m (n, f) \approx \sum n' H m (n', f) S (n - n', f) + V m (n, f), (4)

View Source

where

n∈[1,N]

and

f∈[1,F]

represent the time frame index and frequency index, respectively, and

and

are the numbers of frames and frequencies, respectively. Here,

Xm(n,f)

S(n,f)

and

Vm(n,f)

represent the microphone, source and noise signals in the STFT domain, respectively. In the STFT-domain signal model, RIR can be well approximated by its sub-band representation, namely the convolutive transfer function (CTF)

Hm(n,f)

.
考虑到声音传播和遇到障碍物时的反射都与频率相关，我们将 (1) 中的时域信号模型转换为 STFT 域，如下所示： [42]

X m (n, f) \approx \sum n' H m (n', f) S (n - n', f) + V m (n, f), (4)

View Source

其中，

n∈[1,N] 和

f∈[1,F] 分别表示时间帧索引和频率索引，

N 和

F 分别表示帧数和频率。其中，

Xm(n,f) 、

S(n,f) 和

Vm(n,f) 分别表示 STFT 域中的麦克风信号、源信号和噪声信号。在 STFT 域信号模型中，RIR 可以通过其子带表示，即卷积传递函数（CTF）

Hm(n,f) 很好地近似。

The spatial acoustic information is encoded in the dual-channel RIRs/CTFs, being independent of the source signal and ambient noise. As illustrated in Fig. 1, this work aims to design a self-supervised method to learn a universal spatial acoustic representation related to the RIRs/CTFs, from unlabeled dual-channel microphone signals. The representation can be used to estimate the spatial acoustic parameters including TDOA, DRR, T60, C50, absorption coefficients, etc. Dual-channel microphone signals are given as the input of the self-supervised model since the estimation of some position-dependent parameters like TDOA and DOA requires signals of at least two microphones, and dual-channel microphone signals are expected to provide more reliable spatial acoustic information than single-channel signal [26]. Moreover, as will be shown later, using two microphones allows us to design a self-supervised pretext task that is able to better disentangle the spatial acoustic information and source information. Considering the simplicity of the signal model in (4), this work intends to design a self-supervised learning model in the STFT domain, and learn the CTF information from Xm(n,f). Though some parameters like DRR, T60, and absorption coefficients are frequency-dependent, this work focuses on estimating the full-band versions of them [1] which are obtained by averaging across the sub-band values.
空间声学信息被编码在双通道 RIR/CTF 中，与源信号和环境噪声无关。如图 1 所示，本研究旨在设计一种自监督方法，从未标记的双通道麦克风信号中学习与 RIR/CTF 相关的通用空间声学表征。该表征可用于估计空间声学参数，包括 TDOA、DRR、 T60 、 C50 、吸收系数等。双通道麦克风信号作为自监督模型的输入，因为估计某些与位置相关的参数（例如 TDOA 和 DOA）需要至少两个麦克风的信号，并且双通道麦克风信号预计比单通道信号 [26] 提供更可靠的空间声学信息。此外，正如稍后将展示的，使用两个麦克风使我们能够设计一个自监督的借口任务，该任务能够更好地分离空间声学信息和源信息。考虑到 (4) 中信号模型的简单性，本文旨在设计一个 STFT 域的自监督学习模型，并从 Xm(n,f) 中学习 CTF 信息。尽管一些参数（例如 DRR、 T60 和吸收系数）与频率相关，但本文专注于估计它们的全带版本 [1] ，该版本是通过对子带值取平均值获得的。

Fig. 1. - Illustration of self-supervised learning of spatial acoustic representation using multi-channel microphone recordings. The direct path, early reflections and late reverberation are illustrated in red, green and blue colors, respectively.

Fig. 1. 图 1.

Illustration of self-supervised learning of spatial acoustic representation using multi-channel microphone recordings. The direct path, early reflections and late reverberation are illustrated in red, green and blue colors, respectively.
使用多通道麦克风录音进行空间声学表征的自监督学习示意图。直达路径、早期反射和后期混响分别以红色、绿色和蓝色表示。

SECTION IV. 第四部分

Self-Supervised Learning of Spatial Acoustic Representation
空间声学表征的自监督学习

The proposed spatial acoustic representation learning method follows the basic pipeline of most self-supervised learning methods, namely first pre-training the representation model according to the pretext task using a large amount of unlabeled data, and then fine-tuning the pre-trained model for a specific downstream task using a small amount of labeled data. The key points of this work lie in how to build the pretext task to learn spatial acoustic information (see details in Section IV-A), and how to design a unified network architecture suited for both pretext task and downstream tasks (see details in Section IV-C). The block diagram of the proposed method is shown in Fig. 2.
所提出的空间声学表征学习方法遵循大多数自监督学习方法的基本流程，即首先使用大量未标注数据根据借口任务预训练表征模型，然后使用少量标注数据针对特定的下游任务对预训练模型进行微调。本研究的关键在于如何构建用于学习空间声学信息的借口任务（详见 IV-A 部分），以及如何设计一个同时适用于借口任务和下游任务的统一网络架构（详见 IV-C 部分）。所提方法的框图如图 2 所示。

Fig. 2. - Block diagram of the proposed self-supervised spatial acoustic representation learning model. The complex-valued STFT coefficients are illustrated by their real-part spectrograms.

Fig. 2. 图 2.

Block diagram of the proposed self-supervised spatial acoustic representation learning model. The complex-valued STFT coefficients are illustrated by their real-part spectrograms.
所提出的自监督空间声学表征学习模型的框图。复值 STFT 系数由其实部频谱图表示。

A. Pretext Task: Cross-Channel Signal Reconstruction
A. 借口任务：跨通道信号重建

A cross-channel signal reconstruction (CCSR) pretext task is built to learn the spatial acoustic information. As illustrated in Fig. 2, the basic idea is to mask a portion of STFT frames of one microphone channel to destroy corresponding spectral and spatial information, and then ask the network to reconstruct them. The expected function of the reconstruction network lies in three aspects:
我们构建了一个跨通道信号重构 (CCSR) 借口任务来学习空间声学信息。如图 2 所示，其基本思想是屏蔽某个麦克风通道的部分 STFT 帧，以破坏相应的频谱和空间信息，然后要求网络对其进行重构。重构网络的预期功能体现在三个方面：

Learning the spectral patterns of masked frames from the unmasked channel. Source signals have unique spectral patterns indicating the signal content. Since signals received at the two microphones have the same spectral information, we only mask one channel to preserve the corresponding signal content, and the network can learn the spectral information of sound source from the unmasked channel. The learned information from the unmasked channel may also include some RIR/CTF information of the unmasked channel.
从未掩蔽通道学习掩蔽帧的频谱模式 。源信号具有独特的频谱模式，指示信号内容。由于两个麦克风接收到的信号具有相同的频谱信息，我们仅屏蔽一个通道以保留相应的信号内容，网络可以从未掩蔽通道学习声源的频谱信息。从未掩蔽通道学习到的信息可能还包含一些未掩蔽通道的 RIR/CTF 信息。
Learning the spatial acoustics from the dual-channel unmasked frames. To reconstruct the masked STFT frames, the network needs to learn the (inter-channel) acoustic information from the dual-channel unmasked frames, and apply it to the information learned from the unmasked channel. The (inter-channel) spatial acoustic information relates to the RIR/CTF of the masked channel and more possibly to the relative RIR/CTF of the masked channel to the unmasked channel. In the representation of relative CTF/RIR, the relative information can be directly used for inter-channel downstream tasks, such as TDOA estimation. Moreover, it is expected that the temporal structure of CTF/RIR (or a variant of the temporal structure of CTF/RIR) is also preserved, from which the information used for temporal-structure-related downstream tasks, such as T60 estimation, can be extracted by DNN mapping. These assumptions will be validated through experiments, in which the learned spatial acoustic representation is shown to be effective for a variety of downstream tasks.
从双通道非掩蔽帧学习空间声学 。为了重建掩蔽的 STFT 帧，网络需要从双通道非掩蔽帧学习（通道间）声学信息，并将其应用于从非掩蔽通道学习到的信息。（通道间）空间声学信息与掩蔽通道的 RIR/CTF 相关，更可能与掩蔽通道相对于非掩蔽通道的 RIR/CTF 相关。在相对 CTF/RIR 的表示中，相对信息可直接用于通道间下游任务，例如 TDOA 估计。此外，预计 CTF/RIR 的时间结构（或 CTF/RIR 时间结构的变体）也会保留，以便通过 DNN 映射从中提取用于时间结构相关下游任务（例如 T60 估计）的信息。这些假设将通过实验进行验证，实验表明学习到的空间声学表示对各种下游任务均有效。
Reconstructing the masked frames using the learned spectral and spatial information.
使用学习到的光谱和空间信息重建掩蔽帧。

The proposed cross-channel signal reconstruction pretext task can disentangle source spectral information and spatial information, which facilitates the application of spatial acoustic representation in downstream tasks.
所提出的跨通道信号重建借口任务可以分离源频谱信息和空间信息，从而有利于空间声学表示在下游任务中的应用。

1) Reconstruction Framework
1）重建框架

A portion of the STFT frames of one microphone channel is randomly masked. The signal masked by the single-channel masking operation is formulated as

X ~ m m a s k (n, f) = X m m a s k (n, f) W (n, f), (5)

View Source

where

mmask

denotes the index of the masked channel, which is randomly selected from the two microphones for each training sample.

W(n,f)

is the binary-valued TF mask, and 0 for masking while 1 for not masking. The pretext task is to reconstruct the STFT coefficients of masked frames given the STFT coefficients of both the masked and the unmasked channels. The mean squared error (MSE) between the original and the reconstructed STFT coefficients of the masked frames is adopted as the reconstruction loss, which is formulated as

L = \sum N , F n = 1 , f = 1 ∣ ∣ X m m a s k ( n , f ) - X ^ m m a s k ( n , f ) ∣ ∣ 2 ( 1 - W ( n , f ) ) \sum N , F n = 1 , f = 1 ( 1 - W ( n , f ) ), (6)

View Source

where

X^mmask(n,f)

is the reconstructed STFT coefficients, and

|⋅|

denotes the magnitude of the complex number. Here,

1−W(n,f)

indicates that the reconstruction loss is only computed on the masked frames.
一个麦克风通道的部分 STFT 帧被随机屏蔽。单通道屏蔽操作屏蔽的信号公式如下：

X ~ m m a s k (n, f) = X m m a s k (n, f) W (n, f), (5)

View Source

其中

mmask 表示掩蔽通道的索引，该通道是从每个训练样本的两个麦克风中随机选择的。

W(n,f) 是二进制 TF 掩码，0 表示掩蔽，1 表示不掩蔽。前置任务是在给定掩蔽通道和非掩蔽通道的 STFT 系数的情况下，重建掩蔽帧的 STFT 系数。重建损失函数采用掩蔽帧的原始 STFT 系数与重建 STFT 系数之间的均方误差 (MSE)，公式如下

L = \sum N , F n = 1 , f = 1 ∣ ∣ X m m a s k ( n , f ) - X ^ m m a s k ( n , f ) ∣ ∣ 2 ( 1 - W ( n , f ) ) \sum N , F n = 1 , f = 1 ( 1 - W ( n , f ) ), (6)

View Source

其中

X^mmask(n,f) 是重建的 STFT 系数，

|⋅| 表示复数的幅值。此处，

1−W(n,f) 表示重建损失仅在被遮罩的帧上计算。

2) Encoder-Decoder Structure
2）编码器-解码器结构

The model of the pretext task adopts an encoder-decoder structure, as shown in Fig. 2. Considering the characteristics and heterogeneity of spatial acoustics and signal content, the input STFT coefficients are first masked in two different ways (see details in Section IV-A-3)), then fed into spatial and spectral encoders to separately learn the two kinds of information. Both encoders adopt the MC-Conformer (see details in Section IV-C) but with different configurations (see details in Section V-A-4). The complex-valued STFT coefficients have a dimension of F×N×M. The concatenation of their real and imaginary parts along the channel dimension is taken as the input of encoders, which has a dimension of F×N×2M. The two encoders convert their own masked inputs to a spatial embedding sequence of N×Dspat and a spectral embedding sequence of N×Dspec, respectively. Dspat and Dspec are the hidden dimensions. Note that this work intends to learn the spatial acoustic information, and the learning of spectral information is just designed for disentangling the spectral information from the learning of spatial acoustic representation and making the spatial acoustic representation learning more concentrated. The embeddings learned by the two encoders are concatenated along the hidden dimension and then fed into the decoder to predict the real and imaginary parts of the dual-channel STFT coefficients. Though the reconstruction loss is only calculated on the masked frames of one channel, we still output the dual-channel STFT coefficients to preserve the frame and channel information of the original input. As shown in Fig. 3(b), two fully-connected (FC) layers are adopted as the decoder, and each frame is separately processed by the same decoder. In order to encourage the spatial and spectral information used for signal reconstruction to be fully learned by the encoders instead of the decoder, the decoder is set to not have any information interaction between time frames.
借口任务模型采用编码器-解码器结构，如图 2 所示。考虑到空间声学和信号内容的特点及异质性，输入的 STFT 系数首先以两种不同的方式进行掩蔽（详见 IV-A-3 节），然后输入到空间和频谱编码器中，分别学习这两类信息。两个编码器都采用 MC-Conformer（详见 IV-C 节），但配置不同（详见 V-A-4 节）。复值 STFT 系数的维度为 F×N×M 。沿通道维度连接它们的实部和虚部作为编码器的输入，其维度为 F×N×2M 。两个编码器分别将各自的掩蔽输入转换为空间嵌入序列 N×Dspat 和频谱嵌入序列 N×Dspec 。 Dspat 和 Dspec 是隐藏维度。需要注意的是，本研究旨在学习空间声学信息，而频谱信息的学习只是为了将频谱信息从空间声学表征学习中分离出来，使空间声学表征学习更加集中。两个编码器学习到的嵌入沿隐藏维度连接，然后输入解码器，用于预测双通道 STFT 系数的实部和虚部。虽然重建损失仅针对一个通道的掩码帧计算，但我们仍然输出双通道 STFT 系数，以保留原始输入的帧和通道信息。如图 3(b) 所示，解码器采用两个全连接（FC）层，每个帧由同一个解码器分别处理。为了促使编码器而不是解码器充分学习用于信号重建的空间和频谱信息，解码器被设置为在时间帧之间不存在任何信息交互。

$Fig. 3. - Model architecture of (a) spatial/spectral encoder (namely multi-channel audio Conformer), (b) decoder and (c) convolution block in the encoder. $D$ is the hidden dimension, and $D=D^\mathrm{{spat}}$ for spatial encoder and $D=D^\mathrm{{spec}}$ for spectral encoder.$

Fig. 3. 图 3.

Model architecture of (a) spatial/spectral encoder (namely multi-channel audio Conformer), (b) decoder and (c) convolution block in the encoder. D is the hidden dimension, and D=Dspat for spatial encoder and D=Dspec for spectral encoder.
(a) 空间/频谱编码器（即多通道音频 Conformer）、(b) 解码器和 (c) 编码器中的卷积块的模型架构。 D 是隐藏维度， D=Dspat 代表空间编码器， D=Dspec 代表频谱编码器。

3) Masking Scheme 3）掩蔽方案

The masked signal X~mmask(n,f) is readjusted for the two encoders to disentangle spatial acoustics information and signal content information. In order to encourage the spatial encoder to focus on learning the spatial acoustic information, we remove the spectral information of masked frames from the input of the spatial encoder. Specifically, we apply the single-channel mask to both microphone channels, which is formulated as

X s p a t m (n, f) = X m (n, f) W (n, f) . (7)

View Source

It guarantees that the spatial encoder will not see any channel of the masked frames, and hopefully will focus on learning the spatial information from unmasked frames.
针对两个编码器重新调整掩蔽信号

X~mmask(n,f) ，以分离空间声学信息和信号内容信息。为了鼓励空间编码器专注于学习空间声学信息，我们从空间编码器的输入中移除了掩蔽帧的频谱信息。具体来说，我们将单通道掩蔽应用于两个麦克风通道，其公式如下：

X s p a t m (n, f) = X m (n, f) W (n, f) . (7)

View Source

它保证空间编码器不会看到任何被遮罩的帧的通道，并且有望专注于从未被遮罩的帧中学习空间信息。

The spectral encoder is used to learn signal content information indicated by the inner-channel information. To make the spectral encoder learn the signal content information, we only input the single-channel signal to the spectral encoder. Specifically, the inverse single-channel mask is additionally applied to the signals of the other microphone, which is formulated as

X s p e c m (n, f) = {X m (n, f) W (n, f), if m = m m a s k X m (n, f) (1 - W (n, f)), o t h e r w i s e . (8)

View Source

The spectral encoder cannot see the same frames of both microphones simultaneously, and thus will not learn the inter-channel spatial acoustic information. To model the spectral information of masked frames, in addition to the corresponding frames of unmasked channel, we also input the unmasked frames of the masked channel to provide more context information for the masked frames. The spectral encoder may also learn some single-channel RIR/CTF information.
谱编码器用于学习由内部通道信息指示的信号内容信息。为了使谱编码器学习信号内容信息，我们仅将单通道信号输入谱编码器。具体而言，将逆单通道掩模额外应用于另一个麦克风的信号，其公式为

X s p e c m (n, f) = {X m (n, f) W (n, f), if m = m m a s k X m (n, f) (1 - W (n, f)), o t h e r w i s e . (8)

View Source

频谱编码器无法同时看到两个麦克风的相同帧，因此无法学习通道间的空间声学信息。为了对掩蔽帧的频谱信息进行建模，除了未掩蔽通道的相应帧外，我们还输入了掩蔽通道的未掩蔽帧，以便为掩蔽帧提供更多上下文信息。频谱编码器还可以学习一些单通道 RIR/CTF 信息。

B. Downstream Tasks: Spatial Acoustic Parameter Estimation
B.下游任务：空间声学参数估计

Fig. 2 also shows the diagram of how to use the pre-trained model in downstream tasks. The dual-channel STFT coefficients Xm(n,f) are directly passed to the spatial encoder without any masking operation. The spatial encoder processes the input and produces a spatial acoustic representation. This representation is then pooled across all time frames and passed to a linear head to estimate the spatial acoustic parameter. The weights of the spatial encoder are initialized from the self-supervised pre-trained model and then fine-tuned using the labeled data of downstream tasks.
图 2 还展示了如何在下游任务中使用预训练模型。双通道 STFT 系数 Xm(n,f) 直接传递到空间编码器，无需任何掩蔽操作。空间编码器处理输入并生成空间声学表示。然后，该表示在所有时间帧上进行池化，并传递到线性头以估计空间声学参数。空间编码器的权重由自监督预训练模型初始化，然后使用下游任务的标记数据进行微调。

We consider estimating the following spatial acoustic parameters as downstream tasks.
我们将估计以下空间声学参数视为下游任务。

TDOA is an important feature for sound source localization. It is defined as the relative time delay, namely Δt in seconds, when the sound emitted by the source arrives at the two microphones. This work estimates TDOA in samples, namely Δtfs, where fs is the sampling rate of signals.
TDOA 是声源定位的一个重要特征。它定义为声源发出的声音到达两个麦克风的相对时间延迟，即 Δt 秒。本文以样本为单位估计 TDOA，即 Δtfs ，其中 fs 是信号的采样率。
DRR is defined as the energy ratio of direct-path part to the rest of RIRs [1], i.e.,
$D R R m = 10 log 10 \sum t d + Δ t d t = t d - Δ t d h 2 m ( t ) \sum t d - Δ t d t = 0 h 2 m ( t ) + \sum \infty t = t d + Δ t d h 2 m ( t ), (9)$ View Sourcewhere the direct-path signal arrives at the td-th sample, and Δtd is the additional sample spread for the direct-path signal, which typically corresponds to 2.5 ms [1].
DRR 定义为直接路径部分与其余 RIR [1] 的能量比，即 $D R R m = 10 log 10 \sum t d + Δ t d t = t d - Δ t d h 2 m ( t ) \sum t d - Δ t d t = 0 h 2 m ( t ) + \sum \infty t = t d + Δ t d h 2 m ( t ), (9)$ View Source 其中，直接路径信号到达第 td 个样本， Δtd 是直接路径信号的额外样本扩展，通常对应于 2.5 毫秒 [1] 。
T60 is defined as the time when the sound energy decays by 60 dB after the source is switched off. It can be computed from the energy decay curve of the RIR.
T60 定义为声源关闭后声能衰减 60 dB 的时间。它可以根据 RIR 的能量衰减曲线计算出来。
C50 measures the energy ratio between early reflections and late reverberation. It can be obtained from RIRs as [43]
$C 50 m = 10 log 10 \sum t d + t 50 t = 0 h 2 m ( t ) \sum \infty t = t d + t 50 h 2 m ( t ), (10)$ View Sourcewhere t50 is the number of samples for 50 ms.
C50 测量早期反射与后期混响之间的能量比。它可以通过 RIR 计算得出，公式为 [43] $C 50 m = 10 log 10 \sum t d + t 50 t = 0 h 2 m ( t ) \sum \infty t = t d + t 50 h 2 m ( t ), (10)$ View Source ，其中 t50 是 50 毫秒内的样本数。
The surface area-weighted mean absorption coefficient is computed as [6], [26]
$α ¯ = \sum I i = 1 S i α i \sum I i = 1 S i, (11)$ View Sourcewhere I is the number of room surfaces, and Si and αi represent the surface area and the absorption coefficients of the i-th surface, respectively.
表面面积加权平均吸收系数计算为 [6] , [26] $α ¯ = \sum I i = 1 S i α i \sum I i = 1 S i, (11)$ View Source ，其中 I 是房间表面的数量， Si 和 αi 分别表示第 i 个表面的表面面积和吸收系数。

These spatial acoustic parameters are all continuous values, so we treat spatial acoustic parameter estimation as regression problems. The MSE loss between the predictions and the ground truths is used to train the downstream model.
这些空间声学参数均为连续值，因此我们将空间声学参数估计视为回归问题。预测值与真实值之间的 MSE 损失用于训练下游模型。

C. Encoder Model: Multi-Channel Audio Conformer
C.编码器模型：多通道音频转换器

The model architecture of the spatial encoder is required to be suitable for both the pretext task and downstream tasks. Existing spatial acoustic parameter estimation works commonly adopt CNN and recurrent neural network (RNN) architectures [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], in order to leverage the local information modeling ability of convolutional layers and the long-term temporal context learning ability of recurrent layers. Transformer and Conformer are two widely used architectures for self-supervised learning of audio spectrogram [38], [39], [44], [45]. The Transformer architecture [46], known for its ability to capture longer-term temporal context, has outperformed RNN in various audio signal processing tasks [47]. The Conformer [14] architecture, which incorporates convolutional layers into the Transformer block, has also shown effectiveness in many speech processing tasks [48], [49]. Therefore, we utilize a combination of CNN and Conformer architectures to construct the encoders in our work.
空间编码器的模型架构需要适用于前置任务和下游任务。现有的空间声学参数估计研究通常采用 CNN 和循环神经网络 (RNN) 架构 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] ，以利用卷积层的局部信息建模能力和循环层的长期时间上下文学习能力。Transformer 和 Conformer 是两种广泛用于音频频谱图自监督学习的架构 [38] 、 [39] 、 [44] 、 [45] 。Transformer 架构 [46] 以其捕捉长期时间上下文的能力而闻名，在各种音频信号处理任务中都表现优于 RNN [47] 。 Conformer [14] 架构将卷积层融入 Transformer 模块，也已在许多语音处理任务中展现出有效性 [48] , [49] 。因此，我们在工作中结合使用 CNN 和 Conformer 架构来构建编码器。

The spatial acoustic information exhibits discriminative properties in local and global TF regions. Local region means short time and sub band, while global region means long time and full band. Local characteristics: The CTF model naturally involves local convolution operations along the time axis [42]. In addition, smoothing over time frames and frequencies is necessary for estimating statistics of random signals, such as the spatial correlation matrix, which will be helpful for the estimation of acoustic parameters [50], [51], [52]. These characteristics motivate us to use time-frequency 2D convolutional layers to capture the local information. Global characteristics: The spatial acoustic information exhibits certain long-time and full-band properties. This work considers the case that the sound source and microphone array are static, and hence the RIRs remain time-invariant throughout the entire signal. The directional pulses involving both direct path and early reflections are highly correlated across frequencies. For each propagation path, the frequency-wise IPD is wrapped-linearly increased with respect to frequency [17], [53], and the slope corresponds to the TDOA of this path. These observations motivate the use of fully connected layers (across all frequencies) to capture the full-band linear dependence, and self-attention scheme to learn the (time-invariant) temporal dependence of spatial acoustic information.
空间声学信息在局部和全局 TF 区域表现出判别特性。局部区域意味着短时间和子带，而全局区域意味着长时间和全频带。 局部特征 ：CTF 模型自然涉及沿时间轴 [42] 的局部卷积运算。此外，对时间帧和频率进行平滑对于估计随机信号的统计数据（例如空间相关矩阵）是必要的，这将有助于估计声学参数 [50] ， [51] ， [52] 。这些特性促使我们使用时频 2D 卷积层来捕获局部信息。 全局特征 ：空间声学信息表现出一定的长时间和全频带特性。这项工作考虑了声源和麦克风阵列是静态的情况，因此 RIR 在整个信号中保持时不变。涉及直接路径和早期反射的方向脉冲在不同频率上高度相关。对于每条传播路径，频率方向的 IPD 相对于频率 [17] 和 [53] 呈包裹线性递增，斜率对应于该路径的 TDOA。这些观察结果促使我们使用全连接层（覆盖所有频率）来捕捉全频带线性相关性，并使用自注意力机制来学习空间声学信息的（时不变的）时间相关性。

Based on the considerations mentioned above, we design a MC-Conformer as our spatial encoder. Since the CNN and Conformer architectures are widely used for spectral pattern learning, we also use the MC-conformer as our spectral encoder. As shown in Fig. 3(a), the MC-Conformer follows a local-to-global processing pipeline. Local processing model utilizes time-frequency 2D convolutional layers to extract short-term and sub-band information. Global processing model uses Conformer blocks to mainly capture long-term and full-band information. We adopt the sandwich-structured Conformer block presented in [14], which stacks a half-step feed-forward module, a multi-head self-attention module, a 1D convolution module, a second half-step feed-forward module and a layernorm layer. The feed-forward module is applied to the time-wise full-band features, which can capture the full-band frequency dependence. The 1D convolution module and multi-head self-attention module mainly learn the short-term and long-term temporal dependencies, respectively.
基于上述考虑，我们设计了一个 MC-Conformer 作为空间编码器。由于 CNN 和 Conformer 架构广泛用于光谱模式学习，我们也使用 MC-Conformer 作为光谱编码器。如图 3(a) 所示，MC-Conformer 遵循从局部到全局的处理流程。局部处理模型利用时频二维卷积层提取短期和子带信息。全局处理模型使用 Conformer 模块主要捕获长期和全频带信息。我们采用 [14] 中提出的三明治结构 Conformer 模块，它堆叠了一个半步前馈模块、一个多头自注意力模块、一个一维卷积模块、一个第二个半步前馈模块和一个 layernorm 层。前馈模块应用于时间方向的全频带特征，可以捕获全频带频率依赖性。 1D 卷积模块和多头自注意力模块分别主要学习短期和长期时间依赖性。

SECTION V. 第五部分

Experiments and Discussions
实验与讨论

In this section, we conduct experiments on both simulated data and real-world data to evaluate the effectiveness of the proposed method. We first describe the experimental datasets and configurations, and then present extensive experimental results and discussions.
在本节中，我们将在模拟数据和真实数据上进行实验，以评估所提方法的有效性。我们首先描述实验数据集和配置，然后展示详细的实验结果和讨论。

A. Experimental Setup A. 实验设置

1) Simulated Dataset 1）模拟数据集

A large number of rectangular rooms are simulated using an implementation of the image method [54] provided by gpuRIR toolbox [13].¹ The size of simulated rooms ranges from 3×3×2.5 m3 to 15×10×6 m3. The reverberation time is in the range of 0.2 s to 1.3 s. The absorption coefficients of six walls are in [0, 1] and can be totally different. The array has two microphones with an aperture in [3 cm, 20 cm], which is placed parallel to the floor. The center of the microphone array is set in the [20%, 20%, 10%] to [80%, 80%, 50%] of room space in order to ensure a minimum boundary distance from the walls. The static omnidirectional sound source is randomly placed in the room, with a minimum distance of 30 cm from the microphone array and from each room surface. Speech recordings from the training, development and evaluation subsets of the WSJ0 dataset² are used as source signals, which are used for training, validation and test of the proposed model, respectively. An arbitrary noise field generator³ is used to generate the spatially-diffuse white noise [55] for different array apertures. The microphone signals are generated by first filtering four-second speech recordings with RIRs, and then scaling and adding diffuse noise with a SNR ranging from 15 dB to 30 dB. Each signal is a random combination of the aforementioned data settings regarding source position, microphone position, source signal, noise signal, room size, reverberation time, relative ratios among six absorption coefficients, etc.
使用 gpuRIR 工具箱 [13] 提供的图像方法 [54] 实现模拟了大量矩形房间。 ¹ 模拟房间的大小范围从 3×3×2.5 m 3 到 15×10×6 m 3 。混响时间在 0.2 s 到 1.3 s 范围内。六面墙壁的吸收系数在 [0, 1] 之间，并且可以完全不同。阵列具有两个麦克风，孔径为 [3 cm, 20 cm]，与地板平行放置。麦克风阵列的中心设置在房间空间的 [20%, 20%, 10%] 到 [80%, 80%, 50%] 之间，以确保与墙壁的最小边界距离。静态全向声源随机放置在房间内，与麦克风阵列和每个房间表面的最小距离为 30 厘米。 WSJ0 数据集 ² 的训练、开发和评估子集的语音记录被用作源信号，分别用于所提模型的训练、验证和测试。任意噪声场发生器 ³ 用于生成不同阵列孔径的空间漫射白噪声 [55] 。麦克风信号首先通过 RIR 滤波四秒语音记录，然后缩放并添加信噪比 (SNR) 在 15 dB 至 30 dB 范围内的漫射噪声来生成。每个信号都是上述数据设置的随机组合，这些数据设置包括源位置、麦克风位置、源信号、噪声信号、房间大小、混响时间、六个吸收系数之间的相对比例等。

Note that, the proposed pretext task, i.e. cross-channel signal reconstruction, is an ill-posed problem for random noise, as the random samples of noise are unpredictable. Our preliminary experiments demonstrated that more noise lead to more reconstruction error and decreased capability of spatial acoustic learning. In this experiment, we set the SNR range as 15-30 dB, which does not introduce much adverse effect. Meanwhile, this SNR condition (namely lager than 15 dB) can be satisfied in many real-world scenes for data recording.
需要注意的是，我们提出的借口任务，即跨通道信号重构，对于随机噪声而言是一个病态问题，因为噪声的随机样本是不可预测的。我们的初步实验表明，噪声越大，重构误差越大，空间声学学习能力也越差。在本实验中，我们将信噪比 (SNR) 范围设置为 15-30 dB，这不会带来太大的负面影响。同时，这一信噪比条件（即大于 15 dB）在许多实际数据记录场景中都能得到满足。

We randomly generate 512,000 training signals for the pretext task, in order to imitate that the real-world unlabeled data used for training can be with enough quantity and diversity. The validation set and the test set of the pretext task contain 5,120 signals each.
我们为借口任务随机生成了 512,000 个训练信号，以模拟用于训练的真实世界未标记数据能够具有足够的数量和多样性。借口任务的验证集和测试集各包含 5,120 个信号。

As for downstream tasks, we generate a series of new room conditions in terms of room size, reverberation time, absorption coefficient and source-microphone position, and 50 RIRs are randomly generated for each room condition. These rooms are divided without overlap to obtain the training, validation and test sets. To imitate a small number of labeled data with limited diversity used for downstream tasks, and to evaluate the influence of data diversity, the number of training (fine-tuning) rooms are set to 2, 4, 8, 16, 32, 64, 128 and 256, respectively. When the number of training rooms is too small, the results will vary a lot from trial to trial (different trials use different training rooms). Therefore, we conduct 16, 8, 4, 2, 1, 1, 1 and 1 trials for these settings of training room number, respectively, and the averaged results over trials are reported. The numbers of validation rooms and test rooms are both 20. Each RIR is convoluted with two, one and four different source signals for training, validation and test, respectively. Accordingly, the numbers of signals for each training, validation and test room are 100, 50 and 200, respectively.
对于下游任务，我们根据房间大小、混响时间、吸收系数和源麦克风位置生成一系列新的房间条件，并为每个房间条件随机生成 50 个 RIR。这些房间被不重叠地划分，以获得训练、验证和测试集。为了模拟用于下游任务的少量且多样性有限的标记数据，并评估数据多样性的影响，训练（微调）房间的数量分别设置为 2、4、8、16、32、64、128 和 256。当训练房间数量太少时，每次试验的结果会有很大差异（不同的试验使用不同的训练房间）。因此，我们分别针对这些训练房间数量的设置进行了 16、8、4、2、1、1、1 和 1 次试验，并报告了试验的平均结果。验证室和测试室的数量均为 20 个。每个 RIR 分别与 2 个、1 个和 4 个不同的源信号进行卷积，用于训练、验证和测试。因此，每个训练、验证和测试室的信号数量分别为 100、50 和 200 个。

2) Real-World Datasets 2）真实世界数据集

We collect 11 public real-world multi-channel datasets. Among them, MIR [56],⁴ MeshRIR [57],⁵ DCASE [58],⁶ dEchorate [59],⁷ BUTReverb [60]⁸ and ACE [1]⁹ provide real-measured multi-channel RIRs. The microphone signals are created by convolving the real-measured RIRs with source signals from WSJ0, and then adding noise with a SNR ranging from 15 dB to 30 dB if noise signals are provided by the corresponding dataset. LOCATA [61],¹⁰ MC-WSJ-AV [62],¹¹ LibriCSS [63],¹² AMIMeeting [64],¹³ AISHELL-4 [65],¹⁴ AliMeeting [66]¹⁵ and RealMAN [67]¹⁶ provide real-recorded multi-channel speech signals. From the original multi-channel audio recordings, all the two-channel sub-arrays with an aperture in [3 cm, 20 cm] are selected. We only use the data of a single static speaker in our experiments. Table II summarizes the settings of selected data from the collected datasets. There are a total of 111 rooms and more than 40k RIR settings (in terms of room condition, source position, and array position).
我们收集了 11 个公开的真实世界多通道数据集。其中，MIR [56] 、 ⁴ MeshRIR [57] 、 ⁵ DCASE [58] 、 ⁶ dEchorate [59] 、 ⁷ BUTReverb [60] ⁸ 和 ACE [1] ⁹ 提供了实测的多通道 RIR。麦克风信号是通过将实测 RIR 与来自 WSJ0 的源信号进行卷积生成的，然后如果相应数据集提供了噪声信号，则添加信噪比 (SNR) 范围为 15 dB 至 30 dB 的噪声。 LOCATA [61] 、 ¹⁰ MC-WSJ-AV [62] 、 ¹¹ LibriCSS [63] 、 ¹² AMIMeeting [64] 、 ¹³ AISHELL-4 [65] 、 ¹⁴ AliMeeting [66] ¹⁵ 和 RealMAN [67] ¹⁶ 提供真实录制的多通道语音信号。从原始多通道音频录音中，选择了所有孔径为 [3 cm, 20 cm] 的双通道子阵列。我们在实验中仅使用单个静态说话人的数据。表 II 总结了从收集到的数据集中选定数据的设置。共有 111 个房间和超过 40k 个 RIR 设置（就房间条件、源位置和阵列位置而言）。

TABLE II Settings of Our Selected Data From Seven Public Real-World Multi-Channel Datasets
表二我们从七个公共真实世界多通道数据集中选择的数据设置

For pre-training, we use all the collected real-world datasets. We distinctively generate 512,000, 4,000 and 4,000 signals for training, validation and test, respectively. The importance weight of each dataset for generating data is set according to the number of rooms and the duration of speech recordings in this dataset. As for downstream tasks, we use the LOCATA dataset for TDOA estimation, and the ACE dataset for the tasks including DRR, T60, C50 and mean absorption coefficient estimation. The ACE dataset provides multi-channel measured RIRs and noise signals. The ACE dataset contains 7 different rooms, which however still lacks room diversity, and thence 7-fold cross-validation is adopted. In each fold, we use one room for test, one room for validation and the other five rooms for training. During training (fine-tuning), each training signal is generated on-the-fly as a random combination of RIRs, source signals (from the WSJ0 dataset) and noise signals. A fixed numbers of signals, i.e. 1,000 and 4,000, are generated for validation and test in each fold, respectively.
在预训练中，我们使用所有收集到的真实世界数据集。我们分别生成 512,000、4,000 和 4,000 个信号用于训练、验证和测试。每个数据集生成数据的重要性权重根据该数据集中的房间数量和语音记录的时长来设置。对于下游任务，我们使用 LOCATA 数据集进行 TDOA 估计，并使用 ACE 数据集进行 DRR、 T60 、 C50 和平均吸收系数估计等任务。ACE 数据集提供多通道实测的 RIR 和噪声信号。ACE 数据集包含 7 个不同的房间，但仍然缺乏房间多样性，因此采用 7 倍交叉验证。在每一倍中，我们使用一个房间进行测试，一个房间进行验证，其余五个房间进行训练。在训练（微调）期间，每个训练信号都是即时生成的，由 RIR、源信号（来自 WSJ0 数据集）和噪声信号随机组合而成。每次验证和测试都会生成固定数量的信号，即 1,000 和 4,000。

3) Parameter Settings 3）参数设置

The sampling rate of signals is 16 kHz. STFT is performed with a window length of 32 ms and a frame shift of 16 ms. The number of frequencies F is 256. For pre-training, the length of microphone signals is set to 4 s, and correspondingly the number of time frames N is 256. The input STFT coefficients are normalized by dividing the mean value of the magnitude of the first microphone channel. The number of masked time frames is 128, namely half of the number of all frames. For downstream tasks, the length of microphone signals is set to 1 s for TDOA estimation and 4 s for the other tasks. In addition to the data of single static speaker in the LOCATA dataset, we also use the data of single moving speaker and assume the speaker is static during each one-second segment.
信号采样率为 16 kHz。STFT 采用 32 ms 的窗口长度和 16 ms 的帧移位进行。频率数 F 为 256。对于预训练，麦克风信号长度设置为 4 s，相应的时间帧数 N 为 256。输入 STFT 系数通过除以第一个麦克风通道幅度的平均值进行归一化。掩蔽的时间帧数为 128，即所有帧数的一半。对于下游任务，对于 TDOA 估计，麦克风信号长度设置为 1 s，对于其他任务，麦克风信号长度设置为 4 s。除了 LOCATA 数据集中的单个静态说话人数据外，我们还使用单个运动说话人的数据，并假设说话人在每个一秒的片段中都是静止的。

4) Model Configurations 4）模型配置

In the encoder, the setting of the convolution block is shown in Fig. 3. In Conformer blocks, the number of attention heads is 4, the kernel size of convolutional layers is 31, and the expansion factor of feed-forward layers is 4. For the spectral encoder, we use one Conformer block and the embedding dimension Dspec is set to 512. For the spatial encoder, we use three Conformer blocks and the embedding dimension Dspat is set to 256. The decoder has two FC layers, where the first layer is with 3072 hidden units and is activated by a rectified linear unit (ReLU), and the second layer outputs the reconstructed signal vector with a dimension of 2FM.
在编码器中，卷积块的设置如图 3 所示。在 Conformer 块中，注意力头的数量为 4，卷积层的核大小为 31，前馈层的扩展因子为 4。对于谱编码器，我们使用一个 Conformer 块，嵌入维度 Dspec 设置为 512。对于空间编码器，我们使用三个 Conformer 块，嵌入维度 Dspat 设置为 256。解码器具有两个 FC 层，其中第一层具有 3072 个隐藏单元，由整流线性单元（ReLU）激活，第二层输出重构信号向量，维度为 2FM 。

5) Training Details 5）培训详情

For self-supervised pre-training, the model is trained from scratch using simulated data in the simulated-data experiments, while in the real-data experiments, the model is initialized with the pre-trained model on simulated data and then trained using the real-world data. We found that the real-world training data (collected from 41 rooms) are not quite sufficient for pre-training, and initializing the model with the pre-trained model of simulated data is helpful for mitigating this problem. We use the Adam optimizer with an initial learning rate 0.001 and a cosine-decay learning rate scheduler. The batch size is set to 128. The maximum number of training epochs is 30. The best model is the one with the minimum validation loss.
对于自监督预训练，在模拟数据实验中，模型使用模拟数据从头开始训练；而在真实数据实验中，模型先使用模拟数据的预训练模型进行初始化，然后再使用真实数据进行训练。我们发现，真实世界的训练数据（收集自 41 个房间）不足以进行预训练，而使用模拟数据的预训练模型初始化模型有助于缓解这个问题。我们使用 Adam 优化器，初始学习率为 0.001，并采用余弦衰减学习率调度器。批量大小设置为 128。最大训练周期数为 30。最佳模型是验证损失最小的模型。

For downstream tasks, the pre-trained spatial encoder is fine-tuned using labeled data. The Adam optimizer is used for fine-tuning. The batch size is set to 8 for experiments on simulated data and 16 for experiments on real-world data. Fine-tuning the model with a small amount of labeled data is difficult and unstable in general [37], so we have carefully designed the fine-tuning scheme. The validation loss is recursively smoothed along the training epochs to reduce its fluctuations. The initial learning rate is divided by 10 when the smoothed validation loss does not descend with a patience of 10 epochs, and then the training is stopped when the smoothed validation loss does not decrease for another 10 epochs. For each task, we search the initial learning rate that achieves the smallest smoothed validation loss. The search range of the learning rate is [5e-5, 1e-4, 5e-4, 1e-3] for experiments on simulated data and [1e-4, 1e-3] for experiments on real-world data. We ensemble the models of the best epoch and its previous four epochs as the final model.
对于下游任务，使用标记数据对预训练的空间编码器进行微调。使用 Adam 优化器进行微调。对于模拟数据实验，批量大小设置为 8，对于真实数据实验，批量大小设置为 16。使用少量标记数据对模型进行微调通常很困难且不稳定 [37] ，因此我们精心设计了微调方案。验证损失会随着训练周期递归平滑，以减少其波动。如果在 10 个周期内平滑后的验证损失没有下降，则将初始学习率除以 10，然后，如果平滑后的验证损失在接下来的 10 个周期内没有下降，则停止训练。对于每个任务，我们搜索实现最小平滑验证损失的初始学习率。学习率的搜索范围：模拟数据实验 [5e-5, 1e-4, 5e-4, 1e-3]；真实数据实验 [1e-4, 1e-3]。我们将最优 epoch 及其前 4 个 epoch 的模型集成，作为最终模型。

6) Evaluation Metrics 6）评估指标

All the downstream tasks, i.e. TDOA, DRR, C50, T60 and mean absorption coefficient estimation, are evaluated with the mean absolute error (MAE) which computes the averaged absolute error between the estimated and ground-truth values over all the test signals.
所有下游任务，即 TDOA、DRR、 C50 、 T60 和平均吸收系数估计，都用平均绝对误差 (MAE) 进行评估，该误差计算所有测试信号的估计值和真实值之间的平均绝对误差。

B. Comparison With Fully Supervised Learning
B. 与完全监督学习的比较

As far as we know, this work is the first one to study the self-supervised learning of spatial acoustic information, and there are no self-supervised baseline methods to compare. Therefore, we compare the proposed self-supervised pre-training plus fine-tuning scheme with a fully supervised learning scheme. In the fully supervised learning scheme, we train the same network architecture as our downstream model (namely the spatial encoder followed by a mean pooling and a linear head) from scratch using labeled data specific to the downstream task. Training from scratch with a small amount of data is also challenging and unstable, and we employ the same training scheme as described earlier for fine-tuning. This comparison aims to demonstrate the effectiveness of the proposed self-supervised pre-training method.
据我们所知，这项工作是第一个研究空间声学信息的自监督学习的工作，并且没有自监督的基线方法可供比较。因此，我们将提出的自监督预训练加微调方案与完全监督学习方案进行比较。在完全监督学习方案中，我们使用针对下游任务的标记数据从头开始训练与下游模型相同的网络架构（即空间编码器，后接均值池化和线性头）。使用少量数据从头开始训练同样具有挑战性且不稳定，我们采用与前面描述的相同的训练方案进行微调。本次比较旨在证明所提出的自监督预训练方法的有效性。

1) Evaluation on Simulated Data
1）模拟数据评估

Fig. 4 shows the performance of five spatial acoustic parameter estimation tasks with the two learning schemes when using labeled data from various amounts of training rooms. It can be observed that the self-supervised setting outperforms the supervised setting under most conditions. This confirms that the spatial encoder learns spatial acoustic information in self-supervised pre-training. More specifically, the learned representation of relative RIR/CTF involves both the inter-channel information (used for TDOA estimation) and the temporal structure of RIR/CTF (used for DRR, C50, T60 and mean absorption coefficient estimation). With the increasing of training rooms (and training data), the MAEs for both self-supervised and supervised settings degrade for all tasks, and the advantage of pre-training becomes less prominent. This highlights the importance of training data diversity, in terms of room conditions, for deep-learning-based acoustic parameter estimation. The significant benefit of pre-training on small labeled datasets confirms that the proposed self-supervised method is promising in real-world applications where data annotation is challenging or resource-consuming. Among the five tasks, one exception is that pre-training is almost not helpful for C50 estimation, which is possibly because the spatial encoder cannot learn late reverberation well with the current pretext task.
图 4 展示了使用来自不同数量训练房间的标记数据，两种学习方案在五项空间声学参数估计任务中的表现。可以观察到，在大多数情况下，自监督设置的表现优于监督设置。这证实了空间编码器在自监督预训练中学习了空间声学信息。更具体地说，学习到的相对 RIR/CTF 表示既涉及通道间信息（用于 TDOA 估计），也涉及 RIR/CTF 的时间结构（用于 DRR、 C50 、 T60 和平均吸收系数估计）。随着训练房间（和训练数据）的增加，所有任务的自监督和监督设置的 MAE 都会下降，预训练的优势变得不那么明显。这凸显了房间条件方面的训练数据多样性对于基于深度学习的声学参数估计的重要性。在小型标注数据集上进行预训练的显著优势，证实了所提出的自监督方法在数据标注具有挑战性或资源耗费的实际应用中前景广阔。在五项任务中，有一个例外：预训练对 C50 估计几乎没有帮助，这可能是因为空间编码器在当前的借口任务中无法很好地学习后期混响。

$Fig. 4. - Results of TDOA, DRR, $T_{60}$, $C_{50}$ and absorption coefficient (ABS) estimation on the simulated dataset, for the proposed self-supervised pre-training plus fine-tuning method and the fully supervised training method, when using labeled data from different amounts of training rooms.$

Fig. 4. 图4.

Results of TDOA, DRR, T60, C50 and absorption coefficient (ABS) estimation on the simulated dataset, for the proposed self-supervised pre-training plus fine-tuning method and the fully supervised training method, when using labeled data from different amounts of training rooms.
对于所提出的自监督预训练加微调方法和完全监督训练方法，在使用来自不同数量训练室的标记数据时，对模拟数据集的 TDOA、DRR、 T60 、 C50 和吸收系数 (ABS) 进行估计的结果。

The training and test curves of three downstream tasks in self-supervised and supervised settings are illustrated in Fig. 5. Fine-tuning pre-trained models converges faster than training from scratch in general. Although the training losses of the two settings reach a similar level at the end, the test loss of the self-supervised setting is notably lower than the one of the supervised setting. This indicates that pre-training helps to reduce the generalization loss from training to test data.
图 5 展示了自监督和监督设置下三个下游任务的训练和测试曲线。通常，微调预训练模型比从头开始训练收敛得更快。虽然两种设置的训练损失最终达到了相似的水平，但自监督设置的测试损失明显低于监督设置。这表明预训练有助于降低从训练集到测试集的泛化损失。

$Fig. 5. - Learning curves (MAE versus training iteration) for TDOA, DRR and $T_{60}$ estimation on the simulated dataset, for the proposed fine-tuning scheme and the scheme of training from scratch. The number of training rooms is 8.$

Fig. 5. 图 5.

Learning curves (MAE versus training iteration) for TDOA, DRR and T60 estimation on the simulated dataset, for the proposed fine-tuning scheme and the scheme of training from scratch. The number of training rooms is 8.
针对模拟数据集、建议的微调方案和从头开始训练方案，绘制了 TDOA、DRR 和 T60 估计的学习曲线（MAE 与训练迭代的关系）。训练室数量为 8 个。

To evaluate how much information the proposed self-supervised pre-training method has learned, the performance of downstream tasks with four different settings are compared in Table III. Non-informative means the acoustic parameter prediction on the test data are simply set as a reasonable non-informative value, namely the mean value of the acoustic parameters of training data, which does not exploit any information from microphone signals of the test dataset. Pre-train plus linear evaluation means the pre-trained model is frozen and only a linear head is trained with downstream data. It can be seen that the linear evaluation setting achieves much better performance measures than the non-informative case, which demonstrates that the pre-trained model/feature indeed involves useful information for downstream tasks. By training/fine-tuning the whole network towards specific downstream tasks, the scratch and fine-tuning settings can better perform on downstream tasks. Although linear evaluation was once a standard way for evaluating the performance of self-supervised learning methods, it misses the opportunity to pursue strong but non-linear features, which is indeed a strength of deep learning [68]. Therefore, more self-supervised learning works put emphasis on the fine-tuning setting than linear evaluation, and we will also only evaluate the fine-tuning setting in the following.
为了评估所提出的自监督预训练方法学习了多少信息，表 III 比较了四种不同设置的下游任务性能。非信息性设置意味着测试数据的声学参数预测被简单地设置为一个合理的非信息性值，即训练数据声学参数的平均值，这不利用测试数据集麦克风信号中的任何信息。预训练加线性评估设置意味着预训练模型被冻结，只使用下游数据训练一个线性头。可以看出，线性评估设置的性能指标远优于非信息性设置，这表明预训练模型/特征确实包含对下游任务有用的信息。通过针对特定的下游任务训练/微调整个网络，初始和微调设置可以在下游任务中表现得更好。虽然线性评估曾经是评估自监督学习方法性能的标准方法，但它错失了追求强大但非线性特征的机会，而这恰恰是深度学习的优势之一 [68] 。因此，更多的自监督学习工作强调微调设置而不是线性评估，并且我们在下面也将只评估微调设置。

TABLE III Performance (MAE) With Different Training Settings on the Simulated Dataset
表三： 模拟数据集上不同训练设置下的性能（MAE）

To assess the impact of pre-training epochs/iterations on the performance of downstream tasks, we present the pre-training MSE and the performance of downstream tasks with different pre-training epochs/iterations in Table IV. It can be seen that the performance of downstream tasks is consistent with the pretext task to a large extent, namely the performance of downstream tasks can be improved when the pre-training loss is reduced. This property is very important for validating that the proposed pretext task is indeed learning information that can be transferred to downstream tasks.
为了评估预训练周期/迭代次数对下游任务性能的影响，我们在表 IV 中展示了预训练 MSE 以及不同预训练周期/迭代次数的下游任务性能。可以看出，下游任务的性能在很大程度上与借口任务一致，即当预训练损失降低时，下游任务的性能可以得到提升。这一特性对于验证所提出的借口任务确实能够学习到可以迁移到下游任务的信息至关重要。

TABLE IV Performance of the Proposed Method With Different Pre-Training epochs/iterations
表四所提方法在不同预训练阶段/迭代中的表现

2) Evaluation on Real-World Data
2）真实世界数据评估

We evaluate the proposed self-supervised method on real-world data to validate its effectiveness for practical applications. It is complicated to conduct real-data experiments mainly for two reasons. One is that we don't have a sufficient amount of real-world data for pre-training, despite the fact that self-supervised pre-training does not require any data annotation. As mentioned in Section V-A-5, we only use collected real-world data of 41 rooms for pre-training, which is not sufficient for fully pre-train the model. As indicated in Table IV, the performance of pre-training is closely related to the performance of downstream tasks, so we think the capability of pre-training may not be fully reflected in this experiment. The other reason is that for fine-tuning or training from scratch in the downstream tasks, it is not necessary to only use a small amount of real data, as a large amount of labeled simulated data can be easily obtained and used. Most DNN-based acoustic parameter estimation methods [6], [16], [17], [18], [19], [20], [21], [22] train the model (from scratch) using a large amount of labeled simulation data. Therefore, we conduct experiments of fine-tuning or training from scratch using three groups of data i) a limited number of real-world data; ii) a sufficiently large amount of simulated data generated from 1000 rooms; iii) both real-world and simulated data, and their importance weights are set to 0.5: 0.5. These three settings are evaluated in Fig. 6 and Table V.
我们在真实数据上评估了所提出的自监督方法，以验证其在实际应用中的有效性。进行真实数据实验很复杂，主要有两个原因。首先，尽管自监督预训练不需要任何数据注释，但我们没有足够数量的真实数据进行预训练。如第 V-A-5 节所述，我们仅使用收集的 41 个房间的真实数据进行预训练，这不足以完全预训练模型。如表 IV 所示，预训练的性能与下游任务的性能密切相关，因此我们认为预训练的能力可能无法在本次实验中得到充分体现。另一个原因是，对于下游任务中的微调或从头开始训练，不必仅使用少量的真实数据，因为可以轻松获取和使用大量带标记的模拟数据。大多数基于 DNN 的声学参数估计方法 [6] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 都使用大量带标签的模拟数据（从头开始）训练模型。因此，我们使用三组数据进行微调或从头开始训练的实验：i）有限数量的真实世界数据；ii）由 1000 个房间生成的足够大量的模拟数据；iii）真实世界数据和模拟数据，并将它们的重要性权重设置为 0.5: 0.5。图 6 和表 V 评估了这三种设置。

TABLE V Performance of Spatial Acoustic Parameter Estimation on the Real-World Datasets
表五空间声学参数估计在真实世界数据集上的表现

$Fig. 6. - Learning curves (MAE versus training iteration) for TDOA, DRR and $T_{60}$ estimation on the real-world datasets, for the proposed fine-tuning scheme and the scheme of training from scratch. The downstream training data is set to real-world data, simulated data, and real-simulated mixed data, respectively. For DRR and $T_{60}$ estimation, the MAEs are the averaged results of 7-fold cross-validation.$

Fig. 6. 图6.

Learning curves (MAE versus training iteration) for TDOA, DRR and T60 estimation on the real-world datasets, for the proposed fine-tuning scheme and the scheme of training from scratch. The downstream training data is set to real-world data, simulated data, and real-simulated mixed data, respectively. For DRR and T60 estimation, the MAEs are the averaged results of 7-fold cross-validation.
在实际数据集上，针对所提出的微调方案和从头开始训练的方案，绘制了 TDOA、DRR 和 T60 估计的学习曲线（MAE 与训练迭代次数的关系）。下游训练数据分别设置为实际数据、模拟数据和实模混合数据。对于 DRR 和 T60 估计，MAE 是 7 倍交叉验证的平均结果。

The fine-tuning/training processes of three downstream tasks are illustrated in Fig. 6. Note that, different from the training scheme presented in Section V-A-5, to fully plot and analyze the training process in this figure, the learning rate is not reduced and the training is not stopped when it converges. When using only a small amount of real-world data, fine-tuning the pre-trained model converges rapidly for the DRR and T60 estimation tasks, while training from scratch needs more steps to converge. The performance gap between training and test for both self-supervised and supervised settings is very large, indicating that the models largely overfit to the small training dataset. The pre-training helps to mitigate the overfitting, and hence the self-supervised setting achieves better test performance. When using only simulated data, the performance gaps between training and test vary among the three tasks. TDOA estimation on has the largest gap while T60 estimation has the smallest gap. As the quantity and diversity of training data are very large, these performance gaps reflect the simulation-to-reality generalization loss, and different tasks have different losses. Compared with using real-world data, the performance of TDOA estimation gets worse, the performance of T60 estimation gets better, and the performance of DRR estimation stays similar. When using only simulated data, the advantage of the self-supervised setting becomes less prominent, possibly due to the data inconsistency between real-data pre-training and simulated-data fine-tuning. When using real-simulated mixed data, we hope to combine the advantages of real data and simulated data. Roughly speaking, for each task and for both self-supervised and supervised settings, the real-simulated mixed data can achieve a performance level similar to the better one of pure real data and pure simulated data, but it is hard to surpass the better one.
图 6 展示了三个下游任务的微调/训练过程。需要注意的是，与 V-A-5 节中介绍的训练方案不同，为了完整地绘制和分析此图中的训练过程，学习率不会降低，训练收敛后也不会停止。当仅使用少量真实数据时，对于 DRR 和 T60 估计任务，对预训练模型进行微调可以快速收敛，而从头开始训练则需要更多步骤才能收敛。在自监督和监督设置下，训练和测试之间的性能差距非常大，这表明模型在小型训练数据集上严重过拟合。预训练有助于缓解过拟合，因此自监督设置实现了更好的测试性能。当仅使用模拟数据时，三个任务之间的训练和测试之间的性能差距有所不同。TDOA 估计的差距最大，而 T60 估计的差距最小。由于训练数据的数量和多样性非常大，这些性能差距反映了模拟到现实的泛化损失，并且不同任务的损失不同。与使用真实数据相比，TDOA 估计的性能变差， T60 估计的性能变好，而 DRR 估计的性能保持不变。当仅使用模拟数据时，自监督设置的优势变得不那么突出，这可能是由于真实数据预训练和模拟数据微调之间的数据不一致造成的。当使用真实-模拟混合数据时，我们希望能够结合真实数据和模拟数据的优势。粗略地说，对于每个任务以及对于自监督和监督设置，真实模拟混合数据可以达到与纯真实数据和纯模拟数据中较好的一个相似的性能水平，但很难超越较好的一个。

Table V shows the final performance of the five tasks using the training scheme described in Section V-A-5. As for the supervised case, one extra setting is added, namely supervised training with simulated data and then supervised fine-tuning with real-world data (denoted as Simulated (+ real-FT)). In addition, some conventional methods are also compared, including GCC-PHAT [69] for TDOA estimation, one blind DRR estimation method [70] and one blind T60 estimation method [71].
表 V 展示了使用第 V-A-5 节所述训练方案的五项任务的最终性能。对于监督学习的情况，增加了一个额外设置，即使用模拟数据进行监督训练，然后使用真实数据进行监督微调（记为“模拟（+真实-FT）”）。此外，还比较了一些传统方法，包括用于 TDOA 估计的 GCC-PHAT [69] 、一种盲 DRR 估计方法 [70] 和一种盲 T60 估计方法 [71] 。

The best performance is highlighted in bold for each task. Compared with the supervised setting, the proposed self-supervised setting wins on estimating TDOA, T60 and absorption coefficient (ABS), and loses on estimating DRR and C50. Overall, these results on (limited amount of pre-training) real-world data are still promising for showing the effectiveness of self-supervised learning of spatial acoustic information. We think that it is possible to further improve the capability of self-supervised pre-training when we can record/collect more real data for pre-training, which is not very difficult as it does not require any data annotation.
每项任务的最佳表现已以粗体突出显示。与监督设置相比，所提出的自监督设置在估计 TDOA、 T60 和吸收系数 (ABS) 方面胜出，但在估计 DRR 和 C50 方面则逊色。总体而言，这些基于（有限数量的预训练）真实世界数据的结果仍然有望展现自监督学习空间声学信息的有效性。我们认为，如果我们能够记录/收集更多真实数据用于预训练，则有可能进一步提升自监督预训练的能力，这并不难，因为它不需要任何数据注释。

Compared to conventional methods, the best-performed learning-based models achieve much better performance, which demonstrates the superiority of deep learning for acoustic parameter estimation if the network can be properly trained.
与传统方法相比，表现最佳的基于学习的模型取得了更好的性能，证明了如果网络能够得到适当的训练，深度学习对于声学参数估计的优势。

C. Ablation Study C. 消融研究

We conduct some ablation experiments to evaluate the effectiveness of each component of the proposed method. Since existing real-world datasets lack diversity in room conditions, we perform ablation studies on the simulated dataset for better analysis. The number of simulated training rooms is set to 8 and four trials are performed for each experiment setting unless otherwise stated. Three representative downstream tasks are mainly considered, namely the estimation of TDOA, DRR and T60, which depends on the information of direct path, both direct and reflective paths, and reflective paths, respectively.
我们进行了一些消融实验，以评估所提方法各组成部分的有效性。由于现有的真实数据集在房间条件下缺乏多样性，因此我们对模拟数据集进行了消融研究，以便更好地进行分析。除非另有说明，模拟训练房间的数量设置为 8，并且每个实验设置进行 4 次试验。主要考虑三个具有代表性的下游任务，即 TDOA、DRR 和 T60 的估计，这三个任务分别依赖于直接路径、直接路径和反射路径以及反射路径的信息。

1) Influence of Masking Rate and Comparison With Patch-Wise Scheme
1）掩蔽率的影响及与 Patch-Wise 方案的比较

Table VI shows the results of three different masking rates, i.e. 25%, 50% and 75%. The pre-training MSE becomes larger with the increase of the masking rate, which is reasonable as reconstructing more frames is more difficult. However, the performance of downstream tasks is comparable for the three masking rates. The masking rate 75% achieves slightly better TDOA performance, while the masking rate 50% provides slightly better DRR and T60 performance. Overall, the performance of downstream tasks is not very sensitive to the three values of masking rates, and hence we set the masking rate to the median value 50% in other experiments.
表 VI 展示了三种不同掩蔽率（即 25%、50% 和 75%）的结果。随着掩蔽率的增加，预训练 MSE 会变大，这在情理之中，因为重建更多帧会更加困难。然而，对于这三种掩蔽率，下游任务的性能表现相当。75% 的掩蔽率实现了略优的 TDOA 性能，而 50% 的掩蔽率实现了略优的 DRR 和 T60 性能。总体而言，下游任务的性能对三种掩蔽率值不太敏感，因此我们在其他实验中将掩蔽率设置为中间值 50%。

TABLE VI Performance for the Proposed Method With Different Masking Rates, and With Frame-Wise and Patch-Wise Schemes
表 VI 所提方法在不同掩蔽率、逐帧和逐块方案下的性能

In many audio spectral pattern learning works [38], [39], the so-called patch-wise scheme outperforms the frame-wise scheme on some downstream tasks, so we also test the patch-wise scheme. Patch-wise means the STFT coefficients are split into patches along the time and frequency axes, and the patches are ranked as a sequence and fed into the Conformer network. In this experiment, the 256 frames × 256 frequencies are split into 16 × 16 patches. Note that the frame-wise scheme can be considered as 256 × 1 patches. The results of the patch-wise scheme with 50% masking rate are also shown in Table VI. It can be seen that the pretext task with the patch-wise setting is much more challenging, possibly due to that it is difficult to reconstruct 16 continuous frames. For downstream tasks, the patch-wise scheme shows better performance on DRR estimation while worse performance on T60 estimation. The reason may be that the frame-wise scheme provides a higher temporal resolution and preserves a finer acoustic reflection structure, which is crucial for T60 estimation.
在许多音频频谱模式学习工作 [38] ， [39] 中，所谓的 patch-wise 方案在某些下游任务上优于 frame-wise 方案，因此我们也测试了 patch-wise 方案。patch-wise 意味着将 STFT 系数沿时间和频率轴分成 patch，并将 patch 按序列排列并输入到 Conformer 网络中。在本实验中，256 帧×256 个频率被分成 16×16 个 patch。注意，frame-wise 方案可以被认为是 256×1 个 patch。表 VI 中还显示了具有 50％掩蔽率的 patch-wise 方案的结果。可以看出，采用 patch-wise 设置的借口任务更具挑战性，可能是因为难以重建 16 个连续帧。对于下游任务，patch-wise 方案在 DRR 估计上表现出色，但在 T60 估计上表现较差。原因可能是逐帧方案提供了更高的时间分辨率并保留了更精细的声学反射结构，这对于 T60 估计至关重要。

2) Contribution of Spectral Encoder
2）光谱编码器的贡献

To evaluate the contribution of the spectral encoder, we conduct experiments using the spectral encoder or not in both pretext and downstream tasks. When using only the spatial encoder for the pretext task, the masking scheme given in (5) is used, and the encoder is actually required to learn both spatial and spectral information for signal reconstruction. The experimental results are shown in Table VII. As a baseline, the performance of training from scratch (namely w/o the pretext encoder) is also given. Compared with using two encoders for the pretext task and the spatial encoder for downstream tasks, using only one encoder for the pretext task achieves much worse performance on TDOA and T60 estimation. When using two encoders for the pretext task and only the spectral encoder for downstream tasks, similar performance measures are achieved compared with training from scratch. This suggests that the pre-trained spectral encoder does not learn much spatial acoustic information. Similarly, using both the spectral and spatial encoders for downstream tasks only brings a negligible performance improvement compared to using only the spatial encoder. Overall, incorporating the spectral encoder in the pretext task to disentangle the learning of signal content makes the spatial encoder more focused on the learning of spatial acoustic information, and hence improves the performance of downstream tasks.
为了评估频谱编码器的贡献，我们在借口任务和下游任务中分别进行了使用和不使用频谱编码器的实验。当仅使用空间编码器进行借口任务时，采用 (5) 中给出的掩蔽方案，编码器实际上需要学习空间和频谱信息才能重建信号。实验结果如表 VII 所示。作为基准，我们还给出了从头开始训练（即不使用借口编码器）的性能。与使用两个编码器进行借口任务并使用空间编码器进行下游任务相比，仅使用一个编码器进行借口任务在 TDOA 和 T60 估计方面的表现要差得多。当使用两个编码器进行借口任务并使用频谱编码器进行下游任务时，与从头开始训练相比，获得了相似的性能指标。这表明预训练的频谱编码器没有学习太多的空间声学信息。同样，与仅使用空间编码器相比，同时使用频谱和空间编码器进行下游任务的性能提升微乎其微。总的来说，在借口任务中加入频谱编码器来解开信号内容的学习，使得空间编码器更加专注于空间声学信息的学习，从而提高下游任务的性能。

TABLE VII Performance of the Proposed Method With and Without Using the Spectral Encoder in Pretext and Downstream Tasks
表 VII 在前置任务和下游任务中使用和不使用谱编码器时所提方法的性能

3) Comparison of Encoder Model Architectures
3）编码器模型架构比较

To demonstrate the effectiveness of the proposed MC-Conformer architecture for both pretext and downstream tasks, we compare the performance of five encoder architectures including CRNN, Transformer, Conformer, CNN+Transformer and CNN+Conformer (namely MC-Conformer).
为了证明所提出的 MC-Conformer 架构对于借口任务和下游任务的有效性，我们比较了五种编码器架构的性能，包括 CRNN、Transformer、Conformer、CNN+Transformer 和 CNN+Conformer（即 MC-Conformer）。

Fig. 7. - Model architecture of CRNN-based spectral/spatial encoder.

Fig. 7. 图 7.

Model architecture of CRNN-based spectral/spatial encoder.
基于 CRNN 的光谱/空间编码器的模型架构。

CRNN is chosen as CNN and RNN are commonly adopted by spatial acoustic parameter estimation works [6], [10], [11], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The architecture of the CRNN-based encoder is shown in Fig. 7. It consists of a convolution block to process local TF information and a recurrent block to obtain global TF information. For the spatial encoder, L=4, [c0,…,c4]=[16,16,32,64,128] and [b0,…,b4]=[1,1,4,4,4]. For the spectral encoder, L=2, [c0,c1,c2]=[32,32,64] and [b0,b1,b2]=[1,4,4] . These hyper-parameters have been well tuned to improve the performance of downstream tasks.
选择 CRNN 是因为 CNN 和 RNN 是空间声学参数估计工作中常用的 [6] 、 [10] 、 [11] 、 [16] 、 [17] 、 [18] 、 [19] 、 [20] 、 [21] 、 [22] 、 [23] 、 [24] 、 [25] 、 [26] 、 [27] 、 [28] 。基于 CRNN 的编码器的架构如图 7 所示。它由一个用于处理局部 TF 信息的卷积块和一个用于获取全局 TF 信息的循环块组成。对于空间编码器， L=4 、 [c0,…,c4]=[16,16,32,64,128] 和 [b0,…,b4]=[1,1,4,4,4] 。对于谱编码器， L=2 、 [c0,c1,c2]=[32,32,64] 和 [b0,b1,b2]=[1,4,4] 。这些超参数已经过精心调整，可以提高下游任务的性能。
Transformer [46] and Conformer [14] are widely used in audio/speech processing and self-supervised audio spectrogram learning [38], [39], [44]. We evaluate four types of architectures, i.e., Transformer, Conformer, CNN+Transformer and CNN+Conformer (namely MC-Conformer). The network configurations, including the number of attention heads, the embedding dimension, the number of Transformer/Conformer blocks and the CNN configurations, are set to be the same as the proposed MC-Conformer.
Transformer [46] 和 Conformer [14] 广泛应用于音频/语音处理以及自监督音频频谱图学习 [38] 、 [39] 、 [44] 。我们评估了四种类型的架构，即 Transformer、Conformer、CNN+Transformer 和 CNN+Conformer（简称 MC-Conformer）。网络配置（包括注意力头数量、嵌入维度、Transformer/Conformer 块数量以及 CNN 配置）均与所提出的 MC-Conformer 相同。

The experimental results are shown in Table VIII. It can be observed that the proposed MC-Conformer outperforms other model architectures in both the pretext and downstream tasks. Transformer alone performs poorly, and once combined with CNN layers (CNN+Transfomer or Conformer) the performance measures improve significantly. This confirms that CNN is crucial and necessary for capturing local spatial acoustic information. Compared with Conformer, the performance improvement of the proposed CNN+Conformer indicates that using 2D CNNs to pre-process the raw STFT coefficients is very important. Finally, based on these comparisons, we can conclude that CRNN is a strong network architecture for learning spatial acoustic information, while RNN can be replaced with Conformer to better learn long-term dependencies.
实验结果如表 VIII 所示。可以观察到，所提出的 MC-Conformer 在前置任务和下游任务中均优于其他模型架构。Transformer 单独表现不佳，与 CNN 层（CNN+Transfomer 或 Conformer）结合使用时，性能指标显著提升。这证实了 CNN 对于捕捉局部空间声学信息至关重要且必不可少。与 Conformer 相比，所提出的 CNN+Conformer 的性能提升表明使用 2D CNN 对原始 STFT 系数进行预处理非常重要。最后，基于这些比较，我们可以得出结论：CRNN 是一种强大的学习空间声学信息的网络架构，而可以用 Conformer 替代 RNN 来更好地学习长期依赖关系。

TABLE VIII Performance of the Proposed Method With Different Encoder Model Architectures
表八所提方法在不同编码器模型架构下的性能

D. Qualitative Experiments
D.定性实验

Fig. 8 provides an example of the reconstructed signal. It can be seen that the main structure of masked frames is well reconstructed. However, compared with the target signal, the reconstructed signal seems less blurred by reverberation, which is possibly due to that late reverberation has not been well reconstructed. Late reverberation is spatially diffuse with a low spatial correlation, making it more challenging to reconstruct the late reverberation of one channel from that of the other channel. This may be related to the phenomenon that pre-training does not help the estimation of C50.
图 8 提供了重构信号的示例。可以看出，蒙版帧的主要结构已得到很好的重构。然而，与目标信号相比，重构信号似乎没有受到混响的影响，这可能是因为后期混响未能得到很好的重构。后期混响在空间上具有弥散性，空间相关性较低，因此从某个通道的后期混响中重构出另一个通道的后期混响更加困难。这可能与预训练对 C50 的估计没有帮助这一现象有关。

Fig. 8. - An example of the masked input, the reconstructed signal and the target signal. The reverberation time is 1 s, and the SNR is 20 dB.

Fig. 8. 图8。

An example of the masked input, the reconstructed signal and the target signal. The reverberation time is 1 s, and the SNR is 20 dB.
掩蔽输入、重构信号和目标信号的示例。混响时间为 1 秒，信噪比为 20 dB。

Fig. 9 visualizes the learned representations (the hidden vectors after mean pooling) of downstream tasks. Compared with training from scratch, fine-tuning the pre-trained model obtains fewer outliers and presents a much smoother and discriminative manifold. For example, when training from scratch, it is hard to discriminate between the red and yellow points for T60 estimation, and between the dark blue and light blue points for DRR estimation. In contrast, they are well discriminated in the fine-tuning results.
图 9 可视化了下游任务学习到的表征（均值池化后的隐藏向量）。与从头开始训练相比，对预训练模型进行微调可以减少异常值，并呈现出更平滑、更具判别性的流形。例如，从头开始训练时，很难区分 T60 估计中的红点和黄点，以及 DRR 估计中的深蓝色点和浅蓝色点。相比之下，在微调结果中，它们得到了很好的区分。

Fig. 9. - Visualization of the learned representations for three downstream tasks. The number of training rooms is 8. The number of test rooms is 20. The representation extracted after the mean pooling layer from all test data is visualized with the t-SNE technique [72]. The gray histograms show the statistics of the values of acoustic parameters in test data.

Fig. 9. 图9。

Visualization of the learned representations for three downstream tasks. The number of training rooms is 8. The number of test rooms is 20. The representation extracted after the mean pooling layer from all test data is visualized with the t-SNE technique [72]. The gray histograms show the statistics of the values of acoustic parameters in test data.
三个下游任务的学习表征可视化。训练房间数量为 8，测试房间数量为 20。使用 t-SNE 技术 [72] 对从所有测试数据中经过均值池化层提取的表征进行可视化。灰色直方图显示了测试数据中声学参数值的统计信息。

SECTION VI. 第六部分

Conclusion 结论

This paper proposes a self-supervised method to learn a universal spatial acoustic representation from dual-channel unlabeled microphone signals. With the designed cross-channel signal reconstruction (CCSR) pretext task, the pretext model is forced to separately learn the spatial acoustic and the spectral pattern information. The dual-encoder plus decoder structure adopted by the pretext task facilitates the disentanglement of the two types of information. In addition, a novel multi-channel Conformer (MC-Conformer) is utilized to learn the local and global properties of spatial acoustics present in the time-frequency domain, which can boost the performance of both pretext and downstream tasks. Experiments conducted on both simulated and real-world data verify that the proposed self-supervised pre-training model learns useful knowledge that can be transferred to the spatial acoustics-related tasks including TDOA, DRR, T60, C50 and mean absorption coefficient estimation. Overall, this work demonstrates the feasibility of learning spatial acoustic information in a self-supervised manner for the first time. Hopefully, this will open a new door for the research of spatial acoustic parameter estimation.
本文提出了一种自监督方法，用于从双通道无标记麦克风信号中学习通用的空间声学表征。通过设计的跨通道信号重构（CCSR）借口任务，借口模型被迫分别学习空间声学和频谱模式信息。借口任务采用的双编码器+解码器结构有助于分离这两类信息。此外，本文还利用一种新颖的多通道 Conformer（MC-Conformer）来学习时频域中空间声学的局部和全局特性，从而提升借口任务和后续任务的性能。在模拟数据和真实数据上进行的实验验证了所提出的自监督预训练模型学习到的有用知识，这些知识可以迁移到与空间声学相关的任务中，包括 TDOA、DRR、 T60 、 C50 和平均吸收系数估计。总而言之，本文首次证明了以自监督方式学习空间声学信息的可行性。希望这能为空间声学参数估计的研究打开一扇新的大门。

This work mainly focuses on learning spatial acoustic information from dual-channel microphone signals recorded in high-SNR environments with a single static speaker. This acoustic setting can be satisfied in many real-world indoor scenes. There are several potential directions for future extensions and improvements. For instance, the considered acoustic condition can be more dynamic and complex, and the joint learning of spatial and spectral cues can be further explored.How to extend the proposed method for more than two channels also needs further investigation.
本研究主要致力于从高信噪比环境下单个静态扬声器录制的双通道麦克风信号中学习空间声学信息。这种声学设置在许多现实世界的室内场景中都能得到满足。未来还有几个潜在的扩展和改进方向。例如，考虑的声学条件可以更加动态和复杂，并且可以进一步探索空间和频谱线索的联合学习。如何将所提出的方法扩展到双通道以上也需要进一步研究。

References is not available for this document.

Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer具有跨通道信号重建和多通道一致性的空间声学表征自监督学习

Alerts

Abstract:

Metadata

Abstract: 抽象的：

ISSN Information: ISSN 信息：

Introduction 介绍

1) Self-Supervised Learning of Spatial Acoustic Representation (SSL-SAR)1）空间声学表征的自监督学习（SSL-SAR）

2) Multi-Channel Audio Conformer (MC-Conformer)2）多通道音频适配器（MC-Conformer）

Related Works 相关作品

A. Deep-Learning-Based Spatial Acoustic Parameter EstimationA.基于深度学习的空间声学参数估计

B. Audio Self-Supervised Representation LearningB. 音频自监督表征学习

Problem Formulation 问题表述

Self-Supervised Learning of Spatial Acoustic Representation空间声学表征的自监督学习

A. Pretext Task: Cross-Channel Signal ReconstructionA. 借口任务：跨通道信号重建

1) Reconstruction Framework1）重建框架

2) Encoder-Decoder Structure2）编码器-解码器结构

3) Masking Scheme 3）掩蔽方案

B. Downstream Tasks: Spatial Acoustic Parameter EstimationB.下游任务：空间声学参数估计

C. Encoder Model: Multi-Channel Audio ConformerC.编码器模型：多通道音频转换器

Experiments and Discussions实验与讨论

A. Experimental Setup A. 实验设置

1) Simulated Dataset 1）模拟数据集

2) Real-World Datasets 2）真实世界数据集

3) Parameter Settings 3）参数设置

4) Model Configurations 4）模型配置

5) Training Details 5）培训详情

6) Evaluation Metrics 6）评估指标

B. Comparison With Fully Supervised LearningB. 与完全监督学习的比较

1) Evaluation on Simulated Data1）模拟数据评估

2) Evaluation on Real-World Data2）真实世界数据评估

C. Ablation Study C. 消融研究

1) Influence of Masking Rate and Comparison With Patch-Wise Scheme1）掩蔽率的影响及与 Patch-Wise 方案的比较

2) Contribution of Spectral Encoder2）光谱编码器的贡献

3) Comparison of Encoder Model Architectures3）编码器模型架构比较

D. Qualitative ExperimentsD.定性实验

Conclusion 结论

Authors 作者

Figures 数字

References 参考

Keywords 关键词

Metrics 指标

Footnotes 脚注

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer
具有跨通道信号重建和多通道一致性的空间声学表征自监督学习

1) Self-Supervised Learning of Spatial Acoustic Representation (SSL-SAR)
1）空间声学表征的自监督学习（SSL-SAR）

2) Multi-Channel Audio Conformer (MC-Conformer)
2）多通道音频适配器（MC-Conformer）

A. Deep-Learning-Based Spatial Acoustic Parameter Estimation
A.基于深度学习的空间声学参数估计

B. Audio Self-Supervised Representation Learning
B. 音频自监督表征学习

Self-Supervised Learning of Spatial Acoustic Representation
空间声学表征的自监督学习

A. Pretext Task: Cross-Channel Signal Reconstruction
A. 借口任务：跨通道信号重建

1) Reconstruction Framework
1）重建框架

2) Encoder-Decoder Structure
2）编码器-解码器结构

B. Downstream Tasks: Spatial Acoustic Parameter Estimation
B.下游任务：空间声学参数估计

C. Encoder Model: Multi-Channel Audio Conformer
C.编码器模型：多通道音频转换器

Experiments and Discussions
实验与讨论

B. Comparison With Fully Supervised Learning
B. 与完全监督学习的比较

1) Evaluation on Simulated Data
1）模拟数据评估

2) Evaluation on Real-World Data
2）真实世界数据评估

1) Influence of Masking Rate and Comparison With Patch-Wise Scheme
1）掩蔽率的影响及与 Patch-Wise 方案的比较

2) Contribution of Spectral Encoder
2）光谱编码器的贡献

3) Comparison of Encoder Model Architectures
3）编码器模型架构比较

D. Qualitative Experiments
D.定性实验