End-to-End Speech-to-Text Translation: A Survey
端到端语音到文本翻译研究综述

Nivedita Sethiya, Chandresh Kumar Maurya

Abstract 摘要

Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
语音到文本（ST）翻译涉及将一种语言的语音信号转换为另一种语言的文本任务。该技术在免提通信、听写、视频讲座转录及翻译等多个领域具有广泛应用。传统 ST 翻译中，自动语音识别（ASR）与机器翻译（MT）模型发挥着关键作用：前者将原始形态的语音转换为书面文本，后者将转写文本翻译为目标语言，从而实现无缝跨语言交流。然而这类级联模型存在误差传播累积、资源消耗大及训练成本高等固有缺陷。为此，研究者们开始探索端到端（E2E）ST 翻译模型。但据我们所知，目前尚缺乏对 E2E ST 研究的系统性综述。本文旨在填补这一空白，从模型架构、评估指标、数据集三个维度对现有研究进行全面梳理，并深入剖析技术挑战，提出具有创新见解的未来研究方向。我们相信本综述将对从事语音翻译模型各类应用研究的研究人员有所裨益。

keywords:

Speech-to-Text Translation , Automatic Speech Recognition , Machine Translation , Modality Bridging
关键词：语音到文本翻译，自动语音识别，机器翻译，模态桥接

^†^†journal: Computer Speech & Language
^† 期刊：《计算机语音与语言》

\affiliation \隶属机构

[label1]organization=Indian Institute of Technology Indore, country=India
[label1]机构=印度理工学院印多尔分校, 国家=印度

1 Introduction 1 引言

The Speech-to-Text (ST) translation task aims to convert a speech in one language into text in another language. It finds its applications in various areas such as automatic subtitling, dictations, video lecture translations, tourism, telephone conversations, to name a few. There are many facets under which the ST problem can be cast. For example, are we performing ST translation online (aka simultaneous translation) or offline? The former is required in live video streaming, while the latter is helpful for movies where some latency may be allowed. The ST problem is further exacerbated by noisy inputs, low-resource/code-mix languages, and the presence of multiple speakers.
语音到文本（ST）翻译任务旨在将一种语言的语音转换为另一种语言的文本。其应用场景包括自动字幕生成、听写记录、视频讲座翻译、旅游导览、电话会话等诸多领域。ST 问题可从多个维度进行划分：例如进行在线（即同步翻译）还是离线翻译？前者适用于实时视频流场景，后者则适用于允许一定延迟的电影字幕场景。该任务还面临噪声输入、低资源/语码混合语言以及多说话人环境等复杂挑战。

Refer to caption — Figure 1: History of E2E ST models. Blue color models correspond to streaming models discussed in §7.1.2. Note that here we have listed only a few selected representative models.
图 1：端到端 ST 模型发展历程。蓝色标注模型对应§7.1.2 讨论的流式模型。注：此处仅列举部分代表性模型。

Historically, the ST problem has been solved by pipelining ASR and MT models together where ASR models take speech in a source language as input and generate the transcript. Whereas MT models translate the transcript into the target language. Such a cascade model suffers from problems like error propagation, higher training, and inference latency. Therefore, the current trend in developing the ST model is toward the E2E system which is defined as
从历史上看，语音翻译问题通常通过级联自动语音识别(ASR)和机器翻译(MT)模型来解决：ASR 模型将源语言语音作为输入并生成文本转录，而 MT 模型则将转录文本翻译为目标语言。这种级联模型存在错误传播、较高训练及推理延迟等问题。因此，当前语音翻译模型的发展趋势是采用端到端系统，其定义为

Definition 1 定义 1

A unified E2E ST model is implemented, facilitating combined training and recognition processes aimed at consistently reducing the anticipated error rate, thereby bypassing the need for independently acquired sources of knowledge.
端到端语音翻译模型通过实施统一架构，支持联合训练与识别过程，旨在持续降低预期错误率，从而规避对独立知识获取来源的依赖。

Therefore, the main goal of the E2E ST model is to achieve a reduced error rate, with secondary objectives potentially including decreased training/inference duration and memory usage.
因此，端到端语音翻译模型的主要目标是实现错误率降低，次要目标可能包括缩短训练/推理时长及减少内存占用。

There has been a lot of work building E2E ST models (as shown in fig. 1), datasets, and metrics in recent years. However, a systematic and comprehensive review of E2E ST works is missing. The authors found that a review paper (Xu et al., 2023b) on ST was published recently. The review mentioned above categorizes existing works mainly based on modeling, data, and application issues. They do not cover the data sets available for the ST tasks nor provide any insights into the cascade vs. E2E model performances. Also, the future open problems provided by them are limited. On the other hand, our work comprehensively reviews the existing models for ST tasks, evaluation methods, metrics, and datasets from a completely different perspective and critically analyzes the existing works; after that, we identify several challenges and future research directions. Thus, our work may be deemed complimentary to (Xu et al., 2023b).
近年来，端到端语音翻译模型（如图 1 所示）、数据集及评价指标的构建已取得大量研究成果，但学界仍缺乏对端到端语音翻译工作的系统化综述。作者发现近期发表的一篇语音翻译综述论文（Xu et al., 2023b）主要从建模方法、数据问题和应用场景三个维度对现有研究进行分类，既未涵盖语音翻译任务可用数据集的全貌，也未深入分析级联模型与端到端模型的性能差异，且提出的未来开放性问题较为有限。相比之下，本研究从全新视角全面梳理了语音翻译任务的现有模型、评估方法、指标体系和数据集，并对现有成果展开批判性分析，进而提出若干关键挑战与未来研究方向。因此，本工作可视为对（Xu et al., 2023b）的重要补充。

The following review is structured following the taxonomy in fig. 2. In §2, we establish the foundation of the ST task through a formal definition, and we subsequently delve into the various metrics and loss functions adopted by different researchers in §3. A comparative discussion between cascade and end-to-end models is presented in §4. Training of E2E ST models suffers from data issues and how to combat them is elaborated in §5. Speech and Text segmentation and representation is an important task in ST model development discussed in §6. In §7, we delve into the strategies employed to tackle the ST problem. We categorize these approaches based on the frameworks utilized and the characteristics of the data involved. Data and toolkits required for ST modeling are discussed in §9. Finally, in §10, we explore the prospects for future research and open problems within the field.
本综述按照图 2 的分类体系展开。在§2 中，我们通过形式化定义奠定语音翻译任务的基础框架，随后在§3 深入探讨不同研究者采用的评估指标与损失函数。§4 对级联模型与端到端模型进行了对比分析。端到端语音翻译模型的训练面临数据问题，§5 详细阐述了应对策略。§6 讨论了语音与文本分割及表征这一语音翻译模型开发中的关键任务。§7 系统剖析了解决语音翻译问题的技术路线，根据所采用框架和数据特性对这些方法进行分类。§9 论述了语音翻译建模所需的数据资源与工具包。最后在§10 中，我们展望了该领域未来研究方向与待解难题。

2 Background 2 背景

This section describes the ST task formally and presents the loss functions and evaluation metrics commonly employed to optimize ST models.
本节正式描述语音翻译任务，并介绍常用于优化语音翻译模型的损失函数与评估指标。

2.1 Task Definition 2.1 任务定义

ST task can be defined as translating the given input speech $U$ in one language to translated text $V$ in another language with the transcription text $X$ (optionally). Formally, it is defined as follows: Given a dataset $D=\{({\bf u}^{i},{\bf x}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ of pairs of input speech features ${\bf u}=(u_{1},u_{2},\ldots,u_{T_{u}})$ in a language and output text tokens ${\bf v}=(v_{1},v_{2},\ldots,v_{T_{v}})$ in a different language, the objective of the ST task is to minimize the conditional probability given below:
端到端语音到文本翻译任务可定义为：将给定输入语音 $U$ （源语言）转换为目标语言的翻译文本 $V$ ，并可选择性地包含转录文本 $X$ 。其形式化定义为：给定由源语言语音特征 ${\bf u}=(u_{1},u_{2},\ldots,u_{T_{u}})$ 与目标语言文本标记 ${\bf v}=(v_{1},v_{2},\ldots,v_{T_{v}})$ 组成的配对数据集 $D=\{({\bf u}^{i},{\bf x}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ ，该任务的目标是最小化如下条件概率：

p({\bf v}|{\bf u};\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|v_{<t},{\bf u};\theta)

(1)

In the above equation, $T_{u}$ , $T_{v}$ , and $\theta$ are the lengths of input features, the number of output tokens, and the model parameter, respectively. Note that the problem formulation given in (1) is for Autoregressive (AR) models ¹¹1Non-autoregressive (NAR) models are an alternative modeling approach that has been proposed in the past few years for the ST task. Only a sparse number of works exist in the literature. We discuss NAR briefly in §7.1. Usually, it is assumed that there are $n$ parallel speech-text pairs in our corpus, and the model is optimized for negative log-likelihood over these pairs as
在上述方程中， $T_{u}$ 、 $T_{v}$ 和 $\theta$ 分别表示输入特征长度、输出标记数量和模型参数量。需要注意的是，(1)式给出的问题公式是针对自回归(AR)模型 ¹ 的。通常假设语料库中存在 $n$ 个平行语音-文本对，模型通过最小化这些配对上的负对数似然进行优化。

\ell(\theta|D)=-\sum_{i=1}^{n}\log P({\bf v}^{i}|{\bf u}^{i};\theta)

(2)

The above optimization is usually solved using an encoder-decoder with an attention approach. Essentially, an encoder maps speech input to a hidden state representation $h$ followed by a decoder which takes the previously generated text tokens $v_{<t}$ , encoder hidden state $h$ and attention vector ${\bf\alpha}$ (Vaswani et al., 2017). Offline ST translation can look at the whole speech before producing output text tokens, whereas streaming ST can start translation of partial speech signal.
上述优化通常采用带有注意力机制的编码器-解码器架构实现。本质上，编码器将语音输入映射为隐藏状态表示 $h$ ，随后解码器基于先前生成的文本标记 $v_{<t}$ 、编码器隐藏状态 $h$ 及注意力向量 ${\bf\alpha}$ 进行解码（Vaswani 等人，2017）。离线语音翻译可在输出文本标记前处理完整语音信号，而流式语音翻译则能对部分语音片段实时启动翻译。

3 Evaluation Metrics 3 评估指标

This section discusses various metrics used to evaluate the E2E ST models. The metrics to evaluate E2E ST models are categorized into two types: quality and latency. The quality of the E2E ST models is the measure of how close the ST translation is to the target sentence. The latency is the time elapsed between the pronunciation of a word and the generation of its textual translation.
本节探讨了用于评估端到端语音翻译模型的各种指标。端到端语音翻译模型的评估指标可分为两类：质量指标与延迟指标。质量指标用于衡量语音翻译结果与目标语句的接近程度，而延迟指标则指从单词发音到生成对应文本翻译所经历的时间间隔。

3.1 Quality-based metrics
3.1 基于质量的评估指标

The quality-based metrics measure how close the translation is to the target sentence. Most of the existing literature evaluates these scores on detokenized output which is the string formed by combining the tokens. Standard metrics for evaluating ST task performance are commonly used MT evaluation metrics such as Bi-lingual Evaluation Understudy (BLEU) (Papineni et al., 2002), Translation Error Rate (TER) (Snover et al., 2006) via sacreBLEU, Metric for Evaluation of Translation with Explicit word Ordering (METEOR) (Banerjee and Lavie, 2005), and CHaRacter-level F-score (CHRF), and CHRF++ (Popović, 2015). Recently BERTScore has shown promising results on comparing with human evaluations. The BERTScore (Zhang et al., 2019) is an automatic evaluation metric that scores the similarity between the translated text and the referenced text. It takes into account the Recall, Precision, and Fscore. There are a few other evaluation metrics such as TRANSTAC (Schlenoff et al., 2009) and which are less frequently reported.
基于质量的度量指标评估翻译结果与目标句的接近程度。现有文献大多采用去标记化输出（即通过组合标记形成的字符串）来计算这些分数。语音翻译任务的标准评估指标通常沿用机器翻译领域的通用评价体系，包括：基于双语评估替换的 BLEU（Papineni 等，2002）、通过 sacreBLEU 计算的翻译错误率 TER（Snover 等，2006）、显式词序翻译评估指标 METEOR（Banerjee 和 Lavie，2005）、字符级 F 值 CHRF 及其增强版 CHRF++（Popović，2015）。近期研究表明，BERTScore 在与人工评估对比中展现出优越性能。该指标（Zhang 等，2019）通过计算译文与参考文本的语义相似度进行自动评估，综合考量召回率、精确率和 F 值。其他较少使用的评估指标还包括 TRANSTAC（Schlenoff 等，2009）等。

3.2 Latency-based metrics
3.2 基于延迟的指标

For streaming ST tasks, researchers report a metric for measuring latency, which is defined as the delay incurred in starting to produce the translation. Let ${\bf u},{\bf v}$ and $\hat{{\bf v}}$ denote the input speech sequence, ground truth text sequence, and system-generated hypothesis sequence, respectively. In the streaming ST task, models produce output with partial input. Suppose ${\bf u}_{1:t}=\{(u_{1},\ldots,u_{t}),t<T_{u}\}$ has been read when generating $v_{s}$ , the delay in $v_{s}$ is defined as (Ma et al., 2020a)
针对流式语音翻译任务，研究者提出了一种衡量延迟的指标，该指标定义为系统开始生成翻译结果所产生的时间延迟。设 ${\bf u},{\bf v}$ 和 $\hat{{\bf v}}$ 分别表示输入语音序列、真实文本序列和系统生成的假设序列。在流式语音翻译任务中，模型基于部分输入数据产生输出。假设生成 $v_{s}$ 时已读取 ${\bf u}_{1:t}=\{(u_{1},\ldots,u_{t}),t<T_{u}\}$ ，则 $v_{s}$ 处的延迟定义为（Ma 等，2020a）

d_{s}=\sum_{k=1}^{t}T_{k}

(3)

where $T_{k}$ is the duration of the speech frame $u_{k}$ . The latency metrics are calculated using a method that analyzes a sequence of time delays $[d_{1},\ldots,d_{T_{v}}]$ .
其中 $T_{k}$ 表示语音帧 $u_{k}$ 的持续时间。延迟指标通过分析时间延迟序列 $[d_{1},\ldots,d_{T_{v}}]$ 的方法进行计算。

Average Proportion (AP) (Cho and Esipova, 2016a) calculates the mean fraction of the source input that is read during the target prediction generating process.

1. 平均比例（AP）（Cho 和 Esipova，2016a）计算在目标预测生成过程中读取源输入的平均分数。

AP=\frac{1}{T_{v}\sum_{k=1}^{T_{u}}T_{k}}\sum_{s=1}^{T_{v}}d_{s}

(4)

Average Lagging (AL) measures the distance between the speaker and the user based on the number of words used in the conversation (Ma et al., 2018
2. 平均延迟（AL）通过对话中使用的词汇数量来衡量说话者与用户之间的距离（Ma 等人，2018）。).

AL=\frac{1}{\tau(T_{u})}\sum_{s=1}^{\tau(T_{u})}d_{s}-\hat{d_{s}}

(5)

Where $\tau(T_{u})=\min\{s\mid d_{s}=\sum_{k=1}^{T_{u}}T_{k}\}$ and $\hat{d_{s}}$ are the delays of an ideal policy defined as (Ma et al., 2020a)
其中 $\tau(T_{u})=\min\{s\mid d_{s}=\sum_{k=1}^{T_{u}}T_{k}\}$ 和 $\hat{d_{s}}$ 表示理想策略的延迟，其定义参见(Ma et al., 2020a)

\hat{d_{s}}=(s-1)\sum_{k=1}^{T_{u}}\frac{T_{k}}{T_{v}}

(6)

Differentiable Average Lagging (DAL) One issue with AL is that it is not differentiable because of the $\min$ function. To solve this, (Cherry and Foster, 2019
3. 可微分平均延迟（DAL）平均延迟（AL）存在的一个问题是因其 $\min$ 函数而不可微分。为解决这一问题，(Cherry and Foster, 2019) 提出在每个操作后引入最小延迟 $1/\gamma$ ，并将 DAL 定义为) introduces a minimum delay of $1/\gamma$ after each operation and defines DAL as

DAL=\frac{1}{T_{v}}\sum_{s=1}^{T_{v}}d_{s}^{{}^{\prime}}-\frac{s-1}{\gamma}

(7)

where 其中

d_{s}^{{}^{\prime}}=\begin{cases}d_{s},&\ s=0\\ \max(d_{s},d^{{}^{\prime}}_{s-1}+\gamma),&s>0\end{cases}

(8)

and $\gamma=T_{v}/\sum_{k=1}^{T_{u}}T_{k}$ 和 $\gamma=T_{v}/\sum_{k=1}^{T_{u}}T_{k}$

Length-Adaptive Average Lagging (LAAL) One issue with AL metric for simultaneous translation is that though it can handle the under-generation²²2Under/Over-generation problem refers to the length of the generated text compared to the reference translation text. problem, it is unable to handle over-generation and produces biased score. To alleviate this issue, (Papi et al., 2022a) propose LAAL which modifies (6) as

\hat{d_{s}}=(s-1)\sum_{k=1}^{T_{u}}\frac{T_{k}}{\max\{T_{v},\hat{T_{v}}\}}

(9)

Essentially, it divides (6) by the maximum length of the reference and predicted text. As such, it can handle both over and under-generation problems.
本质上，该指标通过将(6)式除以参考文本与预测文本的最大长度来实现归一化处理，从而能够同时解决生成文本过长或过短的问题。

4. 长度自适应平均滞后（LAAL）同步翻译中 AL 指标的一个问题是，虽然它能处理生成不足 ² 的情况，却无法处理生成过度的问题，并会产生偏差评分。为解决这一问题，(Papi 等人，2022a)提出了 LAAL 方法，将公式(6)修改为

Average Token Delay (ATD) AL metric does not take into account the length of the partial translation output, i.e., it does not consider the latency caused by longer outputs. To remedy this issue, ATD (Kano et al., 2023
5. 平均令牌延迟（ATD）AL 指标未考虑部分翻译输出的长度，即未计入较长输出导致的延迟。为解决此问题，近期提出了如下定义的 ATD 指标（Kano 等人，2023 年）。), defined below, has been proposed recently.

ATD=\frac{1}{T_{v}}\sum_{s=1}^{T_{v}}(T(v_{s})-T(u_{a(s)}))

(10)

where 其中

	$\displaystyle a(s)$	$\displaystyle=\min(s-f(s),d_{s})$		(11)
	$\displaystyle f(s)$	$\displaystyle=(s-1)-a(s-1)$		(12)

$T(\cdot)$ in (10) represents the ending time of each input or output token. The token is a sub-segment in speech, a character, or a word in text. $a(s)$ represents the index of the input token corresponding to $v_{s}$ in the time difference calculation and $a(0)=0$ . $f(s)$ in (11) represents how much longer the duration of the previous translation prefix is than that of the previous input prefix.
$T(\cdot)$ 在式(10)中表示每个输入或输出标记的结束时间。该标记可以是语音中的子片段、文本中的字符或单词。 $a(s)$ 表示在时间差计算和 $a(0)=0$ 中与 $v_{s}$ 相对应的输入标记索引。 $f(s)$ 在式(11)中表示前一个翻译前缀的持续时间比前一个输入前缀长多少。

3.3 Loss Functions 3.3 损失函数

Let $D=(u,x,v)$ be a tuple where $u,x$ , and $v$ are the speech, the transcription text, and the translation text, respectively. The following are the various loss functions that are used to optimize the performance of the E2E ST models:
设 $D=(u,x,v)$ 为一个元组，其中 $u,x$ 、 $v$ 分别表示语音、转录文本和翻译文本。以下是用于优化端到端语音翻译模型性能的各种损失函数：

Distillation Loss (Liu et al., 2019) The student model not only matches the ground truth, but also the teacher models’s output probabilities, which reduces the variance of the gradients.

1. 蒸馏损失（Liu 等人，2019）学生模型不仅需要匹配真实标签，还需对齐教师模型的输出概率分布，此举有效降低了梯度方差。

L_{KD}=-\sum_{(x,v)\in D}\sum_{t=1}^{N}\sum_{k=1}^{|V|}S(v_{t}=k|v_{<t},x)\log T% (v_{t}=k|v_{<t},x)

(13)

where $S$ and $T$ denote the output distribution of student and teacher models, respectively.
其中 $S$ 和 $T$ 分别表示学生模型和教师模型的输出分布。

CTC Loss (Ren et al., 2020) computes the most likely alignment of output text sequence given input speech sequence by summing over the all possible output sequence paths.

2. CTC 损失函数（Ren 等人，2020）通过对所有可能的输出序列路径进行求和，计算给定输入语音序列时最可能的输出文本序列对齐方式。

L_{CTC}=-\sum_{(u,x)\in D}\sum_{z\in\phi(x)}\log p(z|u)

(14)

Cross-Modal Adaptation Loss (Liu et al., 2020d) is defined as the sum of all the Mean Squared Errors of the speech and the transcription texts.

3. 跨模态适应损失（Liu 等人，2020d）定义为语音与转录文本所有均方误差之和。

L_{AD}=\Biggl{\{}\begin{matrix}\sum_{(u,x)\in D}MSE(\bar{h_{u}},\bar{h_{x}});&% &$seq-level$\\ \sum_{(u,x)\in D}MSE(h_{u},h_{x});&&$word-level$\end{matrix}

(15)

where $h_{u}$ and $h_{x}$ are the speech and word embeddings, and $\bar{h_{u}}$ and $\bar{h_{x}}$ are the average speech and word embeddings, respectively. MSE represents the difference between the two embeddings.
其中 $h_{u}$ 和 $h_{x}$ 分别表示语音嵌入和词嵌入， $\bar{h_{u}}$ 和 $\bar{h_{x}}$ 则对应语音嵌入和词嵌入的平均值。MSE 用于衡量两种嵌入之间的差异。

Cross-Entropy Loss (Ye et al., 2021) is the negative likelihood of the data combined over all the subtasks such as ASR, MT, ST and also from external-MT.

4. 交叉熵损失（Ye 等人，2021 年）是通过将所有子任务（如自动语音识别、机器翻译、语音翻译）以及外部机器翻译的数据联合计算得出的负对数似然。

L_{\theta}=-\sum_{x,v\in D^{\prime}\cup D_{MT-ext}}\log p(x|v;\theta),

(16)

where $D^{\prime}=D_{ASR}\cup D_{MT}\cup D_{ST}$ is the superset of all the parallel subsets data.
其中 $D^{\prime}=D_{ASR}\cup D_{MT}\cup D_{ST}$ 为所有并行子集数据的超集。

Contrastive Loss (Ye et al., 2022a) is computed between the speech and the transcription text bringing them closer, and pushing the unrelated pairs farther.

5. 对比损失（Ye 等人，2022a）通过计算语音与转录文本之间的差异，使相关配对更接近，同时推远无关配对。

L_{CON}=-\sum_{(u,x)\in D}\log\frac{\exp({\cos(\bar{h_{u}},\bar{h_{x}})}/% \kappa)}{\sum_{\forall x_{j}\notin\bar{h_{x}}}\exp({\cos(\bar{h_{u}},\bar{h_{x% }}(x_{j}))}/\kappa)},

(17)

where $cos$ and $\kappa$ denote the cosine similarity and temperature hyperparameter, respectively.
其中 $cos$ 和 $\kappa$ 分别表示余弦相似度和温度超参数。

ST Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the translation text given the source speech as follows

6. ST 损失（Ouyang 等人，2023）定义为给定源语音时翻译文本的负对数似然，其表达式如下：

L_{ST}=-\sum_{(u,v)\in D}\log p(v|u)

(18)

MT Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the translation text given the source transcript as follows

7. 机器翻译损失（Ouyang 等人，2023）定义为给定源文本时翻译文本的负对数似然，其表达式如下

L_{MT}=-\sum_{(x,v)\in D}\log p(v|x)

(19)

ASR Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the transcription text given the source speech as follows

8. 语音识别损失（Ouyang 等人，2023）定义为给定源语音时转录文本的负对数似然，其表达式如下

L_{ASR}=-\sum_{(u,x)\in D}\log p(x|u)

(20)

4 Cascade vs. End-to-End 4 级联式与端到端

The traditional ST translation methods involve a cascade approach– First, applying ASR on the given speech and then performing MT on the transcription produced by ASR (see fig. 3(a)) Such a cascade model is prone to several issues, such as (a) error in the ASR model can propagate to the MT model, (b) higher training time, (d) inability to capture non-lexical cues such as prosody, and (d) resources required for training. To mitigate such issues, various researchers propose using E2E models (see fig. 3(b)) for ST task (Bérard et al., 2016; Anastasopoulos et al., 2016; Bérard et al., 2018; Gangi et al., 2019; Bentivogli et al., 2021). An E2E model offers joint training from scratch; avoids separately trained knowledge sources; and produces the output in a single pass (Prabhavalkar et al., 2024). Because of simpler training, lower memory footprint, and cost, E2E model development has gained significant momentum in the research community.
传统的语音到文本翻译方法采用级联式处理流程：首先对输入语音进行自动语音识别（ASR），随后对 ASR 生成的文本实施机器翻译（MT）（见图 3(a)）。这种级联模型存在若干固有缺陷：（a）ASR 模型的错误会向 MT 模型传递；（b）训练时间较长；（d）无法捕捉韵律等非词汇特征；（d）训练资源需求较高。为应对这些问题，众多研究者提出采用端到端模型（E2E）处理语音翻译任务（Bérard 等，2016；Anastasopoulos 等，2016；Bérard 等，2018；Gangi 等，2019；Bentivogli 等，2021）。端到端模型支持从零开始的联合训练，规避了分立知识源的单独训练需求，并能单次推理生成最终输出（Prabhavalkar 等，2024）。凭借训练流程简化、内存占用降低及成本优势，端到端模型研发已在学术界形成显著发展趋势。

Despite E2E models demonstrating superiority over cascade ST models based on the aforementioned criteria, they still fall short in comparison to the latter in terms of both automatic and human evaluation metrics (Etchegoyhen et al., 2022; Agrawal et al., 2023). In particular, (Lam et al., 2020; Etchegoyhen et al., 2022) show that the cascade model outperforms E2E in a low-resource setting (Basque $\rightarrow$ Spanish) while employing in-domain and out-of-domain data for training the ASR and MT components. The gap is more significant when models are trained using unrestricted data. However, as shown by (Bentivogli et al., 2021) on three language directions, the gap between cascade and E2E is closed, though primarily on English on one side. The same conclusion is found by (Tsiamas et al., 2024) as well. Another study (Zhou et al., 2024) shows that E2E models can capture para-linguistic features of speech and outperform cascade models in disambiguating wh-phrases. Such a study alludes to further comparative study involving more languages and domains to assert the claim that the performance gap is indeed closed.
尽管端到端模型在上述标准上展现出优于级联语音翻译模型的特性，但在自动评估和人工评估指标方面仍不及后者（Etchegoyhen 等，2022；Agrawal 等，2023）。特别是（Lam 等，2020；Etchegoyhen 等，2022）的研究表明，在低资源场景（巴斯克语→西班牙语）中，当使用领域内和领域外数据分别训练自动语音识别（ASR）与机器翻译（MT）组件时，级联模型表现更优。若采用无限制数据进行训练，这种性能差距会进一步扩大。然而，（Bentivogli 等，2021）针对三个语言方向的研究显示，级联与端到端模型的差距正在缩小——尽管主要体现在以英语为源语言的情况下。（Tsiamas 等，2024）的研究也得出相同结论。另一项研究（Zhou 等，2024）则指出，端到端模型能捕捉语音的副语言特征，在消解 wh 短语歧义方面优于级联模型。这类研究暗示需要开展更多跨语言、跨领域的对比实验，才能确证两者性能差距已真正消除。

5 Data Issues 5 数据问题

The lack of adequate parallel speech-text corpora, essential in large quantities for training direct ST models, significantly impedes the performance of such models. The necessity for supervised ST data poses challenges in applying E2E ST systems to low-resource languages, where creating labeled parallel speech-text corpora demands substantial investments of time, money, and expertise. To address data scarcity, various techniques such as data augmentation, pre-training, back-translation, knowledge distillation, etc., are employed. These methods are elaborated as follows.
缺乏足够的平行语音-文本语料库（这对训练直接语音翻译模型至关重要且需要大量数据），严重制约了此类模型的性能。在有监督语音翻译数据的需求下，端到端语音翻译系统在低资源语言中的应用面临挑战，因为创建标注的平行语音-文本语料库需要投入大量时间、资金和专业资源。为解决数据稀缺问题，研究者采用了数据增强、预训练、反向翻译、知识蒸馏等多种技术。这些方法具体阐述如下。

5.1 Augmentation 5.1 数据增强

Data augmentation is a technique in machine learning to synthetically create more data points by applying the class-preserving transformations (Cui et al., 2015). The objective is to increase the variability in the data so that the generalization and robustness of the model may be enhanced. Data augmentation can be applied to both speech and text.
数据增强是机器学习中通过施加类别保持变换（Cui 等人，2015）来合成生成更多数据点的技术。其目的是增加数据的多样性，从而提升模型的泛化能力和鲁棒性。该技术可同时应用于语音和文本数据。

5.1.1 Augmenting speech data
5.1.1 语音数据增强

Speech data can be augmented in various ways. For example, by adding noise, speed and pitch perturbation, time and frequency masking to name a few. SpeechAugment (Park et al., 2019) policy consists of warping the features, masking blocks of frequency channels, and time steps. It has been successfully used both for ASR (Vincent et al., 2017) and ST tasks (Bahar et al., 2019b). MixSpeech (Meng et al., 2021) as shown in Fig. 4(a) takes the weighted combination of two different speech features as input and two recognition losses with the same weights. A generalization of MixSpeech (Xie and Hansen, 2023) called MixRep applies the mixup idea to the acoustic feature and hidden layers inputs. MixRep combination with a regularization term along the time axis further improves ASR performance. Both MixSpeech and MixRep have been shown to perform well for low-resource ASR and their effectiveness is still to be tested for ST tasks. M3ST (Cheng et al., 2022) applies two levels of Fine-Tuning (FT) using mixup data– word, sentence, and frame level mix data in the first FT level and source speech and transcription mixup in the second FT level. M3ST achieves SOTA on MuST-C compared to baselines.
语音数据可通过多种方式进行增强，例如添加噪声、速度和音高扰动、时频掩蔽等。SpeechAugment（Park 等人，2019）策略包含特征扭曲、频率通道块掩蔽和时间步掩蔽，已成功应用于自动语音识别（Vincent 等人，2017）和语音翻译任务（Bahar 等人，2019b）。如图 4(a)所示，MixSpeech（Meng 等人，2021）采用两种不同语音特征的加权组合作为输入，并施加相同权重的双重识别损失。其泛化版本 MixRep（Xie 和 Hansen，2023）将混合增强思想应用于声学特征和隐藏层输入，结合时间轴正则化项进一步提升了语音识别性能。MixSpeech 与 MixRep 在低资源语音识别中表现优异，但其在语音翻译任务中的有效性仍有待验证。M3ST（Cheng 等人，2022）采用两级微调：第一级使用词汇/句子/帧级混合数据，第二级采用源语音与转写文本的混合增强。在 MuST-C 基准测试中，M3ST 取得了当前最优性能。

5.1.2 Augmenting speech and text data
5.1.2 语音与文本数据增强

It is possible to augment both speech and text simultaneously and create new paired data. For example, sample, translate, and recombine (Lam et al., 2022b) first samples a suffix replacement from suffix memory corresponding to a pivot token from transcription. It then translates the combined new utterance (prefix+pivot+replacmenet suffix) to generate a new target sentence. The corresponding audio pair is obtained by concatenating the audio frames of the prefix, pivot, and replacement suffix. The interesting thing about the proposed method is that it can generate real-looking sentences contrary to pseudo-sentences. Concatenation of original ST data has been used to augment the entire training data (Lam et al., 2022a). In particular, (Lam et al., 2022a) proposes CatSpeaker that uses single speaker information and CatRandom that randomly generates audio-text pairs spoken by different speakers.
可以同时对语音和文本进行增强并创建新的配对数据。例如，采样-翻译-重组方法（Lam 等人，2022b）首先从后缀存储器中采样与转录文本中的枢轴标记相对应的后缀替换项，随后将组合后的新话语（前缀+枢轴+替换后缀）翻译生成新的目标语句。对应的音频对则是通过拼接前缀、枢轴及替换后缀的音频帧获得。该方法的有趣之处在于能生成真实感语句而非伪句子。原始语音翻译数据的串联拼接已被用于增强整个训练集（Lam 等人，2022a）。具体而言，（Lam 等人，2022a）提出了使用单一说话者信息的 CatSpeaker 方案，以及随机生成不同说话者音频-文本对的 CatRandom 方案。

5.2 Pre-training 5.2 预训练

Pre-training is an approach to handle data scarcity for low-resource problems and is deemed as a form of transfer learning (Bozinovski and Fulgosi, 1976). Data used for pre-training may consist of either speech, text, or both. Once the models are pre-trained leveraging augmented data, it enhances the robustness of the model on downstream tasks. We find that SOTA ST models often use pre-training on a large amount of ASR/MT corpus. In ST, pre-training has been used by many researchers (Paulik and Waibel, 2013; Bansal et al., 2017; Anastasopoulos and Chiang, 2018; Wang et al., 2020d; Dong et al., 2021; Zhang et al., 2022a; Tang et al., 2022). Pre-training has been applied in two flavors by different researchers: Independently and Jointly.
预训练是解决低资源问题数据稀缺的一种方法，被视为迁移学习的一种形式（Bozinovski 和 Fulgosi，1976）。用于预训练的数据可以包含语音、文本或两者兼具。通过增强数据对模型进行预训练后，能提升模型在下游任务中的鲁棒性。我们发现当前最先进的语音翻译模型通常利用大量自动语音识别/机器翻译语料库进行预训练。在语音翻译领域，众多研究者已采用预训练技术（Paulik 和 Waibel，2013；Bansal 等，2017；Anastasopoulos 和 Chiang，2018；Wang 等，2020d；Dong 等，2021；Zhang 等，2022a；Tang 等，2022）。不同研究者主要采用两种预训练模式：独立预训练与联合预训练。

In independent pre-training, individual modules (encoder, decoder, semantic decoder, etc.) are pre-trained using auxiliary data such as ASR and MT data. Such an approach has been followed by (Wang et al., 2020d; Chen et al., 2020; Zheng et al., 2021a). In particular, (Wang et al., 2020d) pre-trains the encoder using ASR data for learning semantic concepts. (Chen et al., 2020) propose a self-supervised method called Masked Acoustic Modeling (MAM), which randomly masks part of the speech spectrogram and then recovers it on top of the encoder. Whereas (Zheng et al., 2021a) unifies speech and text representation through masked language modeling. Besides pre-training the encoder and the decoder, various researchers also exploit pre-trained feature extractors such as Wav2vec (Schneider et al., 2019) used by (Zhang et al., 2023b) and (Liu et al., 2020b) HuBERT (Hsu et al., 2021) used by (Zhang et al., 2023a). Very recently, (Tsiamas et al., 2024) proposed an ST model that pre-trains the speech encoder using optimal transport and CTC. They claim to surpass supervised ST models requiring no paired speech-text data in a zero-shot setting.
在独立预训练阶段，各模块（编码器、解码器、语义解码器等）利用 ASR 和 MT 数据等辅助数据进行预训练。该方法已被（Wang 等人，2020d；Chen 等人，2020；Zheng 等人，2021a）采用。其中，（Wang 等人，2020d）使用 ASR 数据预训练编码器以学习语义概念；（Chen 等人，2020）提出名为掩码声学建模（MAM）的自监督方法，该方法随机掩码部分语音频谱图后通过编码器进行重构；而（Zheng 等人，2021a）则通过掩码语言建模统一语音与文本表征。除编码器和解码器预训练外，研究者还广泛采用预训练特征提取器，如（Zhang 等人，2023b）使用的 Wav2vec（Schneider 等人，2019）和（Liu 等人，2020b）采用的 HuBERT（Hsu 等人，2021）。最新研究中，（Tsiamas 等人，2024）提出通过最优传输与 CTC 预训练语音编码器的 ST 模型，宣称在零样本场景下无需配对语音-文本数据即可超越有监督 ST 模型。

In joint pre-training, the entire model is first pre-trained in an E2E fashion followed by fine-tuning over the ST corpus (Fang and Feng, 2023; Bapna et al., 2021). It is often accompanied by multitasking pre-training with ASR, MT, and masked language modeling tasks (Chung et al., 2021), using supervised as well as unsupervised speech and text data. The (Tang et al., 2022) pre-trains on speech/text-to-text/speech, text-to-text, speech self-supevised learning (SSL), and speech-to-phoneme. SpeechT5 (Ao et al., 2021) pre-trains on ASR, ST, text-to-speech, speech conversion, and speech enhancement tasks. Wave2Seq (Wu et al., 2022) pre-trains jointly using pseudo-languages. Multi-modal multi-task pre-training leverages five tasks: self-supervised speech-to-pseudo-codes (S2C), phoneme-to-text (P2T), self-supervised masked speech prediction (MSP), supervised phoneme prediction (PP), and ST task (Zhou et al., 2022b).
在联合预训练中，整个模型首先以端到端方式进行预训练，随后在语音翻译语料库上进行微调（Fang 和 Feng，2023；Bapna 等，2021）。该方法通常结合自动语音识别、机器翻译和掩码语言建模任务进行多任务预训练（Chung 等，2021），同时利用有监督和无监督的语音及文本数据。（Tang 等，2022）的研究涵盖了语音/文本到文本/语音、文本到文本、语音自监督学习以及语音到音素的预训练。SpeechT5（Ao 等，2021）则在自动语音识别、语音翻译、文本转语音、语音转换和语音增强任务上进行预训练。Wave2Seq（Wu 等，2022）采用伪语言进行联合预训练。多模态多任务预训练框架包含五项任务：自监督语音到伪代码、音素到文本、自监督掩码语音预测、有监督音素预测以及语音翻译任务（Zhou 等，2022b）。

5.3 Self-training and Back-translation
5.3 自训练与回译

Both the Self-Training and Back-translation (BT) methods are approaches employed to harness monolingual data for training models that necessitate supervised data but encounter limitations in the availability of a sufficient supervised parallel corpus, as illustrated in Fig.4(b) and (c). The self-training method is utilized to make use of source monolingual data, while the back-translation method is applied to target monolingual data. In the end, both methods are employed synergistically to generate augmented data.
自训练与反向翻译（BT）方法均是利用单语数据训练模型的策略，适用于需要监督数据但面临平行语料不足的情况，如图 4(b)和(c)所示。自训练方法用于利用源语言单语数据，而反向翻译方法则应用于目标语言单语数据。最终，这两种方法协同作用生成增强数据。

More specifically, given a speech-text parallel corpus $D_{p}=\{({\bf u}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ , monolingual source speech corpus $D_{s}=\{{\bf u}^{i}_{s}|i=1,2,\ldots,m\}$ and monolingual target text corpus $D_{t}=\{{\bf v}^{i}_{t}|i=1,2,\ldots,p\}$ , where $m,p>>n$ . In self-training, first, a translation model $f_{u\rightarrow v}$ is trained on $D_{p}$ . It is then used to generate “pseudo labels” ${\bf v}^{i}_{s}$ for $D_{s}$ by applying $f_{u\rightarrow v}$ leading to auxiliary data $A_{s}=\{({\bf u}^{i}_{s},{\bf v}^{i}_{s})|i=1,2,\ldots,m\}$ . The combined data $D_{p}\cup A_{s}$ is then used to re-train the model $f_{u\rightarrow v}$ . Whereas in back-translation, $D_{t}$ is translated using a backward translation model $f_{v\rightarrow u}$ creating auxiliary data $A_{t}=\{({\bf u}^{i}_{t},{\bf v}^{i}_{t})|i=1,2,\ldots,p\}$ for training the forward translation model $f_{u\rightarrow v}$ on the combined data $D_{p}\cup A_{t}$ .
具体而言，给定语音-文本平行语料库 $D_{p}=\{({\bf u}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ 、源语言单语语音库 $D_{s}=\{{\bf u}^{i}_{s}|i=1,2,\ldots,m\}$ 和目标语言单语文库 $D_{t}=\{{\bf v}^{i}_{t}|i=1,2,\ldots,p\}$ （其中 $m,p>>n$ ）。在自训练中，首先基于 $D_{p}$ 训练翻译模型 $f_{u\rightarrow v}$ ，随后应用 $f_{u\rightarrow v}$ 为 $D_{s}$ 生成"伪标签" ${\bf v}^{i}_{s}$ ，从而得到辅助数据 $A_{s}=\{({\bf u}^{i}_{s},{\bf v}^{i}_{s})|i=1,2,\ldots,m\}$ 。组合数据 $D_{p}\cup A_{s}$ 随后用于重新训练模型 $f_{u\rightarrow v}$ 。而在反向翻译中，通过反向翻译模型 $f_{v\rightarrow u}$ 将 $D_{t}$ 转换为辅助数据 $A_{t}=\{({\bf u}^{i}_{t},{\bf v}^{i}_{t})|i=1,2,\ldots,p\}$ ，用于在组合数据 $D_{p}\cup A_{t}$ 上训练正向翻译模型 $f_{u\rightarrow v}$ 。

Back-translation on discrete units to train a unit-to-text translation model is applied in (Zhang et al., 2023a) which is on par with methods leveraging large-scale external corpus. (Fang and Feng, 2023) proposes a back-translation strategy for target-to-unit and unit-to-speech synthesis for low-resource language translation without transcript. (Wang et al., 2021b) extract speech features using wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Cyclic feedback from MT output is used as a self-training mechanism for a cascade of ASR-MT model that shows how to exploit the direct speech-translation data in (Lam et al., 2020).
（Zhang 等，2023a）采用离散单元反向翻译训练单元到文本的翻译模型，其性能与利用大规模外部语料库的方法相当。（Fang 和 Feng，2023）针对无文本转录的低资源语言翻译任务，提出了目标语到单元及单元到语音合成的反向翻译策略。（Wang 等，2021b）通过 wav2vec 2.0 预训练提取语音特征，结合单轮自训练及语言模型解码。（Lam 等，2020）则利用机器翻译输出的循环反馈作为 ASR-MT 级联模型的自我训练机制，展示了如何有效利用直接语音翻译数据。

5.4 Knowledge distillation
5.4 知识蒸馏

Knowledge Distillation(KD) transfers learned knowledge from a large ensemble model (called teacher) to a smaller single model (called student) as shown in Fig.4 (d) (Hinton et al., 2015). This process encompasses both model compression (Bucilǎ et al., 2006) and transfer-learning. More details of recent works utilizing KD approaches for ST tasks are given in §7 (ST with MT) and §6.2.3.
知识蒸馏（Knowledge Distillation，简称 KD）通过将大型集成模型（称为教师模型）习得的知识迁移至小型单一模型（称为学生模型）来实现知识传递，如图 4(d)所示（Hinton 等人，2015）。该技术同时涵盖模型压缩（Bucilǎ等人，2006）与迁移学习两大范畴。关于近期采用 KD 方法进行语音翻译任务的研究详情，请参阅第 7 章（结合机器翻译的语音翻译）及第 6.2.3 节内容。

6 Segmentation and Representation Learning
6 分割与表征学习

E2E ST models rely on segmented inputs because handling long inputs is a challenging task (Kim et al., 2017; Tsiamas et al., 2022c). Segmentation is the problem of splitting the long speech/text sequence into smaller and more manageable segments whose representations can be learned. This section will shed some light on the segmentation and representation issues and offer some advice on how to tackle them.
端到端语音转文本模型依赖分段输入，因为处理长输入是一项具有挑战性的任务（Kim 等人，2017；Tsiamas 等人，2022c）。分段是指将长语音/文本序列分割为更小、更易处理的片段，以便学习其表征。本节将探讨分段与表征问题，并提供应对建议。

6.1 Segmentation Learning 6.1 分段学习

As discussed above, segmentation is an important issue while building ST models. Segmentation of text is easy–they can be split based on the strong punctuation. This is what current MT models rely on. Similarly, ASR models give lower importance to segmentation due to the small local context window required for the task. The cascaded ST model can perform segmentation by applying ASR followed by monolingual translation to restore the lost punctuation followed by segmentation on them (Matusov et al., 2007, 2018). On the other side, the E2E ST models require sophisticated segmentation of the speech primarily due to the importance of out-of-order word relation between the input and output that exist as well as the absence of linguistic features.
如前所述，分段是构建语音转文本模型时的重要问题。文本分段较为简单——可通过标点符号进行分割，这也是当前机器翻译模型采用的方法。同样，由于自动语音识别任务所需的局部上下文窗口较小，该模型对分段的重视程度较低。级联式语音转文本模型可通过先进行语音识别，再通过单语翻译恢复丢失的标点符号，最后进行分段处理（Matusov 等人，2007，2018）。而端到端语音转文本模型则需要对语音进行精细分段，这主要是由于输入输出之间存在非连续词序关系，且缺乏语言学特征。

Traditionally, segmentation of speech is done manually. Due to the cumbersome task, segmentation learning is warranted. Segmentation is done based on either length which splits the speech at fixed-lengths or pause which splits the speech based on Voice Activity Detection (VAD) (Sohn et al., 1999). The third approach to segment the speech is hybrid mode in that length and linguistic contents are taken into account (Potapczyk and Przybysz, 2020; Gaido et al., 2021; Tsiamas et al., 2022c). The hybrid approach surpasses the length and pause-based approaches to segmentation in terms of performance (Gaido et al., 2021). Concretely, (Tsiamas et al., 2022c) learns the manual segmentation using a binary classifier and probabilistic divide-and-conquer algorithm (Gaido et al., 2021) is used at inference time to decide the split point. However, there is still a gap in the hybrid and manual approaches to segmentations, and future work may consider paying attention to this.
传统上，语音分割通常采用人工方式进行。鉴于该任务繁琐复杂，开展分割学习研究具有必要性。现有分割方法主要分为三类：基于固定时长的等长分割、基于语音活动检测（VAD）的静默分割（Sohn 等人，1999 年），以及综合考虑时长与语言特征的混合分割模式（Potapczyk 与 Przybysz，2020 年；Gaido 等人，2021 年；Tsiamas 等人，2022c 年）。研究表明，混合分割方法在性能上优于纯时长或静默分割（Gaido 等人，2021 年）。具体而言，（Tsiamas 等人，2022c 年）采用二元分类器学习人工分割特征，并在推理阶段运用概率分治算法（Gaido 等人，2021 年）确定分割点。然而当前混合分割与人工分割方法仍存在性能差距，未来研究可重点关注该领域。

Our discussion above focuses on segmentation in the offline E2E models. Segmentation of speech in streaming E2E models is presented in §7.1.2.
我们上述讨论主要集中于离线端到端模型中的语音分段问题。流式端到端模型的语音分段方法将在第 7.1.2 节中阐述。

6.2 Representation Learning
6.2 表征学习

Representation learning is a type of machine learning where algorithms are supposed to discover and extract useful features automatically from the raw data. It has been successfully applied in computer vision (Wu, 2020), natural language processing (Liu et al., 2021b), and speech (Mohamed et al., 2022). Representation learning is an important issue in ST tasks because speech and text are two distinct modalities of data that reside in different embedding spaces. Hence, we not only need better representation learning methods for speech and text but also their joint representation learning. Many of the works in ST apply speech/text representation learning methods before actually applying encoder-decoder or transducer-based methods (explained later in §7) for the ST task. Below, we provide details of such representation learning methods used for ST tasks.
表征学习是机器学习的一种类型，其算法能够从原始数据中自动发现并提取有用特征。该方法已成功应用于计算机视觉（Wu, 2020）、自然语言处理（Liu 等, 2021b）和语音领域（Mohamed 等, 2022）。在语音翻译任务中，表征学习尤为重要，因为语音和文本是两种不同模态的数据，存在于不同的嵌入空间。因此，我们不仅需要针对语音和文本的更好表征学习方法，还需要二者的联合表征学习。许多语音翻译研究在实际应用编码器-解码器或基于传感器的方法（详见第 7 节）之前，都会先采用语音/文本表征学习方法。下文将详细阐述用于语音翻译任务的此类表征学习方法。

6.2.1 Text Representation 6.2.1 文本表示

ST models often use ASR transcripts and MT translations as auxiliary data which needs to be fed to the encoder and decoder, respectively. To learn representation for such text data, existing works rely on word embedding (Zhang et al., 2023c; Bérard et al., 2016), LSTM (Kim et al., 2017; Weiss et al., 2017; Bérard et al., 2018; Jia et al., 2019), and Transformer (Wang et al., 2021b; Liu et al., 2021a; Zeng et al., 2021). Text data is often tokenized and fed as either a word or as a character (Bérard et al., 2018). The output of the decoder could be graphene, characters, or words.
语音转文本模型常将自动语音识别(ASR)转录文本和机器翻译(MT)译文作为辅助数据，分别输入编码器和解码器。为学习此类文本数据的表征，现有研究主要采用词嵌入(Zhang et al., 2023c; Bérard et al., 2016)、长短期记忆网络(Kim et al., 2017; Weiss et al., 2017; Bérard et al., 2018; Jia et al., 2019)以及 Transformer 架构(Wang et al., 2021b; Liu et al., 2021a; Zeng et al., 2021)。文本数据通常经过分词处理后，以单词或字符形式输入(Bérard et al., 2018)。解码器输出可以是字形、字符或单词。

6.2.2 Speech Representation
6.2.2 语音表征

ST models take speech as input and utilize various speech-based feature representation methods to convert speech into a vector representation. Traditional speech feature extraction methods such as Perceptual Linear Prediction, (PLP), Fbank, and Mel-Filter Cepstral Coefficient (MFCC) (Rabiner and Schafer, 2010) have been used after normalization to extract speech features by many (Duong et al., 2016; Bérard et al., 2016; Kim et al., 2017; Bérard et al., 2018; Anastasopoulos and Chiang, 2018; Bansal et al., 2019; Jia et al., 2019; Inaguma et al., 2019; Liu et al., 2020d; Dong et al., 2021; Le et al., 2023b; Parcollet et al., 2024), sometimes combining them with pitch features and speech augmentation methods as described in §5. These feature extraction methods are sometimes being replaced by distributed feature representation methods such as speech word2vec (Chung and Glass, 2018) owing to their dense continuous feature representation capability.
语音到文本（ST）模型以语音作为输入，采用多种基于语音的特征表示方法将语音转换为向量表征。传统语音特征提取方法如感知线性预测（PLP）、滤波器组（Fbank）和梅尔频率倒谱系数（MFCC）（Rabiner and Schafer, 2010）经过归一化处理后，已被众多研究者（Duong et al., 2016; Bérard et al., 2016; Kim et al., 2017; Bérard et al., 2018; Anastasopoulos and Chiang, 2018; Bansal et al., 2019; Jia et al., 2019; Inaguma et al., 2019; Liu et al., 2020d; Dong et al., 2021; Le et al., 2023b; Parcollet et al., 2024）用于语音特征提取，有时会如第 5 节所述结合基频特征与语音增强方法使用。由于具备稠密连续特征表示能力，这些特征提取方法正逐渐被语音 word2vec（Chung and Glass, 2018）等分布式特征表示方法所替代。

It is difficult to get a large amount of labeled speech data to learn supervised speech feature representation. Therefore, more recent works exploit speech features learned via unsupervised and self-supervised ways, mapping continuous speech signal to discrete units– akin to words and sub-words in the text domain. Such a representation facilitates tools developed in NLP to borrow in the speech domain. Among them the most popular is Wav2Vec (Schneider et al., 2019) and its variants such as w2v-BERT (Chung et al., 2021) and Wav2vec 2.0 (Baevski et al., 2020) used in (Tran et al., 2020; Le et al., 2020; Li et al., 2020; Han et al., 2021; Popuri et al., 2022; Zhang et al., 2023c). Interestingly, Wav2Vec and its variants can be used as an encoder in a Seq2Seq framework alone or combined with adapters and CNN for Length shrinking³³3Length shrinking is an important issue in ST task since speech is a much longer sequence than text. Therefore, existing works employ various techniques such as length adapters, CNN, CTC for length shrinking. A few works such as CSTNet (Khurana et al., 2020; Wang et al., 2020d) use CNN for feature extraction and length shrinking.
获取大量标注语音数据以学习有监督的语音特征表示存在困难。因此，近期研究多采用无监督和自监督方式学习语音特征，将连续语音信号映射为离散单元——类似于文本领域的单词和子词。这种表征方式便于将自然语言处理领域的工具迁移至语音领域。其中最流行的是 Wav2Vec（Schneider 等人，2019）及其变体，如 w2v-BERT（Chung 等人，2021）和 Wav2vec 2.0（Baevski 等人，2020），这些模型已被应用于（Tran 等人，2020；Le 等人，2020；Li 等人，2020；Han 等人，2021；Popuri 等人，2022；Zhang 等人，2023c）等研究。值得注意的是，Wav2Vec 及其变体可单独作为 Seq2Seq 框架中的编码器，也可与适配器和 CNN 结合用于长度缩减 ³ 。部分研究如 CSTNet（Khurana 等人，2020；Wang 等人，2020d）则采用 CNN 进行特征提取和长度缩减。

More recent works in ST are employing HuBERT (Hsu et al., 2021) for speech representation (among other benefits of HuBERT) (Zhang et al., 2023a). Hubert offers stable training and better targets than Wav2Vec 2.0 since it uses hidden layers representation during the clustering process. For encoding long-speech signals, Conformers (Gulati et al., 2020) can be used as they provide local context through convolution block and global context through an attention mechanism. Seamless4MT (Barrault et al., 2023) uses conformer for speech encoding.
近期语音翻译研究开始采用 HuBERT（Hsu 等人，2021）作为语音表征方法（Zhang 等人，2023a），这得益于 HuBERT 的多重优势。相较于 Wav2Vec 2.0，HuBERT 在聚类过程中采用隐藏层表征，能提供更稳定的训练目标和更优的性能指标。针对长语音信号编码，可选用 Conformer 模型（Gulati 等人，2020），其通过卷积模块捕获局部上下文，并借助注意力机制获取全局上下文。Seamless4MT 系统（Barrault 等人，2023）即采用 Conformer 架构进行语音编码。

Other speech representation techniques such as VQ-VAE (van den Oord et al., 2017), WavLM (Chen et al., 2022), data2vec (Baevski et al., 2022), Robust data2vec (Zhu et al., 2023), SpeechLM (Zhang et al., 2024b), may also be explored while encoding speech for ST tasks.
其他语音表征技术，如 VQ-VAE（van den Oord 等，2017）、WavLM（Chen 等，2022）、data2vec（Baevski 等，2022）、Robust data2vec（Zhu 等，2023）、SpeechLM（Zhang 等，2024b）等，在语音到文本翻译任务的编码过程中同样值得探索。

6.2.3 Joint Speech-Text Representation
6.2.3 联合语音-文本表征

The speech and text in an ST task are semantically related because both of them refer to the same thing. Therefore, it is imperative to learn a joint speech-text representation in the hope of bridging the modality gap between them. A method for learning a combined representation of text and speech is called modality bridging (see fig.5). Hence, a good ST model should learn a representation such that embeddings of both modalities for similar speech-text pairs lie close to each other. It is believed that low performance on ST tasks is due to models not learning aligned representations of speech and text. Therefore, different authors have devised different ways to fill the gap, which fall into five major approaches: (a) adapters, (b) contrastive learning, (c) knowledge-distillation, (d) optimal transport, and (e) mix-up strategy. Below we discuss the works utilizing these approaches and show the pros and cons.
在语音到文本（ST）任务中，语音与文本具有语义关联性，因为二者指向同一事物。因此，学习联合语音-文本表征以弥合模态差异至关重要。这种学习语音与文本联合表征的方法称为模态桥接（见图 5）。理想的 ST 模型应习得一种表征方式，使得相似语音-文本对的两种模态嵌入在空间中彼此接近。当前 ST 任务性能不佳的主要原因被认为是模型未能学习到语音与文本的对齐表征。为此，不同研究者提出了填补这一差距的五类主要方法：(a)适配器、(b)对比学习、(c)知识蒸馏、(d)最优传输、(e)混合策略。下文将分别探讨采用这些方法的研究成果，并分析其优劣。

1.

Adapters are small modules integrated with pre-trained networks for specific tasks (Houlsby et al., 2019). They have performed on par with fine-turning-based approaches while requiring only a fraction of trainable parameters. For example, in (Gállego et al., 2021; Zhao et al., 2022; Sarkar et al., 2023), the modality gap is filled using adapter layers, which is a multi-headed self-attention with pooling operation. The author uses Wav2Vec 2.0 (Baevski et al., 2020) for speech-feature extraction, wherein self-attention layers in the transformer are equipped with pooling operation for dimensionality reduction to match the text representation.

1. 适配器是与预训练网络集成的小型模块，用于特定任务（Houlsby 等人，2019）。其性能与基于微调的方法相当，但仅需少量可训练参数。例如在（Gállego 等人，2021；Zhao 等人，2022；Sarkar 等人，2023）中，研究者使用带池化操作的多头自注意力适配器层来填补模态鸿沟。作者采用 Wav2Vec 2.0（Baevski 等人，2020）进行语音特征提取，其中变压器的自注意力层配备池化操作以实现降维，从而匹配文本表征。

Contrastive learning approximates the “semantic” distance in the input space using a simple distance in the target space after mapping input patterns onto the target space (Chopra et al., 2005). It tries to bring positive instances closer while pushing negative ones apart. It has been used excessively in both supervised and unsupervised settings for learning representations. For example, (Zhang et al., 2023c) performs the explicit knowledge transfer through contrastive learning. It learns frame and sentence-level speech feature representation and uses whitening (Su et al., 2021) to alleviate the MT representation degeneration. (Liu et al., 2019) decouples the encoder representation into three parts: acoustic encoder, shrinking (done via CTC) of acoustic encoder output, and semantic encoder for modality-gap bridging. Using a contrastive learning architecture, Chimera (Han et al., 2021) trains a semantic memory module which is shared for overcoming the modality distance. XSTNet (Ye et al., 2021) augmented with contrastive loss (Ye et al., 2022a) investigates three different methods: span masked representation, word-repetition and cut-off. It claims that contrastive loss is better than CTC and L2 loss. Word-aligned contrastive learning (WACO) (Ouyang et al., 2023) bridges the modality gap by forming average speech and word embedding of the same word as the positive pair while of different words as negative pairs. CSTNet is a self-supervised learning framework based on contrastive learning (using a mix of triplet losses)(Khurana et al., 2020). On top of the CTC loss, the boundary-based speech length shrinking mechanism is applied in (Zeng et al., 2022). The authors claim that if boundary-based shrinking is applied with other modality-bridging techniques, such as contrastive loss, it can further improve the model performance. The approach presented achieves lower inference speed and memory footprint. (Yin et al., 2023) proposes a novel integration of speech and text, referred to as a third modality. This fusion is achieved through the application of Cross-modal Contrastive Learning (Sohn, 2016) and Cross-Attentive Regularization (Tang et al., 2021a). Additionally, the method incorporates techniques such as Knowledge Distillation and Jensen-Shannon Divergence (Lin, 1991; Liu et al., 2019; Gaido et al., 2020a) to bridge the modality gap, addressing challenges related to input representation, semantics, and hidden states.
对比学习通过将输入模式映射到目标空间后，利用目标空间中的简单距离来近似输入空间的"语义"距离（Chopra 等，2005）。该方法致力于拉近正样本间距同时推远负样本，已在监督与非监督场景中被广泛用于表征学习。例如（Zhang 等，2023c）通过对比学习实现显式知识迁移，其同时学习帧级和句子级语音特征表示，并采用白化处理（Su 等，2021）缓解机器翻译表征退化问题。（Liu 等，2019）将编码器表征解耦为三部分：声学编码器、声学编码输出的 CTC 收缩模块以及用于模态间隙桥接的语义编码器。Chimera（Han 等，2021）采用对比学习架构训练共享语义记忆模块以克服模态距离。增强对比损失的 XSTNet（Ye 等，2021；Ye 等，2022a）探索了三种方法：跨度掩码表示、词语重复与截断策略。研究表明对比损失函数优于 CTC 和 L2 损失函数。词对齐对比学习（WACO）（Ouyang 等，2023）通过将同一词语的语音嵌入向量与文本嵌入向量的均值构成正样本对，不同词语的嵌入向量构成负样本对，从而弥合模态差异。CSTNet 是基于对比学习的自监督框架（采用混合三元组损失函数）（Khurana 等，2020）。在 CTC 损失函数基础上，（Zeng 等，2022）应用了基于边界的语音长度压缩机制。作者指出，若将基于边界的压缩机制与其他模态桥接技术（如对比损失）结合使用，可进一步提升模型性能。该方法实现了更低的推理时延和内存占用。（Yin 等，2023）提出了一种称为"第三模态"的语音文本融合新范式，通过跨模态对比学习（Sohn，2016）和交叉注意力正则化（Tang 等，2021a）实现模态融合。此外，该方法融合了知识蒸馏与 Jensen-Shannon 散度（Lin, 1991；Liu 等, 2019；Gaido 等, 2020a）等技术以弥合模态差异，解决了输入表征、语义及隐状态相关的挑战。

Models/Techniques 模型/技术	Problem Solved 问题解决	Dataset	Language Pair 语言对	Speech (hours)	Metric (BLEU)
M-Adapter + W2V2 + mBart (Baevski et al., 2020)	training gap between Pre-training & Fine-tuning the modality	MuST-C	En $\rightarrow$ De	408	25.9
			En $\rightarrow$ Ro	432	24.62
			En $\rightarrow$ Fr	492	37.34
Chimera (Han et al., 2021)	projecting audio & text to a common semantic representation	MuST-C	En $\rightarrow$ De	408	27.1
			En $\rightarrow$ Fr	492	35.6
			En $\rightarrow$ Ru	489	17.4
			En $\rightarrow$ Es	504	30.6
			En $\rightarrow$ It	465	25.0
			En $\rightarrow$ Ro	432	24.0
			En $\rightarrow$ Pt	385	30.2
			En $\rightarrow$ Nl	442	29.2
ConST (XSTNet + Constrastive Loss) (Ye et al., 2021)	closes modality gap	MuST-C	En $\rightarrow$ De	408	28.3
			En $\rightarrow$ Es	504	32.0
			En $\rightarrow$ Fr	492	38.3
			En $\rightarrow$ It	465	27.2
			En $\rightarrow$ Nl	442	31.7
			En $\rightarrow$ Pt	385	33.1
			En $\rightarrow$ Ro	432	25.6
			En $\rightarrow$ Ru	489	18.9
W2V2 + mBart + Adapter (Gállego et al., 2021; Zhao et al., 2022)	slow convergence speed	MuST-C	En $\rightarrow$ De	408	28.22
WACO (Ouyang et al., 2023)	limited parallel data (1-hour)	MuST-C	En $\rightarrow$ De	1	17.5
AdaTrans (Zeng et al., 2022)	closing gap between length of speech & text	MuST-C	En $\rightarrow$ De	408	28.7
			En $\rightarrow$ Fr	492	38.7
			En $\rightarrow$ Ru	489	19.0
STEMM (Fang et al., 2022)	Speech representation	MuST-C	En $\rightarrow$ De	408	28.7
			En $\rightarrow$ Fr	492	37.4
			En $\rightarrow$ Ru	489	17.8
			En $\rightarrow$ Es	504	31.0
			En $\rightarrow$ It	465	25.8
			En $\rightarrow$ Ro	432	24.5
			En $\rightarrow$ Pt	385	31.7
			En $\rightarrow$ Nl	442	30.5
CTC loss + Optimal Transport (Siamese-PT) (Le et al., 2023b)	without change in architecture	MuST-C	En $\rightarrow$ De	408	27.9
			En $\rightarrow$ Es	504	31.8
			En $\rightarrow$ Fr	492	39.2
			En $\rightarrow$ It	465	27.7
			En $\rightarrow$ Nl	442	31.7
			En $\rightarrow$ Pt	385	34.2
			En $\rightarrow$ Ro	432	27.0
			En $\rightarrow$ Ru	489	18.5
Fine & Coarse Granularity Contrastive Learning (Zhang et al., 2023c)	limited knowledge transfer ability	MuST-C	En $\rightarrow$ De	408	29.0
			En $\rightarrow$ Fr	492	38.3
			En $\rightarrow$ Ru	489	19.7
			En $\rightarrow$ Es	504	31.9
			En $\rightarrow$ It	465	27.3
			En $\rightarrow$ Ro	432	26.8
			En $\rightarrow$ Pt	385	32.7
			En $\rightarrow$ Nl	442	31.6

Table 1: Performance of the ST models using modality bridging. The datasets, language pairs, duration of speech, and metric(BLEU) are shown.
表 1：采用模态桥接的语音翻译模型性能。表中展示了数据集、语言对、语音时长及评测指标（BLEU 值）。

3.

Knowledge-distillation (Hinton et al., 2015) is a mechanism to distill information from a trained and large “teacher” model to a smaller and efficient “student” model. It has been used with $L_{2}$ loss in (Huzaifah and Kukanov, 2023) to address the modality gap issue.

3. 知识蒸馏（Hinton 等人，2015）是一种将训练好的大型"教师"模型中的信息提炼到更小更高效的"学生"模型中的机制。(Huzaifah 和 Kukanov，2023)中采用 $L_{2}$ 损失函数来应用该机制，以解决模态差异问题。

Optimal transport (OT) (Peyré et al., 2019) is a mechanism for comparing two probability distributions . In the ST task, speech and text representations may be deemed as two probability distributions, and therefore, OT can be applied. More formally, suppose $\alpha$ and $\beta$ denote the discrete probability distributions corresponding to speech and text representations. The masses at each position $u_{i}$ and $v_{i}$ are $a_{i}$ and $b_{j}$ respectively such that $\sum_{i=1}^{m}a_{i}=1$ and $\sum_{j=1}^{n}b_{j}$ . Suppose further that the cost of transporting a unit of mass from $u_{i}$ to $v_{j}$ is $c(u_{i},v_{j})$ , where $c$ is some cost function such as Euclidean distance. Let $Z_{ij}\geq 0$ be the quantity of mass to be transported from $u_{i}$ to $v_{j}$ then the goal of OT is to move all masses from $\alpha$ to $\beta$ such that the following objective function is minimized

4. 最优传输（OT）（Peyré等人，2019）是一种比较两个概率分布的机制。在语音翻译任务中，语音和文本表征可被视为两种概率分布，因此可应用 OT 方法。更形式化地说，假设

\alpha

和

\beta

分别表示语音和文本表征对应的离散概率分布，其中各位置

u_{i}

和

v_{i}

的质量分别为

a_{i}

和

b_{j}

，且满足

\sum_{i=1}^{m}a_{i}=1

和

\sum_{j=1}^{n}b_{j}

。进一步假设将单位质量从

u_{i}

传输到

v_{j}

的成本为

c(u_{i},v_{j})

，其中

c

为某种成本函数（如欧氏距离）。设

Z_{ij}\geq 0

为从

u_{i}

传输到

v_{j}

的质量，则 OT 的目标是通过将所有质量从

\alpha

转移到

\beta

，使得以下目标函数最小化。

\min_{Z}\langle C,Z\rangle,\qquad Z{\bf 1}_{n}=a,Z^{T}{\bf 1}_{m}=b,Z\geq 0

(21)

In the above eq., $C$ and $Z$ are the matrices whose elements are $c_{ij}=c(u_{i},v_{j})$ and $Z_{ij}$ , respectively. ${\bf 1}$ denotes the vector of ones. In ST task, $c(u_{i},v_{j})=\|u_{i}-v_{j}\|_{p}$ for some $p\geq 1$ . The loss corresponding to (21) is called Wassertein loss optimizing which is costly. Hence an entropy-regularized upper-bound approximation is often optimized
在上述方程中， $C$ 和 $Z$ 分别是元素为 $c_{ij}=c(u_{i},v_{j})$ 和 $Z_{ij}$ 的矩阵。 ${\bf 1}$ 表示全 1 向量。在语音翻译任务中，对于某些 $p\geq 1$ 有 $c(u_{i},v_{j})=\|u_{i}-v_{j}\|_{p}$ 成立。与式(21)对应的损失函数称为 Wasserstein 损失，其优化过程计算代价高昂。因此通常采用熵正则化上界近似进行优化。

\min_{Z}\{\langle C,Z\rangle-\lambda H(Z)\}

(22)

where $\lambda$ is a regularization parameter and $H(\cdot)$ is the von-Neuman entropy matrix.
其中 $\lambda$ 是正则化参数， $H(\cdot)$ 为冯·诺依曼熵矩阵。

Recent works make use of the OT as presented above. For example, (Le et al., 2023b) uses optimal transport and CTC together to close the modality gap during the pre-training phase. They show significant gains in BLEU score when the ST model is fine-tuned without any external data compared to multitask learning. Similarly, (Tsiamas et al., 2024, 2023) uses OT+CTC to align the speech-encoder representation space with the MT embedding space whereas (Zhou et al., 2023) aligns the two representations via OT followed by cross-modal mix-up at the token level.
近期研究采用了上述最优传输方法。例如，(Le 等人，2023b) 将最优传输与 CTC 结合使用，在预训练阶段缩小模态差异。实验表明，与多任务学习相比，未经外部数据微调的语音翻译模型 BLEU 分数显著提升。类似地，(Tsiamas 等人，2024，2023) 采用 OT+CTC 方法对齐语音编码器表征空间与机器翻译嵌入空间，而(Zhou 等人，2023) 则通过最优传输实现表征对齐后，在 token 级别进行跨模态混合增强。

5.

Mix-up strategy: Speech-Text Manifold Mixup (STEMM) (Fang et al., 2022) strategy uses speech embedding. It mixes embeddings of speech and text into the encoder-decoder of a translation model for bridging the modality gap under the self-supervised learning framework. PromptST (Yu et al., 2023) presents a linguistic probing learning strategy, referred to as Speech-Senteval, inspired by the approach introduced by (Conneau et al., 2018). This strategy is implemented on the higher layer of the encoder within pre-trained ST models, specifically targeting the challenges associated with learning linguistic properties that these models often struggle with at the higher layers.

5. 混合策略：语音-文本流形混合（STEMM）（Fang 等人，2022）策略采用语音嵌入技术。该策略将语音和文本的嵌入混合到翻译模型的编码器-解码器中，以在自监督学习框架下弥合模态差异。PromptST（Yu 等人，2023）提出了一种称为 Speech-Senteval 的语言探测学习策略，其灵感源自（Conneau 等人，2018）提出的方法。该策略在预训练语音翻译模型编码器的高层实施，专门针对这些模型在高层学习语言属性时常见的困难。

Table 1 presents the performance scores of ST models based on modality-bridging techniques. We can observe that mixup strategy achieves the highest BLEU score on En-De pair. Whereas boundary-based speech length shrinking mechanism matches the score when combined with other modality-bridging techniques.
表 1 展示了基于模态桥接技术的语音翻译模型性能评分。我们可以观察到，混合策略在英德语言对上获得了最高的 BLEU 分数。而当边界语音长度压缩机制与其他模态桥接技术结合使用时，其得分与之相当。

Discussion: The study finds that adapters can shrink the speech length as well as the modality distance between the text and speech representations while requiring a small number of trainable parameters. The contrastive loss is found to be better than CTC and $L_{2}$ loss for modality-bridging. The boundary-based speech length shrinking combined with contrastive loss may improve the ST task performance. Finally, it is possible to build ST models requiring zero parallel ST data (Tsiamas et al., 2024).
讨论：研究发现，适配器不仅能缩减语音长度，还能缩小文本与语音表征之间的模态距离，同时仅需少量可训练参数。对比损失函数在模态桥接方面表现优于 CTC 和 $L_{2}$ 损失。基于边界的语音长度缩减与对比损失相结合，可提升语音翻译任务的性能。最终研究表明，构建无需平行语音翻译数据的语音翻译模型具有可行性（Tsiamas 等人，2024 年）。

7 End-to-End ST Models 7 端到端语音转文本模型

End-to-end models for ST as discussed previously are gaining traction comparably to cascade models. This section presents an overview of E2E models. We categorize them under two major E2E themes: framework-based and data-based. The first category is further divided on whether the framework used is offline or streaming. The second category is based on the nature of the data. The sub-categorization presented in the data-based section depends upon which component boosts the ST task performance, as claimed in the papers. As such, the demarcation is not strict, and there may be overlaps in the subcategories. In addition, our emphasis in the present review of existing works is highlighting the core contribution and limitations as claimed by the authors. That means we look for answers to the question: what is the main technical contribution of authors to solve the ST problem? Thus, wherever possible, we have limited the mathematical description and believe such details can be found in the related papers. We attempt to provide a succinct and clear picture of what works and what does not while addressing the ST problem.
如前所述，端到端语音翻译模型正获得与级联模型相当的关注度。本节将系统梳理端到端模型体系，依据技术特征将其划分为框架导向与数据导向两大范式。框架导向类根据系统架构的实时性进一步区分为离线式与流式两种；数据导向类则依据论文宣称的性能提升要素进行细分。需要说明的是，这种分类边界并非绝对，子类别间可能存在交叉。在综述现有研究时，我们着重提炼作者自述的核心贡献与局限，即重点回答"研究者为解决语音翻译问题做出了哪些关键技术突破"这一问题。因此我们尽可能精简数学表述，相关细节可参阅原始文献。本文旨在清晰呈现语音翻译领域的技术路线图，明确哪些方案有效、哪些存在不足。

7.1 E2E ST Models based on Frameworks
7.1 基于框架的端到端语音翻译模型

As mentioned in the previous section, E2E ST models based on the framework are further divided into whether the framework is offline or streaming. Below, we discuss both of these categories in detail.
如前文所述，基于该框架的端到端语音转文本模型可进一步按框架类型划分为离线式与流式两种。下文将针对这两类模型展开详细探讨。

7.1.1 Offline Frameworks 7.1.1 离线框架

Offline frameworks perform ST tasks where output tokens are produced after having seen the entire speech utterance. These frameworks heavily rely on Seq2Seq architecture as shown in Fig. 6. It has an encoder for speech input, a decoder for text output, and an optional shared/semantic decoder connecting the encoder and the decoder. The model is usually optimized for the ST loss or sometimes in a multitask learning framework where ASR/MT/CTC (Graves et al., 2006) losses are combined with ST loss. Other times Transfer learning is utilized for leveraging pre-trained models for ST tasks. Another approach that has been gaining a lot of attention is Non-Autoregressive modeling (NAR) for the E2E ST task which gives faster inference. The following section will delve deeper into these approaches.
离线框架执行语音翻译任务时，需在接收完整段语音输入后才生成输出标记。如图 6 所示，这类框架主要基于序列到序列架构，包含语音输入的编码器、文本输出的解码器，以及连接二者的可选共享/语义解码器。模型通常针对语音翻译损失函数进行优化，有时也会采用多任务学习框架，将自动语音识别/机器翻译/连接时序分类（Graves 等人，2006）等损失函数与语音翻译损失结合。此外，迁移学习也常被用于借助预训练模型完成语音翻译任务。近年来，端到端语音翻译中非自回归建模方法因能实现更快推理而备受关注，下文将对这些方法进行深入探讨。

The Seq2Seq-based ST models proposed in the literature either use specialized encoders such as transformers or attention mechanisms which we discuss next.
文献中提出的基于 Seq2Seq 的语音翻译模型主要采用两种架构：一种是专用编码器（如 Transformer 结构），另一种是注意力机制，我们将在下文详细探讨。

1.

Attention mechanism is used to concentrate on specific sections of the input data instead of the entire data (Larochelle and Hinton, 2010; Mnih et al., 2014; Vaswani et al., 2017). It has been a successful strategy for getting state-of-the-art (SOTA) results in NLP, computer vision, and other areas. There exist various types of attention in the literature such as soft, hard, local, monotonic, multihead, self- and cross-attention, inter alia. For more details, interested readers are encouraged to skim through (Mnih et al., 2014; Vaswani et al., 2017; Brauwers and Frasincar, 2022). Below we provide efforts made to handle ST tasks using the attention mechanism within the Seq2Seq framework.

1. 注意力机制用于聚焦输入数据的特定部分而非整体数据（Larochelle & Hinton, 2010; Mnih et al., 2014; Vaswani et al., 2017）。该机制已成为自然语言处理、计算机视觉等领域取得最先进（SOTA）成果的成功策略。现有文献中记载的注意力类型包括但不限于：软注意力、硬注意力、局部注意力、单调注意力、多头注意力、自注意力与交叉注意力等。更多细节可参阅（Mnih et al., 2014; Vaswani et al., 2017; Brauwers & Frasincar, 2022）。下文将重点介绍在序列到序列框架中运用注意力机制处理语音翻译任务的研究进展。

The convolutional attention to “remember” and avoid translating the signal twice is used within Seq2Seq by (Bérard et al., 2016), which outperforms a hierarchical encoder with better results on synthetic data without using transcripts. The same author in (Bérard et al., 2018) uses source transcript and achieves results close to cascade models on LibriSpeech data. In (Duong et al., 2016), the author proposes phone-to-text alignment with a structural bias feature in the attention model. The measurement of alignment has been explored in (Anastasopoulos et al., 2016), which uses IBM’s translation model as well as dynamic time warping⁴⁴4dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed.. Seq2seq with attention trained using multitask learning achieves promising results in (Weiss et al., 2017). These models, however, struggle with noisy inputs and long acoustic signals (Kim et al., 2017). They use a joint CTC-attention model (Graves et al., 2006) trained through multitask learning by incorporating regularizers. The author uses two decoders where the second decoder seeks higher level representation (HLR) from the first decoder besides the encoder via the attention mechanism. Attention-Passing Model (APM) (Sperber et al., 2019), which only passes high-attention vectors from the audio encoder to the translation text for decoding demands a smaller amount of data for training.
(Bérard 等，2016)在 Seq2Seq 模型中采用了卷积注意力机制来"记忆"并避免对信号进行二次翻译，该方法在不使用转录文本的情况下，于合成数据上取得了优于分层编码器的效果。同一作者在(Bérard 等，2018)中引入源语言转录文本，在 LibriSpeech 数据集上获得了接近级联模型的表现。(Duong 等，2016)提出在注意力模型中引入音素-文本对齐的结构偏置特征。(Anastasopoulos 等，2016)探索了对齐度量方法，同时采用 IBM 翻译模型和动态时间规整技术 ⁴ 。(Weiss 等，2017)通过多任务学习训练的注意力 Seq2seq 模型取得了显著成果。然而这些模型在处理噪声输入和长语音信号时仍存在困难(Kim 等，2017)。研究者采用(Graves 等，2006)提出的联合 CTC-注意力模型，通过引入正则化项进行多任务学习训练。该模型采用双解码器架构，其中第二解码器除通过注意力机制获取编码器信息外，还从第一解码器提取高层表征(HLR)。注意力传递模型（APM）（Sperber 等人，2019 年）仅将音频编码器中的高注意力向量传递至翻译文本以供解码，该模型训练所需数据量较少。
2.

Transformer is the architecture based on multi-headed self-attention (Vaswani et al., 2017) which produces contextualized representation of the input. Because of parallelization and contextual representation, transformers have outperformed RNNs on several NLP tasks. This entails us applying transformers for the ST task as well. Transformer-based Seq2Seq with attention is proposed in (Cattoni et al., 2021). The architecture has a quadratic memory complexity, which involves: (a) CNN to downsample the input, and (b) 2-D attention to address short-range dependencies of spectrograms. In (Alastruey et al., 2022), the weight of some attention is avoided for speech tasks, hence decreasing the size of the attention matrix. The transformer encodes the speech features, thereby introducing local self-attention with a suitable window size for each layer to reduce the computational complexity. Other transformer variants which reduce its quadratic complexity such as perceivers (Jaegle et al., 2021) have been used as an encoder (Tsiamas et al., 2022b). Besides quadratic complexity, transformers require lossy downsampling of speech features thus potentially throwing useful linguistic information. To tackle such issues, Speechformers have been proposed (Papi et al., 2021a) which aggregates information at higher layers based on more informed linguistic criteria.

2. Transformer 是一种基于多头自注意力机制的架构（Vaswani 等人，2017），可生成输入的上下文表征。得益于并行化处理和上下文表征能力，Transformer 在多项自然语言处理任务中已超越循环神经网络。这促使我们也将 Transformer 应用于语音翻译任务。基于 Transformer 的注意力序列到序列模型由（Cattoni 等人，2021）提出，该架构具有二次方内存复杂度，其特点包括：（a）使用卷积神经网络对输入进行下采样；（b）采用二维注意力机制处理声谱图的短程依赖关系。（Alastruey 等人，2022）通过避免部分注意力权重来缩减注意力矩阵规模，该 Transformer 通过编码语音特征，在各层引入具有合适窗口大小的局部自注意力以降低计算复杂度。其他降低二次方复杂度的 Transformer 变体，如感知器（Jaegle 等人，2021），已被用作编码器（Tsiamas 等人，2022b）。除了二次复杂度问题外，变换器模型需要对语音特征进行有损下采样，这可能导致有用语言信息的丢失。为解决这一问题，研究者提出了 Speechformer 模型（Papi 等人，2021a），该模型基于更精确的语言学标准在高层网络进行信息聚合。

As discussed earlier, multitask learning combines the optimization of ST loss with an auxiliary loss such as ASR/MT/CTC loss. Another direction that has been explored by ST researchers is transfer learning in that Seq2Seq encoder/decoders are first pre-trained using ASR/MT data respectively and then the entire model is fine-tuned using ST data. Below, we discuss works based on multitask/transfer learning frameworks.
如前所述，多任务学习将语音翻译损失与辅助损失（如自动语音识别/机器翻译/连接时序分类损失）联合优化。语音翻译研究探索的另一方向是迁移学习——先分别使用语音识别/机器翻译数据对序列到序列的编码器/解码器进行预训练，再用语音翻译数据对整个模型进行微调。下文将讨论基于多任务/迁移学习框架的研究成果。

1.

ST with ASR: ST with ASR models make use of the transcript data along with speech-text pairs for pre-training. For example, curriculum pre-training (Wang et al., 2020d) refers to using ASR data for pre-training a Seq2Seq model, allowing it to learn transcription. The author argues that if the model is further pre-trained on learning semantic concepts (via frame-based masked language modeling) and word alignment (via frame-based bilingual lexical translation), it boosts the ST task performance. Specifically, existing E2E models either pre-train the encoder or use multi-task learning for ST tasks. As such, the encoder cannot isolate the learning of three tasks: transcription, semantic concept, and alignment, which are segregated by dividing the labor, and experiments prove the theoretical claims. Listen, Understand, and Translate (LUT) (Dong et al., 2021) uses the Seq2Seq model with external ASR loss. Their primary contribution is to introduce a semantic encoder network, whose task is to use the encoder’s output from transcription to minimize the mean-squared loss between the semantic representations and the BERT embeddings of the target text. Such a strategy implicitly builds and trains an NMT model for translation. Pre-training using ASR and/or MT has also been found useful in low-resource scenarios (Zhang et al., 2022a).

1. 基于 ASR 的语音翻译：此类模型利用转录文本数据及语音-文本对进行预训练。例如课程式预训练（Wang 等人，2020d）提出使用 ASR 数据预训练 Seq2Seq 模型以习得转写能力。作者指出，若进一步通过基于帧的掩码语言建模学习语义概念，以及通过基于帧的双语词汇翻译学习词对齐，可显著提升语音翻译性能。现有端到端模型通常仅预训练编码器或采用多任务学习，导致编码器无法区分转写、语义概念学习与对齐这三个本应分工明确的任务，实验数据验证了这一理论假设。听-理解-翻译模型（LUT）（Dong 等人，2021）在 Seq2Seq 框架中引入外部 ASR 损失函数，其核心创新在于构建语义编码网络——该网络通过转写编码器输出与目标文本 BERT 嵌入之间的均方误差损失，实现语义表征的优化。这种策略隐式地构建并训练了一个用于翻译的神经机器翻译模型。在低资源场景下，使用自动语音识别和/或机器翻译进行预训练也被证明是有效的（Zhang 等人，2022a）。
2.

ST using MT: This section discusses approaches that use either MT data for pre-training or directly using a pre-trained MT model in the ST decoder. These approaches rely on the idea of generating pseudotext and then translating them using MT. For example, Unsupervised Term Discovery (UTD) (Bansal et al., 2017) groups repeated words into pseudo-text, which is subsequently used for training an MT model using the parallel pseudo-text and target translations. The main advantage of such a system is that it can translate some content words under low-resource settings. The overall results are not very promising on the Spanish-English Call-Home dataset. Another limitation of this work is that the approach is not an E2E in a true sense as it involves two models– a UTD and an MT model. A weakly supervised learning method for ST (Jia et al., 2019) that outperforms multi-task learning takes advantage of the pre-trained MT and TTS synthesis module. Pre-trained MT model is used as a teacher to guide the student ST model in (Liu et al., 2019) (such an approach is dubbed as knowledge distillation (KD)). They, however, rely on source language text and do not improve upon the pipeline system. Following along, (Gaido et al., 2020b) explores word, sentence and sequence-interpolation based KD approaches for transferring knowledge from pre-trained MT to ST model.

2. 基于机器翻译的语音翻译：本节探讨利用机器翻译数据进行预训练或直接在语音翻译解码器中使用预训练机器翻译模型的方法。这些方法的核心思想是先生成伪文本，再通过机器翻译进行转换。例如无监督术语发现（UTD）（Bansal 等人，2017）将重复词汇聚类为伪文本，随后利用伪文本与目标译文的平行语料训练机器翻译模型。该体系的主要优势在于能在低资源环境下翻译部分实义词。但在西班牙语-英语 Call-Home 数据集上的整体表现不尽如人意。该工作的另一局限在于其本质上并非真正的端到端系统，而是包含 UTD 和机器翻译两个独立模型。Jia 等人（2019）提出的弱监督语音翻译学习方法通过利用预训练机器翻译和文本转语音合成模块，其性能超越了多任务学习方案。Liu 等人（2019）采用预训练机器翻译模型作为教师模型指导学生语音翻译模型（该方法被称为知识蒸馏），但该方法依赖源语言文本且未超越传统级联系统的性能。随后，(Gaido 等人，2020b)探索了基于单词、句子和序列插值的知识蒸馏方法，用于将预训练机器翻译模型的知识迁移至语音翻译模型。
3.

ST using both MT and ASR: This section discusses works employing MT and ASR pre-trained models (Bahar et al., 2020; Tsiamas et al., 2022a) or losses for transfer or multitask learning.

3. 结合机器翻译与自动语音识别的语音翻译：本节讨论利用机器翻译和自动语音识别预训练模型(Bahar 等人，2020；Tsiamas 等人，2022a)或损失函数进行迁移学习或多任务学习的研究工作。

Multitask learning proves to be effective when CTC loss is combined with ASR and MT loss in (Bahar et al., 2019a) using various E2E ST architectures such as direct, multitask many-to-one, one-to-many, tied-cascade, and tied-triangle. They show that pre-trained models with ASR and MT losses achieve promising results. Contrary to claims of (Anastasopoulos and Chiang, 2018), tied-triangle architecture is no better than a direct model when fine-tuned properly. Since the ST task is similar to the MT task from the output perspective, works such as XSTNet (Ye et al., 2021) utilize external MT data to pre-train the encoder-decoder network extensively, then fine-tune it using parallel corpus data of MT, ST, ASR, and external MT data for optimizing the model using what they call progressive training. They achieve impressive performance on MuST-C and augmented Librispeech data. They also demonstrate improved performance on auxiliary tasks of MT and ASR. STPT model (Tang et al., 2022) proposes four sub-tasks for multitask pre-training: text-to-text (T2T), which is self-supervised; speech-to-phoneme which is supervised; acoustic learning, which is self-supervised, and ST which is supervised. Only T2T and ST tasks would subsequently be used for fine-tuning. Despite pre-training on “unlabeled” speech data, they obtained superior results on MuST-C data for the ST task. COSTT (Dong et al., 2020) pre-trains encoder using ASR data, the decoder using paired MT data, and then fine-tunes for the joint transcription-translation task. ComSL is a composite ST model relying on multitask learning with three losses ( $L_{ASR},L_{MT},L_{ST}$ ) combined with cross-modality loss to bridge the gap (Le et al., 2023a). It is worth mentioning that ComSL does not require forced-align ST data and learns the cross-modality alignment during training. This however requires optimizing four different losses, viz. Masked Token Prediction, Speech to Text Mapping, Encoder Representation Matching, and Decoder Distribution Matching⁵⁵5please see (Le et al., 2023a) paper for more details. similar to (Tang et al., 2021b). Fused acoustic and text encoding-ST (FAT-ST) (Zheng et al., 2021b) follows the similar pre-training and fine-tuning idea as ComSL except that they propose to use any combination of training data from $D_{2^{\{u,x,v\}}}$ ⁶⁶6 $2^{\{u,x,v\}}$ is the power set of triplets.. Essentially, they rely on masked language modeling (MLM) and translation language modeling (TLM) for pre-training (Conneau and Lample, 2019).
多任务学习被证明是有效的，当(Bahar 等人，2019a)将 CTC 损失与自动语音识别(ASR)和机器翻译(MT)损失结合使用时，采用了多种端到端语音翻译架构，包括直接模型、多对一多任务、一对多多任务、级联绑定和三角绑定架构。研究表明，采用 ASR 和 MT 损失预训练的模型能取得优异效果。与(Anastasopoulos 和 Chiang，2018)的论断相反，经过适当微调后，三角绑定架构并不优于直接模型。由于从输出角度看语音翻译任务与机器翻译任务相似，诸如 XSTNet(Ye 等人，2021)等研究利用外部 MT 数据对编码器-解码器网络进行大规模预训练，继而采用 MT、ST、ASR 的平行语料库及外部 MT 数据进行渐进式训练优化模型。该模型在 MuST-C 和增强版 Librispeech 数据集上表现卓越，同时在 MT 和 ASR 辅助任务中也展现出性能提升。STPT 模型(Tang 等人，2022)提出了四种多任务预训练子任务：自监督的文本到文本(T2T)、有监督的语音到音素、自监督的声学学习以及有监督的语音翻译。后续仅使用文本到文本（T2T）和语音转文本（ST）任务进行微调。尽管基于"未标注"语音数据进行预训练，他们在 MuST-C 数据集的 ST 任务中仍取得了优异表现。COSTT（Dong 等，2020）采用 ASR 数据预训练编码器，利用配对机器翻译数据预训练解码器，最后针对联合转录-翻译任务进行微调。ComSL 作为复合型 ST 模型，通过结合三重损失函数（ $L_{ASR},L_{MT},L_{ST}$ ）与跨模态损失来实现模态间隙的桥接（Le 等，2023a）。值得注意的是，ComSL 无需强制对齐的 ST 数据，而是在训练过程中自动学习跨模态对齐。但该方法需要同步优化四种不同损失函数：掩码标记预测、语音到文本映射、编码器表征匹配和解码器分布匹配 ⁵ ，其思路与（Tang 等，2021b）相似。融合声学与文本编码的 ST 模型（FAT-ST）（Zheng 等，2021b）沿用了与 ComSL 相似的预训练-微调框架，但其创新点在于允许任意组合使用来自 $D_{2^{\{u,x,v\}}}$ ⁶ 的训练数据。该模型本质上依赖于掩码语言建模（MLM）和翻译语言建模（TLM）进行预训练（Conneau 和 Lample，2019）。

Non-Autoregressive Modeling⁷⁷7 We present the discussion of NAR within the multitask learning framework because all NAR E2E ST models are optimized within the multitask framework. As discussed in the background section, an alternative approach to Autoregressive (AR) modeling is Non-Autoregressive (NAR) modeling. AR assumes that the output tokens are conditional dependent on the previously generated tokens. However, it causes significant latency during inference. NAR models solve this problem by outputting all the translated tokens in parallel thus speeding up the inference. Formally, they are given by (23)

p({\bf v}|{\bf u};\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|,{\bf u};\theta)

(23)

There has been a surge in applying the non-autoregressive models (AR) in ASR and MT and it has prompted ST researchers to apply to it too. For example, (Inaguma et al., 2020a, 2021) trains NAR and autoregressive decoding conditioned on a shared speech encoder. Another line of NAR works (Chuang et al., 2021) explores CTC with ASR as an auxiliary task. CTC-based encoder only architecture ((Inaguma et al., 2020a, 2021) use encoder and decoder both) for NAR E2E ST task is shown to perform comparably or better than strong AR models in (Xu et al., 2023a).
非自回归模型（NAR）在自动语音识别（ASR）和机器翻译（MT）领域的应用激增，这也促使语音翻译（ST）研究者开始采用该技术。例如，(Inaguma et al., 2020a, 2021) 训练了基于共享语音编码器的非自回归与自回归联合解码方案。另一类非自回归研究 (Chuang et al., 2021) 探索了以 ASR 作为辅助任务的 CTC 方法。研究表明，(Xu et al., 2023a) 提出的纯编码器架构（Inaguma 等人采用编码器-解码器架构）CTC 方案在端到端非自回归语音翻译任务中，其性能可媲美或超越强自回归模型。

4. 非自回归建模 ⁷ 如背景部分所述，自回归（AR）建模的替代方案是非自回归（NAR）建模。自回归模型假设输出标记依赖于先前生成的标记，但这会导致推理时产生显著延迟。非自回归模型通过并行输出所有翻译标记来解决该问题，从而加速推理过程。其形式化表达如公式(23)所示

Discussion: Our study of Seq2Seq-based frameworks for ST task reveals that (a) structural bias can be obtained by stacked/pyramidal RNN and alignment smoothing, (b) regularizers such as transitivity and invertibility improves Character Error Rate, (c) HLR helps in transcription as well as translation, and (d) changing the self-attention of the encoder with a logarithmic distance penalty enhances translation, (e) Progressive training needs a huge data and training time to achieve superior results, and (f) multitask pre-training can be used to leverage unlabeled speech data. (Zhang et al., 2022a) shows that ST models trained from scratch using only ST tasks perform on par with or surpass pre-trained models. To achieve such results, proposed best practices include a smaller vocabulary, a wider feedforward layer, a deep speech encoder with the post-layer norm, CTC-based regularization, and parameter-distance penalty. Pre-training is still useful in low-resource data regimes. Transferring knowledge via KD from pre-trained MT to ST causes gender bias, omission of sentences, and generic verbal-tense choice. Use of large vocabulary and models is effective for NAR E2E ST task (Inaguma et al., 2020a). It indicates that leveraging NAR with LLMs may be a future direction to explore.
讨论：我们对基于 Seq2Seq 的语音翻译（ST）任务框架的研究表明：(a) 通过堆叠/金字塔式 RNN 和对齐平滑可获得结构偏置；(b) 传递性和可逆性等正则化器能改善字符错误率；(c) 分层线性回归（HLR）对转写和翻译均有助益；(d) 将编码器的自注意力机制改为带对数距离惩罚的形式可提升翻译质量；(e) 渐进式训练需要大量数据和训练时间才能获得优异结果；(f) 多任务预训练可利用未标注语音数据。Zhang 等人(2022a)的研究显示，仅使用 ST 任务从头训练的模型性能可媲美或超越预训练模型。实现该效果的最佳实践包括：采用更小的词表、更宽的前馈层、带后层归一化的深度语音编码器、基于 CTC 的正则化以及参数距离惩罚。在低资源数据场景下，预训练仍具价值。通过知识蒸馏（KD）从预训练机器翻译模型迁移知识会导致性别偏见、句子遗漏和通用时态选择问题。使用大规模词汇和模型对非自回归端到端语音翻译任务具有显著效果（Inaguma 等人，2020a）。这表明将非自回归模型与 LLMs 结合可能是未来值得探索的研究方向。

7.1.2 Streaming frameworks
7.1.2 流式处理框架

Streaming frameworks for ST tasks start outputting target tokens on seeing only partial inputs, that is, the translation of the input as soon as it arrives without waiting for the entire input. They are also known as Simultaneous ST (SimulST or SST)⁸⁸8Note that in MT literature, some works such as (Iranzo-S’anchez et al., 2022) differentiate between Streaming and Simultaneous setting where sentences are treated independently from each other. However, in ST, we find that existing works make no differentiation between them. (Goldman-Eisler, 1972; Fügen et al., 2007; Tsiartas et al., 2013; Grissom II et al., 2014). It finds application in online speech translation and video dubbing, to name a few. Traditionally, the streaming ST problem has been solved by feeding the segmented output of a streaming ASR model to a streaming MT model (Oda et al., 2014; Iranzo-Sánchez et al., 2020). However, due to the cascade nature of the model, it is prone to high latency and error propagation (Arivazhagan et al., 2019b, 2020; Zaidi et al., 2022). The SST problem faces several issues in practical implementation; reordering, acoustic ambiguity, and variable speech rate, and long inputs being prominent among them. Our literature survey reveals that most of the existing works focus on handling long streaming inputs and therefore, the discussion underneath revolves around that. Other issues mentioned above may also be considered for designing practical SST models.
面向语音翻译任务的流式处理框架在仅接收部分输入时便开始输出目标语符，即对输入内容进行实时翻译而无需等待完整输入。这类技术亦被称为同步语音翻译（SimulST 或 SST） ⁸ （Goldman-Eisler，1972；Fügen 等，2007；Tsiartas 等，2013；Grissom II 等，2014），其典型应用场景包括在线语音翻译和视频配音等。传统解决方案通常将流式语音识别模型的片段输出馈送至流式机器翻译模型进行处理（Oda 等，2014；Iranzo-Sánchez 等，2020）。然而由于级联模型的结构特性，该方法存在高延迟和错误传播的固有缺陷（Arivazhagan 等，2019b，2020；Zaidi 等，2022）。同步语音翻译在实际应用中面临若干挑战，其中词序重组、声学歧义、语音速率变化及长序列输入等问题尤为突出。文献调研表明，现有研究多聚焦于处理长流式输入，故下文讨论将围绕该主题展开。其他提及的问题在设计实用化同步翻译模型时亦需纳入考量。

Existing streaming frameworks intervene Seq2Seq framework at various places to design SST models. These are (a) encoder-level, (b) decoder-level, and (c) input/latent-level.
现有流式框架通过在 Seq2Seq 架构的不同环节进行干预来设计语音到文本翻译模型。具体包括：(a)编码器层面，(b)解码器层面，以及(c)输入/隐层层面。

1.

Encoder-level: SOTA SST models use transformers as encoders. Due to the self-attention operation which looks at the entire utterance, it is unsuitable for streaming inputs. There exist some works that design encoders specialized for streaming inputs. For example, augmented memory transformer (Wu et al., 2020; Ma et al., 2020c) splits the utterance $U$ into smaller-segments $S=[s_{1},\ldots]$ . Each segment $s_{n}$ consists of left context $I_{n}$ , main context $c_{n}$ , and right context $r_{n}$ . Self-attention is calculated at the segment level only thereby reducing the time complexity. Augmented memory propagates the information from one segment to the other. Incremental transformer (Zhang et al., 2020) leverages a unidirectional encoder based on unidirectional-attention with future context masked for handling streaming inputs.

1. 编码器层面：当前最先进的语音到文本翻译模型采用 Transformer 作为编码器。由于自注意力机制需要处理完整话语，这种架构不适合流式输入。已有研究设计了专门针对流式输入的编码器。例如增强记忆 Transformer（Wu 等人，2020 年；Ma 等人，2020c 年）将话语 $U$ 分割为较小片段 $S=[s_{1},\ldots]$ ，每个片段 $s_{n}$ 包含左上下文 $I_{n}$ 、主上下文 $c_{n}$ 和右上下文 $r_{n}$ 。自注意力仅在片段层面计算，从而降低时间复杂度。增强记忆机制实现了片段间的信息传递。增量 Transformer（Zhang 等人，2020 年）则采用基于单向注意力的编码器，通过掩码未来上下文来处理流式输入。
2.

Decoder-level: Instead of modifying encoders, some works such as (Dalvi et al., 2018; Liu et al., 2020a; Nguyen et al., 2021; Guo et al., 2024) propose incremental decoding (see fig. 7). In this framework, input speech is divided into fixed-size chunks and decoded every time a new chunk arrives. To avoid distractions from constantly changing hypotheses, selected chunk-level predictions are committed to and no longer modified. The decoding of the next chunk is conditioned by the predictions committed. Instead of conditioning on all chunk-level predictions, a prefix function is chosen to select a partial hypothesis because early chunks contain limited information (Liu et al., 2020a). There exist several strategies for choosing the prefix function. For example, Hold- $n$ and LA- $n$ (Liu et al., 2020a), SP- $n$ (Nguyen et al., 2021) and Regularized Batched Inputs (R-BI) (Guo et al., 2024). Of these, Hold- $n$ either withholds or deletes the last $n$ tokens in each chunk, LA– $n$ involves displaying the agreeing prefixes of $n$ consecutive chunks. SP- $n$ stands for shared prefix of all best-ranked hypotheses. Contrary to these, RB-I applies various augmentations to input chunks to achieve regularization and SOTA results on the IWSLT SimulST task.

2. 解码器层面：与修改编码器不同，部分研究（Dalvi 等人，2018；Liu 等人，2020a；Nguyen 等人，2021；Guo 等人，2024）提出了增量式解码方案（见图 7）。该框架将输入语音划分为固定大小的片段，每当新片段到达时进行解码。为避免持续变化的假设干扰，已选定的片段级预测结果将被固定且不再修改。后续片段的解码过程以已确认的预测结果为条件。由于早期片段包含信息有限（Liu 等人，2020a），解码过程并非基于所有片段级预测，而是通过前缀函数选择部分假设。现有多种前缀函数选择策略，例如 Hold- $n$ 和 LA- $n$ （Liu 等人，2020a）、SP- $n$ （Nguyen 等人，2021）以及正则化批量输入（R-BI）（Guo 等人，2024）。其中 Hold- $n$ 策略对每个片段末尾的 $n$ 个词元进行保留或删除处理，LA- $n$ 策略需显示 $n$ 个连续片段的一致前缀，SP- $n$ 则表示所有最优假设的共享前缀。与此相反，RB-I 通过对输入块应用多种数据增强技术来实现正则化，并在 IWSLT 同声传译任务上取得了当前最优性能。

Input/latent-level: Since speech input is too fine-grained, deciding when to READ and WRITE is challenging. The existing works introduce pre-decision module which segments the input speech at fixed-chunks (fixed) or word-boundary (flexible). Similarly, READ/WRITE policy can be fixed or adaptive (Ma et al., 2020b). Most research in SST concentrates on either improving speech encoding or pre-decision while relying on fixed policies such as wait- $k$ . In this section, we discuss fixed and adaptive pre-decisions/policies. These techniques are combined with Seq2Seq frameworks to devise streaming ST models.
输入/潜在层面：由于语音输入过于细粒度，决定何时读取与写入具有挑战性。现有研究引入了预决策模块，将输入语音按固定分块（固定式）或词边界（灵活式）进行分割。类似地，读取/写入策略可采用固定或自适应方式（Ma 等，2020b）。当前语音到文本翻译研究主要集中于改进语音编码或预决策模块，同时依赖固定策略如 wait- $k$ 。本节将探讨固定式与自适应式预决策/策略，这些技术与序列到序列框架相结合，可构建流式语音翻译模型。

Wait- $k$ policy (Ma et al., 2018) (shown in fig. 9) learns the parameters $\theta$ of the model by optimizing the negative log-likelihood $-\sum_{(\mathbf{u,v})\in D}\log p(\mathbf{v}|\mathbf{u};k;\theta)$ , where $k$ is the number of segments to look before starting translation (see Fig. 9). The probability $p(\cdot)$ is calculated as
等待- $k$ 策略（Ma 等人，2018 年）（如图 9 所示）通过优化负对数似然 $-\sum_{(\mathbf{u,v})\in D}\log p(\mathbf{v}|\mathbf{u};k;\theta)$ 来学习模型参数 $\theta$ ，其中 $k$ 表示开始翻译前需要观察的语音片段数量（参见图 9）。概率 $p(\cdot)$ 的计算公式为：

p({\bf v}|{\bf u};k;\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|v_{<t},u_{t+k};\theta)

(24)

wait- $k$ policy guarantees that the model can look at $t+k-1$ speech segments while predicting token $v_{t}$ (Ren et al., 2020). However, one limitation of the wait- $k$ policy is that it fails to do a beam search while decoding except for long-tail (Ma et al., 2018). To solve this problem, (Zeng et al., 2021) proposes a wait- $k$ stride- $N$ policy. It essentially is a wait- $k$ policy with the addition of $N$ READ and WRITE operations until the end of the sentence after reading the first $k$ -segments. To determine the $k$ -segments, (Chen et al., 2021b) leverages streaming ASR to guide the direct simultaneous ST decoding via beam search.
等待- $k$ 策略确保模型在预测标记 $v_{t}$ 时能够观察 $t+k-1$ 个语音片段（Ren 等人，2020 年）。然而该策略的局限性在于，除长尾情况外（Ma 等人，2018 年），解码时无法执行束搜索。为解决此问题，（Zeng 等人，2021 年）提出了等待- $k$ 跨步- $N$ 策略，其本质是在读取前 $k$ 个片段后，通过添加 $N$ 个 READ 和 WRITE 操作直至句子结束的等待- $k$ 策略。（Chen 等人，2021b 年）则利用流式自动语音识别技术，通过束搜索确定 $k$ 个片段来指导直接同步语音翻译解码。

As discussed above, determining when to write is crucial for efficient SST. Contrary to wait- $k$ policy which is a fixed-policy, Segmentation can be performed on the embedded speech using CTC (Ren et al., 2020), attention mechanism (Papi et al., 2022b), or incremental BEAM search (Yan et al., 2023). Essentially, these works adapt offline ST to SST showing spectacular performance on benchmark datasets. Note that the models proposed in (Papi et al., 2022b; Yan et al., 2023) train models in a cascade manner while the inference is E2E. Another issue with the fixed policy is that the model can not speed up or slow down appropriately with the input types. Other examples of fixed-policy are Wait-If* (Cho and Esipova, 2016b) and Monotonic Chunkwise Attention (MoChA) (Chiu and Raffel, 2018) that has been used in simultaneous MT and may be explored for SST.
如上所述，确定何时书写对于高效的语音到文本同传至关重要。与固定策略的 Wait- $k$ 策略不同，分段操作可通过连接时序分类（Ren 等，2020）、注意力机制（Papi 等，2022b）或增量束搜索（Yan 等，2023）在嵌入式语音上实现。这些研究本质上将离线语音翻译技术适配为同传场景，在基准数据集上展现出卓越性能。需注意的是，（Papi 等，2022b；Yan 等，2023）提出的模型采用级联方式训练，而推理过程则是端到端的。固定策略的另一缺陷在于模型无法根据输入类型动态调节处理速度。其他固定策略实例包括 Wait-If*（Cho 和 Esipova，2016b）及单调分块注意力机制（MoChA）（Chiu 和 Raffel，2018），这些方法已在同步机器翻译中应用，或可探索其在语音到文本同传中的适用性。

The works mentioned above require that encoded speech be segmented so that the decoder can apply the wait- $k$ policy. The goal of segmentation is to identify the word, sub-word, or phone boundary which are usually not even (due to silences, longer syllables, etc.). That means the number of acoustic units varies with time in each segment. Monotonic-segmented Streaming ST (MoSST) (Dong et al., 2022) is based on learning when to translate, which has a monotonic segmentation module located between the acoustic encoder and the transformer. It has an Integrate-and-Fire (IF) neuron (Abbott, 1999), which fires above a threshold when the context is developed. If the context is not developed, the neuron receives signals and accumulates the acoustic vectors, thus mimicking adaptive policy for READ-WRITE operation. IF strategy has shown impressive performance in simultaneous ASR (Dong and Xu, 2019) and ST (Chang and yi Lee, 2022). It can be used for monotonic segmentation of the speech input along with adaptive decision strategy (Dong et al., 2022). Another adaptive policy-based technique is Monotonic Infinite Lookback Attention (MILk) (Arivazhagan et al., 2019b) used in simultaneous MT can be explored for SST. It essentially is a Monotonic Attention mechanism (Raffel et al., 2017) that extends to infinite encoder states, theoretically, in the past and trains the MT model along with the MILk. It achieves better quality-latency trade-offs than MoCHA thanks to its soft attention to all the encoder states and hard attention. Monotonic Multihead Attention (MMA) (Ma et al., 2019) that extends MILK to multiple heads has been used for SST by (Ma et al., 2020b). Its variants Efficient MMA (Ma et al., 2023) solve numerical stability and biased monotonic alignment issues present in MMA but have not been explored for SST tasks. Adaptive segmentation based on an adaptive policy that takes into account acoustic features and translation history (called meaningful units) is another effective mechanism for SST (Zhang et al., 2022b).
上述研究要求对编码后的语音进行分段，以便解码器能够应用等待- $k$ 策略。分段的目的是识别通常不均匀的词、子词或音素边界（由静音、较长音节等因素导致），这意味着每个片段中的声学单元数量随时间变化。基于学习翻译时机的单调分段流式语音翻译（MoSST）（Dong 等人，2022）在声学编码器与 Transformer 之间设置了单调分段模块，其采用具有积分发放特性的神经元（Abbott，1999），当上下文信息充分时会触发阈值以上的发放。若上下文未充分发展，该神经元将持续接收信号并累积声学向量，从而模拟读写操作的自适应策略。该积分发放策略在同步语音识别（Dong 和 Xu，2019）及语音翻译（Chang 和 yi Lee，2022）中展现出卓越性能，可结合自适应决策策略（Dong 等人，2022）实现语音输入的单调分段处理。另一种基于自适应策略的技术是单调无限回望注意力机制（MILk）（Arivazhagan 等人，2019b），该技术最初用于同步机器翻译，也可探索应用于语音到文本翻译。其本质是一种单调注意力机制（Raffel 等人，2017）的理论延伸，能够无限回溯编码器历史状态，并与 MILk 联合训练机器翻译模型。由于其对所有编码器状态采用软注意力与硬注意力相结合的方式，相比 MoCHA 实现了更优的质量-延迟权衡。单调多头注意力（MMA）（Ma 等人，2019）将 MILk 扩展至多头机制，已被（Ma 等人，2020b）应用于语音到文本翻译。其改进版本高效 MMA（Ma 等人，2023）解决了原 MMA 中存在的数值稳定性和单调对齐偏差问题，但尚未在语音到文本翻译任务中得到探索。另一种有效机制是基于声学特征和翻译历史（称为意义单元）的自适应策略进行分段处理（Zhang 等人，2022b）。

Both fixed and adaptive policy mechanisms employ segmentation modules that are outside the translation module. As such, it breaks the acoustic integrity and potentially may drop the translation performance. Therefore, efforts such as (Zhang and Feng, 2023) propose differentiable segmentation (DiSeg) learned jointly with the translation model using expectation training. DiSeg essentially predicts a Bernoulli random variable $\sigma(FFN(u_{i}))$ , via a feed-forward network (FFN), to decide when to segment. After segmentation, they apply segmented attention which combines unidirectional and bidirectional attention into one while masking future speech frames. Expectation training first constrains the number of segments followed by learning segmentation from the translation model both at semantic and acoustic levels (Zhang and Feng, 2023).
固定策略与自适应策略机制均采用独立于翻译模块的分割组件，这破坏了语音信号的完整性并可能导致翻译性能下降。为此，张与冯（2023）提出了可微分分割（DiSeg）方法，通过期望训练实现与翻译模型的联合学习。DiSeg 本质上通过前馈神经网络预测伯努利随机变量 $\sigma(FFN(u_{i}))$ 以决策分割时机，随后采用融合单向与双向注意力的分段注意力机制，同时屏蔽未来语音帧。期望训练首先约束分段数量，继而从语义和声学两个层面实现翻译模型驱动的分割学习（Zhang and Feng, 2023）。

The discussion so far covered encoder, and decoder level changes, and fixed and adaptive policies used for segmentation, to develop SST models within the Seq2Seq frameworks. Another way to design SST models is by Transduction. It is the process of mapping a sequence to another sequence (Jurafsky and Martin, 2008). A transducer is a special type of Seq2Seq model that solves a few inherent problems. For example, online processing of the long inputs and monotonic sequence alignment is the biggest problem with Seq2Seq models (Graves, 2012), solved by transducers. Below we discuss a special type of transducer called RNN-T and its improvements.
前文已涵盖在序列到序列（Seq2Seq）框架下构建语音到文本翻译（SST）模型时，编码器与解码器层级的改进，以及用于分段的固定策略与自适应策略。另一种设计 SST 模型的方法是通过转导（Transduction）实现。转导是将一个序列映射为另一个序列的过程（Jurafsky 和 Martin，2008）。转导器作为 Seq2Seq 模型的特例，能解决若干固有问题。例如，Seq2Seq 模型在处理长输入流的在线运算与单调序列对齐方面存在显著缺陷（Graves，2012），而转导器可有效解决这些问题。下文将重点讨论 RNN-T 这一特殊转导器及其改进方案。

RNN-T is a transducer that can learn alignment between two sequences in an online/streaming fashion (Graves, 2012) as shown in fig. 10 (a). Formally, it learns the conditional probability $p({\bf v|u})$ by marginalizing all possible alignment paths $A(\bf u,v)$ including blank symbol $\phi$ as
RNN-T 是一种能够以在线/流式方式学习两个序列间对齐关系的转导器（Graves，2012），如图 10(a)所示。其形式化定义为通过边缘化所有可能对齐路径（包含空白符号）来计算条件概率。

p({\bf v|u})=\sum_{{\bf\hat{v}}\in A(\bf u,v)}p({\bf\hat{v}|u})

(25)

RNN-T differs from Seq2Seq in the sense that it divides the decoder into a predictor and a joiner. The predictor takes the previous time step output and yields the representation to be consumed by the joiner along with the hidden representation of the input from the encoder. Since the predictor does not look at the input, it can be pre-trained on the text-only data in a low-data scenario. There have been several SST models proposed based on variants of RNN-T which we discuss next.
RNN-T 与 Seq2Seq 的区别在于它将解码器拆分为预测器和连接器。预测器接收前一时刻输出并生成表征，该表征将与编码器输入的隐藏表征共同输入连接器。由于预测器不直接观察输入数据，在低数据场景下可仅通过文本数据进行预训练。目前已提出多种基于 RNN-T 变体的端到端语音翻译模型，我们将在下文详述。

One of the main issues with RNN-T is the strict monotonic alignment between the input and output sequences which makes them unsuitable for tasks requiring reordering such as MT, ST, etc. For example, Cross-Attention Augmented Transducer (CAAT shown in fig. 10(b)) optimizes translation and policy models in tandem (Liu et al., 2021a). It eliminates the RNN-T’s strict monotonic restriction for reordering in the translation. Using transformers as encoders to reduce the multi-step memory footprint causes a significant delay for CAAT. The use of regularization terms and substantial hyperparameter adjustment are some other limitations of CAAT. An extension of it in (Xue et al., 2022) leverages Transformer Transducer (TT) networks with attention pooling for streaming E2E ST tasks. Attention divides the input audio into chunks of specific sizes. At any time, processing any input frame $\bf u_{t}$ can only see frames within its chunk and a fixed number of left chunks. By sharing the encoder, they also propose a variant to handle E2E ST tasks in multilingual settings. The adaptive READ and WRITE policy choices between encoder output and ground truth contribute to its success. The same authors (Wang et al., 2023) propose to combine the benefits of language-specific and language-agnostic encoders within the TT framework. A shared encoder takes LIDs as gating values and computes weights for each language through the source LID scheduling scheme. The empirical results demonstrate superior performance and a smaller number of trainable parameters than bilingual ST. Adaptive (dynamic) policy for segmenting speech input has recently been explored in a Seq2Seq transduction setting by (Tan et al., 2024). It essentially applies a cross-attention mechanism to decide when to segment the input followed by dynamic compression via anchor representation. Thus, it saves memory and achieves a better latency-quality trade-off.
RNN-T 的主要问题之一在于输入与输出序列间严格的单调对齐特性，这使得其不适用于需要重排序的任务（如机器翻译、语音翻译等）。以交叉注意力增强传感器（CAAT，见图 10(b)）为例（Liu 等人，2021a），该模型通过同步优化翻译策略模块与策略模型，解除了 RNN-T 对翻译过程中重排序的严格单调限制。但采用 Transformer 编码器来降低多步内存占用的设计，导致 CAAT 存在显著延迟。此外，正则化项的使用与大量超参数调整也是 CAAT 的局限性所在。Xue 等人（2022）提出的改进方案采用带注意力池化的 Transformer 传感器（TT）网络处理流式端到端语音翻译任务，通过将输入音频分割为特定大小的块，使任意时刻处理的输入帧 $\bf u_{t}$ 仅能感知当前块及固定数量左侧块内的帧。该研究还通过共享编码器提出多语言端到端语音翻译的变体方案，其成功关键在于编码器输出与真实标签间自适应的 READ-WRITE 策略选择机制。同一批作者（Wang 等，2023）提出在 TT 框架内结合语言特定编码器与语言无关编码器的优势。其共享编码器将语言标识符作为门控值，通过源语言标识符调度方案计算各语言的权重。实证结果表明，该方法不仅性能优越，且可训练参数数量少于双语语音翻译系统。近期（Tan 等，2024）在序列到序列转换框架中探索了语音输入分段的动态自适应策略，该方法通过交叉注意力机制决策分段时机，并采用锚点表示进行动态压缩，从而节省内存并实现更优的延迟-质量平衡。

Besides Transducer and Seq2Seq models, re-translation is another approach adapted for SST task by (Niehues et al., 2016, 2018; Arivazhagan et al., 2019a, 2020) though in a cascade setting. In this approach, the translated output can be re-generated after a fixed amount of time and displayed later for better quality. Though it reduces latency by being greedy to display the partial translation, the output is highly unstable and causes flickring effect. This may give rise to a bad user experience. To mitigate instability, (Arivazhagan et al., 2020) propose a metric called erasure which takes into the length of the suffix deleted during re-translation. Dynamic masking of MT output in a cascade of streaming ASR and MT for improving stability has been explored in (Yao and Haddow, 2020). Another approach to reducing instability is luminance contrast and the Discrete Fourier Transform used in (Liu et al., 2023).
除 Transducer 和 Seq2Seq 模型外，重翻译是(Niehues 等, 2016, 2018; Arivazhagan 等, 2019a, 2020)在级联框架下采用的另一种语音到文本翻译方法。该方法可在固定时间间隔后重新生成翻译结果并延迟显示以提升质量。虽然通过贪心策略显示部分翻译降低了延迟，但输出极不稳定且会产生闪烁效应，可能导致较差的用户体验。为缓解不稳定性，(Arivazhagan 等, 2020)提出了考虑重翻译过程中删除后缀长度的"擦除"度量指标。(Yao 和 Haddow, 2020)探索了在流式语音识别与机器翻译级联系统中动态掩码机器翻译输出的方法以提升稳定性。(Liu 等, 2023)则采用亮度对比度和离散傅里叶变换作为另一种降低不稳定性的解决方案。

Evaluation of SST models: SST models in the literature have been evaluated using the quality and latency metrics presented in §3. Often showing a trade-off between quality and latency. Most of the existing works attempt to balance the quality and latency ignoring the visualization and cognitive load on the viewer when displayed on a screen. Towards this end, (Papi et al., 2021b) emphasizes considering visualization as a metric to be evaluated along with the latency and quality. However, little effort has been made in this direction by the SST community. Therefore, we wish to draw the researcher’s attention to also consider visualization as an evaluation metric for SST models. Towards this end, (Liu et al., 2023) propose tokenized alignment, word updates with semantic similarity, and smooth animation of live captions. They find that it leads to a reduction in fatigue, and distractions while increasing the viewer’s reading comfort.
端到端语音转文本翻译模型的评估：现有文献中对 SST 模型的评估主要采用第 3 节所述的质量与延迟指标，这两者往往呈现此消彼长的关系。当前大多数研究试图平衡质量与延迟，却忽略了字幕在屏幕上显示时的可视化效果及观看者的认知负荷。为此，(Papi 等人，2021b)强调应将可视化作为与延迟和质量并列的评估指标。然而 SST 领域在此方向的探索仍显不足。因此我们呼吁研究者将可视化纳入 SST 模型的评估体系。(Liu 等人，2023)提出的分词对齐、基于语义相似度的单词更新以及实时字幕平滑动画技术，经证实能有效降低观看疲劳与注意力分散，同时提升阅读舒适度。

Discussion: SST is a challenging problem and in that, E2E SST poses a further impediment. Our findings suggest that using adaptive policy significantly improves the latency-quality trade-off. Learned policy mechanisms have been an ongoing research and adapting them for true long-form SST may open new possibilities. Exploring differentiable segmentation for long sequences is still tapped and requires more investigation. Re-translation is found to be on par with or better than SOTA streaming models (Arivazhagan et al., 2020) under a very low revision rate. Such a finding alludes to considering re-translation in an E2E SST system design.
讨论：语音到文本翻译（SST）本身已是一个具有挑战性的问题，而端到端 SST 则带来了更多障碍。我们的研究表明，采用自适应策略能显著改善延迟与质量的权衡关系。学习型策略机制一直是持续研究的课题，将其适配于真正的长文本 SST 可能开辟新的可能性。针对长序列的可微分分割研究仍处于探索阶段，需要更多深入调查。在极低修正率条件下，重翻译表现与当前最优流式模型相当或更优（Arivazhagan 等人，2020），这一发现暗示在端到端 SST 系统设计中应考虑引入重翻译机制。

7.2 ST Models based on the Nature of Available Data
7.2 基于数据特性的语音翻译模型

In the previous section, we provided an overview of the ST models based on the frameworks used. The present section provides readers with another perspective on E2E ST models. In particular, it discusses the E2E ST models categorized based on the nature of the data, such as data is low-resource, streaming, multilingual, etc. Given the specific challenges they pose, we believe such a categorization might be interesting to researchers.
前文从技术框架角度概述了语音翻译模型，本节将为读者提供端到端语音翻译模型的另一种视角。具体而言，将根据数据特性（如低资源数据、流式数据、多语言数据等）对端到端语音翻译模型进行分类讨论。鉴于各类数据带来的特殊挑战，我们认为这种分类方式对研究者具有独特参考价值。

7.2.1 ST in Low-Resource settings
7.2.1 低资源环境下的语音翻译

A low-resource language (LRL) is one where speech and/or text data are scarcely available – usually not enough to pre-train Seq2Seq models. As such, LRLs present challenges of their own such as overfitting and poor generalization. This section will discuss works where ST models are developed especially for low-resource languages. The proposed models under this category have generic architecture as shown in Fig.11(a) which is similar to Seq2Seq ST models. We find the approaches mainly use pre-training the encoder on high-resource ASR data and subsequent fine-tuning on ST data. Another approach that has emerged in recent years to tackle LRL issues is SSL. For example, (Bansal et al., 2019) empirically demonstrates 100% performance improvement on ST tasks. They find that if the ASR language differs from the source and target languages, then pre-training on ASR data enhances ST task performance. Though the BLEU score is improved, the absolute BLEU score is only 7.1. In (Wang et al., 2022), the unsupervised ST is implemented for low-resource settings using pseudo-labels from unsupervised cascade models. SSL with discrete-speech unit (DSU) has been used to fine-tune the ST model on limited ST data (Lam et al., 2024).
低资源语言(LRL)是指语音和/或文本数据极其匮乏的语言——通常不足以预训练序列到序列模型。因此，低资源语言面临特有的挑战，如过拟合和泛化能力差。本节将重点讨论专门针对低资源语言开发的语音翻译模型。该类模型采用图 11(a)所示的通用架构，与常规序列到序列语音翻译模型类似。研究发现，这些方法主要通过对编码器进行高资源语音识别数据的预训练，再在语音翻译数据上进行微调。近年来兴起的另一解决方案是自监督学习(SSL)。例如(Bansal 等人，2019)通过实验证明该方法可使语音翻译任务性能提升 100%。他们发现，当语音识别训练语言与源语言/目标语言不同时，基于语音识别数据的预训练能显著提升翻译性能。虽然 BLEU 分数有所提高，但绝对值仅为 7.1 分。(Wang 等人，2022)则利用无监督级联模型生成的伪标签，实现了低资源环境下的无监督语音翻译。采用离散语音单元（DSU）的自监督学习（SSL）方法已被用于在有限语音翻译数据上对 ST 模型进行微调（Lam 等人，2024 年）。

7.2.2 Code-mix ST 7.2.2 代码混合语音翻译

Code-mix language refers to speech where one primary language is used, but words or phrases from other (embedded) languages are also included. This phenomenon arises from a multitude of challenges, encompassing ambiguous vocabulary, fluctuating lexical representations, intermingling of languages at the word level, redundancy, and alterations in word sequencing. Therefore, it is non-trivial to handle code-mixing while building ST models.
语码混合语言指以某种主要语言为基础，同时夹杂其他（嵌入）语言词汇或短语的语音现象。该现象衍生出多重挑战，包括词汇歧义、词形表征波动、词汇层面的语言混杂、冗余信息以及词序变异等问题。因此，在构建语音翻译模型时处理语码混合现象具有显著难度。

We find that there exist only a few works on code-mix ST. In (Weller et al., 2022), the code-mix dataset is created with the existing publicly available corpora Fisher (Cieri et al., 2004) and Miami⁹⁹9https://github.com/apple/ml-code-switched-speech-translation. As shown in Fig. 11(b), code-mix ST models feed language ID in addition to speech input to the encoder of the Seq2Seq model (Weller et al., 2022). The Wav2Vec 2.0, an acoustic encoder, and mBART, a multilingual decoder, are used for both languages with an attention layer applied for the embedded language. The use of multilingual encoders and decoders is a common practice while building code-mix ST models (Yang et al., 2023). In particular, self-supervised multilingual pre-training with adapters may be explored further.
我们发现目前关于混合编码语音翻译（code-mix ST）的研究成果较少。Weller 等人（2022）利用现有公开语料库 Fisher（Cieri 等人，2004）和 Miami ⁹ 构建了混合编码数据集。如图 11(b)所示，混合编码语音翻译模型除了语音输入外，还需向序列到序列模型的编码器输入语言标识（Weller 等人，2022）。该研究采用声学编码器 Wav2Vec 2.0 和多语言解码器 mBART 处理双语数据，并通过注意力机制处理嵌入语言特征。使用多语言编码器-解码器架构是构建混合编码语音翻译模型的常见方法（Yang 等人，2023）。特别是基于适配器的自监督多语言预训练方法值得进一步探索。

7.2.3 Unsupervised ST 7.2.3 无监督语音翻译

There is an abundance of unlabeled speech and text data. Since manual annotation and creating a parallel corpus is costly, the natural instinct is to exploit unlabeled data for training ST models. This section reviews works where researchers make use of the unlabeled speech data to advance the ST task performance.
存在大量未标注的语音和文本数据。由于人工标注和创建平行语料库成本高昂，研究者很自然地会尝试利用未标注数据来训练语音翻译模型。本节综述了研究人员利用未标注语音数据提升语音翻译任务性能的相关工作。

For unsupervised ST tasks, it is common to leverage large-scale self-supervised and semi-supervised learning. For example, speech encoders such as Wav2vec 2.0 have been pre-trained in a self-supervised manner on Librilight data (Kahn et al., 2019) and used by (Li et al., 2020; Wang et al., 2021b) whereas the decoder is randomly initialized. The entire model is optimized on CoVoST 2 ST data, and the encoder is frozen. Thereby, self-training is executed to generate pseudo-labels for Libri-light data. The Wav2Vec 2.0 is a “student” model which is fine-tuned with ground truth CoVoST 2 data and pseudo labels. Finally, a language model (LM) is trained on CommonCrawl data and combined with the ST model to generate text via beam-search decoding. Following along, for training the E2E model, (Wang et al., 2021b) produces pseudo-labels by cascading ASR, text de-normalization, and MT in an Unsupervised manner. Wav2Vec 2.0 and mBART are optimized for domain adaption using in-domain data (Li et al., 2020). According to experimental results, the proposed method is effective for E2E models without pre-training. However, between supervised and unsupervised pre-trained models, performance gap is encountered, which may be investigated in future works.
在无监督语音翻译任务中，通常采用大规模自监督与半监督学习方法。例如，Wav2vec 2.0 等语音编码器通过自监督方式在 Librilight 数据集上进行预训练（Kahn 等人，2019），并被 Li 等人（2020）与 Wang 等人（2021b）采用，而解码器则采用随机初始化。整个模型基于 CoVoST 2 语音翻译数据进行优化时，编码器保持冻结状态。随后通过自训练为 Libri-light 数据生成伪标签。Wav2Vec 2.0 作为"学生"模型，使用 CoVoST 2 真实数据与伪标签进行微调。最终基于 CommonCrawl 数据训练语言模型（LM），与语音翻译模型结合通过束搜索解码生成文本。Wang 等人（2021b）进一步通过级联自动语音识别、文本规范化与机器翻译的无监督流程生成伪标签，用于端到端模型训练。Wav2Vec 2.0 与 mBART 通过领域内数据实现领域自适应优化（Li 等人，2020）。实验结果表明，该方法对无需预训练的端到端模型具有显著效果。然而，在监督式与无监督式预训练模型之间仍存在性能差距，这值得在未来的研究中深入探讨。

7.2.4 Multilingual ST 7.2.4 多语言语音翻译

The multilingual ST model aims to translate from/to multiple speech input/output languages. It can be one of many-to-one, one-to-many, or many-to-many. The ST models solve multilinguality issues using mainly three approaches: (a) language ID, (b) dual-decoder, and (c) pre-trained models.
多语言语音翻译模型旨在实现多种语音输入/输出语言之间的互译，其架构可分为多对一、一对多或多对多三种类型。当前语音翻译模型主要通过三种方法解决多语言问题：(a) 语言识别编码技术，(b) 双解码器架构，以及(c) 预训练模型应用。

1.

Language ID (LID) is the identification label that allows one to identify the target language and explicitly translate the speech simultaneously. The existing works handle multilinguality using LID either with encoder or decoder. In (Inaguma et al., 2019), the model uses LID in the decoder for one-to-many and many-to-many translation. They demonstrate impressive performance in translation from high-resource to low-resource languages without using any transcript data from LRL. However, using the LID embedding in the decoder (Gangi et al., 2019) is shown to underperform than using it in the encoder. The author shows that LID can be either concatenated or merged with the inputs and, when pre-trained with ASR data, can result in superior performance than the one-to-one system. The model, however, performs poorly when trained on many unrelated target languages. One-to-many and many-to-one multilingual ST systems of (Wang et al., 2020c, a) provide a good set of baselines for research purposes.

1. 语言识别标识（LID）是一种用于识别目标语言并实现语音同步显式翻译的标识标签。现有研究通过编码器或解码器结合 LID 处理多语言场景。Inaguma 等人（2019）提出的模型在解码器中采用 LID 实现一对多和多对多翻译，其研究表明：在不使用低资源语言转录数据的情况下，该模型能实现从高资源语言到低资源语言的出色翻译性能。然而 Gangi 等人（2019）证明，在解码器中使用 LID 嵌入的表现逊色于编码器方案。作者指出 LID 既可与输入向量拼接也可融合，当结合语音识别数据预训练时，其性能可超越单语翻译系统。但该模型在训练涉及多个不相关目标语言时表现欠佳。Wang 等人（2020c，a）提出的一对多与多对一多语言语音翻译系统为研究提供了良好的基准模型。
2.

Dual-decoder model is the transformer with two decoders, one for each ASR and ST, and the dual-attention mechanism. In (Le et al., 2020), a dual-decoder model is proposed to optimize it for ASR and ST tasks jointly. The author hypothesizes that a dual-attention mechanism can benefit each task by transferring knowledge instantly or in wait- $k$ policy mechanism. Their model generalizes earlier models proposed for one-to-many and bilingual ST models.

2. 双解码器模型是一种配备两个解码器的 Transformer 架构，分别用于语音识别（ASR）和语音翻译（ST）任务，并采用双重注意力机制。Le 等人（2020）提出通过联合优化方式使该模型同时适应 ASR 与 ST 任务。作者假设双重注意力机制能通过即时知识迁移或等待策略机制（wait- $k$ policy）使两项任务相互受益。该模型推广了早期针对一对多和双语语音翻译任务所提出的模型架构。
3.

Pre-trained Multilingual Models use a pre-trained encoder and decoder for acoustic modeling and language modeling, respectively. In (Li et al., 2020; Tran et al., 2020), the author shows that efficiently fine-tuning mBART, which is a pre-trained multilingual decoder (Liu et al., 2020c) can achieve SOTA results on CoVoST data on zero-shot cross-lingual and multilingual translation tasks. Along similar lines, (Le et al., 2021) shows that inserting adapters in between layers of the encoder-decoder framework and tuning them can improve the ST task performance over bilingual ST models. SeamlessM4T (Barrault et al., 2023), Whisper (Radford et al., 2023), and other foundation models are built using many of these concepts like language ID in the decoder, multilingual, multimodal, and multitask pre-training.

3. 预训练多语言模型分别采用预训练编码器与解码器进行声学建模和语言建模。Li 等人(2020)与 Tran 等人(2020)的研究表明，通过高效微调 mBART（一种预训练多语言解码器，Liu 等人 2020c 提出）可在 CoVoST 数据集上实现零样本跨语言及多语言翻译任务的 SOTA 性能。类似地，Le 等人(2021)证实，在编码器-解码器框架各层间插入适配器并进行调优，可超越双语语音翻译模型的性能表现。SeamlessM4T（Barrault 等人 2023）、Whisper（Radford 等人 2023）等基础模型均融合了这些核心设计理念，包括解码器语言识别、多语言多模态多任务预训练等技术要素。

7.3 Discussion 7.3 讨论

The works presented so far show that E2E ST models have been improved tremendously. ST models’ improved performance is likely due to leveraging pre-trained ASR/MT models or the respective corpus to train ST encoders/decoders. Weakly labelled/pseudo labels are another way to create more data for training ST models. Contrastive learning, mix-up strategy, adapters, and optimal transport are a few ways to bridge the modality gap.
目前的研究成果表明，端到端语音翻译模型的性能已取得显著提升。这种进步主要得益于以下因素：利用预训练的自动语音识别/机器翻译模型或其对应语料库来训练语音翻译的编码器/解码器；采用弱监督标注/伪标签技术扩充训练数据规模；以及通过对比学习、混合增强策略、适配器模块和最优传输等方法有效弥合模态差异。

Applying unsupervised ASR and MT with the Wav2Vec 2.0 encoder and mBART decoder in a low-resource setting yields good results for ST models. When considering online data streaming, using the IF neuron for context building and translation improves results compared to using CAAT, which had latency issues due to reordering for translation tasks introduced by RNN-T. The mBART handles multilingual settings well by using a dual attention mechanism that facilitates knowledge transfer. Additionally, inserting adapters between the encoder and decoder layers improves performance. In the unsupervised ST setting, the SOTA results were achieved by training Wav2Vec 2.0 on data within the same domain as the speech. We see that the wait- $k$ policy is used in the streaming settings with segmentation and Multilingual settings with a dual-attention mechanism. In both cases, it yields good results. Also, adapters are used in modality bridging and multilingual settings with pre-trained models, which improves the performance. As shown in (Sun et al., 2023), multilingual E2E ST for LRLs can benefit when trained jointly with related HRLs.
在低资源环境下，将 Wav2Vec 2.0 编码器与 mBART 解码器结合的无监督语音识别及机器翻译方法，为语音翻译模型带来了良好效果。针对在线数据流场景，采用 IF 神经元构建上下文并进行翻译的方案相比 CAAT 方法更具优势——后者因 RNN-T 引入的翻译任务重排序导致延迟问题。mBART 通过采用促进知识迁移的双重注意力机制，能有效处理多语言场景。此外，在编码器与解码器层间插入适配器可提升模型性能。在无监督语音翻译场景中，当前最佳成果是通过在语音同源数据上训练 Wav2Vec 2.0 实现的。研究表明，流式处理场景采用分段式 wait- $k$ 策略，配合具有双重注意力机制的多语言方案，均能取得良好效果。适配器技术也广泛应用于模态桥接及预训练模型的多语言场景中，显著提升模型表现。如（Sun 等，2023）所示，低资源语言的端到端多语言语音翻译模型与相关高资源语言联合训练时可获得性能增益。

7.4 Overall Performance Trend of E2E ST approaches in Common Benchmarks
7.4 端到端语音翻译方法在常用基准测试中的总体性能趋势

In this section, we analyse the performance evolution of ST models over the MuST-C dataset, as depicted in Figure 12. We selected the MuST-C dataset due to its widespread adoption by researchers since its introduction in 2019.
本节我们分析了语音翻译模型在 MuST-C 数据集上的性能演进情况（如图 12 所示）。选择 MuST-C 数据集是因为自 2019 年发布以来，该数据集已被研究者广泛采用。

Figure 12 reveals the overall performance of ST models over time has steadily improved across all 8 languages, with a few remarkable gains. The first significant gain was observed in 2021-adapter method (Le et al., 2021). This high jump in performance is achieved due to use of adapter layers within the multilingual models that shows transferability of knowledge across related language pairs (note that not all proposed models tested their models across all 8 languages). It also shows that Chimera (Han et al., 2021), which is a modality bridging model, performs poorly compared to adapter based models. That means, semantic shared network proposed in (Han et al., 2021) is not as good as adapters with multilingual models and there still is a gap between text and speech modality.
图 12 显示，随着时间的推移，语音翻译模型在所有 8 种语言上的整体性能均稳步提升，并出现了若干显著突破。第一个重大性能跃升出现在 2021 年的适配器方法（Le 等人，2021）。这一显著提升得益于在多语言模型中使用适配器层，证明了相关知识在关联语言对间的可迁移性（需注意并非所有模型都在全部 8 种语言上进行了测试）。结果同时表明，跨模态桥接模型 Chimera（Han 等人，2021）的性能远逊于基于适配器的模型。这意味着（Han 等人，2021）提出的语义共享网络效果不及多语言模型中的适配器，文本与语音模态之间仍存在差距。

The next jump we see is due to ConST (Ye et al., 2022a) (for languages like Es, It, Pt, and Ru). This particular model achieved superior results by incorporating contrastive learning to bridge the modality gap the first time. The cross-modal speech-text retrieval accuracy jumps from 4% to 88%! and better way to bridge the gap than Chimera. The drop in performance in STEMM compared to ConST is that both are from the same authors and were proposed in the same year. In fact, ConST is an improvement over XSTNet and STEMM by the use of cross-model contrastive loss. FCCL (medium model) (Zhang et al., 2023c) further improves the performance, by applying contrastive learning over both the sentence- and frame-level, over ConST which applies contrastive learning only at the sentence level. Finally, OT based model outperforms contrastive learning based models on all languages except De and Ru. Looking closely, we find that OT based model (Le et al., 2023b) is able to close the modality-gap only partially compared to ConST and FCCL for a few languages. Hence, as a recommendation, coarse- and fine-grained contrastive learning and ASR pre-training with CTC loss via OT approaches may be explored to build better ST models. Note that LLM-based ST models are not compared here due to primarily their pre-training over massive amount of data and we want a fair comparison where pre-training over external ASR and MT corpus leads to higher performance as we find in ConST and FCCL models.
我们观察到的下一个性能跃升来自 ConST 模型（Ye 等人，2022a）（针对西班牙语、意大利语、葡萄牙语和俄语等语言）。该模型首次通过引入对比学习来弥合模态差异，取得了卓越成果——跨模态语音文本检索准确率从 4%跃升至 88%！其弥合模态差距的效果优于 Chimera 模型。STEMM 模型性能逊于 ConST 的原因在于二者出自同一组研究者且同年提出，实际上 ConST 通过采用跨模态对比损失函数，对 XSTNet 和 STEMM 进行了改进。FCCL（中型模型）（Zhang 等人，2023c）进一步提升了性能，其在句子级和帧级同时应用对比学习，而 ConST 仅采用句子级对比学习。最终，基于最优传输（OT）的模型在除德语和俄语外的所有语言上都超越了基于对比学习的模型。深入分析发现，对于部分语言，OT 模型（Le 等人，2023b）仅能部分缩小模态差距，其效果不及 ConST 和 FCCL。因此，作为一项建议，可探索通过最优传输（OT）方法结合粗粒度与细粒度对比学习以及基于 CTC 损失的 ASR 预训练来构建更优的语音翻译模型。需注意的是，本文未比较基于 LLM 的语音翻译模型，主要因其预训练数据规模过于庞大。为确保公平比较（如 ConST 和 FCCL 模型所示），我们重点关注通过外部 ASR 与机器翻译语料库进行预训练能带来性能提升的模型架构。

7.5 SOTA Performance of E2E ST Models on Low-Resource Languages
7.5 低资源语言端到端语音翻译模型的最先进性能表现

In Table 2, we present the SOTA performance of various ST models on low-resource language pairs as of November 2023. The table indicates which models, utilizing specific techniques, achieve SOTA performance. This provides a comprehensive overview of the current status of ST models for low-resource languages (LRLs). From Table 2, it is evident that the BLEU scores for many LRLs, such as Mn, Si, Ta, Id, Ja, and Sv, are relatively low. This is more likely due to small amount of speech data available for these (as seen in Speech (hours) column)) compared to other LRLs where higher amount of speech data is used for training the LNA+Zero shot model. This highlights the need for improving the performance of ST models for these languages by increasing the data and designing better models.
表 2 展示了截至 2023 年 11 月，各类语音翻译模型在低资源语言对上的最先进性能表现。该表明确了采用特定技术实现最优性能的模型，为低资源语言（LRLs）语音翻译模型的现状提供了全面概览。从表 2 可以明显看出，许多低资源语言（如蒙古语、僧伽罗语、泰米尔语、印尼语、日语和瑞典语）的 BLEU 分数相对较低。这很可能是因为这些语言的可用语音数据量（参见"语音时长（小时）"列）比其他使用更多语音数据训练 LNA+零样本模型的低资源语言要少。这一现象凸显了需要通过增加数据量和设计更优模型来提升这些语言语音翻译模型性能的必要性。

Table 2: SOTA performance in Low-Resource Language Pairs: Dataset, Models, Speech Duration, Settings, and BLEU Score
表 2：低资源语言对的 SOTA 性能表现：数据集、模型、语音时长、设置及 BLEU 分数

Language Pair	Model/Technique	Dataset	Speech (hours)	Setting	Metric (BLEU)
Ainu $\rightarrow$ En	Tied Multitask Learning with regularizers (Anastasopoulos and Chiang, 2018)	Glossed Audio Corpus	2.5	ST with ASR & MT	20.3
Mboshi $\rightarrow$ Fr		Godard Corpus	4.4		24.7
Mt $\rightarrow$ En	WACO (Ouyang et al., 2023)	IWSLT	1	Modality Bridging	13.3
Et $\rightarrow$ En Et $\rightarrow$ 英	Unsupervised + W2V2 + mBart (Wang et al., 2022)	CoVoST-2	3	Low-Resource	19.0
Lv $\rightarrow$ En			2		25.0
En $\rightarrow$ Ar	Teacher-Student (W2V2 + self-training + dec w/o LM) (Kahn et al., 2019)	CoVoST-2	430	Unsupervised	20.8
En $\rightarrow$ Ca					35.6
En $\rightarrow$ Tr					18.9
Sl $\rightarrow$ En	LNA + Zero Shot Learning (Li et al., 2020)	CoVoST-2	2	Multi-Lingual	5.6
Sv $\rightarrow$ En			2		5.9
Fa $\rightarrow$ En			49		11.0
Tr $\rightarrow$ En			4		11.2
Mn $\rightarrow$ En			3		1.2
Ar $\rightarrow$ En			2		6.4
Cy $\rightarrow$ En			2		9.0
Ta $\rightarrow$ En			2		0.9
Ja $\rightarrow$ En			1		2.1
Id $\rightarrow$ En			1		3.7
En $\rightarrow$ Cy			430		30.6
En $\rightarrow$ Et			430		22.2
En $\rightarrow$ Fa			430		21.5
En $\rightarrow$ Id			430		29.9
En $\rightarrow$ Ja			430		39.3
En $\rightarrow$ Lv			430		21.5
En $\rightarrow$ Mn			430		14.8
En $\rightarrow$ Sl			430		25.1
En $\rightarrow$ Sv			430		30.4
En $\rightarrow$ Ta			430		17.8

8 Deployment of E2E ST Models
8 端到端语音翻译模型的部署

Deployment of offline E2E ST models incurs several challenges. The first challenge is handling Cross-talk, noise, and background music removal and getting a clean speech. If the speaker is having stuttering, different dialect and accent then the same ST model may not work effectively. The second challenge is related to the distance of the speaker from the microphone and movements of the speaker around the microphone which can hamper the input speech quality. As a solution to these problems, the ST model may be trained over a variety of speakers in various acoustic conditions. The third challenge is related to memory consumption especially when considering LLM-based ST model deployment. To deploy memory-intensive and LLM-based ST models on edge devices, pruning, quantization, and knowledge distillation techniques may be used (Zhou et al., 2022a) which significantly reduces the memory load.
离线端到端语音翻译模型的部署面临若干挑战。首要挑战在于处理串扰、噪声和背景音乐消除以获取纯净语音。若说话者存在口吃、使用不同方言或口音，同一语音翻译模型可能无法有效工作。第二项挑战涉及说话者与麦克风的距离变化及移动，这会损害输入语音质量。针对这些问题，可在多样化声学条件下使用多说话者数据训练模型。第三项挑战关乎内存消耗，特别是基于 LLM 的语音翻译模型部署时。为在边缘设备上部署高内存占用的 LLM 语音翻译模型，可采用剪枝、量化和知识蒸馏等技术（Zhou 等，2022a），这些方法能显著降低内存负荷。

Streaming ST models on the other hand are used as a submodule within the automatic subtitling. Hence their deployment has challenges of subtitling tasks which is considered harder. For example, subtitling requires the following challenges to be solved: (a) firstly, translated text should be segmented such that it reduces the cognitive load and maximizes the user experience like reading speech and synchronization with the speech (b) how many characters and lines to display? These constraints are usually decided by the media industries. For example, displaying a maximum of 2 lines of subtitles, 42 characters per line at max, and a maximum reading speech of 21 characters/second is used by TEDx (Agrawal et al., 2023).
另一方面，流式语音翻译模型作为自动字幕生成的子模块，其部署面临字幕任务特有的挑战，这些挑战被认为更为复杂。例如，字幕生成需解决以下问题：(a) 译文需合理分段以降低认知负荷，并优化用户体验，包括阅读节奏与语音同步；(b) 显示字符数与行数限制。这些约束通常由媒体行业制定，如 TEDx 采用的标准为：最多显示 2 行字幕，每行不超过 42 个字符，最大阅读速度为 21 字符/秒（Agrawal 等人，2023 年）。

Table 3: Dataset statistics(✓ means that feature is available for the dataset and ✗ means that the feature is unavailable for the dataset)
表 3：数据集统计（✓表示该数据集具备此特征，✗表示不具备）

Datasets 数据集	Source Language (Speech) 源语言（语音）	Target Language (Text) 目标语言（文本）	Speech (hours) 语音时长（小时）	Speakers 说话者	Validation 验证	Gender 性别	Age Group 年龄段
MuST-C	En	14 lang	0.4K	1.6K	✗	✗	✗
Librispeech	En	Fr	0.2K	1.4K	✓	✓	✓
CoVost	En	11 lang	0.7K	11K	✓	✓	✓
CoVost2	21 lang	En	2.8K	11K	✓	✓	✓
	En	15 lang	0.7K	78K	✓	✓	✓
EuroparlST	4 lang	4 lang	0.25K	✗	✗	✗	✗
VoxPopuli	En	15 lang	1.79K	4.3K	✗	✗	✗
Kosp2e	Ko	En	0.2K	0.2K	✗	✗	✗
GigaST	En	De, Zh	10K	✗	✗	✗	✗
Prabhupadavani	en-bn-sn code-mix	25 lang	0.09K	0.13K	✗	✗	✗
How2	En	Pt	2K	✗	✗	✗	✗
FLEURS	102 lang	102 lang	1.4K	0.3K	✓	✓	✗
BSTC	Zn	En	98	✗	✓	✗	✗
Indic-TEDST	En	9 lang	189	1.64K	✗	✗	✗

9 Resources for ST 9 种语音翻译资源

9.1 Datasets for ST Tasks 9.1 语音翻译任务数据集

There have been several datasets created for the ST task. Some of them are listed below, and we describe them here briefly. Table 3 provides information on various dataset statistics, such as hours of speech, the number of speakers, whether the dataset was manually or machine validated, the gender, and the age range to which the speaker belongs. Additionally, the tools required for creating these datasets are (a) Gentle (Ochshorn and Hawkins, 2017) for audio-transcription alignment, and (b) BertAlign¹⁰¹⁰10https://github.com/bfsujason/bertalign for transcription-translation alignment.
目前已创建了多个用于语音翻译任务的数据集。以下列举部分数据集并作简要介绍。表 3 提供了各数据集统计信息，包括语音时长、说话者人数、数据验证方式（人工/机器）、说话者性别及年龄范围等。创建这些数据集所需的工具包括：(a) Gentle（Ochshorn 和 Hawkins，2017）用于音频-文本对齐；(b) BertAlign ¹⁰ 用于文本-译文对齐。

1.

How2 (Sanabria et al., 2018) is an ST corpus of English instructional videos having Portuguese translations.

1. How2（Sanabria 等人，2018 年）是一个包含英语教学视频及其葡萄牙语翻译的语音翻译语料库。
2.

Augmented Librispeech (Kocabiyikoglu et al., 2018) is obtained from the LibriSpeech corpus (Panayotov et al., 2015). It is a speech recognition repository generated using audiobooks of Gutenberg Project ¹¹¹¹11https://www.gutenberg.org/. This dataset is designed to translate English speech into written French text.

2. 增强版 Librispeech（Kocabiyikoglu 等人，2018 年）源自 LibriSpeech 语料库（Panayotov 等人，2015 年）。该语音识别资源库通过古腾堡计划 ¹¹ 的有声读物生成，专为将英语语音转写为法语文本而设计。
3.

CoVoST and CoVoST 2 (Wang et al., 2020a, c), the datasets are based on Common Voice project ¹²¹²12https://commonvoice.mozilla.org/en. CoVoST is a many-to-one dataset covering 11 languages, while CoVoST 2 offers one-to-many and many-to-one translations for 15 languages.

3. CoVoST 与 CoVoST 2（Wang 等人，2020a，c）数据集基于 Common Voice 项目构建。CoVoST 是一个多对一数据集，涵盖 11 种语言；而 CoVoST 2 则提供 15 种语言的单对多及多对一翻译功能。
4.

Europarl-ST (Iranzo-Sánchez et al., 2020) is a collection that contains speech and text data from European Parliament proceedings between 2008 and 2012 in four languages. It includes multiple sources and targets for both speech and text.

4. Europarl-ST（Iranzo-Sánchez 等人，2020 年）是一个包含 2008 年至 2012 年欧洲议会会议中四种语言的语音和文本数据集合。该数据集为语音和文本提供了多种来源与目标语言对。
5.

MuST-C (Cattoni et al., 2021) It is a large multilingual ST translation corpus available . It contains translations from English into fourteen additional languages and is compiled from TED Talks. mTEDx (Salesky et al., 2021) is one such multilingual dataset from TED talks.

5. MuST-C（Cattoni 等人，2021 年）是一个可用的大型多语言语音翻译语料库，包含从英语到另外十四种语言的翻译数据，语料源自 TED 演讲。mTEDx（Salesky 等人，2021 年）是另一个基于 TED 演讲的多语言数据集。
6.

VoxPopuli (Wang et al., 2021a) dataset is an expansion of Europarl-ST. It includes data from European parliament sessions spanning from 2009 to 2020.

6. VoxPopuli（Wang 等人，2021a 年）数据集是 Europarl-ST 的扩展版本，包含 2009 年至 2020 年欧洲议会会议的语音数据。
7.

Kosp2e (Cho et al., 2021) is a Korean (ko) to English(en) ST translation corpus, which contains Korean speech with parallel English texts. The corpus contains data from four different domains: Zeroth from news/newspaper, KSS (Park, 2018) from textbooks, emphStyleKQC (Cho et al., 2022) from AI applications, and Covid-ED (Lee et al., 2021) from covid diaries of people which have emotions.

7. Kosp2e（Cho 等人，2021 年）是一个韩语（ko）到英语（en）的语音翻译语料库，包含韩语语音及对应的英语文本。该语料库涵盖四个不同领域的数据：来自新闻/报纸的 Zeroth、来自教科书的 KSS（Park，2018 年）、来自 AI 应用的 emphStyleKQC（Cho 等人，2022 年）以及来自带有情感的新冠日记的 Covid-ED（Lee 等人，2021 年）。
8.

BSTC (Zhang et al., 2021) is a Baidu Speech Translation Corpus, a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, their manual transcripts, and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model.

8. BSTC（Zhang 等人，2021 年）是百度语音翻译语料库，一个大规模的中英语音翻译数据集。该数据集基于授权讲座视频、人工转录文本、英语翻译文本以及自动语音识别（ASR）模型生成的自动转录文本构建而成。
9.

GigaST (Ye et al., 2022b) corpus is a collection of speech translations from English to German and Chinese. It is created using the English ASR GigaSpeech(Chen et al., 2021a), which features 10,000 hours of transcribed speech from various sources such as audioPortugesebooks, podcasts, and YouTube.

9. GigaST（Ye 等人，2022b 年）语料库是英语到德语和汉语的语音翻译数据集，其构建基于包含 10,000 小时转录语音的英语 ASR 语料库 GigaSpeech（Chen 等人，2021a 年），语音来源包括有声读物、播客和 YouTube 等多种渠道。
10.

Prabhupadavani (Sandhan et al., 2022) is an ST dataset where speech is multilingual and Code-Mix with three different languages, English is the primary language, and words and phrases from Sanskrit and Bengali are interjected. The text part has sentences in 25 languages.

10. Prabhupadavani（Sandhan 等人，2022 年）是一个语音转文本数据集，其语音内容为多语言混合形式，主要使用英语，并夹杂梵语和孟加拉语的单词及短语。文本部分包含 25 种语言的句子。
11.

FLEURS (Conneau et al., 2022) FLEURS stands as a multilingual speech dataset, offering parallel recordings across 102 languages. Developed as an extension of the FLoRes-101 MT benchmark, it encompasses about 12 hours of annotated speech data for each language.

11. FLEURS（Conneau 等人，2022 年）是一个多语言语音数据集，提供 102 种语言的平行录音。作为 FLoRes-101 机器翻译基准的扩展，该数据集每种语言包含约 12 小时的标注语音数据。
12.

Indic-TEDST Sethiya et al. (2024) is a low-resource ST translation dataset across 9 Indic languages: Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi (pa), Tamil (ta), and Telugu (te).

12. Indic-TEDST（Sethiya 等人，2024 年）是一个涵盖 9 种印度语言的低资源语音翻译数据集，包括：孟加拉语(bn)、古吉拉特语(gu)、印地语(hi)、卡纳达语(kn)、马拉雅拉姆语(ml)、马拉地语(mr)、旁遮普语(pa)、泰米尔语(ta)和泰卢固语(te)。

Besides these popular ST datasets, there are some other smaller size datasets such as Fisher(Cieri et al., 2004), Call-Home¹³¹³13https://ca.talkbank.org/access/CallHome/eng.html, Gordard Corpus(Godard et al., 2018), Glosse Audio Corpus¹⁴¹⁴14https://ainu.ninjal.ac.jp/folklore/en/, BTEC ¹⁵¹⁵15http://universal.elra.info/product_info.php?cPath=37_39&products_id=80, WSJ¹⁶¹⁶16https://catalog.ldc.upenn.edu/LDC93s6a, IWSLT¹⁷¹⁷17https://iwslt.org/, Miami Corpus(Deuchar, 2008), and MSLT Corpus (Federmann and Lewis, 2016).
除了这些主流的语音翻译数据集外，还存在一些规模较小的数据集，如 Fisher-Cieri（Cieri 等，2004）、Call-Home（标注 0#）、Gordard 语料库（Godard 等，2018）、Glosse 音频语料库（标注 1#）、BTEC（标注 2#）、WSJ（标注 3#）、IWSLT（标注 4#）、迈阿密语料库（Deuchar 2008）以及 MSLT 语料库（Federmann 和 Lewis，2016）。

9.2 Toolkits for ST 9.2 语音翻译工具包

To facilitate building and training ST models, various researchers have proposed a few toolkits. The toolkits for ST create an environment where the dataset for ST tasks can be pre-processed, and models can be trained, fine-tuned, and evaluated. We provide a short description of these toolkits to make the survey a place for a one-stop-shop for ST modeling.
为便于构建和训练语音翻译模型，研究者们已开发了多种工具包。这些工具包提供了语音翻译任务的预处理环境，支持模型训练、微调与评估。本节简要介绍这些工具包，使本综述成为语音翻译建模的一站式参考指南。

1.

SLT.KIT¹⁸¹⁸18https://github.com/isl-mt/SLT.KIT(Zenkel et al., 2018) offers ASR, MT and ST models along with some specific features such as CTC and Attention based ASR, ASR with punctuation and a neural MT system.

1. SLT.KIT（标注 0#）（Zenkel 等，2018）提供自动语音识别、机器翻译及语音翻译模型，并具备 CTC 与基于注意力的语音识别、带标点的语音识别系统以及神经机器翻译系统等特色功能。
2.

EspNet-ST¹⁹¹⁹19https://github.com/espnet/espnet toolkit (Inaguma et al., 2020b) is developed as there was no toolkit available for performing the sub-tasks of ST. EspNet-ST provides ASR, LM, E2E-ST, Cascade-ST, MT, and TTS along with examples. It also provided pre-trained transformer-based models on various datasets like MUST-C, Libri-trans, Fisher, CALL-HOME, and How2.

2. EspNet-ST ¹⁹ 工具包（Inaguma 等人，2020b）的开发源于当时缺乏专门用于执行语音翻译子任务的工具。该工具包提供自动语音识别（ASR）、语言模型（LM）、端到端语音翻译（E2E-ST）、级联语音翻译（Cascade-ST）、机器翻译（MT）和文本转语音（TTS）功能，并附有使用示例。此外，它还提供了基于 Transformer 的预训练模型，这些模型在 MUST-C、Libri-trans、Fisher、CALL-HOME 和 How2 等多个数据集上进行了训练。
3.

FairSeq S2T²⁰²⁰20https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text (Wang et al., 2020b) toolkit is an extension to FairSeq(Ott et al., 2019) in which all the functions of EspNet-ST are available. Additionally, it provides the Non-Autoregressive MT, Online ST, and Speech Pretraining. The toolkit also provides state-of-the-art ST models based on RNN, transformers, and conformers. It has an in-built data loader for MuST-C, Librispeech, and CoVoST datasets.

3. FairSeq S2T ²⁰ 工具包（Wang 等人，2020b）是对 FairSeq（Ott 等人，2019）的扩展，包含了 EspNet-ST 的所有功能。此外，它还提供非自回归机器翻译、在线语音翻译和语音预训练功能。该工具包还提供了基于循环神经网络（RNN）、Transformer 和 Conformer 的先进语音翻译模型，并内置了针对 MuST-C、Librispeech 和 CoVoST 数据集的数据加载器。
4.

NeurST²¹²¹21https://github.com/bytedance/neurst (Zhao et al., 2021) is a lightweight toolkit, as it has no dependency on kaldi toolkit (Zheng et al., 2011). It has high computation efficiency using mixed precision and accelerated linear algebra and achieves faster training on large-scale datasets using Horovod (Sergeev and Balso, 2018).

4. NeurST ²¹ 工具包（Zhao 等人，2021）是一个轻量级工具包，不依赖于 Kaldi 工具包（Zheng 等人，2011）。它通过混合精度计算和加速线性代数实现了高效计算，并利用 Horovod（Sergeev 和 Balso，2018）在大规模数据集上实现了更快的训练速度。

10 Future Directions for Research
10. 未来研究方向

This section highlights challenges that need the attention of researchers working on ST problems.
本部分重点指出了需要语音翻译领域研究者关注的关键挑战。

10.1 Cascade vs End-to-End Models
10.1 级联模型与端到端模型对比

As argued and presented through comprehensive experiments by (Bentivogli et al., 2021), the performance gaps between cascade and E2E ST models are bridged. However, as shown by (Agrawal et al., 2023) in a recent IWSLT 2023 subtitling generation task, the performance of cascade models is far superior to E2E models for offline ST tasks evaluated on all metrics. Furthermore, as far as our understanding, no thorough assessment has been done for low-resource languages that use E2E and cascade models. It may be interesting to compare E2E and cascade ST models on various ST datasets to assert the claims in the literature.
正如(Bentivogli 等人，2021 年)通过全面实验论证所示，级联模型与端到端语音翻译模型之间的性能差距正在缩小。然而，(Agrawal 等人，2023 年)在最近的 IWSLT 2023 字幕生成任务中表明，就所有评估指标而言，级联模型在离线语音翻译任务中的表现远优于端到端模型。此外，据我们所知，目前尚未对使用端到端和级联模型的低资源语言进行彻底评估。通过在不同语音翻译数据集上比较这两种模型来验证文献中的主张，这将是一个值得探讨的方向。

10.2 ST on Code-Mix data 10.2 混合编码数据的语音翻译

We find that there exists limited study on the ST model that uses code-mix data as an input. A code-mix data has problems, such as different lexicons, syntax, and scarcity of labeled data. Therefore, it will be interesting to (a) create Code-Mix ST datasets incorporating more languages, (b) see how the existing ST models perform on code-mix ST data?, and (c) Can pre-training in many languages assist in tackling the code-mixing issue?
我们发现目前针对使用混合编码数据作为输入的语音翻译模型研究较为有限。混合编码数据存在诸多挑战，包括不同语言的词汇体系差异、句法结构差异以及标注数据稀缺等问题。因此，值得探索的方向包括：(a) 构建包含更多语言的混合编码语音翻译数据集；(b) 评估现有语音翻译模型在混合编码数据上的表现；(c) 研究多语言预训练是否有助于解决混合编码问题。

10.3 Domain-Invariant Models
10.3 领域无关模型

ST models developed for one domain do not scale well to other domains, as shown in the recent IWSLT 2023. Here domain in-variance setting is the ST model which is trained in some language combination (say Eng-De) and needs to be adapted to other language combinations (e.g., Eng-Hi). Transfer learning/continual learning can be explored to develop generic models.
如 IWSLT 2023 最新研究表明，针对特定领域开发的语音翻译模型难以有效迁移到其他领域。这里的领域无关性特指模型在某种语言对（如英-德）上训练后，需要适配其他语言对（如英-印地语）的场景。可以探索迁移学习/持续学习方法来开发通用模型。

10.4 Discrepancy between Automatic and Human Evaluation
10.4 自动评估与人工评估的差异

There may be discrepancies and disagreements among various metrics used to report ST task results. They do not match the mean option score (MOS) provided by human evaluators (Agrawal et al., 2023). For example, if a system evaluates the BLEU score between a ground truth sentence “Police shot the culprit with a gun” and hypothesis sentence “Police use a gun to shot the culprit”, it is 0! However, both sentences above might be deemed appropriate translations of an utterance semantically by an ST system. Such an argument is supported by dubbing artists who often change the voice of the sentence to simplify it or make it more pleasing. ²²²²22In the movie “Pirates of the Caribbean”, Jack Sparrow asks Bloom how long he can go for the girl. The original answer from Bloom is “I can die for her!”. Whereas Hindi dubbing is “Till the dying breadth”
在报告语音翻译任务结果时，不同评估指标之间可能存在差异和矛盾。这些指标与人工评估者提供的平均意见得分（MOS）并不一致（Agrawal 等人，2023 年）。例如，当系统评估参考句子"警察用枪击中了罪犯"与假设句子"警察使用枪支射击罪犯"之间的 BLEU 分数时，得分为 0！然而从语义角度，上述两个句子都可能被语音翻译系统视为合适的译文。这一观点得到了配音演员的支持——他们经常通过调整语句表达来简化内容或增强听觉效果。 ²²

As highlighted in (Marie et al., 2021), the BLEU score is being reported by more than 99% of MT papers without accounting for statistical significance testing or human evaluation. Our survey of ST papers indicates the same trend being followed. Therefore, we call for the attention of researchers to develop and use metrics that match human evaluations semantically. An approach could be to subject the ground truth and hypothesis sentences under semantic textual similarity tasks and score them accordingly.
正如（Marie 等人，2021 年）所强调的，超过 99%的机器翻译论文在未进行统计显著性检验或人工评估的情况下直接报告 BLEU 分数。我们对语音翻译论文的调查显示该趋势仍在延续。因此，我们呼吁研究者关注开发和使用与人工语义评估相匹配的指标。一种可行方案是将参考句与假设句置于语义文本相似度任务中进行评分。

10.5 Handling Ambient Noise
10.5 环境噪声处理

In our literature survey, we find that little has been done to deal with ambient noises. Ambient noise, background music, cross-talk, and non-verbal sounds may create difficulty in ST model learning. The model must distinguish between a meaningful utterance and ambient noise– a non-trivial task.
文献调研显示，针对环境噪声处理的研究较为匮乏。环境噪声、背景音乐、串音和非言语声响可能干扰语音翻译模型的学习。模型必须区分有效语音与环境噪声——这并非易事。

10.6 Handling Multiple Speakers
10.6 多说话人处理

It is common in the real world where the audio/video has multiple speakers, each of which may have its own accent (cf., An Asian and American talking to each other in English), dialect, pitch, and accent. Performing speech separation may be useful before feeding it to the ST model for improved performance.
现实场景中常见音频/视频包含多个说话人的情况，每位说话者可能带有各自的口音（例如亚洲人与美国人用英语交谈）、方言、音高和腔调。在进行语音翻译前实施语音分离可能有助于提升模型性能。

10.7 Handling Speaker Diarization
10.7 说话人日志处理

Speaker diarization refers to demarcating the timing of speakers in a multiple-speaker speech. So far, the datasets for ST do not have speaker boundary marks. Creating speaker-diarized ST data in a multilingual setting will be interesting to test the ST models’ robustness.
说话人日志是指在多人语音中标注不同说话者的时间边界。目前，语音翻译数据集普遍缺乏说话人边界标记。构建多语言环境下具有说话人日志的语音翻译数据，对于测试模型鲁棒性具有重要意义。

10.8 Multilingual and Simultaneous ST
10.8 多语言与同步语音翻译

Multilingual ST has gained momentum recently due to its importance in the real world. For example, a single speech must be broadcast to multilingual communities (e.g., a conference is attended by a diverse group of people). It can be one-to-many, many-to-one, and many-to-many languages ST. Our literature survey shows that only a few works exist in this space. Besides, there is an opportunity to explore simultaneous multilingual ST, which is the most practical setting.
多语言语音翻译因其现实应用价值近年来备受关注。例如，单次演讲需要面向多语言群体进行传播（如国际会议中的多元化听众）。其形式可包括一对多、多对一以及多对多语言的翻译模式。文献研究表明该领域现有成果较少。此外，同步多语言语音翻译作为最具实用价值的场景，仍有广阔探索空间。

10.9 Low-resource ST Datasets and Models
10.9 低资源语音翻译数据集与模型

Most existing works have focused on building ST models and datasets for high-resource languages. As we know, the success of ST models relies on the parallel speech-text corpus; building ST datasets for low-resource languages requires more attention. Further, a few works, such as (Bansal et al., 2019), have reported ST task results on the Mboshi-French pair; however, the BLEU score is poor. Therefore, building models that transfer learning from language pairs with high to low resources is warranted.
现有研究大多集中于高资源语言的语音翻译模型与数据集构建。众所周知，语音翻译模型的成功依赖于平行语音-文本语料库，因此低资源语言的语音翻译数据集建设亟待更多关注。值得注意的是，部分研究（如 Bansal 等人，2019）虽已报告姆博希语-法语语言对的翻译结果，但其 BLEU 评分表现欠佳。这充分表明，建立能够实现从高资源语言对向低资源语言对迁移学习的模型具有重要研究价值。

10.10 LLMs for ST tasks 10.10 用于语音翻译任务的大语言模型

In the last few years, large language models (LLMs) have emerged as a promising solution to many NLP tasks including ST. LLMs show in-context learning (ICL) when trained over a massive amount of data. This process unlocks their hidden emergent abilities (Wei et al., 2022) and enables them for few-shot and zero-shot learning capability via prompting. There exist a few works (Zhang et al., 2023b; Wu et al., 2023; Huang et al., 2023) (see (Gaido et al., 2024) for a comparative discussion) which explore LLMs for ST task. Concretely, all of these models leverage a speech foundation model (SLM) followed by length adapter, modality adaptation, mixing the two modalities, and then LLMs for generating the output. GenTranslate (Hu et al., 2024) builds upon the Seamless4MT by integrating an LLM on top and performing $N$ -best hypothesis tuning. Initial results are plausible. However, it remains to see how various components affect the downstream task performance, what is the best strategy for prompt design, and how to pre-train/fine-tune them in a parameter-efficient way for ST tasks. Further, the use of LLMs for SimulMT has been recently proposed (Agostinelli et al., 2023) and it remains to see how to adapt SimulMT to SimulST.
近年来，大语言模型（LLMs）已成为包括语音翻译在内的众多自然语言处理任务的有力解决方案。当在大规模数据上进行训练时，LLMs 展现出上下文学习能力（ICL）。这一过程释放了其潜在的新兴能力（Wei 等人，2022），并通过提示机制使其具备小样本和零样本学习能力。已有若干研究（Zhang 等人，2023b；Wu 等人，2023；Huang 等人，2023）（对比讨论参见 Gaido 等人，2024）探索了 LLMs 在语音翻译任务中的应用。具体而言，这些模型均采用语音基础模型（SLM）作为前端，依次经过长度适配器、模态适配、多模态融合等处理环节，最终由 LLMs 生成输出结果。GenTranslate（Hu 等人，2024）在 Seamless4MT 框架基础上集成 LLM 顶层模块，并执行 $N$ 最优假设调优，初步结果具有合理性。然而，各组件对下游任务性能的影响机制、最优提示设计策略，以及如何针对语音翻译任务进行参数高效的预训练/微调等问题，仍有待深入探究。此外，近期有研究提出将 LLMs 应用于同步机器翻译（SimulMT）（Agostinelli 等人，2023 年），但如何将 SimulMT 适配至同步语音翻译（SimulST）仍有待探索。

10.11 Really long Context Modelling
10.11 超长上下文建模

As mentioned in the streaming section, SST models need to handle long input sequences. Current speech encoders lack infinite context modeling capability due to their quadratic complexity of self-attention. There have been recent improvements to handle the problem of infinite context. For example, Mamba (Zhang et al., 2024a), Infini-attention (Munkhdalai et al., 2024), and TransforerFAM (Hwang et al., 2024) show some promising results in long context modeling. These models may be explored for SST task as well.
如流式处理章节所述，语音到文本翻译模型需要处理长输入序列。当前语音编码器由于自注意力机制的二次方复杂度限制，缺乏无限上下文建模能力。近期在无限上下文处理方面已取得若干进展，例如 Mamba（Zhang 等人，2024a）、Infini-attention（Munkhdalai 等人，2024）和 TransformerFAM（Hwang 等人，2024）在长上下文建模中展现出良好效果。这些模型同样值得在语音到文本翻译任务中进行探索。

11 Conclusion 11 结论

This survey paper delves into the most recent advancements in E2E ST translation works. Our discussion includes models, evaluation metrics, and datasets used to train ST models. We review various frameworks for ST models and highlight previous research in this field. The categorization of ST models is based on the kind of data they handle and the models employed. Additionally, we discuss potential future directions for improving speech-to-text translation. Our findings suggest that the gap between cascade and E2E system performance in both online and offline settings is narrowing. However, for some language pairs, the gap is still wide and therefore, additional work is warranted. Our goal in the present ST survey is to offer valuable insight into this topic and drive advancements in ST research. We believe that such reviews will be interesting to researchers.
本综述论文深入探讨了端到端语音转文本翻译领域的最新研究进展。我们系统分析了用于训练语音翻译模型的各类架构、评估指标及数据集，梳理了该领域的多种研究框架并重点评述了前人工作。根据处理数据类型和采用模型的不同，我们对现有语音翻译模型进行了分类研究。此外，我们还探讨了改进语音转文本翻译技术的潜在发展方向。研究发现，级联系统与端到端系统在在线和离线场景下的性能差距正在逐步缩小，但某些语言对之间仍存在显著差异，这为后续研究提供了改进空间。本次语音翻译综述旨在为该领域提供有价值的见解，推动相关研究发展。我们相信此类综述研究将对科研人员具有重要参考价值。

References

Abbott (1999) Abbott, L.F., 1999. Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain Research Bulletin 50, 303–304.
Agostinelli et al. (2023) Agostinelli, V., Wild, M., Raffel, M., Fuad, K.A.A., Chen, L., 2023. Simul-llm: A framework for exploring high-quality simultaneous translation with large language models. ArXiv abs/2312.04691.
Agrawal et al. (2023) Agrawal, S., Anastasopoulos, A., Bentivogli, L., Bojar, O., Borg, C., Carpuat, M., Cattoni, R., Cettolo, M., Chen, M., Chen, W., Choukri, K., Chronopoulou, A., Currey, A., Declerck, T., Dong, Q., Duh, K., Estève, Y., Federico, M., Gahbiche, S., Haddow, B., Hsu, B., Mon Htut, P., Inaguma, H., Javorský, D., Judge, J., Kano, Y., Ko, T., Kumar, R., Li, P., Ma, X., Mathur, P., Matusov, E., McNamee, P., P. McCrae, J., Murray, K., Nadejde, M., Nakamura, S., Negri, M., Nguyen, H., Niehues, J., Niu, X., Kr. Ojha, A., E. Ortega, J., Pal, P., Pino, J., van der Plas, L., Polák, P., Rippeth, E., Salesky, E., Shi, J., Sperber, M., Stüker, S., Sudoh, K., Tang, Y., Thompson, B., Tran, K., Turchi, M., Waibel, A., Wang, M., Watanabe, S., Zevallos, R., 2023. FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN, in: Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Association for Computational Linguistics, Toronto, Canada (in-person and online). pp. 1–61.
Alastruey et al. (2022) Alastruey, B., Ferrando, J., Gállego, G.I., Costa-jussà, M.R., 2022. On the locality of attention in direct speech translation, in: Louvan, S., Madotto, A., Madureira, B. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, Dublin, Ireland. pp. 402–412. doi:10.18653/v1/2022.acl-srw.32.
Anastasopoulos and Chiang (2018) Anastasopoulos, A., Chiang, D., 2018. Tied multitask learning for neural speech translation, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 82–91. doi:10.18653/v1/N18-1008.
Anastasopoulos et al. (2016) Anastasopoulos, A., Chiang, D., Duong, L., 2016. An unsupervised probability model for speech-to-translation alignment of low-resource languages, in: Su, J., Duh, K., Carreras, X. (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas. pp. 1255–1263. doi:10.18653/v1/D16-1133.
Ao et al. (2021) Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., et al., 2021. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205 .
Arivazhagan et al. (2019a) Arivazhagan, N., Cherry, C., I, T., Macherey, W., Baljekar, P.N., Foster, G.F., 2019a. Re-translation strategies for long form, simultaneous, spoken language translation. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7919–7923.
Arivazhagan et al. (2019b) Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.C., Yavuz, S., Pang, R., Li, W., Raffel, C., 2019b. Monotonic infinite lookback attention for simultaneous machine translation, in: Annual Meeting of the Association for Computational Linguistics.
Arivazhagan et al. (2020) Arivazhagan, N., Cherry, C., Macherey, W., Foster, G.F., 2020. Re-translation versus streaming for simultaneous translation, in: International Workshop on Spoken Language Translation.
Baevski et al. (2022) Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language, in: International Conference on Machine Learning, PMLR. pp. 1298–1312.
Baevski et al. (2020) Baevski, A., Zhou, H., Mohamed, A., Auli, M., 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA.
Bahar et al. (2019a) Bahar, P., Bieschke, T., Ney, H., 2019a. A comparative study on end-to-end speech to text translation, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE. pp. 792–799.
Bahar et al. (2020) Bahar, P., Wilken, P., Alkhouli, T., Guta, A., Golik, P., Matusov, E., Herold, C., 2020. Start-before-end and end-to-end: Neural speech translation by apptek and rwth aachen university, in: International Workshop on Spoken Language Translation.
Bahar et al. (2019b) Bahar, P., Zeyer, A., Schlüter, R., Ney, H., 2019b. On using SpecAugment for end-to-end speech translation, in: Niehues, J., Cattoni, R., Stüker, S., Negri, M., Turchi, M., Ha, T.L., Salesky, E., Sanabria, R., Barrault, L., Specia, L., Federico, M. (Eds.), Proceedings of the 16th International Conference on Spoken Language Translation, Association for Computational Linguistics, Hong Kong.
Banerjee and Lavie (2005) Banerjee, S., Lavie, A., 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72.
Bansal et al. (2019) Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S., 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 58–68. doi:10.18653/v1/N19-1006.
Bansal et al. (2017) Bansal, S., Kamper, H., Lopez, A., Goldwater, S., 2017. Towards speech-to-text translation without speech recognition, in: Lapata, M., Blunsom, P., Koller, A. (Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Association for Computational Linguistics, Valencia, Spain. pp. 474–479.
Bapna et al. (2021) Bapna, A., Chung, Y.A., Wu, N., Gulati, A., Jia, Y., Clark, J., Johnson, M., Riesa, J., Conneau, A., Zhang, Y., 2021. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. ArXiv abs/2110.10329.
Barrault et al. (2023) Barrault, L., Chung, Y.A., Meglioli, M.C., Dale, D., Dong, N., Duquenne, P.A., ElSahar, H., Gong, H., Heffernan, K., Hoffman, J., Klaiber, C., Li, P., Licht, D., Maillard, J., Rakotoarison, A., Sadagopan, K.R., Wenzek, G., Ye, E., Akula, B., Chen, P.J., Hachem, N.E., Ellis, B., Gonzalez, G.M., Haaheim, J., Hansanti, P., Howes, R., Huang, B., Hwang, M.J., Inaguma, H., Jain, S., Kalbassi, E., Kallet, A., Kulikov, I., Lam, J., Li, S.W., Ma, X., Mavlyutov, R., Peloquin, B., Ramadan, M., Ramakrishnan, A., Sun, A., Tran, K.M., Tran, T., Tufanov, I., Vogeti, V., Wood, C., Yang, Y., Yu, B., Andrews, P.Y., Balioglu, C., Costa-jussà, M.R., Çelebi, O., Elbayad, M., Gao, C., Guzm’an, F., Kao, J.T., Lee, A., Mourachko, A., Pino, J.M., Popuri, S., Ropers, C., Saleem, S., Schwenk, H., Tomasello, P., Wang, C., Wang, J., Wang, S., 2023. Seamlessm4t: Massively multilingual&multimodal machine translation.
Bentivogli et al. (2021) Bentivogli, L., Cettolo, M., Gaido, M., Karakanta, A., Martinelli, A., Negri, M., Turchi, M., 2021. Cascade versus direct speech translation: Do the differences still make a difference?, in: Annual Meeting of the Association for Computational Linguistics.
Bérard et al. (2018) Bérard, A., Besacier, L., Kocabiyikoglu, A.C., Pietquin, O., 2018. End-to-end automatic speech translation of audiobooks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 6224–6228.
Bérard et al. (2016) Bérard, A., Pietquin, O., Besacier, L., Servan, C., 2016. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation, in: NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain.
Bozinovski and Fulgosi (1976) Bozinovski, S., Fulgosi, A., 1976. The influence of pattern similarity and transfer learning upon training of a base perceptron b2, in: Proceedings of Symposium Informatica, pp. 121–126.
Brauwers and Frasincar (2022) Brauwers, G., Frasincar, F., 2022. A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering 35, 3279–3298.
Bucilǎ et al. (2006) Bucilǎ, C., Caruana, R., Niculescu-Mizil, A., 2006. Model compression. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2006, 535–541. doi:10.1145/1150402.1150464.
Cattoni et al. (2021) Cattoni, R., Di Gangi, M.A., Bentivogli, L., Negri, M., Turchi, M., 2021. Must-c: A multilingual corpus for end-to-end speech translation. Computer speech & language 66, 101155.
Chang and yi Lee (2022) Chang, C.C., yi Lee, H., 2022. Exploring continuous integrate-and-fire for adaptive simultaneous speech translation. ArXiv abs/2204.09595.
Chen et al. (2021a) Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al., 2021a. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 .
Chen et al. (2020) Chen, J., Ma, M., Zheng, R., Huang, L., 2020. Mam: Masked acoustic modeling for end-to-end speech-to-text translation. arXiv preprint arXiv:2010.11445 .
Chen et al. (2021b) Chen, J., Ma, M., Zheng, R., Huang, L., 2021b. Direct simultaneous speech-to-text translation assisted by synchronized streaming ASR, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 4618–4624. doi:10.18653/v1/2021.findings-acl.406.
Chen et al. (2022) Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F., 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 1505–1518. doi:10.1109/JSTSP.2022.3188113.
Cheng et al. (2022) Cheng, X., Dong, Q., Yue, F., Ko, T., Wang, M., Zou, Y., 2022. M3st: Mix at three levels for speech translation. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Cherry and Foster (2019) Cherry, C., Foster, G.F., 2019. Thinking slow about latency evaluation for simultaneous machine translation. ArXiv abs/1906.00048.
Chiu and Raffel (2018) Chiu, C.C., Raffel, C., 2018. Monotonic chunkwise attention, in: International Conference on Learning Representations.
Cho and Esipova (2016a) Cho, K., Esipova, M., 2016a. Can neural machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012 .
Cho and Esipova (2016b) Cho, K., Esipova, M., 2016b. Can neural machine translation do simultaneous translation? ArXiv abs/1606.02012.
Cho et al. (2021) Cho, W.I., Kim, S.M., Cho, H., Kim, N.S., 2021. kosp2e: Korean Speech to English Translation Corpus, in: Proc. Interspeech 2021, pp. 3705–3709. doi:10.21437/Interspeech.2021-1040.
Cho et al. (2022) Cho, W.I., Moon, S., Kim, J., Kim, S., Kim, N.S., 2022. StyleKQC: A style-variant paraphrase corpus for Korean questions and commands, in: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., Piperidis, S. (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France. pp. 7122–7128.
Chopra et al. (2005) Chopra, S., Hadsell, R., LeCun, Y., 2005. Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. pp. 539–546.
Chuang et al. (2021) Chuang, S.P., Chuang, Y.S., Chang, C.C., Lee, H.y., 2021. Investigating the reordering capability in CTC-based non-autoregressive end-to-end speech translation, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 1068–1077. doi:10.18653/v1/2021.findings-acl.92.
Chung and Glass (2018) Chung, Y.A., Glass, J., 2018. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech, in: Proc. Interspeech 2018, pp. 811–815. doi:10.21437/Interspeech.2018-2341.
Chung et al. (2021) Chung, Y.A., Zhang, Y., Han, W., Chiu, C.C., Qin, J., Pang, R., Wu, Y., 2021. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 244–250.
Cieri et al. (2004) Cieri, C., Miller, D., Walker, K., 2004. The fisher corpus: A resource for the next generations of speech-to-text., in: LREC, pp. 69–71.
Conneau et al. (2018) Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M., 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070 .
Conneau and Lample (2019) Conneau, A., Lample, G., 2019. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA.
Conneau et al. (2022) Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., Bapna, A., 2022. Fleurs: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings, IEEE. pp. 798–805. doi:10.1109/SLT54892.2023.10023141.
Cui et al. (2015) Cui, X., Goel, V., Kingsbury, B., 2015. Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1469–1477.
Dalvi et al. (2018) Dalvi, F., Durrani, N., Sajjad, H., Vogel, S., 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 493–499. URL: https://aclanthology.org/N18-2079, doi:10.18653/v1/N18-2079.
Deuchar (2008) Deuchar, M., 2008. The miami corpus: Documentation file. Bangortalk, bangortalk. org. uk/docs/Miami_doc. pdf .
Dong and Xu (2019) Dong, L., Xu, B., 2019. Cif: Continuous integrate-and-fire for end-to-end speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 6079–6083.
Dong et al. (2020) Dong, Q., Wang, M., Zhou, H., Xu, S., Xu, B., Li, L., 2020. Consecutive decoding for speech-to-text translation, in: AAAI Conference on Artificial Intelligence.
Dong et al. (2021) Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., Li, L., 2021. Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation, in: AAAI Conference on Artificial Intelligence.
Dong et al. (2022) Dong, Q., Zhu, Y., Wang, M., Li, L., 2022. Learning when to translate for streaming speech, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 680–694. doi:10.18653/v1/2022.acl-long.50.
Duong et al. (2016) Duong, L., Anastasopoulos, A., Chiang, D., Bird, S., Cohn, T., 2016. An attentional model for speech translation without transcription, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959.
Etchegoyhen et al. (2022) Etchegoyhen, T., Arzelus, H., Gete, H., Alvarez, A., Torre, I.G., Martín-Doñas, J.M., González-Docasal, A., Fernandez, E.B., 2022. Cascade or direct speech translation? a case study. Applied Sciences 12, 1097.
Fang and Feng (2023) Fang, Q., Feng, Y., 2023. Back translation for speech-to-text translation without transcripts, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
Fang et al. (2022) Fang, Q., Ye, R., Li, L., Feng, Y., Wang, M., 2022. STEMM: Self-learning with speech-text manifold mixup for speech translation, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 7050–7062. doi:10.18653/v1/2022.acl-long.486.
Federmann and Lewis (2016) Federmann, C., Lewis, W., 2016. Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, in: Proceedings of the 13th International Conference on Spoken Language Translation.
Fügen et al. (2007) Fügen, C., Waibel, A.H., Kolss, M., 2007. Simultaneous translation of lectures and speeches. Machine Translation 21, 209–252.
Gaido et al. (2020a) Gaido, M., Di Gangi, M.A., Negri, M., Turchi, M., 2020a. End-to-end speech-translation with knowledge distillation: FBK@IWSLT2020, in: Federico, M., Waibel, A., Knight, K., Nakamura, S., Ney, H., Niehues, J., Stüker, S., Wu, D., Mariani, J., Yvon, F. (Eds.), Proceedings of the 17th International Conference on Spoken Language Translation, Association for Computational Linguistics, Online. pp. 80–88. doi:10.18653/v1/2020.iwslt-1.8.
Gaido et al. (2020b) Gaido, M., Gangi, M.A.D., Negri, M., Turchi, M., 2020b. On knowledge distillation for direct speech translation. ArXiv abs/2012.04964.
Gaido et al. (2021) Gaido, M., Negri, M., Cettolo, M., Turchi, M., 2021. Beyond voice activity detection: Hybrid audio segmentation for direct speech translation, in: International Conference on Natural Language and Speech Processing.
Gaido et al. (2024) Gaido, M., Papi, S., Negri, M., Bentivogli, L., 2024. Speech translation with speech foundation models and large language models: What is there and what is missing? ArXiv abs/2402.12025.
Gállego et al. (2021) Gállego, G.I., Tsiamas, I., Escolano, C., Fonollosa, J.A.R., Costa-jussà, M.R., 2021. End-to-end speech translation with pre-trained models and adapters: Upc at iwslt 2021, in: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Association for Computational Linguistics, Bangkok, Thailand (online). pp. 110–119. doi:10.18653/v1/2021.iwslt-1.11.
Gangi et al. (2019) Gangi, M.A.D., Negri, M., Turchi, M., 2019. One-to-many multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 585–592.
Godard et al. (2018) Godard, P., Adda, G., Adda-Decker, M., Benjumea, J., Besacier, L., Cooper-Leavitt, J., Kouarata, G.N., Lamel, L., Maynard, H., Mueller, M., Rialland, A., Stueker, S., Yvon, F., Zanon-Boito, M., 2018. A very low resource language speech corpus for computational language documentation experiments, in: Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., Tokunaga, T. (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
Goldman-Eisler (1972) Goldman-Eisler, F., 1972. Segmentation of input in simultaneous translation. Journal of Psycholinguistic Research 1, 127–140.
Graves (2012) Graves, A., 2012. Sequence transduction with recurrent neural networks. ArXiv abs/1211.3711.
Graves et al. (2006) Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, New York, NY, USA. p. 369–376.
Grissom II et al. (2014) Grissom II, A., He, H., Boyd-Graber, J., Morgan, J., Daumé III, H., 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1342–1352. doi:10.3115/v1/D14-1140.
Gulati et al. (2020) Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., Pang, R., 2020. Conformer: Convolution-augmented Transformer for Speech Recognition, in: Proc. Interspeech 2020, pp. 5036–5040. doi:10.21437/Interspeech.2020-3015.
Guo et al. (2024) Guo, J., Wu, Z., Li, Z., Shang, H., Wei, D., Chen, X., Rao, Z., Li, S., Yang, H., 2024. R-bi: Regularized batched inputs enhance incremental decoding framework for low-latency simultaneous speech translation. ArXiv abs/2401.05700.
Han et al. (2021) Han, C., Wang, M., Ji, H., Li, L., 2021. Learning shared semantic space for speech-to-text translation, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP.
Hinton et al. (2015) Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the knowledge in a neural network URL: https://arxiv.org/abs/1503.02531v1.
Houlsby et al. (2019) Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S., 2019. Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR. pp. 2790–2799.
Hsu et al. (2021) Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A., 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3451–3460.
Hu et al. (2024) Hu, Y., Chen, C., Yang, C.H.H., Li, R., Zhang, D., Chen, Z., Chng, E.S., 2024. Gentranslate: Large language models are generative multilingual speech and machine translators. ArXiv abs/2402.06894.
Huang et al. (2023) Huang, Z., Ye, R., Ko, T., Dong, Q., Cheng, S., Wang, M., Li, H., 2023. Speech translation with large language models: An industrial practice. ArXiv abs/2312.13585.
Huzaifah and Kukanov (2023) Huzaifah, M., Kukanov, I., 2023. An analysis of semantically-aligned speech-text embeddings, in: 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE. pp. 747–754.
Hwang et al. (2024) Hwang, D., Wang, W., Huo, Z., Sim, K.C., Mengibar, P.M., 2024. Transformerfam: Feedback attention is working memory. arXiv:2404.09173.
Inaguma et al. (2019) Inaguma, H., Duh, K., Kawahara, T., Watanabe, S., 2019. Multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 570–577.
Inaguma et al. (2020a) Inaguma, H., Higuchi, Y., Duh, K., Kawahara, T., Watanabe, S., 2020a. Orthros: non-autoregressive end-to-end speech translation with dual-decoder. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7503–7507.
Inaguma et al. (2021) Inaguma, H., Higuchi, Y., Duh, K., Kawahara, T., Watanabe, S., 2021. Non-autoregressive end-to-end speech translation with parallel autoregressive rescoring. ArXiv abs/2109.04411. URL: https://api.semanticscholar.org/CorpusID:237453587.
Inaguma et al. (2020b) Inaguma, H., Kiyono, S., Duh, K., Karita, S., Yalta, N., Hayashi, T., Watanabe, S., 2020b. ESPnet-ST: All-in-one speech translation toolkit, in: Celikyilmaz, A., Wen, T.H. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Online. pp. 302–311. doi:10.18653/v1/2020.acl-demos.34.
Iranzo-S’anchez et al. (2022) Iranzo-S’anchez, J., Saiz, J.C., Juan, A., 2022. From simultaneous to streaming machine translation by leveraging streaming history, in: Annual Meeting of the Association for Computational Linguistics.
Iranzo-Sánchez et al. (2020) Iranzo-Sánchez, J., Silvestre-Cerda, J.A., Jorge, J., Roselló, N., Giménez, A., Sanchis, A., Civera, J., Juan, A., 2020. Europarl-st: A multilingual corpus for speech translation of parliamentary debates, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 8229–8233.
Jaegle et al. (2021) Jaegle, A., Gimeno, F., Bfrock, A., Zisserman, A., Vinyals, O., Carreira, J., 2021. Perceiver: General perception with iterative attention. CoRR abs/2103.03206. URL: https://arxiv.org/abs/2103.03206, arXiv:2103.03206.
Jia et al. (2019) Jia, Y., Johnson, M., Macherey, W., Weiss, R.J., Cao, Y., Chiu, C.C., Ari, N., Laurenzo, S., Wu, Y., 2019. Leveraging weakly supervised data to improve end-to-end speech-to-text translation, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 7180–7184.
Jurafsky and Martin (2008) Jurafsky, D., Martin, J.H., 2008. Speech and language processing, 2nd edition.
Kahn et al. (2019) Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazar’e, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., rahman Mohamed, A., Dupoux, E., 2019. Libri-light: A benchmark for asr with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7669–7673.
Kano et al. (2023) Kano, Y., Sudoh, K., Nakamura, S., 2023. Average token delay: A duration-aware latency metric for simultaneous translation. ArXiv abs/2311.14353.
Khurana et al. (2020) Khurana, S., Laurent, A., Glass, J., 2020. Cstnet: Contrastive speech translation network for self-supervised speech representation learning. arXiv preprint arXiv:2006.02814 .
Kim et al. (2017) Kim, S., Hori, T., Watanabe, S., 2017. Joint ctc-attention based end-to-end speech recognition using multi-task learning, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 4835–4839.
Kocabiyikoglu et al. (2018) Kocabiyikoglu, A.C., Besacier, L., Kraif, O., 2018. Augmenting librispeech with French translations: A multimodal corpus for direct speech translation evaluation, in: Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., Tokunaga, T. (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
Lam et al. (2024) Lam, T.K., Birch, A., Haddow, B., 2024. Compact speech translation models via discrete speech units pretraining. ArXiv abs/2402.19333.
Lam et al. (2020) Lam, T.K., Schamoni, S., Riezler, S., 2020. Cascaded models with cyclic feedback for direct speech translation. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7508–7512.
Lam et al. (2022a) Lam, T.K., Schamoni, S., Riezler, S., 2022a. Make more of your data: Minimal effort data augmentation for automatic speech recognition and translation. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Lam et al. (2022b) Lam, T.K., Schamoni, S., Riezler, S., 2022b. Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Larochelle and Hinton (2010) Larochelle, H., Hinton, G.E., 2010. Learning to combine foveal glimpses with a third-order boltzmann machine, in: Neural Information Processing Systems.
Le et al. (2023a) Le, C., Qian, Y., Zhou, L., LIU, S., Qian, Y., Zeng, M., Huang, X., 2023a. ComSL: A composite speech-language model for end-to-end speech-to-text translation, in: Thirty-seventh Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=6Qx7G1xrAk.
Le et al. (2020) Le, H., Pino, J., Wang, C., Gu, J., Schwab, D., Besacier, L., 2020. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation, in: Scott, D., Bel, N., Zong, C. (Eds.), Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online). pp. 3520–3533. doi:10.18653/v1/2020.coling-main.314.
Le et al. (2021) Le, H., Pino, J., Wang, C., Gu, J., Schwab, D., Besacier, L., 2021. Lightweight adapter tuning for multilingual speech translation, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online.
Le et al. (2023b) Le, P.H., Gong, H., Wang, C., Pino, J., Lecouteux, B., Schwab, D., 2023b. Pre-training for speech translation: Ctc meets optimal transport, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org.
Lee et al. (2021) Lee, Y.K., Jung, Y., Lee, I., Park, J.E., Hahn, S., 2021. Building a psychological ground truth dataset with empathy and theory-of-mind during the covid-19 pandemic, in: Proceedings of the Annual Meeting of the Cognitive Science Society.
Li et al. (2020) Li, X., Wang, C., Tang, Y., Tran, C., Tang, Y., Pino, J.M., Baevski, A., Conneau, A., Auli, M., 2020. Multilingual speech translation from efficient finetuning of pretrained models, in: Annual Meeting of the Association for Computational Linguistics.
Lin (1991) Lin, J., 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37, 145–151.
Liu et al. (2021a) Liu, D., Du, M., Li, X., Li, Y., Chen, E., 2021a. Cross attention augmented transducer networks for simultaneous translation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 39–55.
Liu et al. (2020a) Liu, D., Spanakis, G., Niehues, J., 2020a. Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection, in: Interspeech.
Liu et al. (2023) Liu, X.B., Zhang, J., Ferrer, L., Xu, S., Bahirwani, V., Smus, B., Olwal, A., Du, R., 2023. Modeling and improving text stability in live captions. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems .
Liu et al. (2020b) Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L., 2020b. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.
Liu et al. (2020c) Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L., 2020c. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.
Liu et al. (2019) Liu, Y., Xiong, H., Zhang, J., He, Z., Wu, H., Wang, H., Zong, C., 2019. End-to-End Speech Translation with Knowledge Distillation, in: Proc. Interspeech 2019, pp. 1128–1132. doi:10.21437/Interspeech.2019-2582.
Liu et al. (2020d) Liu, Y., Zhu, J., Zhang, J., Zong, C., 2020d. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920 .
Liu et al. (2021b) Liu, Z., Lin, Y., Sun, M., 2021b. Representation learning for natural language processing. CoRR abs/2102.03732. URL: https://arxiv.org/abs/2102.03732, arXiv:2102.03732.
Ma et al. (2018) Ma, M., Huang, L., Xiong, H., Zheng, R., Liu, K., Zheng, B., Zhang, C., He, Z., Liu, H., Li, X., Wu, H., Wang, H., 2018. Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework, in: Annual Meeting of the Association for Computational Linguistics.
Ma et al. (2020a) Ma, X., Dousti, M.J., Wang, C., Gu, J., Pino, J.M., 2020a. Simuleval: An evaluation toolkit for simultaneous translation, in: Conference on Empirical Methods in Natural Language Processing.
Ma et al. (2020b) Ma, X., Pino, J., Koehn, P., 2020b. SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation, in: Wong, K.F., Knight, K., Wu, H. (Eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China. pp. 582–587.
Ma et al. (2019) Ma, X., Pino, J.M., Cross, J., Puzon, L., Gu, J., 2019. Monotonic multihead attention. ICLR abs/1909.12406.
Ma et al. (2023) Ma, X., Sun, A.Y., Ouyang, S., Inaguma, H., Tomasello, P., 2023. Efficient monotonic multihead attention. ArXiv abs/2312.04515.
Ma et al. (2020c) Ma, X., Wang, Y., Dousti, M.J., Koehn, P., Pino, J.M., 2020c. Streaming simultaneous speech translation with augmented memory transformer. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7523–7527.
Marie et al. (2021) Marie, B., Fujita, A., Rubino, R., 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online. pp. 7297–7306. doi:10.18653/v1/2021.acl-long.566.
Matusov et al. (2007) Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani-Tür, D.Z., Ostendorf, M., Ney, H., 2007. Improving speech translation with automatic boundary prediction, in: Interspeech.
Matusov et al. (2018) Matusov, E., Wilken, P., Bahar, P., Schamper, J., Golik, P., Zeyer, A., Silvestre-Cerdà, J.A., Martinez-Villaronga, A.A., Pesch, H., Peter, J.T., 2018. Neural speech translation at apptek, in: International Workshop on Spoken Language Translation.
Meng et al. (2021) Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., Xu, B., 2021. Mixspeech: Data augmentation for low-resource automatic speech recognition. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7008–7012.
Mnih et al. (2014) Mnih, V., Heess, N.M.O., Graves, A., Kavukcuoglu, K., 2014. Recurrent models of visual attention, in: Neural Information Processing Systems.
Mohamed et al. (2022) Mohamed, A., Lee, H.y., Borgholt, L., Havtorn, J.D., Edin, J., Igel, C., Kirchhoff, K., Li, S.W., Livescu, K., Maaloe, L., Sainath, T.N., Watanabe, S., 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing 16, 1179–1210. doi:10.1109/jstsp.2022.3207050.
Munkhdalai et al. (2024) Munkhdalai, T., Faruqui, M., Gopal, S., 2024. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv:2404.07143.
Nguyen et al. (2021) Nguyen, T.S., Stüker, S., Waibel, A., 2021. Super-Human Performance in Online Low-Latency Recognition of Conversational Speech, in: Proc. Interspeech 2021, pp. 1762–1766. doi:10.21437/Interspeech.2021-1114.
Niehues et al. (2016) Niehues, J., Nguyen, T.S., Cho, E., Ha, T.L., Kilgour, K., Müller, M., Sperber, M., Stüker, S., Waibel, A.H., 2016. Dynamic transcription for low-latency speech translation, in: Interspeech.
Niehues et al. (2018) Niehues, J., Pham, N.Q., Ha, T.L., Sperber, M., Waibel, A., 2018. Low-Latency Neural Speech Translation, in: Proc. Interspeech 2018, pp. 1293–1297. doi:10.21437/Interspeech.2018-1055.
Ochshorn and Hawkins (2017) Ochshorn, R., Hawkins, M., 2017. Gentle forced aligner. github. com/lowerquality/gentle .
Oda et al. (2014) Oda, Y., Neubig, G., Sakti, S., Toda, T., Nakamura, S., 2014. Optimizing segmentation strategies for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
van den Oord et al. (2017) van den Oord, A., Vinyals, O., Kavukcuoglu, K., 2017. Neural discrete representation learning, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA. p. 6309–6318.
Ott et al. (2019) Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M., 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 .
Ouyang et al. (2023) Ouyang, S., Ye, R., Li, L., 2023. WACO: Word-aligned contrastive learning for speech translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 3891–3907. doi:10.18653/v1/2023.acl-long.216.
Panayotov et al. (2015) Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech: an asr corpus based on public domain audio books, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5206–5210.
Papi et al. (2021a) Papi, S., Gaido, M., Negri, M., Turchi, M., 2021a. Speechformer: Reducing information loss in direct speech translation, in: Conference on Empirical Methods in Natural Language Processing.
Papi et al. (2022a) Papi, S., Gaido, M., Negri, M., Turchi, M., 2022a. Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation, in: Ive, J., Zhang, R. (Eds.), Proceedings of the Third Workshop on Automatic Simultaneous Translation, Association for Computational Linguistics, Online. pp. 12–17. doi:10.18653/v1/2022.autosimtrans-1.2.
Papi et al. (2021b) Papi, S., Negri, M., Turchi, M., 2021b. Visualization: The missing factor in simultaneous speech translation. ArXiv abs/2111.00514.
Papi et al. (2022b) Papi, S., Negri, M., Turchi, M., 2022b. Attention as a guide for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318.
Parcollet et al. (2024) Parcollet, T., Nguyen, H., Evain, S., Boito, M.Z., Pupier, A., Mdhaffar, S., Le, H., Alisamir, S., Tomashenko, N., Dinarelli, M., et al., 2024. Lebenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of french speech. Computer Speech & Language , 101622.
Park et al. (2019) Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., Le, Q.V., 2019. Specaugment: A simple data augmentation method for automatic speech recognition, in: Interspeech.
Park (2018) Park, K., 2018. Kss dataset: Korean single speaker speech dataset.
Paulik and Waibel (2013) Paulik, M., Waibel, A., 2013. Training speech translation from audio recordings of interpreter-mediated communication. Computer Speech & Language 27, 455–474.
Peyré et al. (2019) Peyré, G., Cuturi, M., et al., 2019. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11, 355–607.
Popović (2015) Popović, M., 2015. chrf: character n-gram f-score for automatic mt evaluation, in: Proceedings of the tenth workshop on statistical machine translation, pp. 392–395.
Popuri et al. (2022) Popuri, S., Chen, P.J., Wang, C., Pino, J., Adi, Y., Gu, J., Hsu, W.N., Lee, A., 2022. Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation, in: Proc. Interspeech 2022, pp. 5195–5199. doi:10.21437/Interspeech.2022-11032.
Potapczyk and Przybysz (2020) Potapczyk, T., Przybysz, P., 2020. Srpol’s system for the iwslt 2020 end-to-end speech translation task, in: International Workshop on Spoken Language Translation.
Prabhavalkar et al. (2024) Prabhavalkar, R., Hori, T., Sainath, T.N., Schlüter, R., Watanabe, S., 2024. End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32, 325–351. doi:10.1109/TASLP.2023.3328283.
Rabiner and Schafer (2010) Rabiner, L., Schafer, R., 2010. Theory and applications of digital speech processing. Prentice Hall Press.
Radford et al. (2023) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., 2023. Robust speech recognition via large-scale weak supervision, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org.
Raffel et al. (2017) Raffel, C., Luong, M.T., Liu, P.J., Weiss, R.J., Eck, D., 2017. Online and linear-time attention by enforcing monotonic alignments, in: International Conference on Machine Learning.
Ren et al. (2020) Ren, Y., Liu, J., Tan, X., Zhang, C., Qin, T., Zhao, Z., Liu, T.Y., 2020. Simulspeech: End-to-end simultaneous speech to text translation, in: Annual Meeting of the Association for Computational Linguistics.
Salesky et al. (2021) Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard, D.W., Post, M., 2021. Multilingual tedx corpus for speech recognition and translation, in: Proceedings of Interspeech.
Sanabria et al. (2018) Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L., Metze, F., 2018. How2: A Large-scale Dataset for Multimodal Language Understanding, in: NeurIPS, Montréal, Canada.
Sandhan et al. (2022) Sandhan, J., Daksh, A., Paranjay, O.A., Behera, L., Goyal, P., 2022. Prabhupadavani: A code-mixed speech translation data for 25 languages, in: Degaetano, S., Kazantseva, A., Reiter, N., Szpakowicz, S. (Eds.), Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, International Conference on Computational Linguistics, Gyeongju, Republic of Korea. pp. 24–29.
Sarkar et al. (2023) Sarkar, B., Maurya, C.K., Agrahri, A., 2023. Direct speech to text translation: Bridging the modality gap using simsiam, in: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), pp. 250–255.
Schlenoff et al. (2009) Schlenoff, C., Sanders, G., Weiss, B., Proctor, F., Steves, M.P., Virts, A., 2009. Evaluating speech translation systems: Applying score to transtac technologies, in: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 223–230.
Schneider et al. (2019) Schneider, S., Baevski, A., Collobert, R., Auli, M., 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 .
Sergeev and Balso (2018) Sergeev, A., Balso, M.D., 2018. Horovod: fast and easy distributed deep learning in tensorflow. CoRR abs/1802.05799. URL: http://arxiv.org/abs/1802.05799, arXiv:1802.05799.
Sethiya et al. (2024) Sethiya, N., Nair, S., Maurya, C., 2024. Indic-TEDST: Datasets and baselines for low-resource speech to text translation, in: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia. pp. 9019–9024.
Snover et al. (2006) Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J., 2006. A study of translation edit rate with targeted human annotation, in: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231.
Sohn et al. (1999) Sohn, J., Kim, N.S., Sung, W., 1999. A statistical model-based voice activity detection. IEEE Signal Processing Letters 6, 1–3.
Sohn (2016) Sohn, K., 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29.
Sperber et al. (2019) Sperber, M., Neubig, G., Niehues, J., Waibel, A., 2019. Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics 7, 313–325.
Su et al. (2021) Su, J., Cao, J., Liu, W., Ou, Y., 2021. Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316.
Sun et al. (2023) Sun, H., Zhao, X., Lei, Y., Zhu, S., Xiong, D., 2023. Towards a deep understanding of multilingual end-to-end speech translation, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore. pp. 14332–14348. doi:10.18653/v1/2023.findings-emnlp.956.
Tan et al. (2024) Tan, W., Chen, Y., Chen, T., Qin, G., Xu, H., Zhang, H.C., Durme, B.V., Koehn, P., 2024. Streaming sequence transduction through dynamic compression. ArXiv abs/2402.01172.
Tang et al. (2022) Tang, Y., Gong, H., Dong, N., Wang, C., Hsu, W.N., Gu, J., Baevski, A., Li, X., Mohamed, A., Auli, M., Pino, J., 2022. Unified speech-text pre-training for speech translation and recognition, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 1488–1499. doi:10.18653/v1/2022.acl-long.105.
Tang et al. (2021a) Tang, Y., Pino, J., Li, X., Wang, C., Genzel, D., 2021a. Improving speech translation by understanding and learning from the auxiliary text translation task, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online. pp. 4252–4261. doi:10.18653/v1/2021.acl-long.328.
Tang et al. (2021b) Tang, Y., Pino, J.M., Li, X., Wang, C., Genzel, D., 2021b. Improving speech translation by understanding and learning from the auxiliary text translation task. ArXiv abs/2107.05782.
Tran et al. (2020) Tran, C., Wang, C., Tang, Y., Tang, Y., Pino, J.M., Li, X., 2020. Cross-modal transfer learning for multilingual speech-to-text translation. ArXiv abs/2010.12829.
Tsiamas et al. (2022a) Tsiamas, I., Gállego, G.I., Escolano, C., Fonollosa, J., Costa-jussà, M.R., 2022a. Pretrained speech encoders and efficient fine-tuning methods for speech translation: UPC at IWSLT 2022, in: Salesky, E., Federico, M., Costa-jussà, M. (Eds.), Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), Association for Computational Linguistics, Dublin, Ireland (in-person and online). pp. 265–276. doi:10.18653/v1/2022.iwslt-1.23.
Tsiamas et al. (2022b) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-juss’a, M.R., 2022b. Efficient speech translation with dynamic latent perceivers. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Tsiamas et al. (2022c) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-jussà, M.R., 2022c. Shas: Approaching optimal segmentation for end-to-end speech translation, in: Interspeech.
Tsiamas et al. (2024) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-jussà, M.R., 2024. Pushing the limits of zero-shot end-to-end speech translation. ArXiv abs/2402.10422.
Tsiamas et al. (2023) Tsiamas, I., I. Gállego, G., Fonollosa, J., R. Costa-jussá, M., 2023. Speech translation with foundation models and optimal transport: UPC at IWSLT23, in: Salesky, E., Federico, M., Carpuat, M. (Eds.), Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Association for Computational Linguistics, Toronto, Canada (in-person and online). pp. 397–410. doi:10.18653/v1/2023.iwslt-1.38.
Tsiartas et al. (2013) Tsiartas, A., Ghosh, P., Georgiou, P., Narayanan, S., 2013. High-quality bilingual subtitle document alignments with application to spontaneous speech translation. Computer Speech & Language 27, 572–591.
Vaswani et al. (2017) Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS.
Vincent et al. (2017) Vincent, E., Watanabe, S., Nugraha, A.A., Barker, J., Marxer, R., 2017. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language 46, 535–557.
Wang et al. (2022) Wang, C., Inaguma, H., Chen, P.J., Kulikov, I., Tang, Y., Hsu, W.N., Auli, M., Pino, J., 2022. Simple and effective unsupervised speech translation. arXiv preprint arXiv:2210.10191 .
Wang et al. (2020a) Wang, C., Pino, J.M., Wu, A., Gu, J., 2020a. Covost: A diverse multilingual speech-to-text translation corpus, in: International Conference on Language Resources and Evaluation.
Wang et al. (2021a) Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J.M., Dupoux, E., 2021a. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, in: Annual Meeting of the Association for Computational Linguistics.
Wang et al. (2020b) Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., Pino, J., 2020b. Fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171 .
Wang et al. (2020c) Wang, C., Wu, A., Pino, J.M., 2020c. Covost 2 and massively multilingual speech-to-text translation. arXiv: Computation and Language .
Wang et al. (2021b) Wang, C., Wu, A., Pino, J.M., Baevski, A., Auli, M., Conneau, A., 2021b. Large-scale self- and semi-supervised learning for speech translation, in: Interspeech.
Wang et al. (2020d) Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z., 2020d. Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 .
Wang et al. (2023) Wang, P., Sun, E., Xue, J., Wu, Y., Zhou, L., Gaur, Y., Liu, S., Li, J., 2023. LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers, in: Proc. INTERSPEECH 2023, pp. 57–61. doi:10.21437/Interspeech.2023-2004.
Wei et al. (2022) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., hsin Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., Fedus, W., 2022. Emergent abilities of large language models. ArXiv abs/2206.07682.
Weiss et al. (2017) Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., Chen, Z., 2017. Sequence-to-Sequence Models Can Directly Translate Foreign Speech, in: Proc. Interspeech 2017, pp. 2625–2629. doi:10.21437/Interspeech.2017-503.
Weller et al. (2022) Weller, O., Sperber, M., Pires, T., Setiawan, H., Gollan, C., Telaar, D., Paulik, M., 2022. End-to-end speech translation for code switched speech. arXiv preprint arXiv:2204.05076 .
Wu et al. (2020) Wu, C., Wang, Y., Shi, Y., Yeh, C.F., Zhang, F., 2020. Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory, in: Proc. Interspeech 2020, pp. 2132–2136. doi:10.21437/Interspeech.2020-2079.
Wu (2020) Wu, F., 2020. Deep representation learning in computer vision and its applications.
Wu et al. (2022) Wu, F., Kim, K., Watanabe, S., Han, K.J., McDonald, R.T., Weinberger, K.Q., Artzi, Y., 2022. Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Wu et al. (2023) Wu, H., Chang, K.W., Wu, Y.K., yi Lee, H., 2023. Speechgen: Unlocking the generative power of speech language models with prompts. ArXiv abs/2306.02207.
Xie and Hansen (2023) Xie, J., Hansen, J.H.L., 2023. Mixrep: Hidden representation mixup for low-resource speech recognition. INTERSPEECH 2023 .
Xu et al. (2023a) Xu, C., Liu, X., Liu, X., Sun, Q., Zhang, Y., Yang, M., Dong, Q., Ko, T., Wang, M., Xiao, T., Ma, A., Zhu, J., 2023a. CTC-based non-autoregressive speech translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 13321–13339. doi:10.18653/v1/2023.acl-long.744.
Xu et al. (2023b) Xu, C., Ye, R., Dong, Q., Zhao, C., Ko, T., Wang, M., Xiao, T., Zhu, J., 2023b. Recent advances in direct speech-to-text translation. ArXiv abs/2306.11646.
Xue et al. (2022) Xue, J., Wang, P., Li, J., Post, M., Gaur, Y., 2022. Large-scale streaming end-to-end speech translation with neural transducers. arXiv preprint arXiv:2204.05352 .
Yan et al. (2023) Yan, B., Shi, J., Maiti, S., Chen, W., Li, X., Peng, Y., Arora, S., Watanabe, S., 2023. Cmu’s iwslt 2023 simultaneous speech translation system, in: International Workshop on Spoken Language Translation.
Yang et al. (2023) Yang, C.K., Huang, K.P., Lu, K.H., Kuan, C.Y., Hsiao, C.Y., yi Lee, H., 2023. Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision. ArXiv abs/2401.00273.
Yao and Haddow (2020) Yao, Y., Haddow, B., 2020. Dynamic masking for improved stability in online spoken language translation, in: Conference of the Association for Machine Translation in the Americas.
Ye et al. (2021) Ye, R., Wang, M., Li, L., 2021. End-to-end speech translation via cross-modal progressive training, in: Proc. of INTERSPEECH.
Ye et al. (2022a) Ye, R., Wang, M., Li, L., 2022a. Cross-modal contrastive learning for speech translation, in: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States. pp. 5099–5113. doi:10.18653/v1/2022.naacl-main.376.
Ye et al. (2022b) Ye, R., Zhao, C., Ko, T., Meng, C., Wang, T., Wang, M., Cao, J., 2022b. Gigast: A 10,000-hour pseudo speech translation corpus. arXiv preprint arXiv:2204.03939 .
Yin et al. (2023) Yin, W., Liu, Z., Zhao, C., Wang, T., Tong, J., Ye, R., 2023. Improving speech translation by fusing speech and text, in: The 2023 Conference on Empirical Methods in Natural Language Processing.
Yu et al. (2023) Yu, T., Ding, L., Liu, X., Chen, K., Zhang, M., Tao, D., Zhang, M., 2023. Promptst: Abstract prompt learning for end-to-end speech translation, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10140–10154.
Zaidi et al. (2022) Zaidi, M.A., Lee, B., Kim, S., Kim, C., 2022. Cross-modal decision regularization for simultaneous speech translation, in: Interspeech.
Zeng et al. (2021) Zeng, X., Li, L., Liu, Q., 2021. RealTranS: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 2461–2474. doi:10.18653/v1/2021.findings-acl.218.
Zeng et al. (2022) Zeng, X., Li, L., Liu, Q., 2022. Adatrans: Adapting with boundary-based shrinking for end-to-end speech translation. ArXiv abs/2212.08911.
Zenkel et al. (2018) Zenkel, T., Sperber, M., Niehues, J., Müller, M., Pham, N.Q., Stüker, S., Waibel, A., 2018. Open source toolkit for speech to text translation. Prague Bull. Math. Linguistics 111, 125–135.
Zhang et al. (2022a) Zhang, B., Haddow, B., Sennrich, R., 2022a. Revisiting end-to-end speech-to-text translation from scratch, in: International Conference on Machine Learning, PMLR. pp. 26193–26205.
Zhang et al. (2023a) Zhang, D., Ye, R., Ko, T., Wang, M., Zhou, Y., 2023a. Dub: Discrete unit back-translation for speech translation, in: Findings of ACL.
Zhang et al. (2023b) Zhang, H., Si, N., Chen, Y., Zhang, W., Yang, X., Qu, D., Jiao, X., 2023b. Tuning large language model for end-to-end speech translation. ArXiv abs/2310.02050.
Zhang et al. (2023c) Zhang, H., Si, N., Chen, Y., Zhang, W., Yang, X., Qu, D., Zhang, W.Q., 2023c. Improving speech translation by cross-modal multi-grained contrastive learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 1075–1086.
Zhang et al. (2022b) Zhang, R., He, Z., Wu, H., Wang, H., 2022b. Learning adaptive segmentation policy for end-to-end simultaneous translation, in: Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2021) Zhang, R., Wang, X., Zhang, C., He, Z., Wu, H., Li, Z., Wang, H., Chen, Y., Li, Q., 2021. BSTC: A large-scale Chinese-English speech translation dataset, in: Wu, H., Cherry, C., Huang, L., He, Z., Liu, Q., Elbayad, M., Liberman, M., Wang, H., Ma, M., Zhang, R. (Eds.), Proceedings of the Second Workshop on Automatic Simultaneous Translation, Association for Computational Linguistics, Online. pp. 28–35. doi:10.18653/v1/2021.autosimtrans-1.5.
Zhang and Feng (2023) Zhang, S., Feng, Y., 2023. End-to-end simultaneous speech translation with differentiable segmentation, in: Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2020) Zhang, S., Feng, Y., Li, L., 2020. Future-guided incremental transformer for simultaneous translation. ArXiv abs/2012.12465.
Zhang et al. (2019) Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 .
Zhang et al. (2024a) Zhang, X., Zhang, Q., Liu, H., Xiao, T., Qian, X., Ahmed, B., Ambikairajah, E., Li, H., Epps, J., 2024a. Mamba in speech: Towards an alternative to self-attention. arXiv:2405.12609.
Zhang et al. (2024b) Zhang, Z., Chen, S., Zhou, L., Wu, Y., Ren, S., Liu, S., Yao, Z., Gong, X., Dai, L., Li, J., et al., 2024b. Speechlm: Enhanced speech pre-training with unpaired textual data. IEEE/ACM Transactions on Audio, Speech, and Language Processing .
Zhao et al. (2021) Zhao, C., Wang, M., Dong, Q., Ye, R., Li, L., 2021. NeurST: Neural speech translation toolkit, in: Ji, H., Park, J.C., Xia, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online. pp. 55–62. doi:10.18653/v1/2021.acl-demo.7.
Zhao et al. (2022) Zhao, J., Yang, H., Haffari, G., Shareghi, E., 2022. M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation, in: Proc. Interspeech 2022, pp. 111–115. doi:10.21437/Interspeech.2022-592.
Zheng et al. (2011) Zheng, R., Chen, J., Ma, M., Huang, LiangPovey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K., 2011. The kaldi speech recognition toolkit, in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society.
Zheng et al. (2021a) Zheng, R., Chen, J., Ma, M., Huang, L., 2021a. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation, in: International Conference on Machine Learning, PMLR. pp. 12736–12746.
Zheng et al. (2021b) Zheng, R., Chen, J., Ma, M., Huang, L., 2021b. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation, in: International Conference on Machine Learning.
Zhou et al. (2024) Zhou, G., Lam, T.K., Birch, A., Haddow, B., 2024. Prosody in cascade and direct speech-to-text translation: a case study on korean wh-phrases, in: Findings of EACL.
Zhou et al. (2022a) Zhou, X., Liu, H., Shi, C., Liu, J., 2022a. Deep Learning on Edge Computing Devices: Design Challenges of Algorithm and Architecture. Elsevier.
Zhou et al. (2022b) Zhou, X., Wang, J., Cui, Z., Zhang, S., Yan, Z., Zhou, J., Zhou, C., 2022b. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. ArXiv abs/2212.00500.
Zhou et al. (2023) Zhou, Y., Fang, Q., Feng, Y., 2023. Cmot: Cross-modal mixup via optimal transport for speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Zhu et al. (2023) Zhu, Q.S., Zhou, L., Zhang, J., Liu, S.J., Hu, Y.C., Dai, L.R., 2023. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 1–5.

End-to-End Speech-to-Text Translation: A Survey端到端语音到文本翻译研究综述

Abstract 摘要

keywords:

1 Introduction 1 引言

Definition 1 定义 1

2 Background 2 背景

2.1 Task Definition 2.1 任务定义

3 Evaluation Metrics 3 评估指标

3.1 Quality-based metrics3.1 基于质量的评估指标

3.2 Latency-based metrics3.2 基于延迟的指标

3.3 Loss Functions 3.3 损失函数

4 Cascade vs. End-to-End 4 级联式与端到端

5 Data Issues 5 数据问题

5.1 Augmentation 5.1 数据增强

5.1.1 Augmenting speech data5.1.1 语音数据增强

5.1.2 Augmenting speech and text data5.1.2 语音与文本数据增强

5.2 Pre-training 5.2 预训练

5.3 Self-training and Back-translation5.3 自训练与回译

5.4 Knowledge distillation5.4 知识蒸馏

6 Segmentation and Representation Learning6 分割与表征学习

6.1 Segmentation Learning 6.1 分段学习

6.2 Representation Learning6.2 表征学习

6.2.1 Text Representation 6.2.1 文本表示

6.2.2 Speech Representation6.2.2 语音表征

6.2.3 Joint Speech-Text Representation6.2.3 联合语音-文本表征