End-to-End Speech-to-Text Translation: A Survey
端到端语音到文本翻译研究综述

Nivedita Sethiya, Chandresh Kumar Maurya

Abstract 摘要

Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
语音到文本（ST）翻译涉及将一种语言的语音信号转换为另一种语言的文本任务。该技术在免提通信、听写、视频讲座转录及翻译等多个领域具有广泛应用。传统 ST 翻译中，自动语音识别（ASR）与机器翻译（MT）模型发挥着关键作用：前者将原始形态的语音转换为书面文本，后者将转写文本翻译为目标语言，从而实现无缝跨语言交流。然而这类级联模型存在误差传播累积、资源消耗大及训练成本高等固有缺陷。为此，研究者们开始探索端到端（E2E）ST 翻译模型。但据我们所知，目前尚缺乏对 E2E ST 研究的系统性综述。本文旨在填补这一空白，从模型架构、评估指标、数据集三个维度对现有研究进行全面梳理，并深入剖析技术挑战，提出具有创新见解的未来研究方向。我们相信本综述将对从事语音翻译模型各类应用研究的研究人员有所裨益。

keywords:

Speech-to-Text Translation , Automatic Speech Recognition , Machine Translation , Modality Bridging
关键词：语音到文本翻译，自动语音识别，机器翻译，模态桥接

^†^†journal: Computer Speech & Language
^† 期刊：《计算机语音与语言》

\affiliation \隶属机构

[label1]organization=Indian Institute of Technology Indore, country=India
[label1]机构=印度理工学院印多尔分校, 国家=印度

1 Introduction 1 引言

The Speech-to-Text (ST) translation task aims to convert a speech in one language into text in another language. It finds its applications in various areas such as automatic subtitling, dictations, video lecture translations, tourism, telephone conversations, to name a few. There are many facets under which the ST problem can be cast. For example, are we performing ST translation online (aka simultaneous translation) or offline? The former is required in live video streaming, while the latter is helpful for movies where some latency may be allowed. The ST problem is further exacerbated by noisy inputs, low-resource/code-mix languages, and the presence of multiple speakers.
语音到文本（ST）翻译任务旨在将一种语言的语音转换为另一种语言的文本。其应用场景包括自动字幕生成、听写记录、视频讲座翻译、旅游导览、电话会话等诸多领域。ST 问题可从多个维度进行划分：例如进行在线（即同步翻译）还是离线翻译？前者适用于实时视频流场景，后者则适用于允许一定延迟的电影字幕场景。该任务还面临噪声输入、低资源/语码混合语言以及多说话人环境等复杂挑战。

Refer to caption — Figure 1: History of E2E ST models. Blue color models correspond to streaming models discussed in §7.1.2. Note that here we have listed only a few selected representative models.
图 1：端到端 ST 模型发展历程。蓝色标注模型对应§7.1.2 讨论的流式模型。注：此处仅列举部分代表性模型。

Historically, the ST problem has been solved by pipelining ASR and MT models together where ASR models take speech in a source language as input and generate the transcript. Whereas MT models translate the transcript into the target language. Such a cascade model suffers from problems like error propagation, higher training, and inference latency. Therefore, the current trend in developing the ST model is toward the E2E system which is defined as
从历史上看，语音翻译问题通常通过级联自动语音识别(ASR)和机器翻译(MT)模型来解决：ASR 模型将源语言语音作为输入并生成文本转录，而 MT 模型则将转录文本翻译为目标语言。这种级联模型存在错误传播、较高训练及推理延迟等问题。因此，当前语音翻译模型的发展趋势是采用端到端系统，其定义为

Definition 1 定义 1

A unified E2E ST model is implemented, facilitating combined training and recognition processes aimed at consistently reducing the anticipated error rate, thereby bypassing the need for independently acquired sources of knowledge.
端到端语音翻译模型通过实施统一架构，支持联合训练与识别过程，旨在持续降低预期错误率，从而规避对独立知识获取来源的依赖。

Therefore, the main goal of the E2E ST model is to achieve a reduced error rate, with secondary objectives potentially including decreased training/inference duration and memory usage.
因此，端到端语音翻译模型的主要目标是实现错误率降低，次要目标可能包括缩短训练/推理时长及减少内存占用。

There has been a lot of work building E2E ST models (as shown in fig. 1), datasets, and metrics in recent years. However, a systematic and comprehensive review of E2E ST works is missing. The authors found that a review paper (Xu et al., 2023b) on ST was published recently. The review mentioned above categorizes existing works mainly based on modeling, data, and application issues. They do not cover the data sets available for the ST tasks nor provide any insights into the cascade vs. E2E model performances. Also, the future open problems provided by them are limited. On the other hand, our work comprehensively reviews the existing models for ST tasks, evaluation methods, metrics, and datasets from a completely different perspective and critically analyzes the existing works; after that, we identify several challenges and future research directions. Thus, our work may be deemed complimentary to (Xu et al., 2023b).
近年来，端到端语音翻译模型（如图 1 所示）、数据集及评价指标的构建已取得大量研究成果，但学界仍缺乏对端到端语音翻译工作的系统化综述。作者发现近期发表的一篇语音翻译综述论文（Xu et al., 2023b）主要从建模方法、数据问题和应用场景三个维度对现有研究进行分类，既未涵盖语音翻译任务可用数据集的全貌，也未深入分析级联模型与端到端模型的性能差异，且提出的未来开放性问题较为有限。相比之下，本研究从全新视角全面梳理了语音翻译任务的现有模型、评估方法、指标体系和数据集，并对现有成果展开批判性分析，进而提出若干关键挑战与未来研究方向。因此，本工作可视为对（Xu et al., 2023b）的重要补充。

The following review is structured following the taxonomy in fig. 2. In §2, we establish the foundation of the ST task through a formal definition, and we subsequently delve into the various metrics and loss functions adopted by different researchers in §3. A comparative discussion between cascade and end-to-end models is presented in §4. Training of E2E ST models suffers from data issues and how to combat them is elaborated in §5. Speech and Text segmentation and representation is an important task in ST model development discussed in §6. In §7, we delve into the strategies employed to tackle the ST problem. We categorize these approaches based on the frameworks utilized and the characteristics of the data involved. Data and toolkits required for ST modeling are discussed in §9. Finally, in §10, we explore the prospects for future research and open problems within the field.
本综述按照图 2 的分类体系展开。在§2 中，我们通过形式化定义奠定语音翻译任务的基础框架，随后在§3 深入探讨不同研究者采用的评估指标与损失函数。§4 对级联模型与端到端模型进行了对比分析。端到端语音翻译模型的训练面临数据问题，§5 详细阐述了应对策略。§6 讨论了语音与文本分割及表征这一语音翻译模型开发中的关键任务。§7 系统剖析了解决语音翻译问题的技术路线，根据所采用框架和数据特性对这些方法进行分类。§9 论述了语音翻译建模所需的数据资源与工具包。最后在§10 中，我们展望了该领域未来研究方向与待解难题。

2 Background 2 背景

This section describes the ST task formally and presents the loss functions and evaluation metrics commonly employed to optimize ST models.
本节正式描述语音翻译任务，并介绍常用于优化语音翻译模型的损失函数与评估指标。

2.1 Task Definition 2.1 任务定义

ST task can be defined as translating the given input speech $U$ in one language to translated text $V$ in another language with the transcription text $X$ (optionally). Formally, it is defined as follows: Given a dataset $D=\{({\bf u}^{i},{\bf x}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ of pairs of input speech features ${\bf u}=(u_{1},u_{2},\ldots,u_{T_{u}})$ in a language and output text tokens ${\bf v}=(v_{1},v_{2},\ldots,v_{T_{v}})$ in a different language, the objective of the ST task is to minimize the conditional probability given below:
端到端语音到文本翻译任务可定义为：将给定输入语音 $U$ （源语言）转换为目标语言的翻译文本 $V$ ，并可选择性地包含转录文本 $X$ 。其形式化定义为：给定由源语言语音特征 ${\bf u}=(u_{1},u_{2},\ldots,u_{T_{u}})$ 与目标语言文本标记 ${\bf v}=(v_{1},v_{2},\ldots,v_{T_{v}})$ 组成的配对数据集 $D=\{({\bf u}^{i},{\bf x}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ ，该任务的目标是最小化如下条件概率：

p({\bf v}|{\bf u};\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|v_{<t},{\bf u};\theta)

(1)

In the above equation, $T_{u}$ , $T_{v}$ , and $\theta$ are the lengths of input features, the number of output tokens, and the model parameter, respectively. Note that the problem formulation given in (1) is for Autoregressive (AR) models ¹¹1Non-autoregressive (NAR) models are an alternative modeling approach that has been proposed in the past few years for the ST task. Only a sparse number of works exist in the literature. We discuss NAR briefly in §7.1. Usually, it is assumed that there are $n$ parallel speech-text pairs in our corpus, and the model is optimized for negative log-likelihood over these pairs as
在上述方程中， $T_{u}$ 、 $T_{v}$ 和 $\theta$ 分别表示输入特征长度、输出标记数量和模型参数量。需要注意的是，(1)式给出的问题公式是针对自回归(AR)模型 ¹ 的。通常假设语料库中存在 $n$ 个平行语音-文本对，模型通过最小化这些配对上的负对数似然进行优化。

\ell(\theta|D)=-\sum_{i=1}^{n}\log P({\bf v}^{i}|{\bf u}^{i};\theta)

(2)

The above optimization is usually solved using an encoder-decoder with an attention approach. Essentially, an encoder maps speech input to a hidden state representation $h$ followed by a decoder which takes the previously generated text tokens $v_{<t}$ , encoder hidden state $h$ and attention vector ${\bf\alpha}$ (Vaswani et al., 2017). Offline ST translation can look at the whole speech before producing output text tokens, whereas streaming ST can start translation of partial speech signal.
上述优化通常采用带有注意力机制的编码器-解码器架构实现。本质上，编码器将语音输入映射为隐藏状态表示 $h$ ，随后解码器基于先前生成的文本标记 $v_{<t}$ 、编码器隐藏状态 $h$ 及注意力向量 ${\bf\alpha}$ 进行解码（Vaswani 等人，2017）。离线语音翻译可在输出文本标记前处理完整语音信号，而流式语音翻译则能对部分语音片段实时启动翻译。

3 Evaluation Metrics 3 评估指标

This section discusses various metrics used to evaluate the E2E ST models. The metrics to evaluate E2E ST models are categorized into two types: quality and latency. The quality of the E2E ST models is the measure of how close the ST translation is to the target sentence. The latency is the time elapsed between the pronunciation of a word and the generation of its textual translation.
本节探讨了用于评估端到端语音翻译模型的各种指标。端到端语音翻译模型的评估指标可分为两类：质量指标与延迟指标。质量指标用于衡量语音翻译结果与目标语句的接近程度，而延迟指标则指从单词发音到生成对应文本翻译所经历的时间间隔。

3.1 Quality-based metrics
3.1 基于质量的评估指标

The quality-based metrics measure how close the translation is to the target sentence. Most of the existing literature evaluates these scores on detokenized output which is the string formed by combining the tokens. Standard metrics for evaluating ST task performance are commonly used MT evaluation metrics such as Bi-lingual Evaluation Understudy (BLEU) (Papineni et al., 2002), Translation Error Rate (TER) (Snover et al., 2006) via sacreBLEU, Metric for Evaluation of Translation with Explicit word Ordering (METEOR) (Banerjee and Lavie, 2005), and CHaRacter-level F-score (CHRF), and CHRF++ (Popović, 2015). Recently BERTScore has shown promising results on comparing with human evaluations. The BERTScore (Zhang et al., 2019) is an automatic evaluation metric that scores the similarity between the translated text and the referenced text. It takes into account the Recall, Precision, and Fscore. There are a few other evaluation metrics such as TRANSTAC (Schlenoff et al., 2009) and which are less frequently reported.
基于质量的度量指标评估翻译结果与目标句的接近程度。现有文献大多采用去标记化输出（即通过组合标记形成的字符串）来计算这些分数。语音翻译任务的标准评估指标通常沿用机器翻译领域的通用评价体系，包括：基于双语评估替换的 BLEU（Papineni 等，2002）、通过 sacreBLEU 计算的翻译错误率 TER（Snover 等，2006）、显式词序翻译评估指标 METEOR（Banerjee 和 Lavie，2005）、字符级 F 值 CHRF 及其增强版 CHRF++（Popović，2015）。近期研究表明，BERTScore 在与人工评估对比中展现出优越性能。该指标（Zhang 等，2019）通过计算译文与参考文本的语义相似度进行自动评估，综合考量召回率、精确率和 F 值。其他较少使用的评估指标还包括 TRANSTAC（Schlenoff 等，2009）等。

3.2 Latency-based metrics
3.2 基于延迟的指标

For streaming ST tasks, researchers report a metric for measuring latency, which is defined as the delay incurred in starting to produce the translation. Let ${\bf u},{\bf v}$ and $\hat{{\bf v}}$ denote the input speech sequence, ground truth text sequence, and system-generated hypothesis sequence, respectively. In the streaming ST task, models produce output with partial input. Suppose ${\bf u}_{1:t}=\{(u_{1},\ldots,u_{t}),t<T_{u}\}$ has been read when generating $v_{s}$ , the delay in $v_{s}$ is defined as (Ma et al., 2020a)
针对流式语音翻译任务，研究者提出了一种衡量延迟的指标，该指标定义为系统开始生成翻译结果所产生的时间延迟。设 ${\bf u},{\bf v}$ 和 $\hat{{\bf v}}$ 分别表示输入语音序列、真实文本序列和系统生成的假设序列。在流式语音翻译任务中，模型基于部分输入数据产生输出。假设生成 $v_{s}$ 时已读取 ${\bf u}_{1:t}=\{(u_{1},\ldots,u_{t}),t<T_{u}\}$ ，则 $v_{s}$ 处的延迟定义为（Ma 等，2020a）

d_{s}=\sum_{k=1}^{t}T_{k}

(3)

where $T_{k}$ is the duration of the speech frame $u_{k}$ . The latency metrics are calculated using a method that analyzes a sequence of time delays $[d_{1},\ldots,d_{T_{v}}]$ .
其中 $T_{k}$ 表示语音帧 $u_{k}$ 的持续时间。延迟指标通过分析时间延迟序列 $[d_{1},\ldots,d_{T_{v}}]$ 的方法进行计算。

Average Proportion (AP) (Cho and Esipova, 2016a) calculates the mean fraction of the source input that is read during the target prediction generating process.

1. 平均比例（AP）（Cho 和 Esipova，2016a）计算在目标预测生成过程中读取源输入的平均分数。

AP=\frac{1}{T_{v}\sum_{k=1}^{T_{u}}T_{k}}\sum_{s=1}^{T_{v}}d_{s}

(4)

Average Lagging (AL) measures the distance between the speaker and the user based on the number of words used in the conversation (Ma et al., 2018
2. 平均延迟（AL）通过对话中使用的词汇数量来衡量说话者与用户之间的距离（Ma 等人，2018）。).

AL=\frac{1}{\tau(T_{u})}\sum_{s=1}^{\tau(T_{u})}d_{s}-\hat{d_{s}}

(5)

Where $\tau(T_{u})=\min\{s\mid d_{s}=\sum_{k=1}^{T_{u}}T_{k}\}$ and $\hat{d_{s}}$ are the delays of an ideal policy defined as (Ma et al., 2020a)
其中 $\tau(T_{u})=\min\{s\mid d_{s}=\sum_{k=1}^{T_{u}}T_{k}\}$ 和 $\hat{d_{s}}$ 表示理想策略的延迟，其定义参见(Ma et al., 2020a)

\hat{d_{s}}=(s-1)\sum_{k=1}^{T_{u}}\frac{T_{k}}{T_{v}}

(6)

Differentiable Average Lagging (DAL) One issue with AL is that it is not differentiable because of the $\min$ function. To solve this, (Cherry and Foster, 2019
3. 可微分平均延迟（DAL）平均延迟（AL）存在的一个问题是因其 $\min$ 函数而不可微分。为解决这一问题，(Cherry and Foster, 2019) 提出在每个操作后引入最小延迟 $1/\gamma$ ，并将 DAL 定义为) introduces a minimum delay of $1/\gamma$ after each operation and defines DAL as

DAL=\frac{1}{T_{v}}\sum_{s=1}^{T_{v}}d_{s}^{{}^{\prime}}-\frac{s-1}{\gamma}

(7)

where 其中

d_{s}^{{}^{\prime}}=\begin{cases}d_{s},&\ s=0\\ \max(d_{s},d^{{}^{\prime}}_{s-1}+\gamma),&s>0\end{cases}

(8)

and $\gamma=T_{v}/\sum_{k=1}^{T_{u}}T_{k}$ 和 $\gamma=T_{v}/\sum_{k=1}^{T_{u}}T_{k}$

Length-Adaptive Average Lagging (LAAL) One issue with AL metric for simultaneous translation is that though it can handle the under-generation²²2Under/Over-generation problem refers to the length of the generated text compared to the reference translation text. problem, it is unable to handle over-generation and produces biased score. To alleviate this issue, (Papi et al., 2022a) propose LAAL which modifies (6) as

\hat{d_{s}}=(s-1)\sum_{k=1}^{T_{u}}\frac{T_{k}}{\max\{T_{v},\hat{T_{v}}\}}

(9)

Essentially, it divides (6) by the maximum length of the reference and predicted text. As such, it can handle both over and under-generation problems.
本质上，该指标通过将(6)式除以参考文本与预测文本的最大长度来实现归一化处理，从而能够同时解决生成文本过长或过短的问题。

4. 长度自适应平均滞后（LAAL）同步翻译中 AL 指标的一个问题是，虽然它能处理生成不足 ² 的情况，却无法处理生成过度的问题，并会产生偏差评分。为解决这一问题，(Papi 等人，2022a)提出了 LAAL 方法，将公式(6)修改为

Average Token Delay (ATD) AL metric does not take into account the length of the partial translation output, i.e., it does not consider the latency caused by longer outputs. To remedy this issue, ATD (Kano et al., 2023
5. 平均令牌延迟（ATD）AL 指标未考虑部分翻译输出的长度，即未计入较长输出导致的延迟。为解决此问题，近期提出了如下定义的 ATD 指标（Kano 等人，2023 年）。), defined below, has been proposed recently.

ATD=\frac{1}{T_{v}}\sum_{s=1}^{T_{v}}(T(v_{s})-T(u_{a(s)}))

(10)

where 其中

	$\displaystyle a(s)$	$\displaystyle=\min(s-f(s),d_{s})$		(11)
	$\displaystyle f(s)$	$\displaystyle=(s-1)-a(s-1)$		(12)

$T(\cdot)$ in (10) represents the ending time of each input or output token. The token is a sub-segment in speech, a character, or a word in text. $a(s)$ represents the index of the input token corresponding to $v_{s}$ in the time difference calculation and $a(0)=0$ . $f(s)$ in (11) represents how much longer the duration of the previous translation prefix is than that of the previous input prefix.
$T(\cdot)$ 在式(10)中表示每个输入或输出标记的结束时间。该标记可以是语音中的子片段、文本中的字符或单词。 $a(s)$ 表示在时间差计算和 $a(0)=0$ 中与 $v_{s}$ 相对应的输入标记索引。 $f(s)$ 在式(11)中表示前一个翻译前缀的持续时间比前一个输入前缀长多少。

3.3 Loss Functions 3.3 损失函数

Let $D=(u,x,v)$ be a tuple where $u,x$ , and $v$ are the speech, the transcription text, and the translation text, respectively. The following are the various loss functions that are used to optimize the performance of the E2E ST models:
设 $D=(u,x,v)$ 为一个元组，其中 $u,x$ 、 $v$ 分别表示语音、转录文本和翻译文本。以下是用于优化端到端语音翻译模型性能的各种损失函数：

Distillation Loss (Liu et al., 2019) The student model not only matches the ground truth, but also the teacher models’s output probabilities, which reduces the variance of the gradients.

1. 蒸馏损失（Liu 等人，2019）学生模型不仅需要匹配真实标签，还需对齐教师模型的输出概率分布，此举有效降低了梯度方差。

L_{KD}=-\sum_{(x,v)\in D}\sum_{t=1}^{N}\sum_{k=1}^{|V|}S(v_{t}=k|v_{<t},x)\log T% (v_{t}=k|v_{<t},x)

(13)

where $S$ and $T$ denote the output distribution of student and teacher models, respectively.
其中 $S$ 和 $T$ 分别表示学生模型和教师模型的输出分布。

CTC Loss (Ren et al., 2020) computes the most likely alignment of output text sequence given input speech sequence by summing over the all possible output sequence paths.

2. CTC 损失函数（Ren 等人，2020）通过对所有可能的输出序列路径进行求和，计算给定输入语音序列时最可能的输出文本序列对齐方式。

L_{CTC}=-\sum_{(u,x)\in D}\sum_{z\in\phi(x)}\log p(z|u)

(14)

Cross-Modal Adaptation Loss (Liu et al., 2020d) is defined as the sum of all the Mean Squared Errors of the speech and the transcription texts.

3. 跨模态适应损失（Liu 等人，2020d）定义为语音与转录文本所有均方误差之和。

L_{AD}=\Biggl{\{}\begin{matrix}\sum_{(u,x)\in D}MSE(\bar{h_{u}},\bar{h_{x}});&% &$seq-level$\\ \sum_{(u,x)\in D}MSE(h_{u},h_{x});&&$word-level$\end{matrix}

(15)

where $h_{u}$ and $h_{x}$ are the speech and word embeddings, and $\bar{h_{u}}$ and $\bar{h_{x}}$ are the average speech and word embeddings, respectively. MSE represents the difference between the two embeddings.
其中 $h_{u}$ 和 $h_{x}$ 分别表示语音嵌入和词嵌入， $\bar{h_{u}}$ 和 $\bar{h_{x}}$ 则对应语音嵌入和词嵌入的平均值。MSE 用于衡量两种嵌入之间的差异。

Cross-Entropy Loss (Ye et al., 2021) is the negative likelihood of the data combined over all the subtasks such as ASR, MT, ST and also from external-MT.

4. 交叉熵损失（Ye 等人，2021 年）是通过将所有子任务（如自动语音识别、机器翻译、语音翻译）以及外部机器翻译的数据联合计算得出的负对数似然。

L_{\theta}=-\sum_{x,v\in D^{\prime}\cup D_{MT-ext}}\log p(x|v;\theta),

(16)

where $D^{\prime}=D_{ASR}\cup D_{MT}\cup D_{ST}$ is the superset of all the parallel subsets data.
其中 $D^{\prime}=D_{ASR}\cup D_{MT}\cup D_{ST}$ 为所有并行子集数据的超集。

Contrastive Loss (Ye et al., 2022a) is computed between the speech and the transcription text bringing them closer, and pushing the unrelated pairs farther.

5. 对比损失（Ye 等人，2022a）通过计算语音与转录文本之间的差异，使相关配对更接近，同时推远无关配对。

L_{CON}=-\sum_{(u,x)\in D}\log\frac{\exp({\cos(\bar{h_{u}},\bar{h_{x}})}/% \kappa)}{\sum_{\forall x_{j}\notin\bar{h_{x}}}\exp({\cos(\bar{h_{u}},\bar{h_{x% }}(x_{j}))}/\kappa)},

(17)

where $cos$ and $\kappa$ denote the cosine similarity and temperature hyperparameter, respectively.
其中 $cos$ 和 $\kappa$ 分别表示余弦相似度和温度超参数。

ST Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the translation text given the source speech as follows

6. ST 损失（Ouyang 等人，2023）定义为给定源语音时翻译文本的负对数似然，其表达式如下：

L_{ST}=-\sum_{(u,v)\in D}\log p(v|u)

(18)

MT Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the translation text given the source transcript as follows

7. 机器翻译损失（Ouyang 等人，2023）定义为给定源文本时翻译文本的负对数似然，其表达式如下

L_{MT}=-\sum_{(x,v)\in D}\log p(v|x)

(19)

ASR Loss (Ouyang et al., 2023) is defined as the negative log-likelihood of the transcription text given the source speech as follows

8. 语音识别损失（Ouyang 等人，2023）定义为给定源语音时转录文本的负对数似然，其表达式如下

L_{ASR}=-\sum_{(u,x)\in D}\log p(x|u)

(20)

4 Cascade vs. End-to-End 4 级联式与端到端

The traditional ST translation methods involve a cascade approach– First, applying ASR on the given speech and then performing MT on the transcription produced by ASR (see fig. 3(a)) Such a cascade model is prone to several issues, such as (a) error in the ASR model can propagate to the MT model, (b) higher training time, (d) inability to capture non-lexical cues such as prosody, and (d) resources required for training. To mitigate such issues, various researchers propose using E2E models (see fig. 3(b)) for ST task (Bérard et al., 2016; Anastasopoulos et al., 2016; Bérard et al., 2018; Gangi et al., 2019; Bentivogli et al., 2021). An E2E model offers joint training from scratch; avoids separately trained knowledge sources; and produces the output in a single pass (Prabhavalkar et al., 2024). Because of simpler training, lower memory footprint, and cost, E2E model development has gained significant momentum in the research community.
传统的语音到文本翻译方法采用级联式处理流程：首先对输入语音进行自动语音识别（ASR），随后对 ASR 生成的文本实施机器翻译（MT）（见图 3(a)）。这种级联模型存在若干固有缺陷：（a）ASR 模型的错误会向 MT 模型传递；（b）训练时间较长；（d）无法捕捉韵律等非词汇特征；（d）训练资源需求较高。为应对这些问题，众多研究者提出采用端到端模型（E2E）处理语音翻译任务（Bérard 等，2016；Anastasopoulos 等，2016；Bérard 等，2018；Gangi 等，2019；Bentivogli 等，2021）。端到端模型支持从零开始的联合训练，规避了分立知识源的单独训练需求，并能单次推理生成最终输出（Prabhavalkar 等，2024）。凭借训练流程简化、内存占用降低及成本优势，端到端模型研发已在学术界形成显著发展趋势。

Despite E2E models demonstrating superiority over cascade ST models based on the aforementioned criteria, they still fall short in comparison to the latter in terms of both automatic and human evaluation metrics (Etchegoyhen et al., 2022; Agrawal et al., 2023). In particular, (Lam et al., 2020; Etchegoyhen et al., 2022) show that the cascade model outperforms E2E in a low-resource setting (Basque $\rightarrow$ Spanish) while employing in-domain and out-of-domain data for training the ASR and MT components. The gap is more significant when models are trained using unrestricted data. However, as shown by (Bentivogli et al., 2021) on three language directions, the gap between cascade and E2E is closed, though primarily on English on one side. The same conclusion is found by (Tsiamas et al., 2024) as well. Another study (Zhou et al., 2024) shows that E2E models can capture para-linguistic features of speech and outperform cascade models in disambiguating wh-phrases. Such a study alludes to further comparative study involving more languages and domains to assert the claim that the performance gap is indeed closed.
尽管端到端模型在上述标准上展现出优于级联语音翻译模型的特性，但在自动评估和人工评估指标方面仍不及后者（Etchegoyhen 等，2022；Agrawal 等，2023）。特别是（Lam 等，2020；Etchegoyhen 等，2022）的研究表明，在低资源场景（巴斯克语→西班牙语）中，当使用领域内和领域外数据分别训练自动语音识别（ASR）与机器翻译（MT）组件时，级联模型表现更优。若采用无限制数据进行训练，这种性能差距会进一步扩大。然而，（Bentivogli 等，2021）针对三个语言方向的研究显示，级联与端到端模型的差距正在缩小——尽管主要体现在以英语为源语言的情况下。（Tsiamas 等，2024）的研究也得出相同结论。另一项研究（Zhou 等，2024）则指出，端到端模型能捕捉语音的副语言特征，在消解 wh 短语歧义方面优于级联模型。这类研究暗示需要开展更多跨语言、跨领域的对比实验，才能确证两者性能差距已真正消除。

5 Data Issues 5 数据问题

The lack of adequate parallel speech-text corpora, essential in large quantities for training direct ST models, significantly impedes the performance of such models. The necessity for supervised ST data poses challenges in applying E2E ST systems to low-resource languages, where creating labeled parallel speech-text corpora demands substantial investments of time, money, and expertise. To address data scarcity, various techniques such as data augmentation, pre-training, back-translation, knowledge distillation, etc., are employed. These methods are elaborated as follows.
缺乏足够的平行语音-文本语料库（这对训练直接语音翻译模型至关重要且需要大量数据），严重制约了此类模型的性能。在有监督语音翻译数据的需求下，端到端语音翻译系统在低资源语言中的应用面临挑战，因为创建标注的平行语音-文本语料库需要投入大量时间、资金和专业资源。为解决数据稀缺问题，研究者采用了数据增强、预训练、反向翻译、知识蒸馏等多种技术。这些方法具体阐述如下。

5.1 Augmentation 5.1 数据增强

Data augmentation is a technique in machine learning to synthetically create more data points by applying the class-preserving transformations (Cui et al., 2015). The objective is to increase the variability in the data so that the generalization and robustness of the model may be enhanced. Data augmentation can be applied to both speech and text.
数据增强是机器学习中通过施加类别保持变换（Cui 等人，2015）来合成生成更多数据点的技术。其目的是增加数据的多样性，从而提升模型的泛化能力和鲁棒性。该技术可同时应用于语音和文本数据。

5.1.1 Augmenting speech data
5.1.1 语音数据增强

Speech data can be augmented in various ways. For example, by adding noise, speed and pitch perturbation, time and frequency masking to name a few. SpeechAugment (Park et al., 2019) policy consists of warping the features, masking blocks of frequency channels, and time steps. It has been successfully used both for ASR (Vincent et al., 2017) and ST tasks (Bahar et al., 2019b). MixSpeech (Meng et al., 2021) as shown in Fig. 4(a) takes the weighted combination of two different speech features as input and two recognition losses with the same weights. A generalization of MixSpeech (Xie and Hansen, 2023) called MixRep applies the mixup idea to the acoustic feature and hidden layers inputs. MixRep combination with a regularization term along the time axis further improves ASR performance. Both MixSpeech and MixRep have been shown to perform well for low-resource ASR and their effectiveness is still to be tested for ST tasks. M3ST (Cheng et al., 2022) applies two levels of Fine-Tuning (FT) using mixup data– word, sentence, and frame level mix data in the first FT level and source speech and transcription mixup in the second FT level. M3ST achieves SOTA on MuST-C compared to baselines.
语音数据可通过多种方式进行增强，例如添加噪声、速度和音高扰动、时频掩蔽等。SpeechAugment（Park 等人，2019）策略包含特征扭曲、频率通道块掩蔽和时间步掩蔽，已成功应用于自动语音识别（Vincent 等人，2017）和语音翻译任务（Bahar 等人，2019b）。如图 4(a)所示，MixSpeech（Meng 等人，2021）采用两种不同语音特征的加权组合作为输入，并施加相同权重的双重识别损失。其泛化版本 MixRep（Xie 和 Hansen，2023）将混合增强思想应用于声学特征和隐藏层输入，结合时间轴正则化项进一步提升了语音识别性能。MixSpeech 与 MixRep 在低资源语音识别中表现优异，但其在语音翻译任务中的有效性仍有待验证。M3ST（Cheng 等人，2022）采用两级微调：第一级使用词汇/句子/帧级混合数据，第二级采用源语音与转写文本的混合增强。在 MuST-C 基准测试中，M3ST 取得了当前最优性能。

5.1.2 Augmenting speech and text data
5.1.2 语音与文本数据增强

It is possible to augment both speech and text simultaneously and create new paired data. For example, sample, translate, and recombine (Lam et al., 2022b) first samples a suffix replacement from suffix memory corresponding to a pivot token from transcription. It then translates the combined new utterance (prefix+pivot+replacmenet suffix) to generate a new target sentence. The corresponding audio pair is obtained by concatenating the audio frames of the prefix, pivot, and replacement suffix. The interesting thing about the proposed method is that it can generate real-looking sentences contrary to pseudo-sentences. Concatenation of original ST data has been used to augment the entire training data (Lam et al., 2022a). In particular, (Lam et al., 2022a) proposes CatSpeaker that uses single speaker information and CatRandom that randomly generates audio-text pairs spoken by different speakers.
可以同时对语音和文本进行增强并创建新的配对数据。例如，采样-翻译-重组方法（Lam 等人，2022b）首先从后缀存储器中采样与转录文本中的枢轴标记相对应的后缀替换项，随后将组合后的新话语（前缀+枢轴+替换后缀）翻译生成新的目标语句。对应的音频对则是通过拼接前缀、枢轴及替换后缀的音频帧获得。该方法的有趣之处在于能生成真实感语句而非伪句子。原始语音翻译数据的串联拼接已被用于增强整个训练集（Lam 等人，2022a）。具体而言，（Lam 等人，2022a）提出了使用单一说话者信息的 CatSpeaker 方案，以及随机生成不同说话者音频-文本对的 CatRandom 方案。

5.2 Pre-training 5.2 预训练

Pre-training is an approach to handle data scarcity for low-resource problems and is deemed as a form of transfer learning (Bozinovski and Fulgosi, 1976). Data used for pre-training may consist of either speech, text, or both. Once the models are pre-trained leveraging augmented data, it enhances the robustness of the model on downstream tasks. We find that SOTA ST models often use pre-training on a large amount of ASR/MT corpus. In ST, pre-training has been used by many researchers (Paulik and Waibel, 2013; Bansal et al., 2017; Anastasopoulos and Chiang, 2018; Wang et al., 2020d; Dong et al., 2021; Zhang et al., 2022a; Tang et al., 2022). Pre-training has been applied in two flavors by different researchers: Independently and Jointly.
预训练是解决低资源问题数据稀缺的一种方法，被视为迁移学习的一种形式（Bozinovski 和 Fulgosi，1976）。用于预训练的数据可以包含语音、文本或两者兼具。通过增强数据对模型进行预训练后，能提升模型在下游任务中的鲁棒性。我们发现当前最先进的语音翻译模型通常利用大量自动语音识别/机器翻译语料库进行预训练。在语音翻译领域，众多研究者已采用预训练技术（Paulik 和 Waibel，2013；Bansal 等，2017；Anastasopoulos 和 Chiang，2018；Wang 等，2020d；Dong 等，2021；Zhang 等，2022a；Tang 等，2022）。不同研究者主要采用两种预训练模式：独立预训练与联合预训练。

In independent pre-training, individual modules (encoder, decoder, semantic decoder, etc.) are pre-trained using auxiliary data such as ASR and MT data. Such an approach has been followed by (Wang et al., 2020d; Chen et al., 2020; Zheng et al., 2021a). In particular, (Wang et al., 2020d) pre-trains the encoder using ASR data for learning semantic concepts. (Chen et al., 2020) propose a self-supervised method called Masked Acoustic Modeling (MAM), which randomly masks part of the speech spectrogram and then recovers it on top of the encoder. Whereas (Zheng et al., 2021a) unifies speech and text representation through masked language modeling. Besides pre-training the encoder and the decoder, various researchers also exploit pre-trained feature extractors such as Wav2vec (Schneider et al., 2019) used by (Zhang et al., 2023b) and (Liu et al., 2020b) HuBERT (Hsu et al., 2021) used by (Zhang et al., 2023a). Very recently, (Tsiamas et al., 2024) proposed an ST model that pre-trains the speech encoder using optimal transport and CTC. They claim to surpass supervised ST models requiring no paired speech-text data in a zero-shot setting.
在独立预训练阶段，各模块（编码器、解码器、语义解码器等）利用 ASR 和 MT 数据等辅助数据进行预训练。该方法已被（Wang 等人，2020d；Chen 等人，2020；Zheng 等人，2021a）采用。其中，（Wang 等人，2020d）使用 ASR 数据预训练编码器以学习语义概念；（Chen 等人，2020）提出名为掩码声学建模（MAM）的自监督方法，该方法随机掩码部分语音频谱图后通过编码器进行重构；而（Zheng 等人，2021a）则通过掩码语言建模统一语音与文本表征。除编码器和解码器预训练外，研究者还广泛采用预训练特征提取器，如（Zhang 等人，2023b）使用的 Wav2vec（Schneider 等人，2019）和（Liu 等人，2020b）采用的 HuBERT（Hsu 等人，2021）。最新研究中，（Tsiamas 等人，2024）提出通过最优传输与 CTC 预训练语音编码器的 ST 模型，宣称在零样本场景下无需配对语音-文本数据即可超越有监督 ST 模型。

In joint pre-training, the entire model is first pre-trained in an E2E fashion followed by fine-tuning over the ST corpus (Fang and Feng, 2023; Bapna et al., 2021). It is often accompanied by multitasking pre-training with ASR, MT, and masked language modeling tasks (Chung et al., 2021), using supervised as well as unsupervised speech and text data. The (Tang et al., 2022) pre-trains on speech/text-to-text/speech, text-to-text, speech self-supevised learning (SSL), and speech-to-phoneme. SpeechT5 (Ao et al., 2021) pre-trains on ASR, ST, text-to-speech, speech conversion, and speech enhancement tasks. Wave2Seq (Wu et al., 2022) pre-trains jointly using pseudo-languages. Multi-modal multi-task pre-training leverages five tasks: self-supervised speech-to-pseudo-codes (S2C), phoneme-to-text (P2T), self-supervised masked speech prediction (MSP), supervised phoneme prediction (PP), and ST task (Zhou et al., 2022b).
在联合预训练中，整个模型首先以端到端方式进行预训练，随后在语音翻译语料库上进行微调（Fang 和 Feng，2023；Bapna 等，2021）。该方法通常结合自动语音识别、机器翻译和掩码语言建模任务进行多任务预训练（Chung 等，2021），同时利用有监督和无监督的语音及文本数据。（Tang 等，2022）的研究涵盖了语音/文本到文本/语音、文本到文本、语音自监督学习以及语音到音素的预训练。SpeechT5（Ao 等，2021）则在自动语音识别、语音翻译、文本转语音、语音转换和语音增强任务上进行预训练。Wave2Seq（Wu 等，2022）采用伪语言进行联合预训练。多模态多任务预训练框架包含五项任务：自监督语音到伪代码、音素到文本、自监督掩码语音预测、有监督音素预测以及语音翻译任务（Zhou 等，2022b）。

5.3 Self-training and Back-translation
5.3 自训练与回译

Both the Self-Training and Back-translation (BT) methods are approaches employed to harness monolingual data for training models that necessitate supervised data but encounter limitations in the availability of a sufficient supervised parallel corpus, as illustrated in Fig.4(b) and (c). The self-training method is utilized to make use of source monolingual data, while the back-translation method is applied to target monolingual data. In the end, both methods are employed synergistically to generate augmented data.
自训练与反向翻译（BT）方法均是利用单语数据训练模型的策略，适用于需要监督数据但面临平行语料不足的情况，如图 4(b)和(c)所示。自训练方法用于利用源语言单语数据，而反向翻译方法则应用于目标语言单语数据。最终，这两种方法协同作用生成增强数据。

More specifically, given a speech-text parallel corpus $D_{p}=\{({\bf u}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ , monolingual source speech corpus $D_{s}=\{{\bf u}^{i}_{s}|i=1,2,\ldots,m\}$ and monolingual target text corpus $D_{t}=\{{\bf v}^{i}_{t}|i=1,2,\ldots,p\}$ , where $m,p>>n$ . In self-training, first, a translation model $f_{u\rightarrow v}$ is trained on $D_{p}$ . It is then used to generate “pseudo labels” ${\bf v}^{i}_{s}$ for $D_{s}$ by applying $f_{u\rightarrow v}$ leading to auxiliary data $A_{s}=\{({\bf u}^{i}_{s},{\bf v}^{i}_{s})|i=1,2,\ldots,m\}$ . The combined data $D_{p}\cup A_{s}$ is then used to re-train the model $f_{u\rightarrow v}$ . Whereas in back-translation, $D_{t}$ is translated using a backward translation model $f_{v\rightarrow u}$ creating auxiliary data $A_{t}=\{({\bf u}^{i}_{t},{\bf v}^{i}_{t})|i=1,2,\ldots,p\}$ for training the forward translation model $f_{u\rightarrow v}$ on the combined data $D_{p}\cup A_{t}$ .
具体而言，给定语音-文本平行语料库 $D_{p}=\{({\bf u}^{i},{\bf v}^{i})|i=1,2,\ldots,n\}$ 、源语言单语语音库 $D_{s}=\{{\bf u}^{i}_{s}|i=1,2,\ldots,m\}$ 和目标语言单语文库 $D_{t}=\{{\bf v}^{i}_{t}|i=1,2,\ldots,p\}$ （其中 $m,p>>n$ ）。在自训练中，首先基于 $D_{p}$ 训练翻译模型 $f_{u\rightarrow v}$ ，随后应用 $f_{u\rightarrow v}$ 为 $D_{s}$ 生成"伪标签" ${\bf v}^{i}_{s}$ ，从而得到辅助数据 $A_{s}=\{({\bf u}^{i}_{s},{\bf v}^{i}_{s})|i=1,2,\ldots,m\}$ 。组合数据 $D_{p}\cup A_{s}$ 随后用于重新训练模型 $f_{u\rightarrow v}$ 。而在反向翻译中，通过反向翻译模型 $f_{v\rightarrow u}$ 将 $D_{t}$ 转换为辅助数据 $A_{t}=\{({\bf u}^{i}_{t},{\bf v}^{i}_{t})|i=1,2,\ldots,p\}$ ，用于在组合数据 $D_{p}\cup A_{t}$ 上训练正向翻译模型 $f_{u\rightarrow v}$ 。

Back-translation on discrete units to train a unit-to-text translation model is applied in (Zhang et al., 2023a) which is on par with methods leveraging large-scale external corpus. (Fang and Feng, 2023) proposes a back-translation strategy for target-to-unit and unit-to-speech synthesis for low-resource language translation without transcript. (Wang et al., 2021b) extract speech features using wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Cyclic feedback from MT output is used as a self-training mechanism for a cascade of ASR-MT model that shows how to exploit the direct speech-translation data in (Lam et al., 2020).
（Zhang 等，2023a）采用离散单元反向翻译训练单元到文本的翻译模型，其性能与利用大规模外部语料库的方法相当。（Fang 和 Feng，2023）针对无文本转录的低资源语言翻译任务，提出了目标语到单元及单元到语音合成的反向翻译策略。（Wang 等，2021b）通过 wav2vec 2.0 预训练提取语音特征，结合单轮自训练及语言模型解码。（Lam 等，2020）则利用机器翻译输出的循环反馈作为 ASR-MT 级联模型的自我训练机制，展示了如何有效利用直接语音翻译数据。

5.4 Knowledge distillation
5.4 知识蒸馏

Knowledge Distillation(KD) transfers learned knowledge from a large ensemble model (called teacher) to a smaller single model (called student) as shown in Fig.4 (d) (Hinton et al., 2015). This process encompasses both model compression (Bucilǎ et al., 2006) and transfer-learning. More details of recent works utilizing KD approaches for ST tasks are given in §7 (ST with MT) and §6.2.3.
知识蒸馏（Knowledge Distillation，简称 KD）通过将大型集成模型（称为教师模型）习得的知识迁移至小型单一模型（称为学生模型）来实现知识传递，如图 4(d)所示（Hinton 等人，2015）。该技术同时涵盖模型压缩（Bucilǎ等人，2006）与迁移学习两大范畴。关于近期采用 KD 方法进行语音翻译任务的研究详情，请参阅第 7 章（结合机器翻译的语音翻译）及第 6.2.3 节内容。

6 Segmentation and Representation Learning
6 分割与表征学习

E2E ST models rely on segmented inputs because handling long inputs is a challenging task (Kim et al., 2017; Tsiamas et al., 2022c). Segmentation is the problem of splitting the long speech/text sequence into smaller and more manageable segments whose representations can be learned. This section will shed some light on the segmentation and representation issues and offer some advice on how to tackle them.
端到端语音转文本模型依赖分段输入，因为处理长输入是一项具有挑战性的任务（Kim 等人，2017；Tsiamas 等人，2022c）。分段是指将长语音/文本序列分割为更小、更易处理的片段，以便学习其表征。本节将探讨分段与表征问题，并提供应对建议。

6.1 Segmentation Learning 6.1 分段学习

As discussed above, segmentation is an important issue while building ST models. Segmentation of text is easy–they can be split based on the strong punctuation. This is what current MT models rely on. Similarly, ASR models give lower importance to segmentation due to the small local context window required for the task. The cascaded ST model can perform segmentation by applying ASR followed by monolingual translation to restore the lost punctuation followed by segmentation on them (Matusov et al., 2007, 2018). On the other side, the E2E ST models require sophisticated segmentation of the speech primarily due to the importance of out-of-order word relation between the input and output that exist as well as the absence of linguistic features.
如前所述，分段是构建语音转文本模型时的重要问题。文本分段较为简单——可通过标点符号进行分割，这也是当前机器翻译模型采用的方法。同样，由于自动语音识别任务所需的局部上下文窗口较小，该模型对分段的重视程度较低。级联式语音转文本模型可通过先进行语音识别，再通过单语翻译恢复丢失的标点符号，最后进行分段处理（Matusov 等人，2007，2018）。而端到端语音转文本模型则需要对语音进行精细分段，这主要是由于输入输出之间存在非连续词序关系，且缺乏语言学特征。

Traditionally, segmentation of speech is done manually. Due to the cumbersome task, segmentation learning is warranted. Segmentation is done based on either length which splits the speech at fixed-lengths or pause which splits the speech based on Voice Activity Detection (VAD) (Sohn et al., 1999). The third approach to segment the speech is hybrid mode in that length and linguistic contents are taken into account (Potapczyk and Przybysz, 2020; Gaido et al., 2021; Tsiamas et al., 2022c). The hybrid approach surpasses the length and pause-based approaches to segmentation in terms of performance (Gaido et al., 2021). Concretely, (Tsiamas et al., 2022c) learns the manual segmentation using a binary classifier and probabilistic divide-and-conquer algorithm (Gaido et al., 2021) is used at inference time to decide the split point. However, there is still a gap in the hybrid and manual approaches to segmentations, and future work may consider paying attention to this.
传统上，语音分割通常采用人工方式进行。鉴于该任务繁琐复杂，开展分割学习研究具有必要性。现有分割方法主要分为三类：基于固定时长的等长分割、基于语音活动检测（VAD）的静默分割（Sohn 等人，1999 年），以及综合考虑时长与语言特征的混合分割模式（Potapczyk 与 Przybysz，2020 年；Gaido 等人，2021 年；Tsiamas 等人，2022c 年）。研究表明，混合分割方法在性能上优于纯时长或静默分割（Gaido 等人，2021 年）。具体而言，（Tsiamas 等人，2022c 年）采用二元分类器学习人工分割特征，并在推理阶段运用概率分治算法（Gaido 等人，2021 年）确定分割点。然而当前混合分割与人工分割方法仍存在性能差距，未来研究可重点关注该领域。

Our discussion above focuses on segmentation in the offline E2E models. Segmentation of speech in streaming E2E models is presented in §7.1.2.
我们上述讨论主要集中于离线端到端模型中的语音分段问题。流式端到端模型的语音分段方法将在第 7.1.2 节中阐述。

6.2 Representation Learning
6.2 表征学习

Representation learning is a type of machine learning where algorithms are supposed to discover and extract useful features automatically from the raw data. It has been successfully applied in computer vision (Wu, 2020), natural language processing (Liu et al., 2021b), and speech (Mohamed et al., 2022). Representation learning is an important issue in ST tasks because speech and text are two distinct modalities of data that reside in different embedding spaces. Hence, we not only need better representation learning methods for speech and text but also their joint representation learning. Many of the works in ST apply speech/text representation learning methods before actually applying encoder-decoder or transducer-based methods (explained later in §7) for the ST task. Below, we provide details of such representation learning methods used for ST tasks.
表征学习是机器学习的一种类型，其算法能够从原始数据中自动发现并提取有用特征。该方法已成功应用于计算机视觉（Wu, 2020）、自然语言处理（Liu 等, 2021b）和语音领域（Mohamed 等, 2022）。在语音翻译任务中，表征学习尤为重要，因为语音和文本是两种不同模态的数据，存在于不同的嵌入空间。因此，我们不仅需要针对语音和文本的更好表征学习方法，还需要二者的联合表征学习。许多语音翻译研究在实际应用编码器-解码器或基于传感器的方法（详见第 7 节）之前，都会先采用语音/文本表征学习方法。下文将详细阐述用于语音翻译任务的此类表征学习方法。

6.2.1 Text Representation 6.2.1 文本表示

ST models often use ASR transcripts and MT translations as auxiliary data which needs to be fed to the encoder and decoder, respectively. To learn representation for such text data, existing works rely on word embedding (Zhang et al., 2023c; Bérard et al., 2016), LSTM (Kim et al., 2017; Weiss et al., 2017; Bérard et al., 2018; Jia et al., 2019), and Transformer (Wang et al., 2021b; Liu et al., 2021a; Zeng et al., 2021). Text data is often tokenized and fed as either a word or as a character (Bérard et al., 2018). The output of the decoder could be graphene, characters, or words.
语音转文本模型常将自动语音识别(ASR)转录文本和机器翻译(MT)译文作为辅助数据，分别输入编码器和解码器。为学习此类文本数据的表征，现有研究主要采用词嵌入(Zhang et al., 2023c; Bérard et al., 2016)、长短期记忆网络(Kim et al., 2017; Weiss et al., 2017; Bérard et al., 2018; Jia et al., 2019)以及 Transformer 架构(Wang et al., 2021b; Liu et al., 2021a; Zeng et al., 2021)。文本数据通常经过分词处理后，以单词或字符形式输入(Bérard et al., 2018)。解码器输出可以是字形、字符或单词。

6.2.2 Speech Representation
6.2.2 语音表征

ST models take speech as input and utilize various speech-based feature representation methods to convert speech into a vector representation. Traditional speech feature extraction methods such as Perceptual Linear Prediction, (PLP), Fbank, and Mel-Filter Cepstral Coefficient (MFCC) (Rabiner and Schafer, 2010) have been used after normalization to extract speech features by many (Duong et al., 2016; Bérard et al., 2016; Kim et al., 2017; Bérard et al., 2018; Anastasopoulos and Chiang, 2018; Bansal et al., 2019; Jia et al., 2019; Inaguma et al., 2019; Liu et al., 2020d; Dong et al., 2021; Le et al., 2023b; Parcollet et al., 2024), sometimes combining them with pitch features and speech augmentation methods as described in §5. These feature extraction methods are sometimes being replaced by distributed feature representation methods such as speech word2vec (Chung and Glass, 2018) owing to their dense continuous feature representation capability.
语音到文本（ST）模型以语音作为输入，采用多种基于语音的特征表示方法将语音转换为向量表征。传统语音特征提取方法如感知线性预测（PLP）、滤波器组（Fbank）和梅尔频率倒谱系数（MFCC）（Rabiner and Schafer, 2010）经过归一化处理后，已被众多研究者（Duong et al., 2016; Bérard et al., 2016; Kim et al., 2017; Bérard et al., 2018; Anastasopoulos and Chiang, 2018; Bansal et al., 2019; Jia et al., 2019; Inaguma et al., 2019; Liu et al., 2020d; Dong et al., 2021; Le et al., 2023b; Parcollet et al., 2024）用于语音特征提取，有时会如第 5 节所述结合基频特征与语音增强方法使用。由于具备稠密连续特征表示能力，这些特征提取方法正逐渐被语音 word2vec（Chung and Glass, 2018）等分布式特征表示方法所替代。

It is difficult to get a large amount of labeled speech data to learn supervised speech feature representation. Therefore, more recent works exploit speech features learned via unsupervised and self-supervised ways, mapping continuous speech signal to discrete units– akin to words and sub-words in the text domain. Such a representation facilitates tools developed in NLP to borrow in the speech domain. Among them the most popular is Wav2Vec (Schneider et al., 2019) and its variants such as w2v-BERT (Chung et al., 2021) and Wav2vec 2.0 (Baevski et al., 2020) used in (Tran et al., 2020; Le et al., 2020; Li et al., 2020; Han et al., 2021; Popuri et al., 2022; Zhang et al., 2023c). Interestingly, Wav2Vec and its variants can be used as an encoder in a Seq2Seq framework alone or combined with adapters and CNN for Length shrinking³³3Length shrinking is an important issue in ST task since speech is a much longer sequence than text. Therefore, existing works employ various techniques such as length adapters, CNN, CTC for length shrinking. A few works such as CSTNet (Khurana et al., 2020; Wang et al., 2020d) use CNN for feature extraction and length shrinking.
获取大量标注语音数据以学习有监督的语音特征表示存在困难。因此，近期研究多采用无监督和自监督方式学习语音特征，将连续语音信号映射为离散单元——类似于文本领域的单词和子词。这种表征方式便于将自然语言处理领域的工具迁移至语音领域。其中最流行的是 Wav2Vec（Schneider 等人，2019）及其变体，如 w2v-BERT（Chung 等人，2021）和 Wav2vec 2.0（Baevski 等人，2020），这些模型已被应用于（Tran 等人，2020；Le 等人，2020；Li 等人，2020；Han 等人，2021；Popuri 等人，2022；Zhang 等人，2023c）等研究。值得注意的是，Wav2Vec 及其变体可单独作为 Seq2Seq 框架中的编码器，也可与适配器和 CNN 结合用于长度缩减 ³ 。部分研究如 CSTNet（Khurana 等人，2020；Wang 等人，2020d）则采用 CNN 进行特征提取和长度缩减。

More recent works in ST are employing HuBERT (Hsu et al., 2021) for speech representation (among other benefits of HuBERT) (Zhang et al., 2023a). Hubert offers stable training and better targets than Wav2Vec 2.0 since it uses hidden layers representation during the clustering process. For encoding long-speech signals, Conformers (Gulati et al., 2020) can be used as they provide local context through convolution block and global context through an attention mechanism. Seamless4MT (Barrault et al., 2023) uses conformer for speech encoding.
近期语音翻译研究开始采用 HuBERT（Hsu 等人，2021）作为语音表征方法（Zhang 等人，2023a），这得益于 HuBERT 的多重优势。相较于 Wav2Vec 2.0，HuBERT 在聚类过程中采用隐藏层表征，能提供更稳定的训练目标和更优的性能指标。针对长语音信号编码，可选用 Conformer 模型（Gulati 等人，2020），其通过卷积模块捕获局部上下文，并借助注意力机制获取全局上下文。Seamless4MT 系统（Barrault 等人，2023）即采用 Conformer 架构进行语音编码。

Other speech representation techniques such as VQ-VAE (van den Oord et al., 2017), WavLM (Chen et al., 2022), data2vec (Baevski et al., 2022), Robust data2vec (Zhu et al., 2023), SpeechLM (Zhang et al., 2024b), may also be explored while encoding speech for ST tasks.
其他语音表征技术，如 VQ-VAE（van den Oord 等，2017）、WavLM（Chen 等，2022）、data2vec（Baevski 等，2022）、Robust data2vec（Zhu 等，2023）、SpeechLM（Zhang 等，2024b）等，在语音到文本翻译任务的编码过程中同样值得探索。

6.2.3 Joint Speech-Text Representation
6.2.3 联合语音-文本表征

The speech and text in an ST task are semantically related because both of them refer to the same thing. Therefore, it is imperative to learn a joint speech-text representation in the hope of bridging the modality gap between them. A method for learning a combined representation of text and speech is called modality bridging (see fig.5). Hence, a good ST model should learn a representation such that embeddings of both modalities for similar speech-text pairs lie close to each other. It is believed that low performance on ST tasks is due to models not learning aligned representations of speech and text. Therefore, different authors have devised different ways to fill the gap, which fall into five major approaches: (a) adapters, (b) contrastive learning, (c) knowledge-distillation, (d) optimal transport, and (e) mix-up strategy. Below we discuss the works utilizing these approaches and show the pros and cons.
在语音到文本（ST）任务中，语音与文本具有语义关联性，因为二者指向同一事物。因此，学习联合语音-文本表征以弥合模态差异至关重要。这种学习语音与文本联合表征的方法称为模态桥接（见图 5）。理想的 ST 模型应习得一种表征方式，使得相似语音-文本对的两种模态嵌入在空间中彼此接近。当前 ST 任务性能不佳的主要原因被认为是模型未能学习到语音与文本的对齐表征。为此，不同研究者提出了填补这一差距的五类主要方法：(a)适配器、(b)对比学习、(c)知识蒸馏、(d)最优传输、(e)混合策略。下文将分别探讨采用这些方法的研究成果，并分析其优劣。

1.

Adapters are small modules integrated with pre-trained networks for specific tasks (Houlsby et al., 2019). They have performed on par with fine-turning-based approaches while requiring only a fraction of trainable parameters. For example, in (Gállego et al., 2021; Zhao et al., 2022; Sarkar et al., 2023), the modality gap is filled using adapter layers, which is a multi-headed self-attention with pooling operation. The author uses Wav2Vec 2.0 (Baevski et al., 2020) for speech-feature extraction, wherein self-attention layers in the transformer are equipped with pooling operation for dimensionality reduction to match the text representation.

1. 适配器是与预训练网络集成的小型模块，用于特定任务（Houlsby 等人，2019）。其性能与基于微调的方法相当，但仅需少量可训练参数。例如在（Gállego 等人，2021；Zhao 等人，2022；Sarkar 等人，2023）中，研究者使用带池化操作的多头自注意力适配器层来填补模态鸿沟。作者采用 Wav2Vec 2.0（Baevski 等人，2020）进行语音特征提取，其中变压器的自注意力层配备池化操作以实现降维，从而匹配文本表征。

Contrastive learning approximates the “semantic” distance in the input space using a simple distance in the target space after mapping input patterns onto the target space (Chopra et al., 2005). It tries to bring positive instances closer while pushing negative ones apart. It has been used excessively in both supervised and unsupervised settings for learning representations. For example, (Zhang et al., 2023c) performs the explicit knowledge transfer through contrastive learning. It learns frame and sentence-level speech feature representation and uses whitening (Su et al., 2021) to alleviate the MT representation degeneration. (Liu et al., 2019) decouples the encoder representation into three parts: acoustic encoder, shrinking (done via CTC) of acoustic encoder output, and semantic encoder for modality-gap bridging. Using a contrastive learning architecture, Chimera (Han et al., 2021) trains a semantic memory module which is shared for overcoming the modality distance. XSTNet (Ye et al., 2021) augmented with contrastive loss (Ye et al., 2022a) investigates three different methods: span masked representation, word-repetition and cut-off. It claims that contrastive loss is better than CTC and L2 loss. Word-aligned contrastive learning (WACO) (Ouyang et al., 2023) bridges the modality gap by forming average speech and word embedding of the same word as the positive pair while of different words as negative pairs. CSTNet is a self-supervised learning framework based on contrastive learning (using a mix of triplet losses)(Khurana et al., 2020). On top of the CTC loss, the boundary-based speech length shrinking mechanism is applied in (Zeng et al., 2022). The authors claim that if boundary-based shrinking is applied with other modality-bridging techniques, such as contrastive loss, it can further improve the model performance. The approach presented achieves lower inference speed and memory footprint. (Yin et al., 2023) proposes a novel integration of speech and text, referred to as a third modality. This fusion is achieved through the application of Cross-modal Contrastive Learning (Sohn, 2016) and Cross-Attentive Regularization (Tang et al., 2021a). Additionally, the method incorporates techniques such as Knowledge Distillation and Jensen-Shannon Divergence (Lin, 1991; Liu et al., 2019; Gaido et al., 2020a) to bridge the modality gap, addressing challenges related to input representation, semantics, and hidden states.
对比学习通过将输入模式映射到目标空间后，利用目标空间中的简单距离来近似输入空间的"语义"距离（Chopra 等，2005）。该方法致力于拉近正样本间距同时推远负样本，已在监督与非监督场景中被广泛用于表征学习。例如（Zhang 等，2023c）通过对比学习实现显式知识迁移，其同时学习帧级和句子级语音特征表示，并采用白化处理（Su 等，2021）缓解机器翻译表征退化问题。（Liu 等，2019）将编码器表征解耦为三部分：声学编码器、声学编码输出的 CTC 收缩模块以及用于模态间隙桥接的语义编码器。Chimera（Han 等，2021）采用对比学习架构训练共享语义记忆模块以克服模态距离。增强对比损失的 XSTNet（Ye 等，2021；Ye 等，2022a）探索了三种方法：跨度掩码表示、词语重复与截断策略。研究表明对比损失函数优于 CTC 和 L2 损失函数。词对齐对比学习（WACO）（Ouyang 等，2023）通过将同一词语的语音嵌入向量与文本嵌入向量的均值构成正样本对，不同词语的嵌入向量构成负样本对，从而弥合模态差异。CSTNet 是基于对比学习的自监督框架（采用混合三元组损失函数）（Khurana 等，2020）。在 CTC 损失函数基础上，（Zeng 等，2022）应用了基于边界的语音长度压缩机制。作者指出，若将基于边界的压缩机制与其他模态桥接技术（如对比损失）结合使用，可进一步提升模型性能。该方法实现了更低的推理时延和内存占用。（Yin 等，2023）提出了一种称为"第三模态"的语音文本融合新范式，通过跨模态对比学习（Sohn，2016）和交叉注意力正则化（Tang 等，2021a）实现模态融合。此外，该方法融合了知识蒸馏与 Jensen-Shannon 散度（Lin, 1991；Liu 等, 2019；Gaido 等, 2020a）等技术以弥合模态差异，解决了输入表征、语义及隐状态相关的挑战。

Models/Techniques 模型/技术	Problem Solved 问题解决	Dataset 数据集	Language Pair 语言对	Speech (hours) 语音时长（小时）	Metric (BLEU) 度量标准（BLEU）
M-Adapter + W2V2 + mBart (Baevski et al., 2020)	training gap between Pre-training & Fine-tuning the modality	MuST-C	En $\rightarrow$ De	408	25.9
			En $\rightarrow$ Ro	432	24.62
			En $\rightarrow$ Fr	492	37.34
Chimera (Han et al., 2021)	projecting audio & text to a common semantic representation 将音频与文本投影至共同的语义表征空间	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	27.1
			En $\rightarrow$ Fr 英译法	492	35.6
			En $\rightarrow$ Ru 恩 $\rightarrow$ 俄	489	17.4
			En $\rightarrow$ Es 英语 $\rightarrow$ 西班牙语	504	30.6
			En $\rightarrow$ It 英语 $\rightarrow$ 意大利语	465	25.0
			En $\rightarrow$ Ro 英语 $\rightarrow$ 罗马尼亚语	432	24.0
			En $\rightarrow$ Pt 恩 $\rightarrow$ 葡	385	30.2
			En $\rightarrow$ Nl 英语 $\rightarrow$ 荷兰语	442	29.2
ConST (XSTNet + Constrastive Loss) (Ye et al., 2021) ConST（XSTNet + 对比损失）叶等人 2021	closes modality gap 消除模态差距	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	28.3
			En $\rightarrow$ Es 英语 $\rightarrow$ 西班牙语	504	32.0
			En $\rightarrow$ Fr 英译法	492	38.3
			En $\rightarrow$ It 英语 $\rightarrow$ 意大利语	465	27.2
			En $\rightarrow$ Nl 英语 $\rightarrow$ 荷兰语	442	31.7
			En $\rightarrow$ Pt 恩 $\rightarrow$ 葡	385	33.1
			En $\rightarrow$ Ro 英语 $\rightarrow$ 罗马尼亚语	432	25.6
			En $\rightarrow$ Ru 恩 $\rightarrow$ 俄	489	18.9
W2V2 + mBart + Adapter (Gállego et al., 2021; Zhao et al., 2022) W2V2 + mBart + 适配器（Gállego 等，2021；赵等，2022）	slow convergence speed 收敛速度慢	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	28.22
WACO (Ouyang et al., 2023) WACO Ouyang 等人，2023	limited parallel data (1-hour) 有限的平行数据（1 小时）	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	1	17.5
AdaTrans (Zeng et al., 2022) AdaTrans Zeng 等人，2022	closing gap between length of speech & text 缩小语音与文本长度差距	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	28.7
			En $\rightarrow$ Fr 英译法	492	38.7
			En $\rightarrow$ Ru 恩 $\rightarrow$ 俄	489	19.0
STEMM (Fang et al., 2022) 方等人，2022 年提出的 STEMM	Speech representation 语音表征	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	28.7
			En $\rightarrow$ Fr 英译法	492	37.4
			En $\rightarrow$ Ru 恩 $\rightarrow$ 俄	489	17.8
			En $\rightarrow$ Es 英语 $\rightarrow$ 西班牙语	504	31.0
			En $\rightarrow$ It 英语 $\rightarrow$ 意大利语	465	25.8
			En $\rightarrow$ Ro 英语 $\rightarrow$ 罗马尼亚语	432	24.5
			En $\rightarrow$ Pt 恩 $\rightarrow$ 葡	385	31.7
			En $\rightarrow$ Nl 英语 $\rightarrow$ 荷兰语	442	30.5
CTC loss + Optimal Transport (Siamese-PT) (Le et al., 2023b) CTC 损失函数 + 最优传输（Siamese-PT）Le 等人，2023b	without change in architecture 无需改变架构	MuST-C MuST-C 语料库	En $\rightarrow$ De 英译德	408	27.9
			En $\rightarrow$ Es	504	31.8
			En $\rightarrow$ Fr	492	39.2
			En $\rightarrow$ It	465	27.7
			En $\rightarrow$ Nl	442	31.7
			En $\rightarrow$ Pt	385	34.2
			En $\rightarrow$ Ro	432	27.0
			En $\rightarrow$ Ru	489	18.5
Fine & Coarse Granularity Contrastive Learning (Zhang et al., 2023c)	limited knowledge transfer ability	MuST-C	En $\rightarrow$ De	408	29.0
			En $\rightarrow$ Fr	492	38.3
			En $\rightarrow$ Ru	489	19.7
			En $\rightarrow$ Es	504	31.9
			En $\rightarrow$ It	465	27.3
			En $\rightarrow$ Ro	432	26.8
			En $\rightarrow$ Pt	385	32.7
			En $\rightarrow$ Nl	442	31.6

Table 1: Performance of the ST models using modality bridging. The datasets, language pairs, duration of speech, and metric(BLEU) are shown.

3.

Knowledge-distillation (Hinton et al., 2015) is a mechanism to distill information from a trained and large “teacher” model to a smaller and efficient “student” model. It has been used with $L_{2}$ loss in (Huzaifah and Kukanov, 2023) to address the modality gap issue.

Optimal transport (OT) (Peyré et al., 2019) is a mechanism for comparing two probability distributions . In the ST task, speech and text representations may be deemed as two probability distributions, and therefore, OT can be applied. More formally, suppose $\alpha$ and $\beta$ denote the discrete probability distributions corresponding to speech and text representations. The masses at each position $u_{i}$ and $v_{i}$ are $a_{i}$ and $b_{j}$ respectively such that $\sum_{i=1}^{m}a_{i}=1$ and $\sum_{j=1}^{n}b_{j}$ . Suppose further that the cost of transporting a unit of mass from $u_{i}$ to $v_{j}$ is $c(u_{i},v_{j})$ , where $c$ is some cost function such as Euclidean distance. Let $Z_{ij}\geq 0$ be the quantity of mass to be transported from $u_{i}$ to $v_{j}$ then the goal of OT is to move all masses from $\alpha$ to $\beta$ such that the following objective function is minimized

\min_{Z}\langle C,Z\rangle,\qquad Z{\bf 1}_{n}=a,Z^{T}{\bf 1}_{m}=b,Z\geq 0

(21)

In the above eq., $C$ and $Z$ are the matrices whose elements are $c_{ij}=c(u_{i},v_{j})$ and $Z_{ij}$ , respectively. ${\bf 1}$ denotes the vector of ones. In ST task, $c(u_{i},v_{j})=\|u_{i}-v_{j}\|_{p}$ for some $p\geq 1$ . The loss corresponding to (21) is called Wassertein loss optimizing which is costly. Hence an entropy-regularized upper-bound approximation is often optimized

\min_{Z}\{\langle C,Z\rangle-\lambda H(Z)\}

(22)

where $\lambda$ is a regularization parameter and $H(\cdot)$ is the von-Neuman entropy matrix.

Recent works make use of the OT as presented above. For example, (Le et al., 2023b) uses optimal transport and CTC together to close the modality gap during the pre-training phase. They show significant gains in BLEU score when the ST model is fine-tuned without any external data compared to multitask learning. Similarly, (Tsiamas et al., 2024, 2023) uses OT+CTC to align the speech-encoder representation space with the MT embedding space whereas (Zhou et al., 2023) aligns the two representations via OT followed by cross-modal mix-up at the token level.

5.

Mix-up strategy: Speech-Text Manifold Mixup (STEMM) (Fang et al., 2022) strategy uses speech embedding. It mixes embeddings of speech and text into the encoder-decoder of a translation model for bridging the modality gap under the self-supervised learning framework. PromptST (Yu et al., 2023) presents a linguistic probing learning strategy, referred to as Speech-Senteval, inspired by the approach introduced by (Conneau et al., 2018). This strategy is implemented on the higher layer of the encoder within pre-trained ST models, specifically targeting the challenges associated with learning linguistic properties that these models often struggle with at the higher layers.

Table 1 presents the performance scores of ST models based on modality-bridging techniques. We can observe that mixup strategy achieves the highest BLEU score on En-De pair. Whereas boundary-based speech length shrinking mechanism matches the score when combined with other modality-bridging techniques.

Discussion: The study finds that adapters can shrink the speech length as well as the modality distance between the text and speech representations while requiring a small number of trainable parameters. The contrastive loss is found to be better than CTC and $L_{2}$ loss for modality-bridging. The boundary-based speech length shrinking combined with contrastive loss may improve the ST task performance. Finally, it is possible to build ST models requiring zero parallel ST data (Tsiamas et al., 2024).

7 End-to-End ST Models

End-to-end models for ST as discussed previously are gaining traction comparably to cascade models. This section presents an overview of E2E models. We categorize them under two major E2E themes: framework-based and data-based. The first category is further divided on whether the framework used is offline or streaming. The second category is based on the nature of the data. The sub-categorization presented in the data-based section depends upon which component boosts the ST task performance, as claimed in the papers. As such, the demarcation is not strict, and there may be overlaps in the subcategories. In addition, our emphasis in the present review of existing works is highlighting the core contribution and limitations as claimed by the authors. That means we look for answers to the question: what is the main technical contribution of authors to solve the ST problem? Thus, wherever possible, we have limited the mathematical description and believe such details can be found in the related papers. We attempt to provide a succinct and clear picture of what works and what does not while addressing the ST problem.

7.1 E2E ST Models based on Frameworks

As mentioned in the previous section, E2E ST models based on the framework are further divided into whether the framework is offline or streaming. Below, we discuss both of these categories in detail.

7.1.1 Offline Frameworks

Offline frameworks perform ST tasks where output tokens are produced after having seen the entire speech utterance. These frameworks heavily rely on Seq2Seq architecture as shown in Fig. 6. It has an encoder for speech input, a decoder for text output, and an optional shared/semantic decoder connecting the encoder and the decoder. The model is usually optimized for the ST loss or sometimes in a multitask learning framework where ASR/MT/CTC (Graves et al., 2006) losses are combined with ST loss. Other times Transfer learning is utilized for leveraging pre-trained models for ST tasks. Another approach that has been gaining a lot of attention is Non-Autoregressive modeling (NAR) for the E2E ST task which gives faster inference. The following section will delve deeper into these approaches.

The Seq2Seq-based ST models proposed in the literature either use specialized encoders such as transformers or attention mechanisms which we discuss next.

1.

Attention mechanism is used to concentrate on specific sections of the input data instead of the entire data (Larochelle and Hinton, 2010; Mnih et al., 2014; Vaswani et al., 2017). It has been a successful strategy for getting state-of-the-art (SOTA) results in NLP, computer vision, and other areas. There exist various types of attention in the literature such as soft, hard, local, monotonic, multihead, self- and cross-attention, inter alia. For more details, interested readers are encouraged to skim through (Mnih et al., 2014; Vaswani et al., 2017; Brauwers and Frasincar, 2022). Below we provide efforts made to handle ST tasks using the attention mechanism within the Seq2Seq framework.

The convolutional attention to “remember” and avoid translating the signal twice is used within Seq2Seq by (Bérard et al., 2016), which outperforms a hierarchical encoder with better results on synthetic data without using transcripts. The same author in (Bérard et al., 2018) uses source transcript and achieves results close to cascade models on LibriSpeech data. In (Duong et al., 2016), the author proposes phone-to-text alignment with a structural bias feature in the attention model. The measurement of alignment has been explored in (Anastasopoulos et al., 2016), which uses IBM’s translation model as well as dynamic time warping⁴⁴4dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed.. Seq2seq with attention trained using multitask learning achieves promising results in (Weiss et al., 2017). These models, however, struggle with noisy inputs and long acoustic signals (Kim et al., 2017). They use a joint CTC-attention model (Graves et al., 2006) trained through multitask learning by incorporating regularizers. The author uses two decoders where the second decoder seeks higher level representation (HLR) from the first decoder besides the encoder via the attention mechanism. Attention-Passing Model (APM) (Sperber et al., 2019), which only passes high-attention vectors from the audio encoder to the translation text for decoding demands a smaller amount of data for training.
2.

Transformer is the architecture based on multi-headed self-attention (Vaswani et al., 2017) which produces contextualized representation of the input. Because of parallelization and contextual representation, transformers have outperformed RNNs on several NLP tasks. This entails us applying transformers for the ST task as well. Transformer-based Seq2Seq with attention is proposed in (Cattoni et al., 2021). The architecture has a quadratic memory complexity, which involves: (a) CNN to downsample the input, and (b) 2-D attention to address short-range dependencies of spectrograms. In (Alastruey et al., 2022), the weight of some attention is avoided for speech tasks, hence decreasing the size of the attention matrix. The transformer encodes the speech features, thereby introducing local self-attention with a suitable window size for each layer to reduce the computational complexity. Other transformer variants which reduce its quadratic complexity such as perceivers (Jaegle et al., 2021) have been used as an encoder (Tsiamas et al., 2022b). Besides quadratic complexity, transformers require lossy downsampling of speech features thus potentially throwing useful linguistic information. To tackle such issues, Speechformers have been proposed (Papi et al., 2021a) which aggregates information at higher layers based on more informed linguistic criteria.

As discussed earlier, multitask learning combines the optimization of ST loss with an auxiliary loss such as ASR/MT/CTC loss. Another direction that has been explored by ST researchers is transfer learning in that Seq2Seq encoder/decoders are first pre-trained using ASR/MT data respectively and then the entire model is fine-tuned using ST data. Below, we discuss works based on multitask/transfer learning frameworks.

1.

ST with ASR: ST with ASR models make use of the transcript data along with speech-text pairs for pre-training. For example, curriculum pre-training (Wang et al., 2020d) refers to using ASR data for pre-training a Seq2Seq model, allowing it to learn transcription. The author argues that if the model is further pre-trained on learning semantic concepts (via frame-based masked language modeling) and word alignment (via frame-based bilingual lexical translation), it boosts the ST task performance. Specifically, existing E2E models either pre-train the encoder or use multi-task learning for ST tasks. As such, the encoder cannot isolate the learning of three tasks: transcription, semantic concept, and alignment, which are segregated by dividing the labor, and experiments prove the theoretical claims. Listen, Understand, and Translate (LUT) (Dong et al., 2021) uses the Seq2Seq model with external ASR loss. Their primary contribution is to introduce a semantic encoder network, whose task is to use the encoder’s output from transcription to minimize the mean-squared loss between the semantic representations and the BERT embeddings of the target text. Such a strategy implicitly builds and trains an NMT model for translation. Pre-training using ASR and/or MT has also been found useful in low-resource scenarios (Zhang et al., 2022a).
2.

ST using MT: This section discusses approaches that use either MT data for pre-training or directly using a pre-trained MT model in the ST decoder. These approaches rely on the idea of generating pseudotext and then translating them using MT. For example, Unsupervised Term Discovery (UTD) (Bansal et al., 2017) groups repeated words into pseudo-text, which is subsequently used for training an MT model using the parallel pseudo-text and target translations. The main advantage of such a system is that it can translate some content words under low-resource settings. The overall results are not very promising on the Spanish-English Call-Home dataset. Another limitation of this work is that the approach is not an E2E in a true sense as it involves two models– a UTD and an MT model. A weakly supervised learning method for ST (Jia et al., 2019) that outperforms multi-task learning takes advantage of the pre-trained MT and TTS synthesis module. Pre-trained MT model is used as a teacher to guide the student ST model in (Liu et al., 2019) (such an approach is dubbed as knowledge distillation (KD)). They, however, rely on source language text and do not improve upon the pipeline system. Following along, (Gaido et al., 2020b) explores word, sentence and sequence-interpolation based KD approaches for transferring knowledge from pre-trained MT to ST model.
3.

ST using both MT and ASR: This section discusses works employing MT and ASR pre-trained models (Bahar et al., 2020; Tsiamas et al., 2022a) or losses for transfer or multitask learning.

Multitask learning proves to be effective when CTC loss is combined with ASR and MT loss in (Bahar et al., 2019a) using various E2E ST architectures such as direct, multitask many-to-one, one-to-many, tied-cascade, and tied-triangle. They show that pre-trained models with ASR and MT losses achieve promising results. Contrary to claims of (Anastasopoulos and Chiang, 2018), tied-triangle architecture is no better than a direct model when fine-tuned properly. Since the ST task is similar to the MT task from the output perspective, works such as XSTNet (Ye et al., 2021) utilize external MT data to pre-train the encoder-decoder network extensively, then fine-tune it using parallel corpus data of MT, ST, ASR, and external MT data for optimizing the model using what they call progressive training. They achieve impressive performance on MuST-C and augmented Librispeech data. They also demonstrate improved performance on auxiliary tasks of MT and ASR. STPT model (Tang et al., 2022) proposes four sub-tasks for multitask pre-training: text-to-text (T2T), which is self-supervised; speech-to-phoneme which is supervised; acoustic learning, which is self-supervised, and ST which is supervised. Only T2T and ST tasks would subsequently be used for fine-tuning. Despite pre-training on “unlabeled” speech data, they obtained superior results on MuST-C data for the ST task. COSTT (Dong et al., 2020) pre-trains encoder using ASR data, the decoder using paired MT data, and then fine-tunes for the joint transcription-translation task. ComSL is a composite ST model relying on multitask learning with three losses ( $L_{ASR},L_{MT},L_{ST}$ ) combined with cross-modality loss to bridge the gap (Le et al., 2023a). It is worth mentioning that ComSL does not require forced-align ST data and learns the cross-modality alignment during training. This however requires optimizing four different losses, viz. Masked Token Prediction, Speech to Text Mapping, Encoder Representation Matching, and Decoder Distribution Matching⁵⁵5please see (Le et al., 2023a) paper for more details. similar to (Tang et al., 2021b). Fused acoustic and text encoding-ST (FAT-ST) (Zheng et al., 2021b) follows the similar pre-training and fine-tuning idea as ComSL except that they propose to use any combination of training data from $D_{2^{\{u,x,v\}}}$ ⁶⁶6 $2^{\{u,x,v\}}$ is the power set of triplets.. Essentially, they rely on masked language modeling (MLM) and translation language modeling (TLM) for pre-training (Conneau and Lample, 2019).

Non-Autoregressive Modeling⁷⁷7 We present the discussion of NAR within the multitask learning framework because all NAR E2E ST models are optimized within the multitask framework. As discussed in the background section, an alternative approach to Autoregressive (AR) modeling is Non-Autoregressive (NAR) modeling. AR assumes that the output tokens are conditional dependent on the previously generated tokens. However, it causes significant latency during inference. NAR models solve this problem by outputting all the translated tokens in parallel thus speeding up the inference. Formally, they are given by (23)

p({\bf v}|{\bf u};\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|,{\bf u};\theta)

(23)

There has been a surge in applying the non-autoregressive models (AR) in ASR and MT and it has prompted ST researchers to apply to it too. For example, (Inaguma et al., 2020a, 2021) trains NAR and autoregressive decoding conditioned on a shared speech encoder. Another line of NAR works (Chuang et al., 2021) explores CTC with ASR as an auxiliary task. CTC-based encoder only architecture ((Inaguma et al., 2020a, 2021) use encoder and decoder both) for NAR E2E ST task is shown to perform comparably or better than strong AR models in (Xu et al., 2023a).

Discussion: Our study of Seq2Seq-based frameworks for ST task reveals that (a) structural bias can be obtained by stacked/pyramidal RNN and alignment smoothing, (b) regularizers such as transitivity and invertibility improves Character Error Rate, (c) HLR helps in transcription as well as translation, and (d) changing the self-attention of the encoder with a logarithmic distance penalty enhances translation, (e) Progressive training needs a huge data and training time to achieve superior results, and (f) multitask pre-training can be used to leverage unlabeled speech data. (Zhang et al., 2022a) shows that ST models trained from scratch using only ST tasks perform on par with or surpass pre-trained models. To achieve such results, proposed best practices include a smaller vocabulary, a wider feedforward layer, a deep speech encoder with the post-layer norm, CTC-based regularization, and parameter-distance penalty. Pre-training is still useful in low-resource data regimes. Transferring knowledge via KD from pre-trained MT to ST causes gender bias, omission of sentences, and generic verbal-tense choice. Use of large vocabulary and models is effective for NAR E2E ST task (Inaguma et al., 2020a). It indicates that leveraging NAR with LLMs may be a future direction to explore.

7.1.2 Streaming frameworks

Streaming frameworks for ST tasks start outputting target tokens on seeing only partial inputs, that is, the translation of the input as soon as it arrives without waiting for the entire input. They are also known as Simultaneous ST (SimulST or SST)⁸⁸8Note that in MT literature, some works such as (Iranzo-S’anchez et al., 2022) differentiate between Streaming and Simultaneous setting where sentences are treated independently from each other. However, in ST, we find that existing works make no differentiation between them. (Goldman-Eisler, 1972; Fügen et al., 2007; Tsiartas et al., 2013; Grissom II et al., 2014). It finds application in online speech translation and video dubbing, to name a few. Traditionally, the streaming ST problem has been solved by feeding the segmented output of a streaming ASR model to a streaming MT model (Oda et al., 2014; Iranzo-Sánchez et al., 2020). However, due to the cascade nature of the model, it is prone to high latency and error propagation (Arivazhagan et al., 2019b, 2020; Zaidi et al., 2022). The SST problem faces several issues in practical implementation; reordering, acoustic ambiguity, and variable speech rate, and long inputs being prominent among them. Our literature survey reveals that most of the existing works focus on handling long streaming inputs and therefore, the discussion underneath revolves around that. Other issues mentioned above may also be considered for designing practical SST models.

Existing streaming frameworks intervene Seq2Seq framework at various places to design SST models. These are (a) encoder-level, (b) decoder-level, and (c) input/latent-level.

1.

Encoder-level: SOTA SST models use transformers as encoders. Due to the self-attention operation which looks at the entire utterance, it is unsuitable for streaming inputs. There exist some works that design encoders specialized for streaming inputs. For example, augmented memory transformer (Wu et al., 2020; Ma et al., 2020c) splits the utterance $U$ into smaller-segments $S=[s_{1},\ldots]$ . Each segment $s_{n}$ consists of left context $I_{n}$ , main context $c_{n}$ , and right context $r_{n}$ . Self-attention is calculated at the segment level only thereby reducing the time complexity. Augmented memory propagates the information from one segment to the other. Incremental transformer (Zhang et al., 2020) leverages a unidirectional encoder based on unidirectional-attention with future context masked for handling streaming inputs.
2.

Decoder-level: Instead of modifying encoders, some works such as (Dalvi et al., 2018; Liu et al., 2020a; Nguyen et al., 2021; Guo et al., 2024) propose incremental decoding (see fig. 7). In this framework, input speech is divided into fixed-size chunks and decoded every time a new chunk arrives. To avoid distractions from constantly changing hypotheses, selected chunk-level predictions are committed to and no longer modified. The decoding of the next chunk is conditioned by the predictions committed. Instead of conditioning on all chunk-level predictions, a prefix function is chosen to select a partial hypothesis because early chunks contain limited information (Liu et al., 2020a). There exist several strategies for choosing the prefix function. For example, Hold- $n$ and LA- $n$ (Liu et al., 2020a), SP- $n$ (Nguyen et al., 2021) and Regularized Batched Inputs (R-BI) (Guo et al., 2024). Of these, Hold- $n$ either withholds or deletes the last $n$ tokens in each chunk, LA– $n$ involves displaying the agreeing prefixes of $n$ consecutive chunks. SP- $n$ stands for shared prefix of all best-ranked hypotheses. Contrary to these, RB-I applies various augmentations to input chunks to achieve regularization and SOTA results on the IWSLT SimulST task.

Input/latent-level: Since speech input is too fine-grained, deciding when to READ and WRITE is challenging. The existing works introduce pre-decision module which segments the input speech at fixed-chunks (fixed) or word-boundary (flexible). Similarly, READ/WRITE policy can be fixed or adaptive (Ma et al., 2020b). Most research in SST concentrates on either improving speech encoding or pre-decision while relying on fixed policies such as wait- $k$ . In this section, we discuss fixed and adaptive pre-decisions/policies. These techniques are combined with Seq2Seq frameworks to devise streaming ST models.

Wait- $k$ policy (Ma et al., 2018) (shown in fig. 9) learns the parameters $\theta$ of the model by optimizing the negative log-likelihood $-\sum_{(\mathbf{u,v})\in D}\log p(\mathbf{v}|\mathbf{u};k;\theta)$ , where $k$ is the number of segments to look before starting translation (see Fig. 9). The probability $p(\cdot)$ is calculated as

p({\bf v}|{\bf u};k;\theta)=\prod_{t=1}^{T_{v}}p(v_{t}|v_{<t},u_{t+k};\theta)

(24)

wait- $k$ policy guarantees that the model can look at $t+k-1$ speech segments while predicting token $v_{t}$ (Ren et al., 2020). However, one limitation of the wait- $k$ policy is that it fails to do a beam search while decoding except for long-tail (Ma et al., 2018). To solve this problem, (Zeng et al., 2021) proposes a wait- $k$ stride- $N$ policy. It essentially is a wait- $k$ policy with the addition of $N$ READ and WRITE operations until the end of the sentence after reading the first $k$ -segments. To determine the $k$ -segments, (Chen et al., 2021b) leverages streaming ASR to guide the direct simultaneous ST decoding via beam search.

As discussed above, determining when to write is crucial for efficient SST. Contrary to wait- $k$ policy which is a fixed-policy, Segmentation can be performed on the embedded speech using CTC (Ren et al., 2020), attention mechanism (Papi et al., 2022b), or incremental BEAM search (Yan et al., 2023). Essentially, these works adapt offline ST to SST showing spectacular performance on benchmark datasets. Note that the models proposed in (Papi et al., 2022b; Yan et al., 2023) train models in a cascade manner while the inference is E2E. Another issue with the fixed policy is that the model can not speed up or slow down appropriately with the input types. Other examples of fixed-policy are Wait-If* (Cho and Esipova, 2016b) and Monotonic Chunkwise Attention (MoChA) (Chiu and Raffel, 2018) that has been used in simultaneous MT and may be explored for SST.

The works mentioned above require that encoded speech be segmented so that the decoder can apply the wait- $k$ policy. The goal of segmentation is to identify the word, sub-word, or phone boundary which are usually not even (due to silences, longer syllables, etc.). That means the number of acoustic units varies with time in each segment. Monotonic-segmented Streaming ST (MoSST) (Dong et al., 2022) is based on learning when to translate, which has a monotonic segmentation module located between the acoustic encoder and the transformer. It has an Integrate-and-Fire (IF) neuron (Abbott, 1999), which fires above a threshold when the context is developed. If the context is not developed, the neuron receives signals and accumulates the acoustic vectors, thus mimicking adaptive policy for READ-WRITE operation. IF strategy has shown impressive performance in simultaneous ASR (Dong and Xu, 2019) and ST (Chang and yi Lee, 2022). It can be used for monotonic segmentation of the speech input along with adaptive decision strategy (Dong et al., 2022). Another adaptive policy-based technique is Monotonic Infinite Lookback Attention (MILk) (Arivazhagan et al., 2019b) used in simultaneous MT can be explored for SST. It essentially is a Monotonic Attention mechanism (Raffel et al., 2017) that extends to infinite encoder states, theoretically, in the past and trains the MT model along with the MILk. It achieves better quality-latency trade-offs than MoCHA thanks to its soft attention to all the encoder states and hard attention. Monotonic Multihead Attention (MMA) (Ma et al., 2019) that extends MILK to multiple heads has been used for SST by (Ma et al., 2020b). Its variants Efficient MMA (Ma et al., 2023) solve numerical stability and biased monotonic alignment issues present in MMA but have not been explored for SST tasks. Adaptive segmentation based on an adaptive policy that takes into account acoustic features and translation history (called meaningful units) is another effective mechanism for SST (Zhang et al., 2022b).

Both fixed and adaptive policy mechanisms employ segmentation modules that are outside the translation module. As such, it breaks the acoustic integrity and potentially may drop the translation performance. Therefore, efforts such as (Zhang and Feng, 2023) propose differentiable segmentation (DiSeg) learned jointly with the translation model using expectation training. DiSeg essentially predicts a Bernoulli random variable $\sigma(FFN(u_{i}))$ , via a feed-forward network (FFN), to decide when to segment. After segmentation, they apply segmented attention which combines unidirectional and bidirectional attention into one while masking future speech frames. Expectation training first constrains the number of segments followed by learning segmentation from the translation model both at semantic and acoustic levels (Zhang and Feng, 2023).

The discussion so far covered encoder, and decoder level changes, and fixed and adaptive policies used for segmentation, to develop SST models within the Seq2Seq frameworks. Another way to design SST models is by Transduction. It is the process of mapping a sequence to another sequence (Jurafsky and Martin, 2008). A transducer is a special type of Seq2Seq model that solves a few inherent problems. For example, online processing of the long inputs and monotonic sequence alignment is the biggest problem with Seq2Seq models (Graves, 2012), solved by transducers. Below we discuss a special type of transducer called RNN-T and its improvements.

RNN-T is a transducer that can learn alignment between two sequences in an online/streaming fashion (Graves, 2012) as shown in fig. 10 (a). Formally, it learns the conditional probability $p({\bf v|u})$ by marginalizing all possible alignment paths $A(\bf u,v)$ including blank symbol $\phi$ as

p({\bf v|u})=\sum_{{\bf\hat{v}}\in A(\bf u,v)}p({\bf\hat{v}|u})

(25)

RNN-T differs from Seq2Seq in the sense that it divides the decoder into a predictor and a joiner. The predictor takes the previous time step output and yields the representation to be consumed by the joiner along with the hidden representation of the input from the encoder. Since the predictor does not look at the input, it can be pre-trained on the text-only data in a low-data scenario. There have been several SST models proposed based on variants of RNN-T which we discuss next.

One of the main issues with RNN-T is the strict monotonic alignment between the input and output sequences which makes them unsuitable for tasks requiring reordering such as MT, ST, etc. For example, Cross-Attention Augmented Transducer (CAAT shown in fig. 10(b)) optimizes translation and policy models in tandem (Liu et al., 2021a). It eliminates the RNN-T’s strict monotonic restriction for reordering in the translation. Using transformers as encoders to reduce the multi-step memory footprint causes a significant delay for CAAT. The use of regularization terms and substantial hyperparameter adjustment are some other limitations of CAAT. An extension of it in (Xue et al., 2022) leverages Transformer Transducer (TT) networks with attention pooling for streaming E2E ST tasks. Attention divides the input audio into chunks of specific sizes. At any time, processing any input frame $\bf u_{t}$ can only see frames within its chunk and a fixed number of left chunks. By sharing the encoder, they also propose a variant to handle E2E ST tasks in multilingual settings. The adaptive READ and WRITE policy choices between encoder output and ground truth contribute to its success. The same authors (Wang et al., 2023) propose to combine the benefits of language-specific and language-agnostic encoders within the TT framework. A shared encoder takes LIDs as gating values and computes weights for each language through the source LID scheduling scheme. The empirical results demonstrate superior performance and a smaller number of trainable parameters than bilingual ST. Adaptive (dynamic) policy for segmenting speech input has recently been explored in a Seq2Seq transduction setting by (Tan et al., 2024). It essentially applies a cross-attention mechanism to decide when to segment the input followed by dynamic compression via anchor representation. Thus, it saves memory and achieves a better latency-quality trade-off.

Besides Transducer and Seq2Seq models, re-translation is another approach adapted for SST task by (Niehues et al., 2016, 2018; Arivazhagan et al., 2019a, 2020) though in a cascade setting. In this approach, the translated output can be re-generated after a fixed amount of time and displayed later for better quality. Though it reduces latency by being greedy to display the partial translation, the output is highly unstable and causes flickring effect. This may give rise to a bad user experience. To mitigate instability, (Arivazhagan et al., 2020) propose a metric called erasure which takes into the length of the suffix deleted during re-translation. Dynamic masking of MT output in a cascade of streaming ASR and MT for improving stability has been explored in (Yao and Haddow, 2020). Another approach to reducing instability is luminance contrast and the Discrete Fourier Transform used in (Liu et al., 2023).

Evaluation of SST models: SST models in the literature have been evaluated using the quality and latency metrics presented in §3. Often showing a trade-off between quality and latency. Most of the existing works attempt to balance the quality and latency ignoring the visualization and cognitive load on the viewer when displayed on a screen. Towards this end, (Papi et al., 2021b) emphasizes considering visualization as a metric to be evaluated along with the latency and quality. However, little effort has been made in this direction by the SST community. Therefore, we wish to draw the researcher’s attention to also consider visualization as an evaluation metric for SST models. Towards this end, (Liu et al., 2023) propose tokenized alignment, word updates with semantic similarity, and smooth animation of live captions. They find that it leads to a reduction in fatigue, and distractions while increasing the viewer’s reading comfort.

Discussion: SST is a challenging problem and in that, E2E SST poses a further impediment. Our findings suggest that using adaptive policy significantly improves the latency-quality trade-off. Learned policy mechanisms have been an ongoing research and adapting them for true long-form SST may open new possibilities. Exploring differentiable segmentation for long sequences is still tapped and requires more investigation. Re-translation is found to be on par with or better than SOTA streaming models (Arivazhagan et al., 2020) under a very low revision rate. Such a finding alludes to considering re-translation in an E2E SST system design.

7.2 ST Models based on the Nature of Available Data

In the previous section, we provided an overview of the ST models based on the frameworks used. The present section provides readers with another perspective on E2E ST models. In particular, it discusses the E2E ST models categorized based on the nature of the data, such as data is low-resource, streaming, multilingual, etc. Given the specific challenges they pose, we believe such a categorization might be interesting to researchers.

7.2.1 ST in Low-Resource settings

A low-resource language (LRL) is one where speech and/or text data are scarcely available – usually not enough to pre-train Seq2Seq models. As such, LRLs present challenges of their own such as overfitting and poor generalization. This section will discuss works where ST models are developed especially for low-resource languages. The proposed models under this category have generic architecture as shown in Fig.11(a) which is similar to Seq2Seq ST models. We find the approaches mainly use pre-training the encoder on high-resource ASR data and subsequent fine-tuning on ST data. Another approach that has emerged in recent years to tackle LRL issues is SSL. For example, (Bansal et al., 2019) empirically demonstrates 100% performance improvement on ST tasks. They find that if the ASR language differs from the source and target languages, then pre-training on ASR data enhances ST task performance. Though the BLEU score is improved, the absolute BLEU score is only 7.1. In (Wang et al., 2022), the unsupervised ST is implemented for low-resource settings using pseudo-labels from unsupervised cascade models. SSL with discrete-speech unit (DSU) has been used to fine-tune the ST model on limited ST data (Lam et al., 2024).

7.2.2 Code-mix ST

Code-mix language refers to speech where one primary language is used, but words or phrases from other (embedded) languages are also included. This phenomenon arises from a multitude of challenges, encompassing ambiguous vocabulary, fluctuating lexical representations, intermingling of languages at the word level, redundancy, and alterations in word sequencing. Therefore, it is non-trivial to handle code-mixing while building ST models.

We find that there exist only a few works on code-mix ST. In (Weller et al., 2022), the code-mix dataset is created with the existing publicly available corpora Fisher (Cieri et al., 2004) and Miami⁹⁹9https://github.com/apple/ml-code-switched-speech-translation. As shown in Fig. 11(b), code-mix ST models feed language ID in addition to speech input to the encoder of the Seq2Seq model (Weller et al., 2022). The Wav2Vec 2.0, an acoustic encoder, and mBART, a multilingual decoder, are used for both languages with an attention layer applied for the embedded language. The use of multilingual encoders and decoders is a common practice while building code-mix ST models (Yang et al., 2023). In particular, self-supervised multilingual pre-training with adapters may be explored further.

7.2.3 Unsupervised ST

There is an abundance of unlabeled speech and text data. Since manual annotation and creating a parallel corpus is costly, the natural instinct is to exploit unlabeled data for training ST models. This section reviews works where researchers make use of the unlabeled speech data to advance the ST task performance.

For unsupervised ST tasks, it is common to leverage large-scale self-supervised and semi-supervised learning. For example, speech encoders such as Wav2vec 2.0 have been pre-trained in a self-supervised manner on Librilight data (Kahn et al., 2019) and used by (Li et al., 2020; Wang et al., 2021b) whereas the decoder is randomly initialized. The entire model is optimized on CoVoST 2 ST data, and the encoder is frozen. Thereby, self-training is executed to generate pseudo-labels for Libri-light data. The Wav2Vec 2.0 is a “student” model which is fine-tuned with ground truth CoVoST 2 data and pseudo labels. Finally, a language model (LM) is trained on CommonCrawl data and combined with the ST model to generate text via beam-search decoding. Following along, for training the E2E model, (Wang et al., 2021b) produces pseudo-labels by cascading ASR, text de-normalization, and MT in an Unsupervised manner. Wav2Vec 2.0 and mBART are optimized for domain adaption using in-domain data (Li et al., 2020). According to experimental results, the proposed method is effective for E2E models without pre-training. However, between supervised and unsupervised pre-trained models, performance gap is encountered, which may be investigated in future works.

7.2.4 Multilingual ST

The multilingual ST model aims to translate from/to multiple speech input/output languages. It can be one of many-to-one, one-to-many, or many-to-many. The ST models solve multilinguality issues using mainly three approaches: (a) language ID, (b) dual-decoder, and (c) pre-trained models.

1.

Language ID (LID) is the identification label that allows one to identify the target language and explicitly translate the speech simultaneously. The existing works handle multilinguality using LID either with encoder or decoder. In (Inaguma et al., 2019), the model uses LID in the decoder for one-to-many and many-to-many translation. They demonstrate impressive performance in translation from high-resource to low-resource languages without using any transcript data from LRL. However, using the LID embedding in the decoder (Gangi et al., 2019) is shown to underperform than using it in the encoder. The author shows that LID can be either concatenated or merged with the inputs and, when pre-trained with ASR data, can result in superior performance than the one-to-one system. The model, however, performs poorly when trained on many unrelated target languages. One-to-many and many-to-one multilingual ST systems of (Wang et al., 2020c, a) provide a good set of baselines for research purposes.
2.

Dual-decoder model is the transformer with two decoders, one for each ASR and ST, and the dual-attention mechanism. In (Le et al., 2020), a dual-decoder model is proposed to optimize it for ASR and ST tasks jointly. The author hypothesizes that a dual-attention mechanism can benefit each task by transferring knowledge instantly or in wait- $k$ policy mechanism. Their model generalizes earlier models proposed for one-to-many and bilingual ST models.
3.

Pre-trained Multilingual Models use a pre-trained encoder and decoder for acoustic modeling and language modeling, respectively. In (Li et al., 2020; Tran et al., 2020), the author shows that efficiently fine-tuning mBART, which is a pre-trained multilingual decoder (Liu et al., 2020c) can achieve SOTA results on CoVoST data on zero-shot cross-lingual and multilingual translation tasks. Along similar lines, (Le et al., 2021) shows that inserting adapters in between layers of the encoder-decoder framework and tuning them can improve the ST task performance over bilingual ST models. SeamlessM4T (Barrault et al., 2023), Whisper (Radford et al., 2023), and other foundation models are built using many of these concepts like language ID in the decoder, multilingual, multimodal, and multitask pre-training.

7.3 Discussion

The works presented so far show that E2E ST models have been improved tremendously. ST models’ improved performance is likely due to leveraging pre-trained ASR/MT models or the respective corpus to train ST encoders/decoders. Weakly labelled/pseudo labels are another way to create more data for training ST models. Contrastive learning, mix-up strategy, adapters, and optimal transport are a few ways to bridge the modality gap.

Applying unsupervised ASR and MT with the Wav2Vec 2.0 encoder and mBART decoder in a low-resource setting yields good results for ST models. When considering online data streaming, using the IF neuron for context building and translation improves results compared to using CAAT, which had latency issues due to reordering for translation tasks introduced by RNN-T. The mBART handles multilingual settings well by using a dual attention mechanism that facilitates knowledge transfer. Additionally, inserting adapters between the encoder and decoder layers improves performance. In the unsupervised ST setting, the SOTA results were achieved by training Wav2Vec 2.0 on data within the same domain as the speech. We see that the wait- $k$ policy is used in the streaming settings with segmentation and Multilingual settings with a dual-attention mechanism. In both cases, it yields good results. Also, adapters are used in modality bridging and multilingual settings with pre-trained models, which improves the performance. As shown in (Sun et al., 2023), multilingual E2E ST for LRLs can benefit when trained jointly with related HRLs.

7.4 Overall Performance Trend of E2E ST approaches in Common Benchmarks

In this section, we analyse the performance evolution of ST models over the MuST-C dataset, as depicted in Figure 12. We selected the MuST-C dataset due to its widespread adoption by researchers since its introduction in 2019.

Figure 12 reveals the overall performance of ST models over time has steadily improved across all 8 languages, with a few remarkable gains. The first significant gain was observed in 2021-adapter method (Le et al., 2021). This high jump in performance is achieved due to use of adapter layers within the multilingual models that shows transferability of knowledge across related language pairs (note that not all proposed models tested their models across all 8 languages). It also shows that Chimera (Han et al., 2021), which is a modality bridging model, performs poorly compared to adapter based models. That means, semantic shared network proposed in (Han et al., 2021) is not as good as adapters with multilingual models and there still is a gap between text and speech modality.

The next jump we see is due to ConST (Ye et al., 2022a) (for languages like Es, It, Pt, and Ru). This particular model achieved superior results by incorporating contrastive learning to bridge the modality gap the first time. The cross-modal speech-text retrieval accuracy jumps from 4% to 88%! and better way to bridge the gap than Chimera. The drop in performance in STEMM compared to ConST is that both are from the same authors and were proposed in the same year. In fact, ConST is an improvement over XSTNet and STEMM by the use of cross-model contrastive loss. FCCL (medium model) (Zhang et al., 2023c) further improves the performance, by applying contrastive learning over both the sentence- and frame-level, over ConST which applies contrastive learning only at the sentence level. Finally, OT based model outperforms contrastive learning based models on all languages except De and Ru. Looking closely, we find that OT based model (Le et al., 2023b) is able to close the modality-gap only partially compared to ConST and FCCL for a few languages. Hence, as a recommendation, coarse- and fine-grained contrastive learning and ASR pre-training with CTC loss via OT approaches may be explored to build better ST models. Note that LLM-based ST models are not compared here due to primarily their pre-training over massive amount of data and we want a fair comparison where pre-training over external ASR and MT corpus leads to higher performance as we find in ConST and FCCL models.

7.5 SOTA Performance of E2E ST Models on Low-Resource Languages

In Table 2, we present the SOTA performance of various ST models on low-resource language pairs as of November 2023. The table indicates which models, utilizing specific techniques, achieve SOTA performance. This provides a comprehensive overview of the current status of ST models for low-resource languages (LRLs). From Table 2, it is evident that the BLEU scores for many LRLs, such as Mn, Si, Ta, Id, Ja, and Sv, are relatively low. This is more likely due to small amount of speech data available for these (as seen in Speech (hours) column)) compared to other LRLs where higher amount of speech data is used for training the LNA+Zero shot model. This highlights the need for improving the performance of ST models for these languages by increasing the data and designing better models.

Table 2: SOTA performance in Low-Resource Language Pairs: Dataset, Models, Speech Duration, Settings, and BLEU Score

Language Pair	Model/Technique	Dataset	Speech (hours)	Setting	Metric (BLEU)
Ainu $\rightarrow$ En	Tied Multitask Learning with regularizers (Anastasopoulos and Chiang, 2018)	Glossed Audio Corpus	2.5	ST with ASR & MT	20.3
Mboshi $\rightarrow$ Fr		Godard Corpus	4.4		24.7
Mt $\rightarrow$ En	WACO (Ouyang et al., 2023)	IWSLT	1	Modality Bridging	13.3
Et $\rightarrow$ En	Unsupervised + W2V2 + mBart (Wang et al., 2022)	CoVoST-2	3	Low-Resource	19.0
Lv $\rightarrow$ En			2		25.0
En $\rightarrow$ Ar	Teacher-Student (W2V2 + self-training + dec w/o LM) (Kahn et al., 2019)	CoVoST-2	430	Unsupervised	20.8
En $\rightarrow$ Ca					35.6
En $\rightarrow$ Tr					18.9
Sl $\rightarrow$ En	LNA + Zero Shot Learning (Li et al., 2020)	CoVoST-2	2	Multi-Lingual	5.6
Sv $\rightarrow$ En			2		5.9
Fa $\rightarrow$ En			49		11.0
Tr $\rightarrow$ En			4		11.2
Mn $\rightarrow$ En			3		1.2
Ar $\rightarrow$ En			2		6.4
Cy $\rightarrow$ En			2		9.0
Ta $\rightarrow$ En			2		0.9
Ja $\rightarrow$ En			1		2.1
Id $\rightarrow$ En			1		3.7
En $\rightarrow$ Cy			430		30.6
En $\rightarrow$ Et			430		22.2
En $\rightarrow$ Fa			430		21.5
En $\rightarrow$ Id			430		29.9
En $\rightarrow$ Ja			430		39.3
En $\rightarrow$ Lv			430		21.5
En $\rightarrow$ Mn			430		14.8
En $\rightarrow$ Sl			430		25.1
En $\rightarrow$ Sv			430		30.4
En $\rightarrow$ Ta			430		17.8

8 Deployment of E2E ST Models

Deployment of offline E2E ST models incurs several challenges. The first challenge is handling Cross-talk, noise, and background music removal and getting a clean speech. If the speaker is having stuttering, different dialect and accent then the same ST model may not work effectively. The second challenge is related to the distance of the speaker from the microphone and movements of the speaker around the microphone which can hamper the input speech quality. As a solution to these problems, the ST model may be trained over a variety of speakers in various acoustic conditions. The third challenge is related to memory consumption especially when considering LLM-based ST model deployment. To deploy memory-intensive and LLM-based ST models on edge devices, pruning, quantization, and knowledge distillation techniques may be used (Zhou et al., 2022a) which significantly reduces the memory load.

Streaming ST models on the other hand are used as a submodule within the automatic subtitling. Hence their deployment has challenges of subtitling tasks which is considered harder. For example, subtitling requires the following challenges to be solved: (a) firstly, translated text should be segmented such that it reduces the cognitive load and maximizes the user experience like reading speech and synchronization with the speech (b) how many characters and lines to display? These constraints are usually decided by the media industries. For example, displaying a maximum of 2 lines of subtitles, 42 characters per line at max, and a maximum reading speech of 21 characters/second is used by TEDx (Agrawal et al., 2023).

Table 3: Dataset statistics(✓ means that feature is available for the dataset and ✗ means that the feature is unavailable for the dataset)

Datasets	Source Language (Speech)	Target Language (Text)	Speech (hours)	Speakers	Validation	Gender	Age Group
MuST-C	En	14 lang	0.4K	1.6K	✗	✗	✗
Librispeech	En	Fr	0.2K	1.4K	✓	✓	✓
CoVost	En	11 lang	0.7K	11K	✓	✓	✓
CoVost2	21 lang	En	2.8K	11K	✓	✓	✓
	En	15 lang	0.7K	78K	✓	✓	✓
EuroparlST	4 lang	4 lang	0.25K	✗	✗	✗	✗
VoxPopuli	En	15 lang	1.79K	4.3K	✗	✗	✗
Kosp2e	Ko	En	0.2K	0.2K	✗	✗	✗
GigaST	En	De, Zh	10K	✗	✗	✗	✗
Prabhupadavani	en-bn-sn code-mix	25 lang	0.09K	0.13K	✗	✗	✗
How2	En	Pt	2K	✗	✗	✗	✗
FLEURS	102 lang	102 lang	1.4K	0.3K	✓	✓	✗
BSTC	Zn	En	98	✗	✓	✗	✗
Indic-TEDST	En	9 lang	189	1.64K	✗	✗	✗

9 Resources for ST

9.1 Datasets for ST Tasks

There have been several datasets created for the ST task. Some of them are listed below, and we describe them here briefly. Table 3 provides information on various dataset statistics, such as hours of speech, the number of speakers, whether the dataset was manually or machine validated, the gender, and the age range to which the speaker belongs. Additionally, the tools required for creating these datasets are (a) Gentle (Ochshorn and Hawkins, 2017) for audio-transcription alignment, and (b) BertAlign¹⁰¹⁰10https://github.com/bfsujason/bertalign for transcription-translation alignment.

1.

How2 (Sanabria et al., 2018) is an ST corpus of English instructional videos having Portuguese translations.
2.

Augmented Librispeech (Kocabiyikoglu et al., 2018) is obtained from the LibriSpeech corpus (Panayotov et al., 2015). It is a speech recognition repository generated using audiobooks of Gutenberg Project ¹¹¹¹11https://www.gutenberg.org/. This dataset is designed to translate English speech into written French text.
3.

CoVoST and CoVoST 2 (Wang et al., 2020a, c), the datasets are based on Common Voice project ¹²¹²12https://commonvoice.mozilla.org/en. CoVoST is a many-to-one dataset covering 11 languages, while CoVoST 2 offers one-to-many and many-to-one translations for 15 languages.
4.

Europarl-ST (Iranzo-Sánchez et al., 2020) is a collection that contains speech and text data from European Parliament proceedings between 2008 and 2012 in four languages. It includes multiple sources and targets for both speech and text.
5.

MuST-C (Cattoni et al., 2021) It is a large multilingual ST translation corpus available . It contains translations from English into fourteen additional languages and is compiled from TED Talks. mTEDx (Salesky et al., 2021) is one such multilingual dataset from TED talks.
6.

VoxPopuli (Wang et al., 2021a) dataset is an expansion of Europarl-ST. It includes data from European parliament sessions spanning from 2009 to 2020.
7.

Kosp2e (Cho et al., 2021) is a Korean (ko) to English(en) ST translation corpus, which contains Korean speech with parallel English texts. The corpus contains data from four different domains: Zeroth from news/newspaper, KSS (Park, 2018) from textbooks, emphStyleKQC (Cho et al., 2022) from AI applications, and Covid-ED (Lee et al., 2021) from covid diaries of people which have emotions.
8.

BSTC (Zhang et al., 2021) is a Baidu Speech Translation Corpus, a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, their manual transcripts, and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model.
9.

GigaST (Ye et al., 2022b) corpus is a collection of speech translations from English to German and Chinese. It is created using the English ASR GigaSpeech(Chen et al., 2021a), which features 10,000 hours of transcribed speech from various sources such as audioPortugesebooks, podcasts, and YouTube.
10.

Prabhupadavani (Sandhan et al., 2022) is an ST dataset where speech is multilingual and Code-Mix with three different languages, English is the primary language, and words and phrases from Sanskrit and Bengali are interjected. The text part has sentences in 25 languages.
11.

FLEURS (Conneau et al., 2022) FLEURS stands as a multilingual speech dataset, offering parallel recordings across 102 languages. Developed as an extension of the FLoRes-101 MT benchmark, it encompasses about 12 hours of annotated speech data for each language.
12.

Indic-TEDST Sethiya et al. (2024) is a low-resource ST translation dataset across 9 Indic languages: Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi (pa), Tamil (ta), and Telugu (te).

Besides these popular ST datasets, there are some other smaller size datasets such as Fisher(Cieri et al., 2004), Call-Home¹³¹³13https://ca.talkbank.org/access/CallHome/eng.html, Gordard Corpus(Godard et al., 2018), Glosse Audio Corpus¹⁴¹⁴14https://ainu.ninjal.ac.jp/folklore/en/, BTEC ¹⁵¹⁵15http://universal.elra.info/product_info.php?cPath=37_39&products_id=80, WSJ¹⁶¹⁶16https://catalog.ldc.upenn.edu/LDC93s6a, IWSLT¹⁷¹⁷17https://iwslt.org/, Miami Corpus(Deuchar, 2008), and MSLT Corpus (Federmann and Lewis, 2016).

9.2 Toolkits for ST

To facilitate building and training ST models, various researchers have proposed a few toolkits. The toolkits for ST create an environment where the dataset for ST tasks can be pre-processed, and models can be trained, fine-tuned, and evaluated. We provide a short description of these toolkits to make the survey a place for a one-stop-shop for ST modeling.

1.

SLT.KIT¹⁸¹⁸18https://github.com/isl-mt/SLT.KIT(Zenkel et al., 2018) offers ASR, MT and ST models along with some specific features such as CTC and Attention based ASR, ASR with punctuation and a neural MT system.
2.

EspNet-ST¹⁹¹⁹19https://github.com/espnet/espnet toolkit (Inaguma et al., 2020b) is developed as there was no toolkit available for performing the sub-tasks of ST. EspNet-ST provides ASR, LM, E2E-ST, Cascade-ST, MT, and TTS along with examples. It also provided pre-trained transformer-based models on various datasets like MUST-C, Libri-trans, Fisher, CALL-HOME, and How2.
3.

FairSeq S2T²⁰²⁰20https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text (Wang et al., 2020b) toolkit is an extension to FairSeq(Ott et al., 2019) in which all the functions of EspNet-ST are available. Additionally, it provides the Non-Autoregressive MT, Online ST, and Speech Pretraining. The toolkit also provides state-of-the-art ST models based on RNN, transformers, and conformers. It has an in-built data loader for MuST-C, Librispeech, and CoVoST datasets.
4.

NeurST²¹²¹21https://github.com/bytedance/neurst (Zhao et al., 2021) is a lightweight toolkit, as it has no dependency on kaldi toolkit (Zheng et al., 2011). It has high computation efficiency using mixed precision and accelerated linear algebra and achieves faster training on large-scale datasets using Horovod (Sergeev and Balso, 2018).

10 Future Directions for Research

This section highlights challenges that need the attention of researchers working on ST problems.

10.1 Cascade vs End-to-End Models

As argued and presented through comprehensive experiments by (Bentivogli et al., 2021), the performance gaps between cascade and E2E ST models are bridged. However, as shown by (Agrawal et al., 2023) in a recent IWSLT 2023 subtitling generation task, the performance of cascade models is far superior to E2E models for offline ST tasks evaluated on all metrics. Furthermore, as far as our understanding, no thorough assessment has been done for low-resource languages that use E2E and cascade models. It may be interesting to compare E2E and cascade ST models on various ST datasets to assert the claims in the literature.

10.2 ST on Code-Mix data

We find that there exists limited study on the ST model that uses code-mix data as an input. A code-mix data has problems, such as different lexicons, syntax, and scarcity of labeled data. Therefore, it will be interesting to (a) create Code-Mix ST datasets incorporating more languages, (b) see how the existing ST models perform on code-mix ST data?, and (c) Can pre-training in many languages assist in tackling the code-mixing issue?

10.3 Domain-Invariant Models

ST models developed for one domain do not scale well to other domains, as shown in the recent IWSLT 2023. Here domain in-variance setting is the ST model which is trained in some language combination (say Eng-De) and needs to be adapted to other language combinations (e.g., Eng-Hi). Transfer learning/continual learning can be explored to develop generic models.

10.4 Discrepancy between Automatic and Human Evaluation

There may be discrepancies and disagreements among various metrics used to report ST task results. They do not match the mean option score (MOS) provided by human evaluators (Agrawal et al., 2023). For example, if a system evaluates the BLEU score between a ground truth sentence “Police shot the culprit with a gun” and hypothesis sentence “Police use a gun to shot the culprit”, it is 0! However, both sentences above might be deemed appropriate translations of an utterance semantically by an ST system. Such an argument is supported by dubbing artists who often change the voice of the sentence to simplify it or make it more pleasing. ²²²²22In the movie “Pirates of the Caribbean”, Jack Sparrow asks Bloom how long he can go for the girl. The original answer from Bloom is “I can die for her!”. Whereas Hindi dubbing is “Till the dying breadth”

As highlighted in (Marie et al., 2021), the BLEU score is being reported by more than 99% of MT papers without accounting for statistical significance testing or human evaluation. Our survey of ST papers indicates the same trend being followed. Therefore, we call for the attention of researchers to develop and use metrics that match human evaluations semantically. An approach could be to subject the ground truth and hypothesis sentences under semantic textual similarity tasks and score them accordingly.

10.5 Handling Ambient Noise

In our literature survey, we find that little has been done to deal with ambient noises. Ambient noise, background music, cross-talk, and non-verbal sounds may create difficulty in ST model learning. The model must distinguish between a meaningful utterance and ambient noise– a non-trivial task.

10.6 Handling Multiple Speakers

It is common in the real world where the audio/video has multiple speakers, each of which may have its own accent (cf., An Asian and American talking to each other in English), dialect, pitch, and accent. Performing speech separation may be useful before feeding it to the ST model for improved performance.

10.7 Handling Speaker Diarization

Speaker diarization refers to demarcating the timing of speakers in a multiple-speaker speech. So far, the datasets for ST do not have speaker boundary marks. Creating speaker-diarized ST data in a multilingual setting will be interesting to test the ST models’ robustness.

10.8 Multilingual and Simultaneous ST

Multilingual ST has gained momentum recently due to its importance in the real world. For example, a single speech must be broadcast to multilingual communities (e.g., a conference is attended by a diverse group of people). It can be one-to-many, many-to-one, and many-to-many languages ST. Our literature survey shows that only a few works exist in this space. Besides, there is an opportunity to explore simultaneous multilingual ST, which is the most practical setting.

10.9 Low-resource ST Datasets and Models

Most existing works have focused on building ST models and datasets for high-resource languages. As we know, the success of ST models relies on the parallel speech-text corpus; building ST datasets for low-resource languages requires more attention. Further, a few works, such as (Bansal et al., 2019), have reported ST task results on the Mboshi-French pair; however, the BLEU score is poor. Therefore, building models that transfer learning from language pairs with high to low resources is warranted.

10.10 LLMs for ST tasks

In the last few years, large language models (LLMs) have emerged as a promising solution to many NLP tasks including ST. LLMs show in-context learning (ICL) when trained over a massive amount of data. This process unlocks their hidden emergent abilities (Wei et al., 2022) and enables them for few-shot and zero-shot learning capability via prompting. There exist a few works (Zhang et al., 2023b; Wu et al., 2023; Huang et al., 2023) (see (Gaido et al., 2024) for a comparative discussion) which explore LLMs for ST task. Concretely, all of these models leverage a speech foundation model (SLM) followed by length adapter, modality adaptation, mixing the two modalities, and then LLMs for generating the output. GenTranslate (Hu et al., 2024) builds upon the Seamless4MT by integrating an LLM on top and performing $N$ -best hypothesis tuning. Initial results are plausible. However, it remains to see how various components affect the downstream task performance, what is the best strategy for prompt design, and how to pre-train/fine-tune them in a parameter-efficient way for ST tasks. Further, the use of LLMs for SimulMT has been recently proposed (Agostinelli et al., 2023) and it remains to see how to adapt SimulMT to SimulST.

10.11 Really long Context Modelling

As mentioned in the streaming section, SST models need to handle long input sequences. Current speech encoders lack infinite context modeling capability due to their quadratic complexity of self-attention. There have been recent improvements to handle the problem of infinite context. For example, Mamba (Zhang et al., 2024a), Infini-attention (Munkhdalai et al., 2024), and TransforerFAM (Hwang et al., 2024) show some promising results in long context modeling. These models may be explored for SST task as well.

11 Conclusion

This survey paper delves into the most recent advancements in E2E ST translation works. Our discussion includes models, evaluation metrics, and datasets used to train ST models. We review various frameworks for ST models and highlight previous research in this field. The categorization of ST models is based on the kind of data they handle and the models employed. Additionally, we discuss potential future directions for improving speech-to-text translation. Our findings suggest that the gap between cascade and E2E system performance in both online and offline settings is narrowing. However, for some language pairs, the gap is still wide and therefore, additional work is warranted. Our goal in the present ST survey is to offer valuable insight into this topic and drive advancements in ST research. We believe that such reviews will be interesting to researchers.

References

Abbott (1999) Abbott, L.F., 1999. Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain Research Bulletin 50, 303–304.
Agostinelli et al. (2023) Agostinelli, V., Wild, M., Raffel, M., Fuad, K.A.A., Chen, L., 2023. Simul-llm: A framework for exploring high-quality simultaneous translation with large language models. ArXiv abs/2312.04691.
Agrawal et al. (2023) Agrawal, S., Anastasopoulos, A., Bentivogli, L., Bojar, O., Borg, C., Carpuat, M., Cattoni, R., Cettolo, M., Chen, M., Chen, W., Choukri, K., Chronopoulou, A., Currey, A., Declerck, T., Dong, Q., Duh, K., Estève, Y., Federico, M., Gahbiche, S., Haddow, B., Hsu, B., Mon Htut, P., Inaguma, H., Javorský, D., Judge, J., Kano, Y., Ko, T., Kumar, R., Li, P., Ma, X., Mathur, P., Matusov, E., McNamee, P., P. McCrae, J., Murray, K., Nadejde, M., Nakamura, S., Negri, M., Nguyen, H., Niehues, J., Niu, X., Kr. Ojha, A., E. Ortega, J., Pal, P., Pino, J., van der Plas, L., Polák, P., Rippeth, E., Salesky, E., Shi, J., Sperber, M., Stüker, S., Sudoh, K., Tang, Y., Thompson, B., Tran, K., Turchi, M., Waibel, A., Wang, M., Watanabe, S., Zevallos, R., 2023. FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN, in: Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Association for Computational Linguistics, Toronto, Canada (in-person and online). pp. 1–61.
Alastruey et al. (2022) Alastruey, B., Ferrando, J., Gállego, G.I., Costa-jussà, M.R., 2022. On the locality of attention in direct speech translation, in: Louvan, S., Madotto, A., Madureira, B. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, Dublin, Ireland. pp. 402–412. doi:10.18653/v1/2022.acl-srw.32.
Anastasopoulos and Chiang (2018) Anastasopoulos, A., Chiang, D., 2018. Tied multitask learning for neural speech translation, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 82–91. doi:10.18653/v1/N18-1008.
Anastasopoulos et al. (2016) Anastasopoulos, A., Chiang, D., Duong, L., 2016. An unsupervised probability model for speech-to-translation alignment of low-resource languages, in: Su, J., Duh, K., Carreras, X. (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas. pp. 1255–1263. doi:10.18653/v1/D16-1133.
Ao et al. (2021) Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., et al., 2021. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205 .
Arivazhagan et al. (2019a) Arivazhagan, N., Cherry, C., I, T., Macherey, W., Baljekar, P.N., Foster, G.F., 2019a. Re-translation strategies for long form, simultaneous, spoken language translation. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7919–7923.
Arivazhagan et al. (2019b) Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.C., Yavuz, S., Pang, R., Li, W., Raffel, C., 2019b. Monotonic infinite lookback attention for simultaneous machine translation, in: Annual Meeting of the Association for Computational Linguistics.
Arivazhagan et al. (2020) Arivazhagan, N., Cherry, C., Macherey, W., Foster, G.F., 2020. Re-translation versus streaming for simultaneous translation, in: International Workshop on Spoken Language Translation.
Baevski et al. (2022) Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language, in: International Conference on Machine Learning, PMLR. pp. 1298–1312.
Baevski et al. (2020) Baevski, A., Zhou, H., Mohamed, A., Auli, M., 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA.
Bahar et al. (2019a) Bahar, P., Bieschke, T., Ney, H., 2019a. A comparative study on end-to-end speech to text translation, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE. pp. 792–799.
Bahar et al. (2020) Bahar, P., Wilken, P., Alkhouli, T., Guta, A., Golik, P., Matusov, E., Herold, C., 2020. Start-before-end and end-to-end: Neural speech translation by apptek and rwth aachen university, in: International Workshop on Spoken Language Translation.
Bahar et al. (2019b) Bahar, P., Zeyer, A., Schlüter, R., Ney, H., 2019b. On using SpecAugment for end-to-end speech translation, in: Niehues, J., Cattoni, R., Stüker, S., Negri, M., Turchi, M., Ha, T.L., Salesky, E., Sanabria, R., Barrault, L., Specia, L., Federico, M. (Eds.), Proceedings of the 16th International Conference on Spoken Language Translation, Association for Computational Linguistics, Hong Kong.
Banerjee and Lavie (2005) Banerjee, S., Lavie, A., 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72.
Bansal et al. (2019) Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S., 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 58–68. doi:10.18653/v1/N19-1006.
Bansal et al. (2017) Bansal, S., Kamper, H., Lopez, A., Goldwater, S., 2017. Towards speech-to-text translation without speech recognition, in: Lapata, M., Blunsom, P., Koller, A. (Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Association for Computational Linguistics, Valencia, Spain. pp. 474–479.
Bapna et al. (2021) Bapna, A., Chung, Y.A., Wu, N., Gulati, A., Jia, Y., Clark, J., Johnson, M., Riesa, J., Conneau, A., Zhang, Y., 2021. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. ArXiv abs/2110.10329.
Barrault et al. (2023) Barrault, L., Chung, Y.A., Meglioli, M.C., Dale, D., Dong, N., Duquenne, P.A., ElSahar, H., Gong, H., Heffernan, K., Hoffman, J., Klaiber, C., Li, P., Licht, D., Maillard, J., Rakotoarison, A., Sadagopan, K.R., Wenzek, G., Ye, E., Akula, B., Chen, P.J., Hachem, N.E., Ellis, B., Gonzalez, G.M., Haaheim, J., Hansanti, P., Howes, R., Huang, B., Hwang, M.J., Inaguma, H., Jain, S., Kalbassi, E., Kallet, A., Kulikov, I., Lam, J., Li, S.W., Ma, X., Mavlyutov, R., Peloquin, B., Ramadan, M., Ramakrishnan, A., Sun, A., Tran, K.M., Tran, T., Tufanov, I., Vogeti, V., Wood, C., Yang, Y., Yu, B., Andrews, P.Y., Balioglu, C., Costa-jussà, M.R., Çelebi, O., Elbayad, M., Gao, C., Guzm’an, F., Kao, J.T., Lee, A., Mourachko, A., Pino, J.M., Popuri, S., Ropers, C., Saleem, S., Schwenk, H., Tomasello, P., Wang, C., Wang, J., Wang, S., 2023. Seamlessm4t: Massively multilingual&multimodal machine translation.
Bentivogli et al. (2021) Bentivogli, L., Cettolo, M., Gaido, M., Karakanta, A., Martinelli, A., Negri, M., Turchi, M., 2021. Cascade versus direct speech translation: Do the differences still make a difference?, in: Annual Meeting of the Association for Computational Linguistics.
Bérard et al. (2018) Bérard, A., Besacier, L., Kocabiyikoglu, A.C., Pietquin, O., 2018. End-to-end automatic speech translation of audiobooks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 6224–6228.
Bérard et al. (2016) Bérard, A., Pietquin, O., Besacier, L., Servan, C., 2016. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation, in: NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain.
Bozinovski and Fulgosi (1976) Bozinovski, S., Fulgosi, A., 1976. The influence of pattern similarity and transfer learning upon training of a base perceptron b2, in: Proceedings of Symposium Informatica, pp. 121–126.
Brauwers and Frasincar (2022) Brauwers, G., Frasincar, F., 2022. A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering 35, 3279–3298.
Bucilǎ et al. (2006) Bucilǎ, C., Caruana, R., Niculescu-Mizil, A., 2006. Model compression. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2006, 535–541. doi:10.1145/1150402.1150464.
Cattoni et al. (2021) Cattoni, R., Di Gangi, M.A., Bentivogli, L., Negri, M., Turchi, M., 2021. Must-c: A multilingual corpus for end-to-end speech translation. Computer speech & language 66, 101155.
Chang and yi Lee (2022) Chang, C.C., yi Lee, H., 2022. Exploring continuous integrate-and-fire for adaptive simultaneous speech translation. ArXiv abs/2204.09595.
Chen et al. (2021a) Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al., 2021a. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 .
Chen et al. (2020) Chen, J., Ma, M., Zheng, R., Huang, L., 2020. Mam: Masked acoustic modeling for end-to-end speech-to-text translation. arXiv preprint arXiv:2010.11445 .
Chen et al. (2021b) Chen, J., Ma, M., Zheng, R., Huang, L., 2021b. Direct simultaneous speech-to-text translation assisted by synchronized streaming ASR, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 4618–4624. doi:10.18653/v1/2021.findings-acl.406.
Chen et al. (2022) Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F., 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 1505–1518. doi:10.1109/JSTSP.2022.3188113.
Cheng et al. (2022) Cheng, X., Dong, Q., Yue, F., Ko, T., Wang, M., Zou, Y., 2022. M3st: Mix at three levels for speech translation. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Cherry and Foster (2019) Cherry, C., Foster, G.F., 2019. Thinking slow about latency evaluation for simultaneous machine translation. ArXiv abs/1906.00048.
Chiu and Raffel (2018) Chiu, C.C., Raffel, C., 2018. Monotonic chunkwise attention, in: International Conference on Learning Representations.
Cho and Esipova (2016a) Cho, K., Esipova, M., 2016a. Can neural machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012 .
Cho and Esipova (2016b) Cho, K., Esipova, M., 2016b. Can neural machine translation do simultaneous translation? ArXiv abs/1606.02012.
Cho et al. (2021) Cho, W.I., Kim, S.M., Cho, H., Kim, N.S., 2021. kosp2e: Korean Speech to English Translation Corpus, in: Proc. Interspeech 2021, pp. 3705–3709. doi:10.21437/Interspeech.2021-1040.
Cho et al. (2022) Cho, W.I., Moon, S., Kim, J., Kim, S., Kim, N.S., 2022. StyleKQC: A style-variant paraphrase corpus for Korean questions and commands, in: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., Piperidis, S. (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France. pp. 7122–7128.
Chopra et al. (2005) Chopra, S., Hadsell, R., LeCun, Y., 2005. Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. pp. 539–546.
Chuang et al. (2021) Chuang, S.P., Chuang, Y.S., Chang, C.C., Lee, H.y., 2021. Investigating the reordering capability in CTC-based non-autoregressive end-to-end speech translation, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 1068–1077. doi:10.18653/v1/2021.findings-acl.92.
Chung and Glass (2018) Chung, Y.A., Glass, J., 2018. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech, in: Proc. Interspeech 2018, pp. 811–815. doi:10.21437/Interspeech.2018-2341.
Chung et al. (2021) Chung, Y.A., Zhang, Y., Han, W., Chiu, C.C., Qin, J., Pang, R., Wu, Y., 2021. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 244–250.
Cieri et al. (2004) Cieri, C., Miller, D., Walker, K., 2004. The fisher corpus: A resource for the next generations of speech-to-text., in: LREC, pp. 69–71.
Conneau et al. (2018) Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M., 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070 .
Conneau and Lample (2019) Conneau, A., Lample, G., 2019. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA.
Conneau et al. (2022) Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., Bapna, A., 2022. Fleurs: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings, IEEE. pp. 798–805. doi:10.1109/SLT54892.2023.10023141.
Cui et al. (2015) Cui, X., Goel, V., Kingsbury, B., 2015. Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1469–1477.
Dalvi et al. (2018) Dalvi, F., Durrani, N., Sajjad, H., Vogel, S., 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 493–499. URL: https://aclanthology.org/N18-2079, doi:10.18653/v1/N18-2079.
Deuchar (2008) Deuchar, M., 2008. The miami corpus: Documentation file. Bangortalk, bangortalk. org. uk/docs/Miami_doc. pdf .
Dong and Xu (2019) Dong, L., Xu, B., 2019. Cif: Continuous integrate-and-fire for end-to-end speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 6079–6083.
Dong et al. (2020) Dong, Q., Wang, M., Zhou, H., Xu, S., Xu, B., Li, L., 2020. Consecutive decoding for speech-to-text translation, in: AAAI Conference on Artificial Intelligence.
Dong et al. (2021) Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., Li, L., 2021. Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation, in: AAAI Conference on Artificial Intelligence.
Dong et al. (2022) Dong, Q., Zhu, Y., Wang, M., Li, L., 2022. Learning when to translate for streaming speech, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 680–694. doi:10.18653/v1/2022.acl-long.50.
Duong et al. (2016) Duong, L., Anastasopoulos, A., Chiang, D., Bird, S., Cohn, T., 2016. An attentional model for speech translation without transcription, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959.
Etchegoyhen et al. (2022) Etchegoyhen, T., Arzelus, H., Gete, H., Alvarez, A., Torre, I.G., Martín-Doñas, J.M., González-Docasal, A., Fernandez, E.B., 2022. Cascade or direct speech translation? a case study. Applied Sciences 12, 1097.
Fang and Feng (2023) Fang, Q., Feng, Y., 2023. Back translation for speech-to-text translation without transcripts, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
Fang et al. (2022) Fang, Q., Ye, R., Li, L., Feng, Y., Wang, M., 2022. STEMM: Self-learning with speech-text manifold mixup for speech translation, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 7050–7062. doi:10.18653/v1/2022.acl-long.486.
Federmann and Lewis (2016) Federmann, C., Lewis, W., 2016. Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, in: Proceedings of the 13th International Conference on Spoken Language Translation.
Fügen et al. (2007) Fügen, C., Waibel, A.H., Kolss, M., 2007. Simultaneous translation of lectures and speeches. Machine Translation 21, 209–252.
Gaido et al. (2020a) Gaido, M., Di Gangi, M.A., Negri, M., Turchi, M., 2020a. End-to-end speech-translation with knowledge distillation: FBK@IWSLT2020, in: Federico, M., Waibel, A., Knight, K., Nakamura, S., Ney, H., Niehues, J., Stüker, S., Wu, D., Mariani, J., Yvon, F. (Eds.), Proceedings of the 17th International Conference on Spoken Language Translation, Association for Computational Linguistics, Online. pp. 80–88. doi:10.18653/v1/2020.iwslt-1.8.
Gaido et al. (2020b) Gaido, M., Gangi, M.A.D., Negri, M., Turchi, M., 2020b. On knowledge distillation for direct speech translation. ArXiv abs/2012.04964.
Gaido et al. (2021) Gaido, M., Negri, M., Cettolo, M., Turchi, M., 2021. Beyond voice activity detection: Hybrid audio segmentation for direct speech translation, in: International Conference on Natural Language and Speech Processing.
Gaido et al. (2024) Gaido, M., Papi, S., Negri, M., Bentivogli, L., 2024. Speech translation with speech foundation models and large language models: What is there and what is missing? ArXiv abs/2402.12025.
Gállego et al. (2021) Gállego, G.I., Tsiamas, I., Escolano, C., Fonollosa, J.A.R., Costa-jussà, M.R., 2021. End-to-end speech translation with pre-trained models and adapters: Upc at iwslt 2021, in: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Association for Computational Linguistics, Bangkok, Thailand (online). pp. 110–119. doi:10.18653/v1/2021.iwslt-1.11.
Gangi et al. (2019) Gangi, M.A.D., Negri, M., Turchi, M., 2019. One-to-many multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 585–592.
Godard et al. (2018) Godard, P., Adda, G., Adda-Decker, M., Benjumea, J., Besacier, L., Cooper-Leavitt, J., Kouarata, G.N., Lamel, L., Maynard, H., Mueller, M., Rialland, A., Stueker, S., Yvon, F., Zanon-Boito, M., 2018. A very low resource language speech corpus for computational language documentation experiments, in: Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., Tokunaga, T. (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
Goldman-Eisler (1972) Goldman-Eisler, F., 1972. Segmentation of input in simultaneous translation. Journal of Psycholinguistic Research 1, 127–140.
Graves (2012) Graves, A., 2012. Sequence transduction with recurrent neural networks. ArXiv abs/1211.3711.
Graves et al. (2006) Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, New York, NY, USA. p. 369–376.
Grissom II et al. (2014) Grissom II, A., He, H., Boyd-Graber, J., Morgan, J., Daumé III, H., 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1342–1352. doi:10.3115/v1/D14-1140.
Gulati et al. (2020) Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., Pang, R., 2020. Conformer: Convolution-augmented Transformer for Speech Recognition, in: Proc. Interspeech 2020, pp. 5036–5040. doi:10.21437/Interspeech.2020-3015.
Guo et al. (2024) Guo, J., Wu, Z., Li, Z., Shang, H., Wei, D., Chen, X., Rao, Z., Li, S., Yang, H., 2024. R-bi: Regularized batched inputs enhance incremental decoding framework for low-latency simultaneous speech translation. ArXiv abs/2401.05700.
Han et al. (2021) Han, C., Wang, M., Ji, H., Li, L., 2021. Learning shared semantic space for speech-to-text translation, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP.
Hinton et al. (2015) Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the knowledge in a neural network URL: https://arxiv.org/abs/1503.02531v1.
Houlsby et al. (2019) Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S., 2019. Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR. pp. 2790–2799.
Hsu et al. (2021) Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A., 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3451–3460.
Hu et al. (2024) Hu, Y., Chen, C., Yang, C.H.H., Li, R., Zhang, D., Chen, Z., Chng, E.S., 2024. Gentranslate: Large language models are generative multilingual speech and machine translators. ArXiv abs/2402.06894.
Huang et al. (2023) Huang, Z., Ye, R., Ko, T., Dong, Q., Cheng, S., Wang, M., Li, H., 2023. Speech translation with large language models: An industrial practice. ArXiv abs/2312.13585.
Huzaifah and Kukanov (2023) Huzaifah, M., Kukanov, I., 2023. An analysis of semantically-aligned speech-text embeddings, in: 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE. pp. 747–754.
Hwang et al. (2024) Hwang, D., Wang, W., Huo, Z., Sim, K.C., Mengibar, P.M., 2024. Transformerfam: Feedback attention is working memory. arXiv:2404.09173.
Inaguma et al. (2019) Inaguma, H., Duh, K., Kawahara, T., Watanabe, S., 2019. Multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 570–577.
Inaguma et al. (2020a) Inaguma, H., Higuchi, Y., Duh, K., Kawahara, T., Watanabe, S., 2020a. Orthros: non-autoregressive end-to-end speech translation with dual-decoder. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7503–7507.
Inaguma et al. (2021) Inaguma, H., Higuchi, Y., Duh, K., Kawahara, T., Watanabe, S., 2021. Non-autoregressive end-to-end speech translation with parallel autoregressive rescoring. ArXiv abs/2109.04411. URL: https://api.semanticscholar.org/CorpusID:237453587.
Inaguma et al. (2020b) Inaguma, H., Kiyono, S., Duh, K., Karita, S., Yalta, N., Hayashi, T., Watanabe, S., 2020b. ESPnet-ST: All-in-one speech translation toolkit, in: Celikyilmaz, A., Wen, T.H. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Online. pp. 302–311. doi:10.18653/v1/2020.acl-demos.34.
Iranzo-S’anchez et al. (2022) Iranzo-S’anchez, J., Saiz, J.C., Juan, A., 2022. From simultaneous to streaming machine translation by leveraging streaming history, in: Annual Meeting of the Association for Computational Linguistics.
Iranzo-Sánchez et al. (2020) Iranzo-Sánchez, J., Silvestre-Cerda, J.A., Jorge, J., Roselló, N., Giménez, A., Sanchis, A., Civera, J., Juan, A., 2020. Europarl-st: A multilingual corpus for speech translation of parliamentary debates, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 8229–8233.
Jaegle et al. (2021) Jaegle, A., Gimeno, F., Bfrock, A., Zisserman, A., Vinyals, O., Carreira, J., 2021. Perceiver: General perception with iterative attention. CoRR abs/2103.03206. URL: https://arxiv.org/abs/2103.03206, arXiv:2103.03206.
Jia et al. (2019) Jia, Y., Johnson, M., Macherey, W., Weiss, R.J., Cao, Y., Chiu, C.C., Ari, N., Laurenzo, S., Wu, Y., 2019. Leveraging weakly supervised data to improve end-to-end speech-to-text translation, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 7180–7184.
Jurafsky and Martin (2008) Jurafsky, D., Martin, J.H., 2008. Speech and language processing, 2nd edition.
Kahn et al. (2019) Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazar’e, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., rahman Mohamed, A., Dupoux, E., 2019. Libri-light: A benchmark for asr with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7669–7673.
Kano et al. (2023) Kano, Y., Sudoh, K., Nakamura, S., 2023. Average token delay: A duration-aware latency metric for simultaneous translation. ArXiv abs/2311.14353.
Khurana et al. (2020) Khurana, S., Laurent, A., Glass, J., 2020. Cstnet: Contrastive speech translation network for self-supervised speech representation learning. arXiv preprint arXiv:2006.02814 .
Kim et al. (2017) Kim, S., Hori, T., Watanabe, S., 2017. Joint ctc-attention based end-to-end speech recognition using multi-task learning, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 4835–4839.
Kocabiyikoglu et al. (2018) Kocabiyikoglu, A.C., Besacier, L., Kraif, O., 2018. Augmenting librispeech with French translations: A multimodal corpus for direct speech translation evaluation, in: Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., Tokunaga, T. (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
Lam et al. (2024) Lam, T.K., Birch, A., Haddow, B., 2024. Compact speech translation models via discrete speech units pretraining. ArXiv abs/2402.19333.
Lam et al. (2020) Lam, T.K., Schamoni, S., Riezler, S., 2020. Cascaded models with cyclic feedback for direct speech translation. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7508–7512.
Lam et al. (2022a) Lam, T.K., Schamoni, S., Riezler, S., 2022a. Make more of your data: Minimal effort data augmentation for automatic speech recognition and translation. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Lam et al. (2022b) Lam, T.K., Schamoni, S., Riezler, S., 2022b. Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Larochelle and Hinton (2010) Larochelle, H., Hinton, G.E., 2010. Learning to combine foveal glimpses with a third-order boltzmann machine, in: Neural Information Processing Systems.
Le et al. (2023a) Le, C., Qian, Y., Zhou, L., LIU, S., Qian, Y., Zeng, M., Huang, X., 2023a. ComSL: A composite speech-language model for end-to-end speech-to-text translation, in: Thirty-seventh Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=6Qx7G1xrAk.
Le et al. (2020) Le, H., Pino, J., Wang, C., Gu, J., Schwab, D., Besacier, L., 2020. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation, in: Scott, D., Bel, N., Zong, C. (Eds.), Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online). pp. 3520–3533. doi:10.18653/v1/2020.coling-main.314.
Le et al. (2021) Le, H., Pino, J., Wang, C., Gu, J., Schwab, D., Besacier, L., 2021. Lightweight adapter tuning for multilingual speech translation, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online.
Le et al. (2023b) Le, P.H., Gong, H., Wang, C., Pino, J., Lecouteux, B., Schwab, D., 2023b. Pre-training for speech translation: Ctc meets optimal transport, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org.
Lee et al. (2021) Lee, Y.K., Jung, Y., Lee, I., Park, J.E., Hahn, S., 2021. Building a psychological ground truth dataset with empathy and theory-of-mind during the covid-19 pandemic, in: Proceedings of the Annual Meeting of the Cognitive Science Society.
Li et al. (2020) Li, X., Wang, C., Tang, Y., Tran, C., Tang, Y., Pino, J.M., Baevski, A., Conneau, A., Auli, M., 2020. Multilingual speech translation from efficient finetuning of pretrained models, in: Annual Meeting of the Association for Computational Linguistics.
Lin (1991) Lin, J., 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37, 145–151.
Liu et al. (2021a) Liu, D., Du, M., Li, X., Li, Y., Chen, E., 2021a. Cross attention augmented transducer networks for simultaneous translation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 39–55.
Liu et al. (2020a) Liu, D., Spanakis, G., Niehues, J., 2020a. Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection, in: Interspeech.
Liu et al. (2023) Liu, X.B., Zhang, J., Ferrer, L., Xu, S., Bahirwani, V., Smus, B., Olwal, A., Du, R., 2023. Modeling and improving text stability in live captions. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems .
Liu et al. (2020b) Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L., 2020b. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.
Liu et al. (2020c) Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L., 2020c. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.
Liu et al. (2019) Liu, Y., Xiong, H., Zhang, J., He, Z., Wu, H., Wang, H., Zong, C., 2019. End-to-End Speech Translation with Knowledge Distillation, in: Proc. Interspeech 2019, pp. 1128–1132. doi:10.21437/Interspeech.2019-2582.
Liu et al. (2020d) Liu, Y., Zhu, J., Zhang, J., Zong, C., 2020d. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920 .
Liu et al. (2021b) Liu, Z., Lin, Y., Sun, M., 2021b. Representation learning for natural language processing. CoRR abs/2102.03732. URL: https://arxiv.org/abs/2102.03732, arXiv:2102.03732.
Ma et al. (2018) Ma, M., Huang, L., Xiong, H., Zheng, R., Liu, K., Zheng, B., Zhang, C., He, Z., Liu, H., Li, X., Wu, H., Wang, H., 2018. Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework, in: Annual Meeting of the Association for Computational Linguistics.
Ma et al. (2020a) Ma, X., Dousti, M.J., Wang, C., Gu, J., Pino, J.M., 2020a. Simuleval: An evaluation toolkit for simultaneous translation, in: Conference on Empirical Methods in Natural Language Processing.
Ma et al. (2020b) Ma, X., Pino, J., Koehn, P., 2020b. SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation, in: Wong, K.F., Knight, K., Wu, H. (Eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China. pp. 582–587.
Ma et al. (2019) Ma, X., Pino, J.M., Cross, J., Puzon, L., Gu, J., 2019. Monotonic multihead attention. ICLR abs/1909.12406.
Ma et al. (2023) Ma, X., Sun, A.Y., Ouyang, S., Inaguma, H., Tomasello, P., 2023. Efficient monotonic multihead attention. ArXiv abs/2312.04515.
Ma et al. (2020c) Ma, X., Wang, Y., Dousti, M.J., Koehn, P., Pino, J.M., 2020c. Streaming simultaneous speech translation with augmented memory transformer. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7523–7527.
Marie et al. (2021) Marie, B., Fujita, A., Rubino, R., 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online. pp. 7297–7306. doi:10.18653/v1/2021.acl-long.566.
Matusov et al. (2007) Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani-Tür, D.Z., Ostendorf, M., Ney, H., 2007. Improving speech translation with automatic boundary prediction, in: Interspeech.
Matusov et al. (2018) Matusov, E., Wilken, P., Bahar, P., Schamper, J., Golik, P., Zeyer, A., Silvestre-Cerdà, J.A., Martinez-Villaronga, A.A., Pesch, H., Peter, J.T., 2018. Neural speech translation at apptek, in: International Workshop on Spoken Language Translation.
Meng et al. (2021) Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., Xu, B., 2021. Mixspeech: Data augmentation for low-resource automatic speech recognition. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7008–7012.
Mnih et al. (2014) Mnih, V., Heess, N.M.O., Graves, A., Kavukcuoglu, K., 2014. Recurrent models of visual attention, in: Neural Information Processing Systems.
Mohamed et al. (2022) Mohamed, A., Lee, H.y., Borgholt, L., Havtorn, J.D., Edin, J., Igel, C., Kirchhoff, K., Li, S.W., Livescu, K., Maaloe, L., Sainath, T.N., Watanabe, S., 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing 16, 1179–1210. doi:10.1109/jstsp.2022.3207050.
Munkhdalai et al. (2024) Munkhdalai, T., Faruqui, M., Gopal, S., 2024. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv:2404.07143.
Nguyen et al. (2021) Nguyen, T.S., Stüker, S., Waibel, A., 2021. Super-Human Performance in Online Low-Latency Recognition of Conversational Speech, in: Proc. Interspeech 2021, pp. 1762–1766. doi:10.21437/Interspeech.2021-1114.
Niehues et al. (2016) Niehues, J., Nguyen, T.S., Cho, E., Ha, T.L., Kilgour, K., Müller, M., Sperber, M., Stüker, S., Waibel, A.H., 2016. Dynamic transcription for low-latency speech translation, in: Interspeech.
Niehues et al. (2018) Niehues, J., Pham, N.Q., Ha, T.L., Sperber, M., Waibel, A., 2018. Low-Latency Neural Speech Translation, in: Proc. Interspeech 2018, pp. 1293–1297. doi:10.21437/Interspeech.2018-1055.
Ochshorn and Hawkins (2017) Ochshorn, R., Hawkins, M., 2017. Gentle forced aligner. github. com/lowerquality/gentle .
Oda et al. (2014) Oda, Y., Neubig, G., Sakti, S., Toda, T., Nakamura, S., 2014. Optimizing segmentation strategies for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
van den Oord et al. (2017) van den Oord, A., Vinyals, O., Kavukcuoglu, K., 2017. Neural discrete representation learning, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA. p. 6309–6318.
Ott et al. (2019) Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M., 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 .
Ouyang et al. (2023) Ouyang, S., Ye, R., Li, L., 2023. WACO: Word-aligned contrastive learning for speech translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 3891–3907. doi:10.18653/v1/2023.acl-long.216.
Panayotov et al. (2015) Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech: an asr corpus based on public domain audio books, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5206–5210.
Papi et al. (2021a) Papi, S., Gaido, M., Negri, M., Turchi, M., 2021a. Speechformer: Reducing information loss in direct speech translation, in: Conference on Empirical Methods in Natural Language Processing.
Papi et al. (2022a) Papi, S., Gaido, M., Negri, M., Turchi, M., 2022a. Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation, in: Ive, J., Zhang, R. (Eds.), Proceedings of the Third Workshop on Automatic Simultaneous Translation, Association for Computational Linguistics, Online. pp. 12–17. doi:10.18653/v1/2022.autosimtrans-1.2.
Papi et al. (2021b) Papi, S., Negri, M., Turchi, M., 2021b. Visualization: The missing factor in simultaneous speech translation. ArXiv abs/2111.00514.
Papi et al. (2022b) Papi, S., Negri, M., Turchi, M., 2022b. Attention as a guide for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318.
Parcollet et al. (2024) Parcollet, T., Nguyen, H., Evain, S., Boito, M.Z., Pupier, A., Mdhaffar, S., Le, H., Alisamir, S., Tomashenko, N., Dinarelli, M., et al., 2024. Lebenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of french speech. Computer Speech & Language , 101622.
Park et al. (2019) Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., Le, Q.V., 2019. Specaugment: A simple data augmentation method for automatic speech recognition, in: Interspeech.
Park (2018) Park, K., 2018. Kss dataset: Korean single speaker speech dataset.
Paulik and Waibel (2013) Paulik, M., Waibel, A., 2013. Training speech translation from audio recordings of interpreter-mediated communication. Computer Speech & Language 27, 455–474.
Peyré et al. (2019) Peyré, G., Cuturi, M., et al., 2019. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11, 355–607.
Popović (2015) Popović, M., 2015. chrf: character n-gram f-score for automatic mt evaluation, in: Proceedings of the tenth workshop on statistical machine translation, pp. 392–395.
Popuri et al. (2022) Popuri, S., Chen, P.J., Wang, C., Pino, J., Adi, Y., Gu, J., Hsu, W.N., Lee, A., 2022. Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation, in: Proc. Interspeech 2022, pp. 5195–5199. doi:10.21437/Interspeech.2022-11032.
Potapczyk and Przybysz (2020) Potapczyk, T., Przybysz, P., 2020. Srpol’s system for the iwslt 2020 end-to-end speech translation task, in: International Workshop on Spoken Language Translation.
Prabhavalkar et al. (2024) Prabhavalkar, R., Hori, T., Sainath, T.N., Schlüter, R., Watanabe, S., 2024. End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32, 325–351. doi:10.1109/TASLP.2023.3328283.
Rabiner and Schafer (2010) Rabiner, L., Schafer, R., 2010. Theory and applications of digital speech processing. Prentice Hall Press.
Radford et al. (2023) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I., 2023. Robust speech recognition via large-scale weak supervision, in: Proceedings of the 40th International Conference on Machine Learning, JMLR.org.
Raffel et al. (2017) Raffel, C., Luong, M.T., Liu, P.J., Weiss, R.J., Eck, D., 2017. Online and linear-time attention by enforcing monotonic alignments, in: International Conference on Machine Learning.
Ren et al. (2020) Ren, Y., Liu, J., Tan, X., Zhang, C., Qin, T., Zhao, Z., Liu, T.Y., 2020. Simulspeech: End-to-end simultaneous speech to text translation, in: Annual Meeting of the Association for Computational Linguistics.
Salesky et al. (2021) Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard, D.W., Post, M., 2021. Multilingual tedx corpus for speech recognition and translation, in: Proceedings of Interspeech.
Sanabria et al. (2018) Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L., Metze, F., 2018. How2: A Large-scale Dataset for Multimodal Language Understanding, in: NeurIPS, Montréal, Canada.
Sandhan et al. (2022) Sandhan, J., Daksh, A., Paranjay, O.A., Behera, L., Goyal, P., 2022. Prabhupadavani: A code-mixed speech translation data for 25 languages, in: Degaetano, S., Kazantseva, A., Reiter, N., Szpakowicz, S. (Eds.), Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, International Conference on Computational Linguistics, Gyeongju, Republic of Korea. pp. 24–29.
Sarkar et al. (2023) Sarkar, B., Maurya, C.K., Agrahri, A., 2023. Direct speech to text translation: Bridging the modality gap using simsiam, in: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), pp. 250–255.
Schlenoff et al. (2009) Schlenoff, C., Sanders, G., Weiss, B., Proctor, F., Steves, M.P., Virts, A., 2009. Evaluating speech translation systems: Applying score to transtac technologies, in: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 223–230.
Schneider et al. (2019) Schneider, S., Baevski, A., Collobert, R., Auli, M., 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 .
Sergeev and Balso (2018) Sergeev, A., Balso, M.D., 2018. Horovod: fast and easy distributed deep learning in tensorflow. CoRR abs/1802.05799. URL: http://arxiv.org/abs/1802.05799, arXiv:1802.05799.
Sethiya et al. (2024) Sethiya, N., Nair, S., Maurya, C., 2024. Indic-TEDST: Datasets and baselines for low-resource speech to text translation, in: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia. pp. 9019–9024.
Snover et al. (2006) Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J., 2006. A study of translation edit rate with targeted human annotation, in: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231.
Sohn et al. (1999) Sohn, J., Kim, N.S., Sung, W., 1999. A statistical model-based voice activity detection. IEEE Signal Processing Letters 6, 1–3.
Sohn (2016) Sohn, K., 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29.
Sperber et al. (2019) Sperber, M., Neubig, G., Niehues, J., Waibel, A., 2019. Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics 7, 313–325.
Su et al. (2021) Su, J., Cao, J., Liu, W., Ou, Y., 2021. Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316.
Sun et al. (2023) Sun, H., Zhao, X., Lei, Y., Zhu, S., Xiong, D., 2023. Towards a deep understanding of multilingual end-to-end speech translation, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore. pp. 14332–14348. doi:10.18653/v1/2023.findings-emnlp.956.
Tan et al. (2024) Tan, W., Chen, Y., Chen, T., Qin, G., Xu, H., Zhang, H.C., Durme, B.V., Koehn, P., 2024. Streaming sequence transduction through dynamic compression. ArXiv abs/2402.01172.
Tang et al. (2022) Tang, Y., Gong, H., Dong, N., Wang, C., Hsu, W.N., Gu, J., Baevski, A., Li, X., Mohamed, A., Auli, M., Pino, J., 2022. Unified speech-text pre-training for speech translation and recognition, in: Muresan, S., Nakov, P., Villavicencio, A. (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 1488–1499. doi:10.18653/v1/2022.acl-long.105.
Tang et al. (2021a) Tang, Y., Pino, J., Li, X., Wang, C., Genzel, D., 2021a. Improving speech translation by understanding and learning from the auxiliary text translation task, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online. pp. 4252–4261. doi:10.18653/v1/2021.acl-long.328.
Tang et al. (2021b) Tang, Y., Pino, J.M., Li, X., Wang, C., Genzel, D., 2021b. Improving speech translation by understanding and learning from the auxiliary text translation task. ArXiv abs/2107.05782.
Tran et al. (2020) Tran, C., Wang, C., Tang, Y., Tang, Y., Pino, J.M., Li, X., 2020. Cross-modal transfer learning for multilingual speech-to-text translation. ArXiv abs/2010.12829.
Tsiamas et al. (2022a) Tsiamas, I., Gállego, G.I., Escolano, C., Fonollosa, J., Costa-jussà, M.R., 2022a. Pretrained speech encoders and efficient fine-tuning methods for speech translation: UPC at IWSLT 2022, in: Salesky, E., Federico, M., Costa-jussà, M. (Eds.), Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), Association for Computational Linguistics, Dublin, Ireland (in-person and online). pp. 265–276. doi:10.18653/v1/2022.iwslt-1.23.
Tsiamas et al. (2022b) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-juss’a, M.R., 2022b. Efficient speech translation with dynamic latent perceivers. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Tsiamas et al. (2022c) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-jussà, M.R., 2022c. Shas: Approaching optimal segmentation for end-to-end speech translation, in: Interspeech.
Tsiamas et al. (2024) Tsiamas, I., Gállego, G.I., Fonollosa, J.A.R., Costa-jussà, M.R., 2024. Pushing the limits of zero-shot end-to-end speech translation. ArXiv abs/2402.10422.
Tsiamas et al. (2023) Tsiamas, I., I. Gállego, G., Fonollosa, J., R. Costa-jussá, M., 2023. Speech translation with foundation models and optimal transport: UPC at IWSLT23, in: Salesky, E., Federico, M., Carpuat, M. (Eds.), Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Association for Computational Linguistics, Toronto, Canada (in-person and online). pp. 397–410. doi:10.18653/v1/2023.iwslt-1.38.
Tsiartas et al. (2013) Tsiartas, A., Ghosh, P., Georgiou, P., Narayanan, S., 2013. High-quality bilingual subtitle document alignments with application to spontaneous speech translation. Computer Speech & Language 27, 572–591.
Vaswani et al. (2017) Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS.
Vincent et al. (2017) Vincent, E., Watanabe, S., Nugraha, A.A., Barker, J., Marxer, R., 2017. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language 46, 535–557.
Wang et al. (2022) Wang, C., Inaguma, H., Chen, P.J., Kulikov, I., Tang, Y., Hsu, W.N., Auli, M., Pino, J., 2022. Simple and effective unsupervised speech translation. arXiv preprint arXiv:2210.10191 .
Wang et al. (2020a) Wang, C., Pino, J.M., Wu, A., Gu, J., 2020a. Covost: A diverse multilingual speech-to-text translation corpus, in: International Conference on Language Resources and Evaluation.
Wang et al. (2021a) Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J.M., Dupoux, E., 2021a. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, in: Annual Meeting of the Association for Computational Linguistics.
Wang et al. (2020b) Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., Pino, J., 2020b. Fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171 .
Wang et al. (2020c) Wang, C., Wu, A., Pino, J.M., 2020c. Covost 2 and massively multilingual speech-to-text translation. arXiv: Computation and Language .
Wang et al. (2021b) Wang, C., Wu, A., Pino, J.M., Baevski, A., Auli, M., Conneau, A., 2021b. Large-scale self- and semi-supervised learning for speech translation, in: Interspeech.
Wang et al. (2020d) Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z., 2020d. Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 .
Wang et al. (2023) Wang, P., Sun, E., Xue, J., Wu, Y., Zhou, L., Gaur, Y., Liu, S., Li, J., 2023. LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers, in: Proc. INTERSPEECH 2023, pp. 57–61. doi:10.21437/Interspeech.2023-2004.
Wei et al. (2022) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., hsin Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., Fedus, W., 2022. Emergent abilities of large language models. ArXiv abs/2206.07682.
Weiss et al. (2017) Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., Chen, Z., 2017. Sequence-to-Sequence Models Can Directly Translate Foreign Speech, in: Proc. Interspeech 2017, pp. 2625–2629. doi:10.21437/Interspeech.2017-503.
Weller et al. (2022) Weller, O., Sperber, M., Pires, T., Setiawan, H., Gollan, C., Telaar, D., Paulik, M., 2022. End-to-end speech translation for code switched speech. arXiv preprint arXiv:2204.05076 .
Wu et al. (2020) Wu, C., Wang, Y., Shi, Y., Yeh, C.F., Zhang, F., 2020. Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory, in: Proc. Interspeech 2020, pp. 2132–2136. doi:10.21437/Interspeech.2020-2079.
Wu (2020) Wu, F., 2020. Deep representation learning in computer vision and its applications.
Wu et al. (2022) Wu, F., Kim, K., Watanabe, S., Han, K.J., McDonald, R.T., Weinberger, K.Q., Artzi, Y., 2022. Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 1–5.
Wu et al. (2023) Wu, H., Chang, K.W., Wu, Y.K., yi Lee, H., 2023. Speechgen: Unlocking the generative power of speech language models with prompts. ArXiv abs/2306.02207.
Xie and Hansen (2023) Xie, J., Hansen, J.H.L., 2023. Mixrep: Hidden representation mixup for low-resource speech recognition. INTERSPEECH 2023 .
Xu et al. (2023a) Xu, C., Liu, X., Liu, X., Sun, Q., Zhang, Y., Yang, M., Dong, Q., Ko, T., Wang, M., Xiao, T., Ma, A., Zhu, J., 2023a. CTC-based non-autoregressive speech translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 13321–13339. doi:10.18653/v1/2023.acl-long.744.
Xu et al. (2023b) Xu, C., Ye, R., Dong, Q., Zhao, C., Ko, T., Wang, M., Xiao, T., Zhu, J., 2023b. Recent advances in direct speech-to-text translation. ArXiv abs/2306.11646.
Xue et al. (2022) Xue, J., Wang, P., Li, J., Post, M., Gaur, Y., 2022. Large-scale streaming end-to-end speech translation with neural transducers. arXiv preprint arXiv:2204.05352 .
Yan et al. (2023) Yan, B., Shi, J., Maiti, S., Chen, W., Li, X., Peng, Y., Arora, S., Watanabe, S., 2023. Cmu’s iwslt 2023 simultaneous speech translation system, in: International Workshop on Spoken Language Translation.
Yang et al. (2023) Yang, C.K., Huang, K.P., Lu, K.H., Kuan, C.Y., Hsiao, C.Y., yi Lee, H., 2023. Investigating zero-shot generalizability on mandarin-english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision. ArXiv abs/2401.00273.
Yao and Haddow (2020) Yao, Y., Haddow, B., 2020. Dynamic masking for improved stability in online spoken language translation, in: Conference of the Association for Machine Translation in the Americas.
Ye et al. (2021) Ye, R., Wang, M., Li, L., 2021. End-to-end speech translation via cross-modal progressive training, in: Proc. of INTERSPEECH.
Ye et al. (2022a) Ye, R., Wang, M., Li, L., 2022a. Cross-modal contrastive learning for speech translation, in: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States. pp. 5099–5113. doi:10.18653/v1/2022.naacl-main.376.
Ye et al. (2022b) Ye, R., Zhao, C., Ko, T., Meng, C., Wang, T., Wang, M., Cao, J., 2022b. Gigast: A 10,000-hour pseudo speech translation corpus. arXiv preprint arXiv:2204.03939 .
Yin et al. (2023) Yin, W., Liu, Z., Zhao, C., Wang, T., Tong, J., Ye, R., 2023. Improving speech translation by fusing speech and text, in: The 2023 Conference on Empirical Methods in Natural Language Processing.
Yu et al. (2023) Yu, T., Ding, L., Liu, X., Chen, K., Zhang, M., Tao, D., Zhang, M., 2023. Promptst: Abstract prompt learning for end-to-end speech translation, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10140–10154.
Zaidi et al. (2022) Zaidi, M.A., Lee, B., Kim, S., Kim, C., 2022. Cross-modal decision regularization for simultaneous speech translation, in: Interspeech.
Zeng et al. (2021) Zeng, X., Li, L., Liu, Q., 2021. RealTranS: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online. pp. 2461–2474. doi:10.18653/v1/2021.findings-acl.218.
Zeng et al. (2022) Zeng, X., Li, L., Liu, Q., 2022. Adatrans: Adapting with boundary-based shrinking for end-to-end speech translation. ArXiv abs/2212.08911.
Zenkel et al. (2018) Zenkel, T., Sperber, M., Niehues, J., Müller, M., Pham, N.Q., Stüker, S., Waibel, A., 2018. Open source toolkit for speech to text translation. Prague Bull. Math. Linguistics 111, 125–135.
Zhang et al. (2022a) Zhang, B., Haddow, B., Sennrich, R., 2022a. Revisiting end-to-end speech-to-text translation from scratch, in: International Conference on Machine Learning, PMLR. pp. 26193–26205.
Zhang et al. (2023a) Zhang, D., Ye, R., Ko, T., Wang, M., Zhou, Y., 2023a. Dub: Discrete unit back-translation for speech translation, in: Findings of ACL.
Zhang et al. (2023b) Zhang, H., Si, N., Chen, Y., Zhang, W., Yang, X., Qu, D., Jiao, X., 2023b. Tuning large language model for end-to-end speech translation. ArXiv abs/2310.02050.
Zhang et al. (2023c) Zhang, H., Si, N., Chen, Y., Zhang, W., Yang, X., Qu, D., Zhang, W.Q., 2023c. Improving speech translation by cross-modal multi-grained contrastive learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 1075–1086.
Zhang et al. (2022b) Zhang, R., He, Z., Wu, H., Wang, H., 2022b. Learning adaptive segmentation policy for end-to-end simultaneous translation, in: Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2021) Zhang, R., Wang, X., Zhang, C., He, Z., Wu, H., Li, Z., Wang, H., Chen, Y., Li, Q., 2021. BSTC: A large-scale Chinese-English speech translation dataset, in: Wu, H., Cherry, C., Huang, L., He, Z., Liu, Q., Elbayad, M., Liberman, M., Wang, H., Ma, M., Zhang, R. (Eds.), Proceedings of the Second Workshop on Automatic Simultaneous Translation, Association for Computational Linguistics, Online. pp. 28–35. doi:10.18653/v1/2021.autosimtrans-1.5.
Zhang and Feng (2023) Zhang, S., Feng, Y., 2023. End-to-end simultaneous speech translation with differentiable segmentation, in: Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2020) Zhang, S., Feng, Y., Li, L., 2020. Future-guided incremental transformer for simultaneous translation. ArXiv abs/2012.12465.
Zhang et al. (2019) Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 .
Zhang et al. (2024a) Zhang, X., Zhang, Q., Liu, H., Xiao, T., Qian, X., Ahmed, B., Ambikairajah, E., Li, H., Epps, J., 2024a. Mamba in speech: Towards an alternative to self-attention. arXiv:2405.12609.
Zhang et al. (2024b) Zhang, Z., Chen, S., Zhou, L., Wu, Y., Ren, S., Liu, S., Yao, Z., Gong, X., Dai, L., Li, J., et al., 2024b. Speechlm: Enhanced speech pre-training with unpaired textual data. IEEE/ACM Transactions on Audio, Speech, and Language Processing .
Zhao et al. (2021) Zhao, C., Wang, M., Dong, Q., Ye, R., Li, L., 2021. NeurST: Neural speech translation toolkit, in: Ji, H., Park, J.C., Xia, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online. pp. 55–62. doi:10.18653/v1/2021.acl-demo.7.
Zhao et al. (2022) Zhao, J., Yang, H., Haffari, G., Shareghi, E., 2022. M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation, in: Proc. Interspeech 2022, pp. 111–115. doi:10.21437/Interspeech.2022-592.
Zheng et al. (2011) Zheng, R., Chen, J., Ma, M., Huang, LiangPovey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K., 2011. The kaldi speech recognition toolkit, in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society.
Zheng et al. (2021a) Zheng, R., Chen, J., Ma, M., Huang, L., 2021a. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation, in: International Conference on Machine Learning, PMLR. pp. 12736–12746.
Zheng et al. (2021b) Zheng, R., Chen, J., Ma, M., Huang, L., 2021b. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation, in: International Conference on Machine Learning.
Zhou et al. (2024) Zhou, G., Lam, T.K., Birch, A., Haddow, B., 2024. Prosody in cascade and direct speech-to-text translation: a case study on korean wh-phrases, in: Findings of EACL.
Zhou et al. (2022a) Zhou, X., Liu, H., Shi, C., Liu, J., 2022a. Deep Learning on Edge Computing Devices: Design Challenges of Algorithm and Architecture. Elsevier.
Zhou et al. (2022b) Zhou, X., Wang, J., Cui, Z., Zhang, S., Yan, Z., Zhou, J., Zhou, C., 2022b. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. ArXiv abs/2212.00500.
Zhou et al. (2023) Zhou, Y., Fang, Q., Feng, Y., 2023. Cmot: Cross-modal mixup via optimal transport for speech translation, in: Annual Meeting of the Association for Computational Linguistics.
Zhu et al. (2023) Zhu, Q.S., Zhou, L., Zhang, J., Liu, S.J., Hu, Y.C., Dai, L.R., 2023. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 1–5.

End-to-End Speech-to-Text Translation: A Survey端到端语音到文本翻译研究综述