15.8. Bidirectional Encoder Representations from Transformers (BERT)¶
15.8. 基于 Transformers 的双向编码器表征（BERT） ¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab
在 SageMaker Studio Lab 中打开笔记本

We have introduced several word embedding models for natural language understanding. After pretraining, the output can be thought of as a matrix where each row is a vector that represents a word of a predefined vocabulary. In fact, these word embedding models are all context-independent. Let’s begin by illustrating this property.
我们已介绍了几种用于自然语言理解的词嵌入模型。预训练后，输出可视为一个矩阵，其中每行都是代表预定义词汇表中某个单词的向量。事实上，这些词嵌入模型都是上下文无关的。让我们通过一个例子来说明这一特性。

15.8.1. From Context-Independent to Context-Sensitive¶
15.8.1. 从上下文无关到上下文敏感 ¶

Recall the experiments in Section 15.4 and Section 15.7. For instance, word2vec and GloVe both assign the same pretrained vector to the same word regardless of the context of the word (if any). Formally, a context-independent representation of any token $x$ is a function $f (x)$ that only takes $x$ as its input. Given the abundance of polysemy and complex semantics in natural languages, context-independent representations have obvious limitations. For instance, the word “crane” in contexts “a crane is flying” and “a crane driver came” has completely different meanings; thus, the same word may be assigned different representations depending on contexts.
回顾 15.4 节和 15.7 节的实验，例如 word2vec 和 GloVe 模型均不考虑单词的上下文语境（若有），为同一单词分配相同的预训练向量。从形式上看，任何词元 $x$ 的上下文无关表示仅是将 $x$ 作为输入的函数 $f (x)$ 。鉴于自然语言中多义词和复杂语义的普遍性，上下文无关表示存在明显局限。例如，"a crane is flying"（一只鹤在飞）和"a crane driver came"（一名起重机司机来了）中的"crane"含义截然不同，因此同一单词应根据不同上下文分配不同表征。

This motivates the development of context-sensitive word representations, where representations of words depend on their contexts. Hence, a context-sensitive representation of token $x$ is a function $f (x, c (x))$ depending on both $x$ and its context $c (x)$ . Popular context-sensitive representations include TagLM (language-model-augmented sequence tagger) (Peters et al., 2017), CoVe (Context Vectors) (McCann et al., 2017), and ELMo (Embeddings from Language Models) (Peters et al., 2018).
这推动了上下文敏感的词表示方法的发展，其中词的表示取决于其上下文。因此，词元 $x$ 的上下文敏感表示是一个函数 $f (x, c (x))$ ，它既依赖于 $x$ ，也依赖于其上下文 $c (x)$ 。流行的上下文敏感表示方法包括 TagLM（语言模型增强的序列标注器）（Peters 等人，2017）、CoVe（上下文向量）（McCann 等人，2017）和 ELMo（来自语言模型的嵌入）（Peters 等人，2018）。

For example, by taking the entire sequence as input, ELMo is a function that assigns a representation to each word from the input sequence. Specifically, ELMo combines all the intermediate layer representations from pretrained bidirectional LSTM as the output representation. Then the ELMo representation will be added to a downstream task’s existing supervised model as additional features, such as by concatenating ELMo representation and the original representation (e.g., GloVe) of tokens in the existing model. On the one hand, all the weights in the pretrained bidirectional LSTM model are frozen after ELMo representations are added. On the other hand, the existing supervised model is specifically customized for a given task. Leveraging different best models for different tasks at that time, adding ELMo improved the state of the art across six natural language processing tasks: sentiment analysis, natural language inference, semantic role labeling, coreference resolution, named entity recognition, and question answering.
例如，通过将整个序列作为输入，ELMo 是一种为输入序列中的每个单词分配表示的函数。具体来说，ELMo 将所有中间层的预训练双向 LSTM 表示组合起来作为输出表示。随后，ELMo 表示会被添加到下游任务的现有监督模型中作为额外特征，例如通过将 ELMo 表示与现有模型中标记的原始表示（如 GloVe）进行拼接。一方面，在添加 ELMo 表示后，预训练双向 LSTM 模型中的所有权重都会被冻结。另一方面，现有监督模型是专门针对特定任务定制的。当时利用不同任务的最佳模型，添加 ELMo 在六项自然语言处理任务中提升了技术水准：情感分析、自然语言推理、语义角色标注、共指消解、命名实体识别和问答系统。

15.8.2. From Task-Specific to Task-Agnostic¶
15.8.2. 从任务特定到任务无关 ¶

Although ELMo has significantly improved solutions to a diverse set of natural language processing tasks, each solution still hinges on a task-specific architecture. However, it is practically non-trivial to craft a specific architecture for every natural language processing task. The GPT (Generative Pre-Training) model represents an effort in designing a general task-agnostic model for context-sensitive representations (Radford et al., 2018). Built on a Transformer decoder, GPT pretrains a language model that will be used to represent text sequences. When applying GPT to a downstream task, the output of the language model will be fed into an added linear output layer to predict the label of the task. In sharp contrast to ELMo that freezes parameters of the pretrained model, GPT fine-tunes all the parameters in the pretrained Transformer decoder during supervised learning of the downstream task. GPT was evaluated on twelve tasks of natural language inference, question answering, sentence similarity, and classification, and improved the state of the art in nine of them with minimal changes to the model architecture.
尽管 ELMo 显著改进了针对各类自然语言处理任务的解决方案，但每个方案仍依赖于特定任务架构。然而，为每个自然语言处理任务定制专用架构实际上并非易事。GPT（生成式预训练）模型代表了设计通用任务无关的上下文敏感表征的尝试（Radford 等人，2018）。基于 Transformer 解码器构建的 GPT 预训练了一个用于表征文本序列的语言模型。当将 GPT 应用于下游任务时，语言模型的输出会被馈入新增的线性输出层以预测任务标签。与冻结预训练模型参数的 ELMo 形成鲜明对比的是，GPT 在下游任务的监督学习过程中会微调预训练 Transformer 解码器的所有参数。GPT 在自然语言推理、问答、句子相似度和分类等十二项任务上进行了评估，其中九项任务仅需对模型架构进行最小改动便刷新了当时的最优性能。

However, due to the autoregressive nature of language models, GPT only looks forward (left-to-right). In contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”, as “bank” is sensitive to the context to its left, GPT will return the same representation for “bank”, though it has different meanings.
然而，由于语言模型的自回归特性，GPT 只能单向（从左到右）处理上下文。在“我去银行存现金”和“我去银行坐下”这两个句子中，“银行”一词对左侧上下文敏感，尽管其含义不同，GPT 仍会为“银行”生成相同的表征。

15.8.3. BERT: Combining the Best of Both Worlds¶
15.8.3. BERT：集两者之大成 ¶

As we have seen, ELMo encodes context bidirectionally but uses task-specific architectures; while GPT is task-agnostic but encodes context left-to-right. Combining the best of both worlds, BERT (Bidirectional Encoder Representations from Transformers) encodes context bidirectionally and requires minimal architecture changes for a wide range of natural language processing tasks (Devlin et al., 2018). Using a pretrained Transformer encoder, BERT is able to represent any token based on its bidirectional context. During supervised learning of downstream tasks, BERT is similar to GPT in two aspects. First, BERT representations will be fed into an added output layer, with minimal changes to the model architecture depending on nature of tasks, such as predicting for every token vs. predicting for the entire sequence. Second, all the parameters of the pretrained Transformer encoder are fine-tuned, while the additional output layer will be trained from scratch. Fig. 15.8.1 depicts the differences among ELMo, GPT, and BERT.
正如我们所看到的，ELMo 通过双向编码上下文但采用任务特定架构；而 GPT 虽与任务无关却仅从左到右编码上下文。结合两者优势，BERT（基于 Transformer 的双向编码器表示）不仅双向编码上下文，且对广泛自然语言处理任务只需极少的架构调整（Devlin 等，2018）。通过预训练的 Transformer 编码器，BERT 能基于双向上下文表示任意词元。在下游任务的监督学习中，BERT 与 GPT 有两点相似之处：首先，BERT 的表示会被输入到新增的输出层，根据任务性质（如预测每个词元还是整个序列）对模型架构进行最小改动；其次，预训练 Transformer 编码器的所有参数都会微调，而新增输出层则从头训练。图 15.8.1 展示了 ELMo、GPT 与 BERT 之间的差异。

Fig. 15.8.1 A comparison of ELMo, GPT, and BERT.¶
图 15.8.1 ELMo、GPT 与 BERT 的对比。

BERT further improved the state of the art on eleven natural language processing tasks under broad categories of (i) single text classification (e.g., sentiment analysis), (ii) text pair classification (e.g., natural language inference), (iii) question answering, (iv) text tagging (e.g., named entity recognition). All proposed in 2018, from context-sensitive ELMo to task-agnostic GPT and BERT, conceptually simple yet empirically powerful pretraining of deep representations for natural languages have revolutionized solutions to various natural language processing tasks.
BERT 在以下四大类十一项自然语言处理任务中显著提升了技术水平：(i) 单文本分类（如情感分析），(ii) 文本对分类（如自然语言推理），(iii) 问答系统，(iv) 文本标注（如命名实体识别）。2018 年相继提出的三大模型——语境敏感的 ELMo、任务无关的 GPT 以及 BERT，虽然概念简洁但实证效果惊人，这种针对自然语言的深度表征预训练方法彻底革新了各类自然语言处理任务的解决方案。

In the rest of this chapter, we will dive into the pretraining of BERT. When natural language processing applications are explained in Section 16, we will illustrate fine-tuning of BERT for downstream applications.
在本章的剩余部分，我们将深入探讨 BERT 的预训练过程。当在第 16 节解释自然语言处理应用时，我们将展示如何针对下游任务对 BERT 进行微调。

pytorch mxnet

import torch
from torch import nn
from d2l import torch as d2l

from mxnet import gluon, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

15.8.4. Input Representation¶
15.8.4. 输入表示 ¶

In natural language processing, some tasks (e.g., sentiment analysis) take single text as input, while in some other tasks (e.g., natural language inference), the input is a pair of text sequences. The BERT input sequence unambiguously represents both single text and text pairs. In the former, the BERT input sequence is the concatenation of the special classification token “<cls>”, tokens of a text sequence, and the special separation token “<sep>”. In the latter, the BERT input sequence is the concatenation of “<cls>”, tokens of the first text sequence, “<sep>”, tokens of the second text sequence, and “<sep>”. We will consistently distinguish the terminology “BERT input sequence” from other types of “sequences”. For instance, one BERT input sequence may include either one text sequence or two text sequences.
在自然语言处理中，有些任务（如情感分析）接受单个文本作为输入，而另一些任务（如自然语言推理）的输入则是一对文本序列。BERT 输入序列明确地统一表示这两种情况：对于前者，BERT 输入序列由特殊分类标记"<cls>"、文本序列的分词结果以及特殊分隔标记"<sep>"拼接而成；对于后者，BERT 输入序列则由"<cls>"、第一文本序列的分词结果、"<sep>"、第二文本序列的分词结果和"<sep>"共同组成。我们将始终区分"BERT 输入序列"与其他类型的"序列"——例如，一个 BERT 输入序列可能包含单个文本序列或两个文本序列。

To distinguish text pairs, the learned segment embeddings $e_{A}$ and $e_{B}$ are added to the token embeddings of the first sequence and the second sequence, respectively. For single text inputs, only $e_{A}$ is used.
为了区分文本对，学习到的片段嵌入 $e_{A}$ 和 $e_{B}$ 分别被添加到第一个序列和第二个序列的词元嵌入中。对于单文本输入，仅使用 $e_{A}$ 。

The following get_tokens_and_segments takes either one sentence or two sentences as input, then returns tokens of the BERT input sequence and their corresponding segment IDs.
以下 get_tokens_and_segments 接受一个或两个句子作为输入，随后返回 BERT 输入序列的标记及其对应的片段 ID。

pytorch mxnet

#@save
def get_tokens_and_segments(tokens_a, tokens_b=None):
    """Get tokens of the BERT input sequence and their segment IDs."""
    tokens = ['<cls>'] + tokens_a + ['<sep>']
    # 0 and 1 are marking segment A and B, respectively
    segments = [0] * (len(tokens_a) + 2)
    if tokens_b is not None:
        tokens += tokens_b + ['<sep>']
        segments += [1] * (len(tokens_b) + 1)
    return tokens, segments

      
    

#@save
def get_tokens_and_segments(tokens_a, tokens_b=None):
    """Get tokens of the BERT input sequence and their segment IDs."""
    tokens = ['<cls>'] + tokens_a + ['<sep>']
    # 0 and 1 are marking segment A and B, respectively
    segments = [0] * (len(tokens_a) + 2)
    if tokens_b is not None:
        tokens += tokens_b + ['<sep>']
        segments += [1] * (len(tokens_b) + 1)
    return tokens, segments

      
    

BERT chooses the Transformer encoder as its bidirectional architecture. Common in the Transformer encoder, positional embeddings are added at every position of the BERT input sequence. However, different from the original Transformer encoder, BERT uses learnable positional embeddings. To sum up, Fig. 15.8.2 shows that the embeddings of the BERT input sequence are the sum of the token embeddings, segment embeddings, and positional embeddings.
BERT 选择 Transformer 编码器作为其双向架构。在 BERT 输入序列的每个位置，都会添加位置嵌入，这在 Transformer 编码器中很常见。然而，与原始 Transformer 编码器不同，BERT 使用的是可学习的位置嵌入。总结来说，图 15.8.2 展示了 BERT 输入序列的嵌入是词元嵌入、片段嵌入和位置嵌入的总和。

Fig. 15.8.2 The embeddings of the BERT input sequence are the sum of the token embeddings, segment embeddings, and positional embeddings.¶
图 15.8.2 BERT 输入序列的嵌入表示是词元嵌入、片段嵌入和位置嵌入的总和。

The following BERTEncoder class is similar to the TransformerEncoder class as implemented in Section 11.7. Different from TransformerEncoder, BERTEncoder uses segment embeddings and learnable positional embeddings.
以下 BERTEncoder 类与第 11.7 节实现的 TransformerEncoder 类类似。与 TransformerEncoder 不同的是， BERTEncoder 使用了片段嵌入和可学习的位置嵌入。

pytorch mxnet

#@save
class BERTEncoder(nn.Module):
    """BERT encoder."""
    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
                 num_blks, dropout, max_len=1000, **kwargs):
        super(BERTEncoder, self).__init__(**kwargs)
        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
        self.segment_embedding = nn.Embedding(2, num_hiddens)
        self.blks = nn.Sequential()
        for i in range(num_blks):
            self.blks.add_module(f"{i}", d2l.TransformerEncoderBlock(
                num_hiddens, ffn_num_hiddens, num_heads, dropout, True))
        # In BERT, positional embeddings are learnable, thus we create a
        # parameter of positional embeddings that are long enough
        self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
                                                      num_hiddens))

    def forward(self, tokens, segments, valid_lens):
        # Shape of `X` remains unchanged in the following code snippet:
        # (batch size, max sequence length, `num_hiddens`)
        X = self.token_embedding(tokens) + self.segment_embedding(segments)
        X = X + self.pos_embedding[:, :X.shape[1], :]
        for blk in self.blks:
            X = blk(X, valid_lens)
        return X

      
    

#@save
class BERTEncoder(nn.Block):
    """BERT encoder."""
    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
                 num_blks, dropout, max_len=1000, **kwargs):
        super(BERTEncoder, self).__init__(**kwargs)
        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
        self.segment_embedding = nn.Embedding(2, num_hiddens)
        self.blks = nn.Sequential()
        for _ in range(num_blks):
            self.blks.add(d2l.TransformerEncoderBlock(
                num_hiddens, ffn_num_hiddens, num_heads, dropout, True))
        # In BERT, positional embeddings are learnable, thus we create a
        # parameter of positional embeddings that are long enough
        self.pos_embedding = self.params.get('pos_embedding',
                                             shape=(1, max_len, num_hiddens))

    def forward(self, tokens, segments, valid_lens):
        # Shape of `X` remains unchanged in the following code snippet:
        # (batch size, max sequence length, `num_hiddens`)
        X = self.token_embedding(tokens) + self.segment_embedding(segments)
        X = X + self.pos_embedding.data(ctx=X.ctx)[:, :X.shape[1], :]
        for blk in self.blks:
            X = blk(X, valid_lens)
        return X

      
    

Suppose that the vocabulary size is 10000. To demonstrate forward inference of BERTEncoder, let’s create an instance of it and initialize its parameters.
假设词汇表大小为 10000。为了演示 BERTEncoder 的前向推理，我们创建一个实例并初始化其参数。

pytorch mxnet

vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
ffn_num_input, num_blks, dropout = 768, 2, 0.2
encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
                      num_blks, dropout)

vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
num_blks, dropout = 2, 0.2
encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
                      num_blks, dropout)
encoder.initialize()

      
    

[22:07:48] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

We define tokens to be 2 BERT input sequences of length 8, where each token is an index of the vocabulary. The forward inference of BERTEncoder with the input tokens returns the encoded result where each token is represented by a vector whose length is predefined by the hyperparameter num_hiddens. This hyperparameter is usually referred to as the hidden size (number of hidden units) of the Transformer encoder.
我们定义 tokens 为两个长度为 8 的 BERT 输入序列，其中每个词元都是词表索引。对输入 tokens 进行 BERTEncoder 的前向推理会返回编码结果，其中每个词元都由一个向量表示，该向量的长度由超参数 num_hiddens 预先定义。该超参数通常被称为 Transformer 编码器的隐藏层大小（隐藏单元数量）。

pytorch mxnet

tokens = torch.randint(0, vocab_size, (2, 8))
segments = torch.tensor([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
encoded_X = encoder(tokens, segments, None)
encoded_X.shape

torch.Size([2, 8, 768])

tokens = np.random.randint(0, vocab_size, (2, 8))
segments = np.array([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
encoded_X = encoder(tokens, segments, None)
encoded_X.shape

(2, 8, 768)

15.8.5. Pretraining Tasks¶
15.8.5. 预训练任务 ¶

The forward inference of BERTEncoder gives the BERT representation of each token of the input text and the inserted special tokens “<cls>” and “<seq>”. Next, we will use these representations to compute the loss function for pretraining BERT. The pretraining is composed of the following two tasks: masked language modeling and next sentence prediction.
BERTEncoder 的前向推断给出输入文本每个词元及插入的特殊标记""和""的 BERT 表示。接下来，我们将利用这些表示来计算预训练 BERT 的损失函数。预训练包含以下两项任务：掩码语言建模和下一句预测。

15.8.5.1. Masked Language Modeling¶
15.8.5.1. 掩码语言建模 ¶

As illustrated in Section 9.3, a language model predicts a token using the context on its left. To encode context bidirectionally for representing each token, BERT randomly masks tokens and uses tokens from the bidirectional context to predict the masked tokens in a self-supervised fashion. This task is referred to as a masked language model.
如第 9.3 节所述，语言模型利用左侧上下文预测词元。为了双向编码上下文以表示每个词元，BERT 随机掩蔽词元，并通过自监督方式利用双向上下文中的词元预测被掩蔽的词元。这一任务被称为掩码语言模型。

In this pretraining task, 15% of tokens will be selected at random as the masked tokens for prediction. To predict a masked token without cheating by using the label, one straightforward approach is to always replace it with a special “<mask>” token in the BERT input sequence. However, the artificial special token “<mask>” will never appear in fine-tuning. To avoid such a mismatch between pretraining and fine-tuning, if a token is masked for prediction (e.g., “great” is selected to be masked and predicted in “this movie is great”), in the input it will be replaced with:
在这个预训练任务中，随机选择 15%的标记作为待预测的掩码标记。为了避免直接使用标签作弊，一个直观的方法是在 BERT 输入序列中始终用特殊的"<mask>"标记替换待预测词。然而，人工引入的特殊标记"<mask>"在微调阶段永远不会出现。为了消除预训练与微调之间的这种不匹配，当某个标记被选中进行掩码预测时（例如在句子"this movie is great"中选中"great"进行掩码预测），输入文本中的该标记会被以下三种方式之一替换：

a special “<mask>” token for 80% of the time (e.g., “this movie is great” becomes “this movie is <mask>”);
80%的情况下用一个特殊的"<mask>"标记替代（例如，"这部电影很棒"变成"这部电影是<mask>"）；
a random token for 10% of the time (e.g., “this movie is great” becomes “this movie is drink”);
10%的情况下随机替换一个标记（例如，“这部电影很棒”变成“这部电影是饮料”）；
the unchanged label token for 10% of the time (e.g., “this movie is great” becomes “this movie is great”).
在 10%的情况下保留原始标签词（例如，“这部电影很棒”仍为“这部电影很棒”）。

Note that for 10% of 15% time a random token is inserted. This occasional noise encourages BERT to be less biased towards the masked token (especially when the label token remains unchanged) in its bidirectional context encoding.
需要注意的是，在 15%的时间中有 10%的概率会随机插入一个标记。这种偶尔引入的噪声促使 BERT 在双向上下文编码中减少对遮蔽标记的偏见（尤其在标注标记保持不变的情况下）。

We implement the following MaskLM class to predict masked tokens in the masked language model task of BERT pretraining. The prediction uses a one-hidden-layer MLP (self.mlp). In forward inference, it takes two inputs: the encoded result of BERTEncoder and the token positions for prediction. The output is the prediction results at these positions.
我们实现以下 MaskLM 类来预测 BERT 预训练中遮蔽语言模型任务里的被遮蔽词元。该预测使用了一个单隐藏层的多层感知机（ self.mlp ）。在前向推断中，它接收两个输入： BERTEncoder 的编码结果和待预测词元的位置。输出即为这些位置上的预测结果。

pytorch mxnet

#@save
class MaskLM(nn.Module):
    """The masked language model task of BERT."""
    def __init__(self, vocab_size, num_hiddens, **kwargs):
        super(MaskLM, self).__init__(**kwargs)
        self.mlp = nn.Sequential(nn.LazyLinear(num_hiddens),
                                 nn.ReLU(),
                                 nn.LayerNorm(num_hiddens),
                                 nn.LazyLinear(vocab_size))

    def forward(self, X, pred_positions):
        num_pred_positions = pred_positions.shape[1]
        pred_positions = pred_positions.reshape(-1)
        batch_size = X.shape[0]
        batch_idx = torch.arange(0, batch_size)
        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
        # `batch_idx` is `torch.tensor([0, 0, 0, 1, 1, 1])`
        batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
        masked_X = X[batch_idx, pred_positions]
        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
        mlm_Y_hat = self.mlp(masked_X)
        return mlm_Y_hat

      
    

#@save
class MaskLM(nn.Block):
    """The masked language model task of BERT."""
    def __init__(self, vocab_size, num_hiddens, **kwargs):
        super(MaskLM, self).__init__(**kwargs)
        self.mlp = nn.Sequential()
        self.mlp.add(
            nn.Dense(num_hiddens, flatten=False, activation='relu'))
        self.mlp.add(nn.LayerNorm())
        self.mlp.add(nn.Dense(vocab_size, flatten=False))

    def forward(self, X, pred_positions):
        num_pred_positions = pred_positions.shape[1]
        pred_positions = pred_positions.reshape(-1)
        batch_size = X.shape[0]
        batch_idx = np.arange(0, batch_size)
        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
        # `batch_idx` is `np.array([0, 0, 0, 1, 1, 1])`
        batch_idx = np.repeat(batch_idx, num_pred_positions)
        masked_X = X[batch_idx, pred_positions]
        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
        mlm_Y_hat = self.mlp(masked_X)
        return mlm_Y_hat

      
    

To demonstrate the forward inference of MaskLM, we create its instance mlm and initialize it. Recall that encoded_X from the forward inference of BERTEncoder represents 2 BERT input sequences. We define mlm_positions as the 3 indices to predict in either BERT input sequence of encoded_X. The forward inference of mlm returns prediction results mlm_Y_hat at all the masked positions mlm_positions of encoded_X. For each prediction, the size of the result is equal to the vocabulary size.
为了演示 MaskLM 的前向推理，我们创建其实例 mlm 并初始化。回想一下， BERTEncoder 前向推理中的 encoded_X 代表 2 个 BERT 输入序列。我们将 mlm_positions 定义为 encoded_X 中任一 BERT 输入序列的 3 个预测索引。 mlm 的前向推理返回 encoded_X 所有掩码位置 mlm_positions 处的预测结果 mlm_Y_hat 。每个预测结果的大小等于词汇表规模。

pytorch mxnet

mlm = MaskLM(vocab_size, num_hiddens)
mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]])
mlm_Y_hat = mlm(encoded_X, mlm_positions)
mlm_Y_hat.shape

torch.Size([2, 3, 10000])

mlm = MaskLM(vocab_size, num_hiddens)
mlm.initialize()
mlm_positions = np.array([[1, 5, 2], [6, 1, 5]])
mlm_Y_hat = mlm(encoded_X, mlm_positions)
mlm_Y_hat.shape

      
    

(2, 3, 10000)

With the ground truth labels mlm_Y of the predicted tokens mlm_Y_hat under masks, we can calculate the cross-entropy loss of the masked language model task in BERT pretraining.
通过预测掩码下标记 mlm_Y_hat 的真实标签 mlm_Y ，我们可以计算 BERT 预训练中掩码语言模型任务的交叉熵损失。

pytorch mxnet

mlm_Y = torch.tensor([[7, 8, 9], [10, 20, 30]])
loss = nn.CrossEntropyLoss(reduction='none')
mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
mlm_l.shape

torch.Size([6])

mlm_Y = np.array([[7, 8, 9], [10, 20, 30]])
loss = gluon.loss.SoftmaxCrossEntropyLoss()
mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
mlm_l.shape

(6,)

15.8.5.2. Next Sentence Prediction¶
15.8.5.2. 下一句预测

Although masked language modeling is able to encode bidirectional context for representing words, it does not explicitly model the logical relationship between text pairs. To help understand the relationship between two text sequences, BERT considers a binary classification task, next sentence prediction, in its pretraining. When generating sentence pairs for pretraining, for half of the time they are indeed consecutive sentences with the label “True”; while for the other half of the time the second sentence is randomly sampled from the corpus with the label “False”.
尽管掩码语言建模能够通过双向上下文来表示词语，但它并未显式建模文本对之间的逻辑关系。为了帮助理解两个文本序列之间的关系，BERT 在预训练中引入了一个二元分类任务——下一句预测。在生成预训练用的句子对时，有一半概率它们确实是连续句子（标记为"True"），另一半概率则从语料库随机采样第二个句子（标记为"False"）。

The following NextSentencePred class uses a one-hidden-layer MLP to predict whether the second sentence is the next sentence of the first in the BERT input sequence. Due to self-attention in the Transformer encoder, the BERT representation of the special token “<cls>” encodes both the two sentences from the input. Hence, the output layer (self.output) of the MLP classifier takes X as input, where X is the output of the MLP hidden layer whose input is the encoded “<cls>” token.
以下 NextSentencePred 类使用单隐藏层 MLP 来预测 BERT 输入序列中的第二句话是否为第一句话的下一句。得益于 Transformer 编码器的自注意力机制，特殊标记“”的 BERT 表征同时编码了输入的两个句子。因此，MLP 分类器的输出层（ self.output ）以 X 作为输入，其中 X 是 MLP 隐藏层的输出，其输入为编码后的“”标记。

pytorch mxnet

#@save
class NextSentencePred(nn.Module):
    """The next sentence prediction task of BERT."""
    def __init__(self, **kwargs):
        super(NextSentencePred, self).__init__(**kwargs)
        self.output = nn.LazyLinear(2)

    def forward(self, X):
        # `X` shape: (batch size, `num_hiddens`)
        return self.output(X)

      
    

#@save
class NextSentencePred(nn.Block):
    """The next sentence prediction task of BERT."""
    def __init__(self, **kwargs):
        super(NextSentencePred, self).__init__(**kwargs)
        self.output = nn.Dense(2)

    def forward(self, X):
        # `X` shape: (batch size, `num_hiddens`)
        return self.output(X)

      
    

We can see that the forward inference of an NextSentencePred instance returns binary predictions for each BERT input sequence.
我们可以看到一个 NextSentencePred 实例的前向推理会为每个 BERT 输入序列返回二元预测结果。

pytorch mxnet

# PyTorch by default will not flatten the tensor as seen in mxnet where, if
# flatten=True, all but the first axis of input data are collapsed together
encoded_X = torch.flatten(encoded_X, start_dim=1)
# input_shape for NSP: (batch size, `num_hiddens`)
nsp = NextSentencePred()
nsp_Y_hat = nsp(encoded_X)
nsp_Y_hat.shape

      
    

torch.Size([2, 2])

nsp = NextSentencePred()
nsp.initialize()
nsp_Y_hat = nsp(encoded_X)
nsp_Y_hat.shape

(2, 2)

The cross-entropy loss of the 2 binary classifications can also be computed.
也可以计算两个二元分类的交叉熵损失。

pytorch mxnet

nsp_y = torch.tensor([0, 1])
nsp_l = loss(nsp_Y_hat, nsp_y)
nsp_l.shape

torch.Size([2])

nsp_y = np.array([0, 1])
nsp_l = loss(nsp_Y_hat, nsp_y)
nsp_l.shape

(2,)

It is noteworthy that all the labels in both the aforementioned pretraining tasks can be trivially obtained from the pretraining corpus without manual labeling effort. The original BERT has been pretrained on the concatenation of BookCorpus (Zhu et al., 2015) and English Wikipedia. These two text corpora are huge: they have 800 million words and 2.5 billion words, respectively.
值得注意的是，上述两个预训练任务中的所有标签都可以直接从预训练语料中自动获取，无需人工标注。原始 BERT 模型在 BookCorpus（Zhu 等人，2015）和英文维基百科的合并语料上进行预训练，这两个文本语料库规模庞大：分别包含 8 亿单词和 25 亿单词。

15.8.6. Putting It All Together¶
15.8.6. 整体实现 ¶

When pretraining BERT, the final loss function is a linear combination of both the loss functions for masked language modeling and next sentence prediction. Now we can define the BERTModel class by instantiating the three classes BERTEncoder, MaskLM, and NextSentencePred. The forward inference returns the encoded BERT representations encoded_X, predictions of masked language modeling mlm_Y_hat, and next sentence predictions nsp_Y_hat.
在预训练 BERT 时，最终的损失函数是掩码语言建模和下一句预测两个损失函数的线性组合。现在我们可以通过实例化三个类 BERTEncoder 、 MaskLM 和 NextSentencePred 来定义 BERTModel 类。前向推理返回经过编码的 BERT 表征 encoded_X 、掩码语言建模预测 mlm_Y_hat 以及下一句预测 nsp_Y_hat 。

pytorch mxnet

#@save
class BERTModel(nn.Module):
    """The BERT model."""
    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
                 num_heads, num_blks, dropout, max_len=1000):
        super(BERTModel, self).__init__()
        self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens,
                                   num_heads, num_blks, dropout,
                                   max_len=max_len)
        self.hidden = nn.Sequential(nn.LazyLinear(num_hiddens),
                                    nn.Tanh())
        self.mlm = MaskLM(vocab_size, num_hiddens)
        self.nsp = NextSentencePred()

    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
        encoded_X = self.encoder(tokens, segments, valid_lens)
        if pred_positions is not None:
            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
        else:
            mlm_Y_hat = None
        # The hidden layer of the MLP classifier for next sentence prediction.
        # 0 is the index of the '<cls>' token
        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
        return encoded_X, mlm_Y_hat, nsp_Y_hat

      
    

#@save
class BERTModel(nn.Block):
    """The BERT model."""
    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
                 num_blks, dropout, max_len=1000):
        super(BERTModel, self).__init__()
        self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens,
                                   num_heads, num_blks, dropout, max_len)
        self.hidden = nn.Dense(num_hiddens, activation='tanh')
        self.mlm = MaskLM(vocab_size, num_hiddens)
        self.nsp = NextSentencePred()

    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
        encoded_X = self.encoder(tokens, segments, valid_lens)
        if pred_positions is not None:
            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
        else:
            mlm_Y_hat = None
        # The hidden layer of the MLP classifier for next sentence prediction.
        # 0 is the index of the '<cls>' token
        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
        return encoded_X, mlm_Y_hat, nsp_Y_hat

      
    

15.8.7. Summary¶ 15.8.7. 小结 ¶

Word embedding models such as word2vec and GloVe are context-independent. They assign the same pretrained vector to the same word regardless of the context of the word (if any). It is hard for them to handle well polysemy or complex semantics in natural languages.
像 word2vec 和 GloVe 这样的词嵌入模型是上下文无关的。无论单词所处的上下文如何（如果有的话），它们都会为相同的单词分配相同的预训练向量。这些模型难以有效处理自然语言中的多义词或复杂语义。
For context-sensitive word representations such as ELMo and GPT, representations of words depend on their contexts.
对于诸如 ELMo 和 GPT 这类语境敏感的词语表征方法而言，词语的表示依赖于其上下文语境。
ELMo encodes context bidirectionally but uses task-specific architectures (however, it is practically non-trivial to craft a specific architecture for every natural language processing task); while GPT is task-agnostic but encodes context left-to-right.
ELMo 通过双向编码上下文，但需要针对特定任务设计架构（然而为每个自然语言处理任务定制架构实际上并非易事）；而 GPT 虽与任务无关，却仅从左至右编码上下文。
BERT combines the best of both worlds: it encodes context bidirectionally and requires minimal architecture changes for a wide range of natural language processing tasks.
BERT 结合了两者的优势：它能双向编码上下文，并且只需最小的架构改动就能适用于广泛的自然语言处理任务。
The embeddings of the BERT input sequence are the sum of the token embeddings, segment embeddings, and positional embeddings.
BERT 输入序列的嵌入表示是词元嵌入、片段嵌入和位置嵌入的总和。
Pretraining BERT is composed of two tasks: masked language modeling and next sentence prediction. The former is able to encode bidirectional context for representing words, while the latter explicitly models the logical relationship between text pairs.
BERT 的预训练包含两个任务：掩码语言建模和下一句预测。前者能够编码双向上下文以表征词语，后者则显式建模文本对间的逻辑关系。

15.8.8. Exercises¶ 15.8.8. 练习 ¶

All other things being equal, will a masked language model require more or fewer pretraining steps to converge than a left-to-right language model? Why?
在其他条件相同的情况下，掩码语言模型比从左到右的语言模型需要更多还是更少的预训练步骤才能收敛？为什么？
In the original implementation of BERT, the positionwise feed-forward network in BERTEncoder (via d2l.TransformerEncoderBlock) and the fully connected layer in MaskLM both use the Gaussian error linear unit (GELU) (Hendrycks and Gimpel, 2016) as the activation function. Research into the difference between GELU and ReLU.
在 BERT 的原始实现中， BERTEncoder 中的逐位置前馈网络（通过 d2l.TransformerEncoderBlock ）和 MaskLM 中的全连接层均采用高斯误差线性单元（GELU）（Hendrycks 与 Gimpel，2016）作为激活函数。研究 GELU 与 ReLU 之间的差异。

pytorch mxnet

Discussions

15.8.1. From Context-Independent to Context-Sensitive¶15.8.1. 从上下文无关到上下文敏感 ¶

15.8.2. From Task-Specific to Task-Agnostic¶15.8.2. 从任务特定到任务无关 ¶

15.8.3. BERT: Combining the Best of Both Worlds¶15.8.3. BERT：集两者之大成 ¶

15.8.4. Input Representation¶15.8.4. 输入表示 ¶

15.8.5. Pretraining Tasks¶15.8.5. 预训练任务 ¶

15.8.5.1. Masked Language Modeling¶15.8.5.1. 掩码语言建模 ¶

15.8.5.2. Next Sentence Prediction¶15.8.5.2. 下一句预测

15.8.6. Putting It All Together¶15.8.6. 整体实现 ¶

15.8.7. Summary¶ 15.8.7. 小结 ¶

15.8.8. Exercises¶ 15.8.8. 练习 ¶

15.8.1. From Context-Independent to Context-Sensitive¶
15.8.1. 从上下文无关到上下文敏感 ¶

15.8.2. From Task-Specific to Task-Agnostic¶
15.8.2. 从任务特定到任务无关 ¶

15.8.3. BERT: Combining the Best of Both Worlds¶
15.8.3. BERT：集两者之大成 ¶

15.8.4. Input Representation¶
15.8.4. 输入表示 ¶

15.8.5. Pretraining Tasks¶
15.8.5. 预训练任务 ¶

15.8.5.1. Masked Language Modeling¶
15.8.5.1. 掩码语言建模 ¶

15.8.5.2. Next Sentence Prediction¶
15.8.5.2. 下一句预测

15.8.6. Putting It All Together¶
15.8.6. 整体实现 ¶