Training language models to follow instructions with human feedback
训练语言模型遵循人类反馈指令

Long Ouyang* Jeff Wu* Xu Jiang* Diogo Almeida* Carroll L. Wainwright*
欧洋* 吴杰夫* 姜旭* 迪奥戈·阿尔梅达* 卡罗尔·L·温赖特*Pamela Mishkin* Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray
帕梅拉·米什金* 张冲桑迪尼·阿加瓦尔卡特琳娜·斯拉玛亚历克斯·雷John Schulman Jacob Hilton Fraser Kelton Luke Miller Maddie Simens
约翰·舒尔曼雅各布·希尔顿弗雷泽·凯尔顿卢克·米勒麦迪·西门斯Amanda Askell $^{†}$ Peter Welinder Paul Christiano* $^{*}$
阿曼达·阿斯克尔 $^{†}$ 彼得·韦林德保罗·克里斯蒂亚诺* $^{*}$ Jan Leike* Ryan Lowe* 扬·莱克* 瑞安·洛*OpenAI

Abstract 摘要

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
单纯扩大语言模型的规模并不能使其更好地遵循用户意图。例如，大型语言模型可能生成不真实、有害或对用户毫无帮助的输出内容。换言之，这些模型与用户需求存在偏差。本文提出了一种通过人类反馈微调来使语言模型与多样化任务中用户意图保持一致的解决方案。我们首先收集标注员编写的提示词和通过 OpenAI API 提交的提示词，构建了展示理想模型行为的标注员示范数据集，并利用监督学习对 GPT-3 进行微调。随后我们收集了模型输出排序数据集，通过人类反馈强化学习对监督模型进行二次微调。最终得到的模型称为 InstructGPT。在人类评估中，13 亿参数的 InstructGPT 模型输出结果在提示词分布测试中优于 1750 亿参数的 GPT-3，尽管其参数量仅为后者的 1%。此外，InstructGPT 模型在真实性方面有所提升，同时减少了有害内容的生成，而在公开 NLP 数据集上的性能下降微乎其微。尽管 InstructGPT 仍会犯一些简单错误，但我们的研究结果表明，通过人类反馈进行微调是实现语言模型与人类意图对齐的重要方向。

1 Introduction 1 引言

Large language models (LMs) can be “prompted” to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021, Tamkin et al., 2021; Gehman et al., 2020). This is because the language modeling objective
大型语言模型（LM）可以通过"提示"方式执行各类自然语言处理（NLP）任务——只需输入任务示例即可。然而这些模型经常表现出预期外的行为，例如编造事实、生成带有偏见或有害的文本，或是直接违背用户指令（Bender 等人，2021；Bommasani 等人，2021；Kenton 等人，2021；Weidinger 等人，2021；Tamkin 等人，2021；Gehman 等人，2020）。这是因为语言建模目标

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are

95 %

confidence intervals.
图 1：针对我们 API 提示分布中各类模型的人类评估结果，评估标准为各模型输出相较于 175B SFT 模型被优先选择的频率。我们的 InstructGPT 模型（PPO-ptx）及其未使用预训练混合的变体（PPO）显著优于 GPT-3 基线模型（GPT、GPT 提示）；1.3B PPO-ptx 模型的输出表现甚至优于 175B GPT-3 模型。全文误差线均表示 95%置信区间。
used for many recent large LMs-predicting the next token on a webpage from the internet-is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019, Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.
当前多数大型语言模型采用的训练目标——基于互联网网页内容预测下一个 token——与"遵循用户指令并提供安全有效帮助"的核心目标存在偏差（Radford 等，2019；Brown 等，2020；Fedus 等，2021；Rae 等，2021；Thoppilan 等，2022）。因此我们认为语言建模目标存在错位问题。对于部署在数百个应用场景中的语言模型而言，规避这些非预期行为尤为重要。
We make progress on aligning language models by training them to act in accordance with the user’s intention (Leike et al., 2018). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of Askell et al. (2021), we want language models to be helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.6.
我们通过训练语言模型使其行为符合用户意图来推进对齐研究（Leike 等，2018）。这既包括遵循指令等显性意图，也涵盖保持真实、避免偏见、有害内容或其他伤害性行为等隐性意图。借用 Askell 等人（2021）的表述，我们希望语言模型具备三个特性：有帮助性（能协助用户完成任务）、诚实性（不编造信息或误导用户）以及无害性（不会对人类或环境造成生理、心理或社会层面的伤害）。我们将在 3.6 节详细阐述这些标准的评估方法。
We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API

^{3}

and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al. 2017). We illustrate this process in Figure 2 This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models InstructGPT.
我们专注于采用微调方法来对齐语言模型。具体而言，我们利用人类反馈强化学习（RLHF；Christiano 等人，2017；Stiennon 等人，2020）对 GPT-3 进行微调，使其能够遵循广泛的书面指令类别（见图 2）。该技术将人类偏好作为奖励信号来优化我们的模型。我们首先聘请了 40 名合同工组成标注团队，根据他们在筛选测试中的表现进行数据标注（详见第 3.4 节和附录 B.1）。随后我们收集了由人类撰写的期望输出行为示范数据集，这些数据主要基于提交至 OpenAI API 的（多数为英文）提示词

^{3}

以及部分标注员撰写的提示词，并利用该数据集训练监督学习基线模型。接着我们在更大量的 API 提示词上收集人工标注的模型输出对比数据，基于此训练奖励模型（RM）来预测标注员更偏好的模型输出。最后我们将该 RM 作为奖励函数，使用 PPO 算法（Schulman 等人，2017）对监督学习基线模型进行微调以最大化该奖励。我们在图 2 中展示了这一流程。该过程将 GPT-3 的行为与特定人群（主要是我们的标注员和研究人员）声明的偏好对齐，而非任何更广泛的"人类价值观"概念；我们将在第 5.2 节进一步讨论这一点。我们将最终得到的模型称为 InstructGPT。
We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model
我们主要通过让标注员在测试集上评估模型输出的质量来进行模型评估，测试集包含来自未参与训练数据构建的客户提供的提示词。我们还在一系列公开 NLP 数据集上进行了自动化评估。我们训练了三个模型

Figure 2: A diagram illustrating the three steps of our method: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train one of our models. In Step 2, boxes A-D are samples from our models that get ranked by labelers. See Section 3 for more details on our method.
图 2：展示我们方法三个步骤的示意图：(1)监督微调(SFT)，(2)奖励模型(RM)训练，(3)基于该奖励模型通过近端策略优化(PPO)进行强化学习。蓝色箭头表示该数据用于训练我们的某个模型。在步骤 2 中，方框 A-D 是来自我们模型的样本，由标注员进行排序。关于我们方法的更多细节请参阅第 3 节。
sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are as follows:
（13 亿、60 亿和 1750 亿参数规模），所有模型均采用 GPT-3 架构。我们的主要发现如下：

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. These models have the same architecture, and differ only by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs

85 \pm 3 %

of the time, and preferred

71 \pm 4 %

of the time to few-shot 175B GPT-3. InstructGPT models also generate more appropriate outputs according to our labelers, and more reliably follow explicit constraints in the instruction.
标注员明显更倾向于选择 InstructGPT 的输出而非 GPT-3。在我们的测试集中，尽管 13 亿参数的 InstructGPT 模型参数量比 1750 亿的 GPT-3 少 100 多倍，其输出结果仍更受青睐。这些模型架构完全相同，唯一区别在于 InstructGPT 经过了我们人类数据的微调。即便我们为 GPT-3 添加少量示例提示使其更擅长遵循指令，这一结论依然成立。1750 亿参数的 InstructGPT 输出在

85 \pm 3 %

情况下优于同等规模的 GPT-3，在

71 \pm 4 %

情况下优于带少量示例的 1750 亿 GPT-3。根据标注员反馈，InstructGPT 模型生成的输出更恰当，且更可靠地遵循指令中的显式约束条件。

InstructGPT models show improvements in truthfulness over GPT-3. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3. Our results are equally strong on the subset of questions that were not adversarially selected against GPT-3. On “closed-domain” tasks from our API prompt distribution, where the output should not contain information that is not present in the input (e.g. summarization and closed-domain QA), InstructGPT models make up information not present in the input about half as often as GPT-3 (a

21 %

vs.

41 %

hallucination rate, respectively).
InstructGPT 模型在真实性方面相比 GPT-3 有所提升。在 TruthfulQA 基准测试中，InstructGPT 生成真实且信息丰富的答案的频率约为 GPT-3 的两倍。对于未被针对性设计来对抗 GPT-3 的问题子集，我们的结果同样出色。在 API 提示分布中的"封闭领域"任务（如摘要和封闭领域问答，其输出不应包含输入中未出现的信息）上，InstructGPT 模型编造输入中不存在信息的频率约为 GPT-3 的一半（幻觉率分别为

21 %

和

41 %

）。

InstructGPT shows small improvements in toxicity over GPT-3, but not bias. To measure toxicity, we use the RealToxicityPrompts dataset (Gehman et al., 2020) and conduct both automatic and human evaluations. InstructGPT models generate about

25 %

fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender (Rudinger et al., 2018) and CrowSPairs (Nangia et al., 2020) datasets.
InstructGPT 在毒性方面相比 GPT-3 略有改善，但在偏见方面未见提升。为测量毒性，我们使用 RealToxicityPrompts 数据集（Gehman 等人，2020 年）进行了自动和人工评估。当被提示要保持尊重时，InstructGPT 模型产生的有毒输出比 GPT-3 少约

25 %

。在 Winogender（Rudinger 等人，2018 年）和 CrowSPairs（Nangia 等人，2020 年）数据集上，InstructGPT 相比 GPT-3 未显示出显著改进。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al. 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of
通过改进 RLHF 微调流程，我们能够有效减少公开 NLP 数据集上的性能衰退现象。在强化学习人类反馈微调阶段，我们观察到相较于 GPT-3 模型，在部分公开 NLP 数据集（特别是 SQuAD 问答数据集、DROP 阅读理解基准、HellaSwag 常识推理测试及 WMT2015 法英翻译任务）上出现了性能下滑。这种现象体现了"对齐代价"——即模型对齐过程会以牺牲某些重要任务的性能为代价。
lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores.
通过将 PPO 更新与提升预训练分布对数似然的更新方法相结合（PPO-ptx），我们能在不影响标注者偏好评分的前提下，显著改善这些数据集上的性能表现。

Our models generalize to the preferences of “held-out” labelers that did not produce any training data. To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer InstructGPT outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.
我们的模型能够泛化到未参与训练数据标注的"保留标注者"的偏好。为测试模型的泛化能力，我们针对保留标注者进行了初步实验，发现他们偏好 InstructGPT 输出而非 GPT-3 输出的比例，与我们的训练标注者大致相同。但还需要更多研究来评估这些模型在更广泛用户群体中的表现，以及当人类对期望行为存在分歧时模型的表现。

Public NLP datasets are not reflective of how our language models are used. We compare GPT-3 fine-tuned on our human preference data (i.e. InstructGPT) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) (in particular, the T0++ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer InstructGPT to these models (InstructGPT has a

73.4 \pm 2 %

winrate vs. our baseline, compared to

26.8 \pm 2 %

and

29.8 \pm 2 %

for our version of T0 and FLAN, respectively).
公开 NLP 数据集无法反映我们语言模型的实际使用场景。我们将基于人类偏好数据微调的 GPT-3（即 InstructGPT）与基于两种公开 NLP 任务集（FLAN[Wei 等，2021]和 T0[Sanh 等，2021]，特别是 T0++变体）微调的 GPT-3 进行对比。这些数据集包含多种 NLP 任务及对应的自然语言指令。在我们的 API 提示分布测试中，FLAN 和 T0 模型表现略逊于监督微调基线，且标注者明显更青睐 InstructGPT（InstructGPT 相对基线的胜率为

73.4 \pm 2 %

，而我们实现的 T0 和 FLAN 版本分别为

26.8 \pm 2 %

和

29.8 \pm 2 %

）。

InstructGPT models show promising generalization to instructions outside of the RLHF finetuning distribution. We qualitatively probe InstructGPT’s capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of “following instructions.” They retain some alignment even on tasks for which they get very little direct supervision signal.
InstructGPT 模型展现出对 RLHF 微调分布之外指令的出色泛化能力。我们定性探究了 InstructGPT 的能力边界，发现它能够遵循代码总结指令、回答代码相关问题，有时甚至能理解不同语言的指令——尽管这些指令在微调数据中极为罕见。相比之下，GPT-3 虽然也能完成这些任务，但需要更精细的提示工程，且通常无法在这些领域稳定遵循指令。这一发现令人振奋，表明我们的模型能够泛化"遵循指令"的核心概念，即使在极少直接监督信号的任务上仍保持一定的对齐性。

InstructGPT still makes simple mistakes. For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.
但 InstructGPT 仍会犯低级错误。例如可能出现指令理解偏差、虚构事实、对简单问题给出冗长模棱两可的回答，或是无法识别包含错误前提的指令。

Overall, our results indicate that fine-tuning large language models using human preferences significantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability.
总体而言，我们的研究结果表明，基于人类偏好对大语言模型进行微调能显著提升其在各类任务中的表现，但在提升模型安全性和可靠性方面仍需大量工作。

The rest of this paper is structured as follows: We first detail related work in Section 2, before diving into our method and experiment details in Section 3, including our high-level methodology (3.1), task and dataset details (3.3) and 3.2), human data collection (3.4), how we trained our models (3.5), and our evaluation procedure (3.6). We then present our results in Section 4 , divided into three parts: results on the API prompt distribution (4.1), results on public NLP datasets (4.2), and qualitative results (4.3). Finally we give an extended discussion of our work in Section 5, including implications for alignment research (5.1), what we are aligning to (5.2), limitations (5.3), open questions (5.4), and broader impacts of this work 5.5 .
本文其余部分结构如下：我们首先在第 2 节详述相关研究工作，随后在第 3 节深入介绍方法与实验细节，包括高层方法论（3.1 节）、任务与数据集详情（3.3 节与 3.2 节）、人类数据收集（3.4 节）、模型训练方式（3.5 节）以及评估流程（3.6 节）。第 4 节展示实验结果，分为三部分：API 提示分布结果（4.1 节）、公开 NLP 数据集结果（4.2 节）和定性分析结果（4.3 节）。最后在第 5 节展开延伸讨论，涵盖对齐研究的意义（5.1 节）、对齐目标（5.2 节）、局限性（5.3 节）、开放性问题（5.4 节）以及本研究的广泛影响（5.5 节）。

Research on alignment and learning from human feedback. We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feedback (RLHF). Originally developed for training simple robots in simulated environments and Atari games (Christiano et al., 2017, Ibarz et al., 2018), it has recently been applied to fine-tuning language models to summarize text (Ziegler et al., 2019; Stiennon et al., 2020; Böhm et al., 2019; Wu et al., 2021). This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue (Jaques et al., 2019; Yi et al., 2019; Hancock et al., 2019), translation (Kreutzer et al., 2018; Bahdanau et al., 2016), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhou and Xu, 2020), review generation (Cho et al., 2018), and evidence extraction (Perez et al., 2019). Madaan et al. (2022) use written human feedback to augment prompts and improve the performance of GPT-3. There has also been work on aligning agents in text-based environments using RL with
关于人类反馈对齐与学习的研究。我们在现有技术基础上探索如何使模型与人类意图对齐，特别是基于人类反馈的强化学习（RLHF）。该技术最初用于训练模拟环境和雅达利游戏中的简单机器人（Christiano 等，2017；Ibarz 等，2018），最近已被应用于微调语言模型以进行文本摘要（Ziegler 等，2019；Stiennon 等，2020；Böhm 等，2019；Wu 等，2021）。这项研究反过来受到其他领域类似工作的启发，这些工作将人类反馈作为奖励信号应用于对话（Jaques 等，2019；Yi 等，2019；Hancock 等，2019）、翻译（Kreutzer 等，2018；Bahdanau 等，2016）、语义解析（Lawrence 和 Riezler，2018）、故事生成（Zhou 和 Xu，2020）、评论生成（Cho 等，2018）以及证据提取（Perez 等，2019）等任务。Madaan 等（2022）通过书面人类反馈来增强提示词，从而提升 GPT-3 的性能。此外还有研究探索在基于文本的环境中使用强化学习来对齐智能体
a normative prior (Nahian et al., 2021). Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks.
一种规范性先验（Nahian 等人，2021 年）。我们的工作可视为将 RLHF 直接应用于在广泛语言任务分布上对齐语言模型。

The question of what it means for language models to be aligned has also received attention recently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming misspecified objectives. In concurrent work, Askell et al. (2021) propose language assistants as a testbed for alignment research, study some simple baselines, and their scaling properties.
关于语言模型对齐意味着什么的问题近期也受到关注（Gabriel，2020 年）。Kenton 等人（2021 年）列举了语言模型因未对齐导致的行为问题，包括生成有害内容和利用目标函数漏洞。在同期研究中，Askell 等人（2021 年）提出将语言助手作为对齐研究的测试平台，探讨了一些简单基线方法及其扩展特性。

Training language models to follow instructions. Our work is also related to research on crosstask generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021, Khashabi et al., 2020, Sanh et al., 2021, Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.
训练语言模型遵循指令。我们的研究也与语言模型跨任务泛化相关，这类研究通过在大量公开 NLP 数据集（通常添加适当指令前缀）上微调语言模型，并在不同 NLP 任务集上进行评估。该领域已有系列研究成果（Yi 等，2019；Mishra 等，2021；Wei 等，2021；Khashabi 等，2020；Sanh 等，2021；Aribandi 等，2021），各研究在训练评估数据、指令格式、预训练模型规模等实验细节上存在差异。多项研究一致发现：在带指令的多项 NLP 任务上微调语言模型，能提升其在零样本和小样本设置下对保留任务的后续表现。

There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018, Abramson et al., 2020; Zhao et al., 2021).
另有一系列关于导航指令跟随的相关研究，通过训练模型在模拟环境中遵循自然语言指令进行导航（Bahdanau 等，2018；Abramson 等，2020；Zhao 等，2021）。

Evaluating the harms of language models. A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018, Xu et al., 2020, Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021).
评估语言模型的危害性。调整语言模型行为的目标在于减轻这些模型在现实世界部署时可能造成的危害。相关风险已被广泛记录（Bender 等，2021；Bommasani 等，2021；Kenton 等，2021；Weidinger 等，2021；Tamkin 等，2021）。语言模型可能产生带有偏见的输出（Dhamala 等，2021；Liang 等，2021；Manela 等，2021；Caliskan 等，2017；Kirk 等，2021），泄露隐私数据（Carlini 等，2021），生成虚假信息（Solaiman 等，2019；Buchanan 等，2021），甚至被恶意利用；详细综述可参阅 Weidinger 等（2021）的研究。在特定领域部署语言模型会引发新的风险与挑战，例如在对话系统中（Henderson 等，2018；Xu 等，2020；Dinan 等，2019b）。一个新兴但快速发展的研究领域正致力于建立基准来具体评估这些危害，尤其针对毒性言论（Gehman 等，2020）、刻板印象（Nadeem 等，2020）和社会偏见（Dhamala 等，2021；Nangia 等，2020；Rudinger 等，2018）。在这些问题上取得重大进展十分困难，因为针对语言模型行为的善意干预可能会产生副作用（Welbl 等人，2021；Blodgett 等人，2020）。例如，由于训练数据中存在偏见性关联，降低语言模型毒性的努力可能会削弱其对 underrepresented 群体文本的建模能力（Xu 等人，2021）。

Modifying the behavior of language models to mitigate harms. There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019, Dinan et al., 2019a), and human-in-theloop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021).
调整语言模型行为以减少危害。改变语言模型生成行为的方法多种多样。Solaiman 和 Dennison（2021）在小规模价值观导向数据集上对语言模型进行微调，从而提升模型在问答任务中遵循这些价值观的能力。Ngo 等人（2021）通过剔除预训练数据集中语言模型对研究者编写的触发短语集具有高条件生成概率的文档进行数据过滤。基于该过滤数据集训练的语言模型生成有害文本的概率降低，但代价是语言建模性能略有下降。Xu 等人（2020）采用多种方法提升聊天机器人安全性，包括数据过滤、生成过程中屏蔽特定词语或 n 元语法、安全专用控制标记（Keskar 等人，2019；Dinan 等人，2019a）以及人机协同数据收集（Dinan 等人，2019b）。其他缓解语言模型生成偏见的方法包括词嵌入正则化（Liu 等人，2019；Huang 等人，2019）、数据增强（Liu 等人，2019；Dinan 等人，2019a；Sheng 等人，2019）、通过零空间投影使敏感词符分布更均匀（Liang 等人，2021）、采用不同目标函数（Qian 等人，2019）或因果中介分析（Vig 等人，2020）。另有研究通过次级（通常较小的）语言模型引导生成（Dathathri 等人，2019；Krause 等人，2020），该方法的变体已被应用于降低语言模型毒性（Schick 等人，2021）。

Table 1: Distribution of use case categories from our API prompt dataset.
表 1：我们 API 提示数据集中用例类别的分布

Use-case 用例类别	$(%)$
Generation 生成类	$45.6 %$
Open QA 开放式问答	$12.4 %$
Brainstorming 头脑风暴	$11.2 %$
Chat 聊天对话	$8.4 %$
Rewrite 改写润色	$6.6 %$
Summarization 摘要生成	$4.2 %$
Classification 文本分类	$3.5 %$
Other 其他功能	$3.5 %$
Closed QA 封闭式问答	$2.6 %$
Extract 提取	$1.9 %$

Table 2: Illustrative prompts from our API prompt dataset. These are fictional examples inspired by real usage-see more examples in Appendix A.2.1
表 2：我们 API 提示数据集中的示例提示。这些虚构示例灵感来源于实际使用场景——更多示例详见附录 A.2.1

Use-case 使用场景	Prompt 提示词
Brainstorming 头脑风暴	List five ideas for how to regain enthusiasm for my career 列出五个重燃职业热情的方法
Generation 创意生成	Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home. 写一个短篇故事：一只熊去海滩，与海豹交上朋友，然后回家。
Rewrite 重写	This is the summary of a Broadway play: """ 这是一部百老汇戏剧的剧情概要："""
	{summary} """
	This is the outline of the commercial for that play: """ 这是该剧宣传广告的提纲："""

3 Methods and experimental details
3 方法与实验细节

3.1 High-level methodology
3.1 高层次方法论

Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Sections 3.4 for details). We then apply the following three steps (Figure 2).
我们的方法论沿袭了 Ziegler 等人(2019)和 Stiennon 等人(2020)在风格延续与摘要生成领域的研究框架。研究起点包括：一个预训练语言模型(Radford 等人，2019；Brown 等人，2020；Fedus 等人，2021；Rae 等人，2021；Thoppilan 等人，2022)、期望模型产生对齐输出的提示词分布（详见 3.4 节），以及经过专业训练的人工标注团队。随后我们执行以下三个步骤（见图 2）。

Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.
步骤 1：收集示范数据并训练监督策略。标注人员根据输入提示词分布提供目标行为的示范样本（该分布的具体说明见 3.2 节）。随后我们使用监督学习方式，基于这些数据对预训练的 GPT-3 模型进行微调。

Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.
第二步：收集对比数据并训练奖励模型。我们收集模型输出之间的对比数据集，标注人员会标明在给定输入下更偏好哪种输出。随后训练一个奖励模型来预测人类偏好的输出。

Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).
第三步：使用 PPO 算法针对奖励模型优化策略。我们将奖励模型的输出作为标量奖励值，通过 PPO 算法（Schulman 等人，2017）对监督策略进行微调以优化该奖励。
Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.
第二步和第三步可循环迭代：基于当前最优策略收集更多对比数据，用于训练新的奖励模型继而得到新策略。实践中，我们的对比数据主要来自监督策略，部分来自 PPO 策略。

3.2 Dataset 3.2 数据集

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface

^{4}

Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time InstructGPT models were used. In this paper we do not use data from customers using the API in production. We heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we filter all prompts in the training split for personally identifiable information (PII).
我们的提示数据集主要由提交至 OpenAI API 的文本提示构成，特别是那些在 Playground 界面上使用早期版本 InstructGPT 模型（基于我们演示数据子集通过监督学习训练）的提示

^{4}

。使用 Playground 的客户在每次调用 InstructGPT 模型时，都会通过循环通知被告知他们的数据可能被用于进一步训练模型。本文未使用生产环境中 API 客户的数据。我们通过检查具有长公共前缀的提示进行启发式去重，并将每个用户 ID 的提示数量限制在 200 条以内。同时基于用户 ID 划分训练集、验证集和测试集，确保验证集和测试集中不包含训练集用户的数据。为防止模型学习潜在敏感客户信息，我们对训练集中的所有提示进行了个人身份信息（PII）过滤。

To train the very first InstructGPT models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API. We asked labelers to write three kinds of prompts:
为了训练最初的 InstructGPT 模型，我们要求标注员自行编写提示词。这是因为我们需要一个初始的指令式提示词来源来启动训练流程，而这类提示词在 API 的常规 GPT-3 模型中并不常见。我们要求标注员编写三种类型的提示词：

Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity.
基础型：我们仅要求标注员构思任意任务，同时确保这些任务具有足够的多样性。
Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction.
少样本型：我们要求标注员设计一条指令，并为该指令提供多组查询/响应对。
User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.
用户场景型：我们从 OpenAI API 候补名单申请中提取了若干使用场景，要求标注员根据这些用例构思对应的提示词。

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33 k training prompts (from the API and labeler-written), and the PPO dataset has 31 k training prompts (only from the API). More details on dataset sizes are provided in Table 6
基于这些提示，我们生成了用于微调过程的三个不同数据集：(1) SFT 数据集，包含标注员演示样本用于训练 SFT 模型；(2) RM 数据集，包含标注员对模型输出的排序用于训练奖励模型；(3) PPO 数据集，不含人工标注，作为 RLHF 微调的输入数据。SFT 数据集包含约 1.3 万条训练提示（来自 API 和标注员撰写），RM 数据集包含 3.3 万条训练提示（来自 API 和标注员撰写），PPO 数据集包含 3.1 万条训练提示（仅来自 API）。更多关于数据集规模的细节见表 6。
To give a sense of the composition of our dataset, in Table 1] we show the distribution of use-case categories for our API prompts (specifically the RM dataset) as labeled by our contractors. Most of the use-cases have are generative, rather than classification or QA. We also show some illustrative prompts (written by researchers to mimic the kinds of prompts submitted to InstructGPT models) in Table 2; more prompts submitted to InstructGPT models are shown in Appendix A.2.1, and prompts submitted to GPT-3 models are shown in Appendix A.2.2 We provide more details about our dataset in Appendix A.
为展示数据集的构成情况，我们在表 1 中呈现了由标注人员分类的 API 提示词用例分布（特别是 RM 数据集）。大多数用例属于生成类任务，而非分类或问答类。表 2 展示了一些由研究人员编写的示例提示词（用于模拟提交给 InstructGPT 模型的提示类型）；附录 A.2.1 展示了更多提交给 InstructGPT 模型的提示词，附录 A.2.2 则展示了提交给 GPT-3 模型的提示词。关于数据集的更多细节详见附录 A。

3.3 Tasks 3.3 任务

Our training tasks are from two sources: (1) a dataset of prompts written by our labelers and (2) a dataset of prompts submitted to early InstructGPT models on our API (see Table 6). These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1). Our dataset is over 96% English, however in Section 4.3 we also probe our model’s ability to respond to instructions in other languages and complete coding tasks.
我们的训练任务来自两个来源：(1) 标注人员编写的提示词数据集；(2) API 上提交给早期 InstructGPT 模型的提示词数据集（参见表 6）。这些提示词类型极为多样，涵盖生成、问答、对话、摘要、信息抽取及其他自然语言任务（见表 1）。虽然数据集 96%以上为英文内容，但在第 4.3 节中我们也探究了模型响应其他语言指令及完成编程任务的能力。

For each natural language prompt, the task is most often specified directly through a natural language instruction (e.g. “Write a story about a wise frog”), but could also be indirectly through either few-shot examples (e.g. giving two examples of frog stories, and prompting the model to generate a new one) or implicit continuation (e.g. providing the start of a story about a frog). In each case, we ask our labelers to do their best to infer the intent of the user who wrote the prompt, and ask them to skip inputs where the task is very unclear. Moreover, our labelers also take into account the implicit intentions such as truthfulness of the response, and potentially harmful outputs such as biased or toxic language, guided by the instructions we provide them (see Appendix B) and their best judgment.
对于每个自然语言提示，任务通常通过自然语言指令直接指定（例如"写一个关于聪明青蛙的故事"），但也可以通过少量示例（例如给出两个青蛙故事的示例，并提示模型生成一个新故事）或隐式延续（例如提供一个关于青蛙的故事开头）间接指定。在每种情况下，我们要求标注者尽力推断编写提示用户的意图，并让他们跳过任务非常不明确的输入。此外，我们的标注者还会考虑诸如回答的真实性等隐含意图，以及根据我们提供的指导（见附录 B）和他们最佳判断，考虑可能存在有害的输出，如偏见或有毒语言。

3.4 Human data collection
3.4 人工数据收集

To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of summarization (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics, see Appendix B.1.
为生成演示与对比数据并开展主要评估工作，我们通过 Upwork 和 ScaleAI 平台招募了约 40 名外包人员组成团队。与早期针对摘要任务收集人类偏好数据的研究（Ziegler 等，2019；Stiennon 等，2020；Wu 等，2021）相比，我们的输入内容覆盖更广泛的任务类型，偶尔会涉及争议性及敏感话题。我们旨在筛选出能敏锐感知不同人群偏好、擅长识别潜在有害输出的标注人员，因此设计了专项筛选测试来评估标注者在这两方面的能力。最终遴选出测试表现优异者，具体筛选流程及标注人员人口统计信息详见附录 B.1。

During training and evaluation, our alignment criteria may come into conflict: for example, when a user requests a potentially harmful response. During training we prioritize helpfulness to the user (not
在训练与评估过程中，我们的对齐标准可能出现冲突：例如当用户请求可能有害的响应时。训练阶段我们优先考虑对用户的帮助性（而非
doing so requires making some difficult design decisions that we leave to future work; see Section 5.4 for more discussion). However, in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about).
这样做需要做出一些艰难的设计决策，我们将这些留给未来的工作去解决（更多讨论请参见第 5.4 节）。不过，在最终评估阶段，我们要求标注员优先考虑真实性和无害性（因为这才是我们真正关心的重点）。
As in Stiennon et al. (2020), we collaborate closely with labelers over the course of the project. We have an onboarding process to train labelers on the project, write detailed instructions for each task (see Appendix B.2), and answer labeler questions in a shared chat room.
正如 Stiennon 等人（2020 年）的研究所述，我们在项目全程与标注员保持紧密合作。我们建立了完整的入职培训流程来指导标注员，为每项任务编写详细说明（参见附录 B.2），并在共享聊天室中及时解答标注员的疑问。

As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.
作为初步研究，我们雇佣了另一组标注员来测试模型对其他标注者偏好的泛化能力，这些标注员不参与任何训练数据的制作。他们来自相同的供应商，但无需通过筛选测试。

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other

72.6 \pm 1.5 %

of the time, while for held-out labelers this number is

77.3 \pm 1.3 %

. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was

73 \pm 4 %

.
尽管任务复杂，我们发现标注者间一致性相当高：训练标注者间相互认同率为

72.6 \pm 1.5 %

，而保留标注者的这一数值为

77.3 \pm 1.3 %

。作为对比，Stiennon 等人（2020 年）的摘要研究中研究者间一致性为

73 \pm 4 %

。

3.5 Models 3.5 模型

We start with the GPT-3 pretrained language models from Brown et al. (2020). These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. Starting from these models, we then train models with three different techniques:
我们从 Brown 等人（2020 年）提出的 GPT-3 预训练语言模型出发。这些模型基于广泛的互联网数据分布进行训练，能够适应多种下游任务，但其行为特征尚未得到充分研究。基于这些初始模型，我们随后采用三种不同技术进行模型训练：

Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2 . We do our final SFT model selection based on the RM score on the validation set. Similarly to Wu et al. (2021), we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting.
监督微调（SFT）。我们采用监督学习方式，基于标注员演示数据对 GPT-3 进行微调。训练共进行 16 个周期，使用余弦学习率衰减策略，并设置 0.2 的残差丢弃率。最终 SFT 模型的选择依据验证集上的 RM 分数。与 Wu 等人（2021）的研究类似，我们发现 SFT 模型在 1 个周期后就会出现验证损失过拟合现象；但有趣的是，尽管存在过拟合，延长训练周期反而能同时提升 RM 分数和人类偏好评分。

Reward modeling (RM). Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175 B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix Cfor more details).
奖励建模（RM）。我们从移除最终解嵌入层的 SFT 模型出发，训练了一个接收提示与响应并输出标量奖励的模型。本文仅使用 60 亿参数的 RM 模型，这大幅节省了计算资源——我们发现 1750 亿参数的 RM 训练可能不稳定，因此不太适合作为强化学习中的价值函数使用（详见附录 C）。
In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels-the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.
在 Stiennon 等人（2020 年）的研究中，奖励模型（RM）的训练基于同一输入下两个模型输出的比较数据集。他们采用交叉熵损失函数，将人工标注者对两个响应的偏好差异作为标签——奖励值的差异反映了人类标注者更倾向于选择某个响应的对数几率。
In order to speed up comparison collection, we present labelers with anywhere between

K = 4

and

K = 9

responses to rank. This produces

(\binom{K}{2})

comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit

^{5}

Instead, we train on all

(\binom{K}{2})

comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than

(\binom{K}{2})

forward passes for

K

completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss.
为了加速比较数据的收集，我们向标注者提供

K = 4

至

K = 9

个待排序的响应。这为每个呈现给标注者的提示生成

(\binom{K}{2})

组比较数据。由于同一标注任务内的比较数据存在高度相关性，我们发现若简单地将所有比较数据混入单一数据集，会导致奖励模型在单轮训练后出现过拟合现象

^{5}

。因此，我们改为将每个提示对应的全部

(\binom{K}{2})

组比较数据作为单个批次元素进行训练。这种方法显著提升了计算效率：每个生成结果只需一次 RM 前向传播（而非对

K

个生成结果进行

(\binom{K}{2})

次前向传播），同时由于避免了过拟合，验证准确率和对数损失指标都获得了显著提升。
Specifically, the loss function for the reward model is:
具体而言，奖励模型的损失函数为：

loss (θ) = - \frac{1}{(\binom{K}{2})} E_{(x, y_{w}, y_{l}) \sim D} [\log (σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l})))]

where

r_{θ} (x, y)

is the scalar output of the reward model for prompt

x

and completion

y

with parameters

θ, y_{w}

is the preferred completion out of the pair of

y_{w}

and

y_{l}

, and

D

is the dataset of human comparisons.
其中

r_{θ} (x, y)

是奖励模型针对提示

x

和补全

y

（参数为

θ, y_{w}

）的标量输出，

y_{w}

和

y_{l}

这对补全中

θ, y_{w}

是优选补全，

D

是人类比较数据集。

Table 3: Labeler-collected metadata on the API distribution.
表 3：API 分布上标注员收集的元数据

Metadata 元数据	Scale 规模
Overall quality 整体质量	Likert scale; 1-7 李克特量表；1-7 分
Fails to follow the correct instruction / task 未能遵循正确指令/完成任务	Binary 二分法（是/否）
Inappropriate for customer assistant 不适合客户助理	Binary 二元
Hallucination 幻觉	Binary 二元
Satisifies constraint provided in the instruction 满足指令中提供的约束条件	Binary 二元分类
Contains sexual content 包含性相关内容	Binary 二元分类
Contains violent content 包含暴力内容	Binary 二元分类
Encourages or fails to discourage violence/abuse/terrorism/self-harm 鼓励或未能阻止暴力/虐待/恐怖主义/自残行为	Binary 二元分类
Denigrates a protected class 贬低受保护群体	Binary 二元性
Gives harmful advice 提供有害建议	Binary 二元性
Expresses opinion 表达观点	Binary 二元性
Expresses moral judgment 表达道德判断	Binary 二元性

Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.
最终，由于奖励模型的损失对奖励偏移具有不变性，我们在进行强化学习前通过添加偏置项对奖励模型进行归一化处理，使得标注员示范的平均得分归零。

Reinforcement learning (RL). Once again following Stiennon et al. (2020), we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate overoptimization of the reward model. The value function is initialized from the RM. We call these models “PPO.”
强化学习（RL）。再次遵循 Stiennon 等人（2020）的方法，我们使用 PPO 算法（Schulman 等人，2017）在环境中对 SFT 模型进行微调。该环境是一个老虎机式环境，会随机呈现用户提示并期待对提示作出响应。根据提示和响应，环境会生成由奖励模型决定的奖励值并结束当前回合。此外，我们在每个词元处添加来自 SFT 模型的逐词元 KL 散度惩罚项，以缓解对奖励模型的过度优化。价值函数从奖励模型初始化而来。我们将这些模型称为"PPO 模型"。

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the following combined objective function in RL training:
我们还尝试将预训练梯度与 PPO 梯度相混合，以修正模型在公开 NLP 数据集上的性能衰退问题。这类模型我们称为"PPO-ptx 模型"。在强化学习训练中，我们通过最大化以下组合目标函数来实现：

\begin{aligned} objective (ϕ) = & E_{(x, y) \sim D_{π_{ϕ}^{RL}}} [r_{θ} (x, y) - β \log (π_{ϕ}^{RL} (y ∣ x) / π^{SFT} (y ∣ x))] + \\ γ E_{x \sim D_{pretrain}} [\log (π_{ϕ}^{RL} (x))] \end{aligned}

where

π_{ϕ}^{RL}

is the learned RL policy,

π^{SFT}

is the supervised trained model, and

D_{pretrain}

is the pretraining distribution. The KL reward coefficient,

β

, and the pretraining loss coefficient,

γ

, control the strength of the KL penalty and pretraining gradients respectively. For “PPO” models,

γ

is set to 0 . Unless otherwise specified, in this paper InstructGPT refers to the PPO-ptx models.
其中

π_{ϕ}^{RL}

表示学习到的强化学习策略，

π^{SFT}

代表监督训练模型，

D_{pretrain}

为预训练数据分布。KL 奖励系数

β

和预训练损失系数

γ

分别控制 KL 惩罚项与预训练梯度的强度。对于"PPO"模型，

γ

设置为 0。除非特别说明，本文中 InstructGPT 均指代 PPO-ptx 模型。

Baselines. We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction 6
基线对比。我们将 PPO 模型与 SFT 模型及 GPT-3 进行性能比较，同时对比了 GPT-3 在添加少量示例前缀以"提示"其进入指令跟随模式的情况（GPT-3-prompted）。该前缀会附加在用户指定指令之前 6。

We additionally compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (the datasets differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples respectively and choose the checkpoint which obtains the highest reward model score on the validation set. See Appendix [] for more training details.
我们还将 InstructGPT 与基于 FLAN（Wei 等，2021）和 T0（Sanh 等，2021）数据集微调的 175B 参数 GPT-3 进行对比。这两个数据集均包含多种 NLP 任务及对应的自然语言指令（差异在于所含 NLP 数据集及指令风格）。我们分别在约 100 万样本上进行微调，并选择验证集上获得最高奖励模型得分的检查点。具体训练细节参见附录[]。

3.6 Evaluation 3.6 评估

To evaluate how “aligned” our models are, we first need to clarify what alignment means in this context. The definition of alignment has historically been a vague and confusing topic, with various
要评估模型的"对齐"程度，我们首先需要明确在此背景下对齐的含义。对齐的定义历来是个模糊且令人困惑的话题，存在多种

competing proposals (Chen et al., 2021, Leike et al., 2018, Gabriel, 2020). Following Leike et al. (2018), our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless.
相互竞争的观点（Chen 等，2021；Leike 等，2018；Gabriel，2020）。遵循 Leike 等（2018）的研究，我们的目标是训练符合用户意图的模型。更实际地说，针对我们的语言任务目标，我们采用与 Askell 等（2021）类似的框架，将模型定义为对齐的标准是：有帮助性、诚实性和无害性。

To be helpful, the model should follow instructions, but also infer intention from a few-shot prompt or another interpretable pattern such as " Q : {question}

∖ nA

:". Since a given prompt’s intention can be unclear or ambiguous, we rely on judgment from our labelers, and our main metric is labeler preference ratings. However, since our labelers are not the users who generated the prompts, there could be a divergence between what a user actually intended and what the labeler thought was intended from only reading the prompt.
所谓有帮助性，模型不仅应遵循指令，还应能从少量示例提示或其他可解释模式（如"问：{问题}

∖ nA

："）中推断意图。由于给定提示的意图可能不明确或存在歧义，我们依赖标注员的判断，主要度量标准是标注员的偏好评分。然而，由于标注员并非生成提示的真实用户，用户实际意图与标注员仅通过阅读提示所理解的意图之间可能存在偏差。

It is unclear how to measure honesty in purely generative models; this requires comparing the model’s actual output to its “belief” about the correct output, and since the model is a big black box, we can’t infer its beliefs. Instead, we measure truthfulness-whether the model’s statements about the world are true-using two metrics: (1) evaluating our model’s tendency to make up information on closed domain tasks (“hallucinations”), and (2) using the TruthfulQA dataset (Lin et al., 2021). Needless to say, this only captures a small part of what is actually meant by truthfulness.
目前尚不清楚如何衡量纯生成模型的诚实性，这需要将模型的实际输出与其对正确输出的"信念"进行比较。由于模型本身是一个巨大的黑箱，我们无法推断其信念。因此，我们采用两项指标来衡量真实性（即模型关于世界的陈述是否真实）：(1)评估模型在封闭领域任务中编造信息的倾向（"幻觉"）；(2)使用 TruthfulQA 数据集（Lin 等人，2021）。毋庸讳言，这仅能捕捉真实性实际含义的一小部分。
Similarly to honesty, measuring the harms of language models also poses many challenges. In most cases, the harms from language models depend on how their outputs are used in the real world. For instance, a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might even be helpful if used for data augmentation to train a more accurate toxicity detection model. Earlier in the project, we had labelers evaluate whether an output was ‘potentially harmful’. However, we discontinued this as it required too much speculation about how the outputs would ultimately be used; especially since our data also comes from customers who interact with the Playground API interface (rather than from production use cases).
与衡量诚实性类似，评估语言模型的危害也面临诸多挑战。在多数情况下，语言模型造成的危害取决于其输出内容在现实世界中的具体应用场景。例如，生成有害内容的模型若部署在聊天机器人中可能造成危害，但若用于数据增强来训练更精准的有害内容检测模型，反而可能产生积极作用。项目初期，我们曾让标注员评估输出内容是否"具有潜在危害"，但最终停止了这种做法——因为这需要过度推测输出内容最终将如何被使用，尤其考虑到我们的数据还来自与 Playground API 接口交互的客户（而非实际生产用例）。

Therefore we use a suite of more specific proxy criteria that aim to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content. We also benchmark our model on datasets intended to measure bias and toxicity, such as RealToxicityPrompts (Gehman et al., 2020) and CrowS-Pairs (Nangia et al., 2020).
因此，我们采用了一套更具体的代理标准，旨在捕捉部署模型中可能最终造成危害的行为的不同方面：我们让标注员评估输出内容在客户助理场景下是否不当、贬低受保护群体或包含性与暴力内容。同时，我们在衡量偏见和毒性的数据集（如 RealToxicityPrompts（Gehman 等人，2020）和 CrowS-Pairs（Nangia 等人，2020））上对模型进行基准测试。
To summarize, we can divide our quantitative evaluations into two separate parts:
总结而言，我们的量化评估可分为两个独立部分：
Evaluations on API distribution. Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by customers we haven’t included in training. However, given that our training prompts are designed to be used with InstructGPT models, it’s likely that they disadvantage the GPT-3 baselines. Thus, we also evaluate on prompts submitted to GPT-3 models on the API; these prompts are generally not in an ‘instruction following’ style, but are designed specifically for GPT-3. In both cases, for each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 3).
API 分布评估。我们的主要指标是基于与训练集同源的保留提示集上的人类偏好评分。当使用 API 中的提示进行评估时，我们仅选择未包含在训练集中的客户提示。然而，由于我们的训练提示是专为 InstructGPT 模型设计的，这些提示可能会对 GPT-3 基线模型造成不利影响。因此，我们也会对通过 API 提交给 GPT-3 模型的提示进行评估；这些提示通常不属于"指令跟随"风格，而是专门为 GPT-3 设计的。在两种情况下，我们都计算每个模型输出优于基线策略的频率；我们选择 175B 参数的 SFT 模型作为基线，因为其性能处于中等水平。此外，我们要求标注人员按照 1-7 分的李克特量表评估每个回复的整体质量，并为每个模型输出收集一系列元数据（见表 3）。

Evaluations on public NLP datasets. We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehension, and summarization. We also conduct human evaluations of toxicity on the RealToxicityPrompts dataset (Gehman et al., 2020). We are releasing samples from our models on all of the sampling-based NLP tasks
在公开 NLP 数据集上的评估。我们评估两类公开数据集：一类用于捕捉语言模型安全性的特定方面（特别是真实性、毒性和偏见），另一类用于衡量传统 NLP 任务（如问答、阅读理解和摘要生成）的零样本性能。我们还对 RealToxicityPrompts 数据集（Gehman 等人，2020 年）进行了人工毒性评估。我们将发布模型在所有基于采样的 NLP 任务上的输出样本

4 Results 4 实验结果

In this section, we provide experimental evidence for our claims in Section 1, sorted into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.
本节通过实验数据验证第 1 章提出的主张，结果分为三部分呈现：API 提示分布结果、公开 NLP 数据集结果以及定性分析结果。

Figure 3: Preference results of our models, measured by winrate against the 175B SFT model. Left: results on prompts submitted to GPT models on the API; Right: results on prompts submitted to InstructGPT models on the API; Top: results from held-out labelers; Bottom: results from training labelers. We omit GPT (prompted) from the evals on prompts submitted to GPT-3 models (left) as these prompts are already designed to perform well for GPT-3, as opposed to prompts submitted to InstructGPT models (right).
图 3：我们模型的偏好结果，以相对于 175B SFT 模型的胜率衡量。左图：针对 API 上提交给 GPT 模型的提示词结果；右图：针对 API 上提交给 InstructGPT 模型的提示词结果；上图：来自预留标注员的结果；下图：来自训练标注员的结果。我们在针对 GPT-3 模型提交的提示词评估中（左图）省略了 GPT（带提示）的结果，因为这些提示词本就是为 GPT-3 优化设计的，而右图针对 InstructGPT 模型提交的提示词则不同。

4.1 Results on the API distribution
4.1 API 分布上的结果

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set of prompts, our labelers significantly prefer InstructGPT outputs across model sizes. These results are shown in Figure 1. We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pretraining mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs

85 \pm 3 %

of the time, and preferred

71 \pm 4 %

of the time to few-shot GPT-3.
标注人员明显更倾向于 InstructGPT 的输出而非 GPT-3 的输出。在我们的测试提示集上，标注人员对不同模型规模的 InstructGPT 输出都表现出显著偏好。这些结果如图 1 所示。我们发现 GPT-3 的输出表现最差，通过精心设计的少量示例提示（GPT-3（prompted））可以获得显著的阶段性改进，随后通过监督学习（SFT）在演示数据上进行训练，最后通过 PPO 在比较数据上进行训练。在 PPO 期间添加预训练混合的更新不会导致标注人员偏好的重大变化。为说明我们的改进幅度：在直接比较时，175B 参数的 InstructGPT 输出有

85 \pm 3 %

的时间优于 GPT-3 输出，并有

71 \pm 4 %

的时间优于少量示例的 GPT-3。
We also found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3), though our PPO-ptx models perform slightly worse at larger model sizes.
我们还发现，当评估 API 上提交给 GPT-3 模型的提示时，我们的结果没有显著变化（见图 3），尽管我们的 PPO-ptx 模型在较大模型规模时表现稍差。
In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes. Specifically, compared to GPT-3, InstructGPT outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. “Write your answer in 2 paragraphs or less.”), are less likely to fail to follow the correct instruction entirely, and make up facts (‘hallucinate’) less often in closed-domain tasks. These results suggest that InstructGPT models are more reliable and easier to control than GPT-3. We’ve found that our other metadata
图 4 显示，标注人员还在几个更具体的维度上对 InstructGPT 的输出给予积极评价。具体而言，与 GPT-3 相比，InstructGPT 的输出在客户助理场景中更得体，更常遵循指令中明确定义的约束条件（例如"用不超过 2 个段落回答"），完全未能遵循正确指令的情况更少，且在封闭领域任务中虚构事实（"幻觉"）的频率更低。这些结果表明 InstructGPT 模型比 GPT-3 更可靠且更易控制。我们发现其他元数据

Figure 4: Metadata results on the API distribution. Note that, due to dataset sizes, these results are collapsed across model sizes. See Appendix E.2 for analysis that includes model size. Compared to GPT-3, the PPO models are more appropriate in the context of a customer assistant, are better at following explicit constraints in the instruction and attempting the correct instruction, and less likely to ‘hallucinate’ (meaning, making up information on closed domain tasks like summarization).
图 4：API 分布上的元数据结果。请注意，由于数据集规模限制，这些结果是跨模型尺寸合并统计的。包含模型尺寸的分析请参阅附录 E.2。与 GPT-3 相比，PPO 模型在客户助理场景中表现更得体，更擅长遵循指令中的显式约束并执行正确指令，且产生"幻觉"（即在摘要等封闭领域任务中编造信息）的概率更低。

Figure 5: Comparing our models with FLAN and T0 in terms of Likert scores on a

1 - 7

scale, on the InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably with a few-shot GPT-3 model placed into ‘instruction-following’ mode.
图 5：在 InstructGPT 提示分布上，以

1 - 7

量表的李克特分数对比我们的模型与 FLAN 和 T0 的表现。FLAN 和 T0 优于默认的 GPT-3，与切换至"指令跟随"模式的少量示例 GPT-3 模型表现相当。
categories occur too infrequently in our API to obtain statistically significant differences between our models.
某些类别在我们的 API 中出现频率过低，无法获得模型间具有统计学意义的差异。

Our models generalize to the preferences of “held-out” labelers that did not produce any training data. Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren’t simply overfitting to the preferences of our training labelers.
我们的模型能够泛化至未参与训练数据标注的"保留"标注者偏好。保留标注者与用于生成训练数据的工作者具有相似的排序偏好（见图 3）。特别是根据保留工作者的评估，我们所有的 InstructGPT 模型仍显著优于 GPT-3 基线。这表明我们的 InstructGPT 模型并非简单地过拟合训练标注者的偏好。
We see further evidence of this from the generalization capabilities of our reward models. We ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group). These RMs have an accuracy of

69.6 \pm 0.9 %

on predicting the preferences of labelers in the held-out group, a small decrease from their

72.4 \pm 0.4 %

accuracy on predicting the preferences of labelers in their training set.
我们从奖励模型的泛化能力中进一步验证了这一点。我们进行了一项实验：将标注人员分为 5 组，采用 5 折交叉验证（用其中 4 组训练，剩余 1 组评估）训练 5 个奖励模型（使用 3 种不同随机种子）。这些模型在预测预留组标注人员偏好时的准确率为

69.6 \pm 0.9 %

，相较其训练集标注人员偏好的预测准确率

72.4 \pm 0.4 %

仅有小幅下降。

Public NLP datasets are not reflective of how our language models are used. In Figure 5, we also compare InstructGPT to our 175B GPT-3 baselines fine-tuned on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets (see Appendix C for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt
现有公共 NLP 数据集无法反映语言模型的实际使用场景。图 5 中我们还对比了 InstructGPT 与基于 FLAN（Wei 等人，2021）和 T0（Sanh 等人，2021）数据集微调的 175B 参数 GPT-3 基线模型（详见附录 C）。发现这些模型表现优于原始 GPT-3，与精心设计提示词的 GPT-3 相当，但逊于我们的监督微调基线。这表明这些数据集缺乏足够的多样性来提升我们 API 提示任务的表现。
distribution. In a head to head comparison, our 175B InstructGPT model outputs were preferred over our FLAN model

78 \pm 4 %

of the time and over our T0 model

79 \pm 4 %

of the time. Likert scores for these models are shown in Figure 5.
分布情况。在直接对比中，我们的 1750 亿参数 InstructGPT 模型输出结果被选中的频率分别比 FLAN 模型高出

78 \pm 4 %

，比 T0 模型高出

79 \pm 4 %

。这些模型的李克特评分如图 5 所示。
We believe our InstructGPT model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation. However, classification and QA are only a small part (about

18 %

) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about

57 %

of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.
我们认为 InstructGPT 模型优于 FLAN 和 T0 的原因有两点。首先，公开 NLP 数据集的设计初衷是捕捉易于通过自动指标评估的任务，如分类、问答，以及某种程度上的摘要和翻译。但根据标注员统计（见表 1），分类和问答仅占 API 客户使用我们语言模型场景的很小部分（约

18 %

），而开放式生成和头脑风暴类任务约占我们提示数据集的

57 %

。其次，公开 NLP 数据集很难获取极高多样性的输入类型（至少就现实用户真正感兴趣的使用场景而言）。当然，NLP 数据集中的任务确实代表了我们希望语言模型能够解决的一类指令，因此最广义的指令遵循模型应当结合这两种类型的数据集。

4.2 Results on public NLP datasets
4.2 公开 NLP 数据集上的实验结果

InstructGPT models show improvements in truthfulness over GPT-3. As measured by human evaluatoins on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. When evaluated only on prompts that were not adversarially selected against GPT-3, our PPO models are still significantly more truthful and informative than GPT-3 (although the absolute improvement decreases by a couple of percentage points.
InstructGPT 模型在真实性方面相比 GPT-3 有所提升。根据人类评估者在 TruthfulQA 数据集上的评分，我们的 PPO 模型在生成真实且信息丰富的输出方面相比 GPT-3 展现出虽小但显著的改进（见图 6）。这是默认行为：我们的模型无需被特别指示"说真话"就能表现出更高的真实性。有趣的是，1.3B 参数的 PPO-ptx 模型是个例外，其表现略逊于同尺寸的 GPT-3 模型。当仅评估那些非针对 GPT-3 设计的对抗性提示时，我们的 PPO 模型仍然比 GPT-3 显著更真实且信息丰富（尽管绝对改进幅度降低了几个百分点）。

Figure 6: Results on the TruthfulQA dataset. Gray bars indicate ratings of truthfulness; colored bars indicate ratings of truthfulness and informativeness.
图 6：TruthfulQA 数据集上的评估结果。灰色柱表示真实性评分，彩色柱表示真实性与信息丰富性综合评分。

Following Lin et al. (2021), we also give a helpful “Instruction+QA” prompt that instructs the model to respond with “I have no comment” when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren’t as good at this.
遵循 Lin 等人（2021）的方法，我们还提供了一个实用的"指令+问答"提示，要求模型在不确定正确答案时回应"我无话可说"。在这种情况下，我们的 PPO 模型倾向于保持真实但避免提供不确定信息，而非自信地给出错误答案；相比之下，基准 GPT-3 模型在这方面表现欠佳。
Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate (i.e. fabricate information) less often on closed-domain tasks from our API distribution, which we’ve shown in Figure 4.
我们 API 分布中封闭领域任务的幻觉现象（即捏造信息）减少，也印证了 PPO 模型在真实性方面的提升，如图 4 所示。

InstructGPT shows small improvements in toxicity over GPT-3, but not bias. We first evaluate our models on the RealToxicityPrompts dataset (Gehman et al., 2020). We do this in two ways: we run model samples through the Perspective API to obtain automatic toxicity scores, which is the
InstructGPT 在毒性控制方面较 GPT-3 略有改进，但在偏见消除上未见成效。我们首先在 RealToxicityPrompts 数据集（Gehman 等，2020）上评估模型，采用两种方式：通过 Perspective API 运行模型样本获取自动毒性评分

Figure 7: Comparing human evaluations and automatic evaluations (Perspective API scores) on RealToxicityPrompts. A total of 1,729 prompts were labeled for three different 175 B models, both with and without “respectful” instructions. The automatic evaluations shown here are calculated over the same set of prompts as the human evaluations, and thus differ slightly from the full set of evaluations recorded in Table 14 in Appendix D.
图 7：在 RealToxicityPrompts 数据集上对比人工评估与自动评估（Perspective API 分数）。针对三个不同的 1750 亿参数模型（包含/不含"尊重性"指令）共标注了 1,729 条提示词。此处展示的自动评估结果与人工评估基于相同提示词集计算，因此与附录 D 表 14 记录的完整评估结果略有差异。
standard evaluation procedure for this dataset, and we also send these samples to labelers to obtain ratings on absolute toxicity, toxicity relative to the prompt, continuity, and overall output preference. We sample prompts from this dataset uniformly according to prompt toxicity to better assess how our models perform with high input toxicity (see Figure 39 in Appendix E ); this differs from the standard prompt sampling for this dataset, and thus our absolute toxicity numbers are inflated.
该数据集的标准评估流程中，我们同时将这些样本发送给标注人员以获取以下评分：绝对毒性值、相对于提示词的毒性、连贯性以及整体输出偏好。我们根据提示词毒性值进行均匀采样（详见附录 E 图 39），以更好评估模型在高输入毒性下的表现——这与该数据集的标准提示采样方式不同，因此我们的绝对毒性数值会偏高。
Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output (“respectful prompt”), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed (“no prompt”). Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than those from GPT-3 (see Figure 39).
我们的实验结果如图 7 所示。研究发现，当被要求生成安全且尊重的输出时（"尊重性提示"），根据 Perspective API 的评估，InstructGPT 模型产生的有害内容明显少于 GPT-3。但当去除尊重性提示（"无提示"）时，这一优势便不复存在。有趣的是，当明确要求生成有害内容时，InstructGPT 输出的毒性反而远高于 GPT-3（参见图 39）。
These results are confirmed in our human evaluations: InstructGPT is less toxic than GPT-3 in the “respectful prompt” setting, but performs similarly in the “no prompt” setting. We provide extended results in Appendix E. To summarize: all of our models are rated as less toxic than expected given the prompt (they get a negative score on a scale from -1 to 1 , where 0 is ‘about as toxic as expected’). Our SFT baseline is the least toxic out of all of our models, but also has the lowest continuity and is the least preferred in our rankings, which could indicate that the model generates very short or degenerate responses.
人工评估结果证实了这一发现：在"尊重性提示"情境下，InstructGPT 的有害性低于 GPT-3；但在"无提示"情境下，两者表现相当。附录 E 提供了更详尽的实验结果。总体而言：在给定提示条件下，我们所有模型的有害性评分都低于预期（在-1 到 1 的量表中获得负分，其中 0 分表示"与预期毒性相当"）。监督微调基线模型是所有模型中毒性最低的，但其连贯性也最差，在模型排名中偏好度最低——这可能意味着该模型倾向于生成非常简短或退化的回答。
To evaluate the model’s propensity to generate biased speech (see Appendix

\sqrt{E}

), we also evaluated InstructGPT on modified versions of the Winogender (Rudinger et al., 2018) and CrowS-Pairs (Nangia et al., 2020) datasets. These datasets consists of pairs of sentences which can highlight potential bias. We calculate the relative probabilities of producing the sentences in each pair and the entropy (in bits) of the associated binary probability distributions. Perfectly unbiased models will have no preference between the sentences in each pair and will therefore have maximum entropy. By this metric, our models are not less biased than GPT-3. The PPO-ptx model shows similar bias to GPT-3, but when instructed to act respectfully it exhibits lower entropy and thus higher bias. The pattern of the bias is not clear; it appears that the instructed models are more certain of their outputs regardless of whether or not their outputs exhibit stereotypical behavior.
为评估模型生成偏见言论的倾向性（参见附录

\sqrt{E}

），我们还在 Winogender（Rudinger 等人，2018）和 CrowS-Pairs（Nangia 等人，2020）数据集的修改版本上对 InstructGPT 进行了测试。这些数据集包含能突显潜在偏见的句子对。我们计算了生成每对句子的相对概率以及相关二元概率分布的熵值（以比特为单位）。完全无偏见的模型对每对句子应无偏好，因此具有最大熵值。根据该指标，我们的模型并未比 GPT-3 表现出更少的偏见。PPO-ptx 模型显示出与 GPT-3 相似的偏见，但当被要求以尊重态度行事时，其表现出更低的熵值即更高的偏见。这种偏见模式尚不明确：似乎被指导的模型对其输出更具确定性，无论其输出是否表现出刻板行为。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases. We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks.
我们可以通过修改 RLHF 微调程序来最小化公开 NLP 数据集的性能衰退。默认情况下，当我们在 API 分布上训练 PPO 模型时，它会遭受"对齐税"的影响，导致其在多个公开 NLP 数据集上的性能下降。我们需要一种能避免对齐税的对齐方法，因为这会激励人们使用未对齐但在这些任务上能力更强的模型。

Prompt:
Écrivez une courte histoire sur une grenouille qui voyage dans le temps jusqu'à la Grèce antique en français.
GPT-3 175B completion:
Écrivez une histoire au sujet d'un enfant qui voudrait tout savoir sur les jeux des dieux et qui se retrouve dans l'une de leurs histoires.
Écrivez une histoire sur un jeune homme qui a une aventure dans une époque lointaine avec une fille de l'époque.
Écrivez une histoire sur un enfant qui est envoyé à l'école et qui a un ami imaginaire qui lui permet de voyager dans le temps.

InstructGPT 175B completion:
InstructGPT 175B 生成内容：

La grenouille était perdue et épuisée. Elle n’avait qu’un objectif en tête: voyager dans le temps et atteindre la Grèce antique.
青蛙迷路了且精疲力竭。它脑海中只有一个目标：穿越时空抵达古希腊。

Premier problème, trouver un moyen de transport. Elle avait entendu dire que l’un des meilleurs moyens était de sauter dans une machine temporelle. Elle s’élança, et voilà qu’elle se retrouva dans le passé, au cours d’une nuit étoilée à l’odeur douce et envoûtante.
首要问题是找到交通工具。它听说最佳方式之一是跳进时间机器。它纵身一跃，转眼间便置身于繁星点点的过去之夜，空气中弥漫着甜美迷人的芬芳。

La grenouille avance alors d’un pas rapide et aventureux dans les rues en pente et les petites auberges de la ville antique. […]
这只青蛙迈着轻快而冒险的步伐，穿梭在古城倾斜的街道和小客栈之间。[…]

Prompt: 提示：

What is the purpose of the list C in the code below?

def binomial_coefficient(n, r):
    C = [0 for i in range(r + 1)];
    C[0] = 1;
    for i in range(1, n + 1):
        j = min(i, r);
        while j > 0:
            C[j] += C[j - 1];
            j -= 1;
    return C[r]

GPT-3 175B completion: GPT-3 175B 生成内容：
A. to store the value of

C [0]

A. 用于存储

C [0]

的值
B. to store the value of C[1]
B. 用于存储 C[1]的值
C. to store the value of

C [i]

C. 用于存储

C [i]

的值
D. to store the value of

C [i - 1]

D. 用于存储

C [i - 1]

的值

InstructGPT 175B completion:
The list C in this code is used to store the values of the binomial coefficient as the function iterates through the values of n and r . It is used to calculate the value of the binomial coefficient for a given value of n and r , and stores the result in the final return value of the function.
InstructGPT 175B 生成的解释：这段代码中的列表 C 用于在函数遍历 n 和 r 值时存储二项式系数的值。它用于计算给定 n 和 r 值的二项式系数，并将结果存储在函数的最终返回值中。

Figure 8: Examples of generalization in the 175B PPO-ptx model (InstructGPT 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) InstructGPT can follow instructions in other languages, though it sometimes generates outputs in English. GPT-3 requires more careful prompting, similarly to in English. (2) InstructGPT can summarize and answer questions about code more reliably than GPT-3 (though its answer here isn’t quite correct). For the code QA example, GPT-3 does answer the question about

50 %

of the time.
图 8：175B PPO-ptx 模型（InstructGPT 175B）与未添加额外前缀的 GPT-3 175B 在泛化能力上的对比示例。提示词经过精选以展示特定行为，但输出结果未经筛选。(1) InstructGPT 能遵循其他语言的指令，尽管有时会生成英文输出。与英文场景类似，GPT-3 需要更谨慎的提示设计。(2) 在代码总结和问答方面，InstructGPT 比 GPT-3 表现更可靠（虽然此处给出的答案并不完全正确）。对于代码问答示例，GPT-3 约有

50 %

的概率能回答正确。

In Figure 29 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions.
图 29 显示，在 PPO 微调过程中加入预训练更新（PPO-ptx）可缓解所有数据集上的性能衰退现象，甚至在 HellaSwag 任务上超越 GPT-3。但 PPO-ptx 模型在 DROP、SQuADv2 和翻译任务上的表现仍落后于 GPT-3，需要进一步研究以消除这些性能差距。
Mixing in pretraining updates performs better than the simpler solution of increasing the KL coefficient. In Figure 33, we show that there is a value of the pretraining mix coefficient that both reverses the performance regressions on SQuADv2 and DROP (the datasets we used for testing), and has minimal reductions in validation reward. In contrast, increasing the KL coefficient (Figure 34) leads to significant decreases in validation reward and never fully recovers on DROP and SQuAD. Changing the KL model from the PPO init to GPT-3 gives similar results.
混合预训练更新的方法比单纯增加 KL 系数效果更佳。图 33 显示，存在一个预训练混合系数值，既能逆转 SQuADv2 和 DROP（我们用于测试的数据集）上的性能衰退，又能将验证奖励的降低控制在最小范围。相比之下，增加 KL 系数（图 34）会导致验证奖励显著下降，且在 DROP 和 SQuAD 数据集上始终无法完全恢复性能。将 KL 模型从 PPO 初始模型切换为 GPT-3 也得到类似结果。

4.3 Qualitative results 4.3 定性分析结果

InstructGPT models show promising generalization to instructions outside of the RLHF finetuning distribution. In particular, we find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code. This is
InstructGPT 模型展现出对 RLHF 微调分布之外指令的良好泛化能力。特别值得注意的是，我们发现 InstructGPT 能够遵循非英语语言的指令，并具备代码摘要和问答能力。这表明...

提示：为什么冥想后吃袜子很重要？

Prompt:

Why is it important to eat socks after meditating?

GPT-3 175B 生成的回答：从袜子中能学到什么？答：当你品尝袜子时，你正在体悟开悟的真谛。你尝到的是觉醒心灵的真实滋味。问：觉醒心灵的真实滋味是什么？答：觉醒心灵的真实滋味是无味之味。而无味之味的真谛，其实是臭脚丫的味道。

GPT-3 175B completion:

What can you learn from socks?

A: When you eat socks, you are tasting the essence of enlightenment. You are tasting the true flavor of the awakened mind.

Q: What is the true flavor of the awakened mind?

A: The true flavor of the awakened mind is the flavor of no flavor. And the true flavor of no flavor is the flavor of smelly feet.

InstructGPT 175B 回答：关于为何冥想后会产生吃袜子的需求，目前尚无明确答案，但存在多种可能解释的理论观点。部分专家认为，吃袜子的行为能帮助大脑从冥想后的意识变化状态中恢复；另一些观点则指出，这可能与冥想这类身心挑战活动后需要补充能量有关。还有理论提出，吃袜子能为冥想者提供新奇的感官体验，或通过打破常规思维模式的方式让其重新与物质世界建立连接。不过目前尚无确凿科学证据支持这些理论，最终每个人都需要根据自身经历和思考，来判断吃袜子对个人的特殊意义。

InstructGPT 175B completion:

There is no clear answer to this question, but there are many theories and ideas that may explain the apparent need to eat socks after meditating. Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation, while others suggest that it has something to do with fueling the body after engaging in a physically and mentally challenging activity, such as mediation. Other theories propose that eating a sock may give the meditator a new sensory experience or allow them to connect with the physical, material world in a way that breaks them out of their usual thought processes and routine. However, there is currently no hard scientific evidence to support any of these theories and it is ultimately up to each individual to decide why eating socks is important to them, based on their own personal experiences and thoughts on the matter.

提问：若以高速将炮弹直接射向南瓜会发生什么？

Prompt:

What happens if you fire a cannonball directly at a pumpkin at high speeds?

GPT-3 175B 生成内容：南瓜会把炮弹吸进去，然后炮弹会从南瓜上弹开。南瓜是一种强力磁铁，其磁力之强足以操控金属物体。

GPT-3 175B completion:

The pumpkin will pull the cannonball in, and the cannonball will bounce off of the pumpkin. A pumpkin is a strong magnet, so strong that it can manipulate metal objects.

InstructGPT 175B 生成内容：无法准确预测高速发射的炮弹直接击中南瓜会发生什么，因为涉及太多不可预测因素。这些因素可能包括炮弹的尺寸、重量和密度，南瓜的构成材料，炮弹的速度和冲击力，以及撞击瞬间炮弹与南瓜之间的距离。较可能的结果包括将南瓜击碎或击倒在地、使南瓜开裂，或是彻底摧毁南瓜。

InstructGPT 175B completion:

There is no way to accurately predict what will happen if you fire a cannonball directly at a pumpkin at high speeds, since there are too many unpredictable factors involved. Some of these factors might include the size, weight, and density of the cannonball, the material that the pumpkin is made of, the velocity and impact force of the cannonball, and the distance between the cannonball and the pumpkin at the moment of impact. Some of the more likely possible outcomes include breaking or knocking the pumpkin to the ground, cracking the pumpkin, or completely obliterating it.

"Prompt: Why is it important to eat socks after meditating?" "GPT-3 175B completion: What can you learn from socks? A: When you eat socks, you are tasting the essence of enlightenment. You are tasting the true flavor of the awakened mind. Q: What is the true flavor of the awakened mind? A: The true flavor of the awakened mind is the flavor of no flavor. And the true flavor of no flavor is the flavor of smelly feet." "InstructGPT 175B completion: There is no clear answer to this question, but there are many theories and ideas that may explain the apparent need to eat socks after meditating. Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation, while others suggest that it has something to do with fueling the body after engaging in a physically and mentally challenging activity, such as mediation. Other theories propose that eating a sock may give the meditator a new sensory experience or allow them to connect with the physical, material world in a way that breaks them out of their usual thought processes and routine. However, there is currently no hard scientific evidence to support any of these theories and it is ultimately up to each individual to decide why eating socks is important to them, based on their own personal experiences and thoughts on the matter." "Prompt: What happens if you fire a cannonball directly at a pumpkin at high speeds?" "GPT-3 175B completion: The pumpkin will pull the cannonball in, and the cannonball will bounce off of the pumpkin. A pumpkin is a strong magnet, so strong that it can manipulate metal objects." "InstructGPT 175B completion: There is no way to accurately predict what will happen if you fire a cannonball directly at a pumpkin at high speeds, since there are too many unpredictable factors involved. Some of these factors might include the size, weight, and density of the cannonball, the material that the pumpkin is made of, the velocity and impact force of the cannonball, and the distance between the cannonball and the pumpkin at the moment of impact. Some of the more likely possible outcomes include breaking or knocking the pumpkin to the ground, cracking the pumpkin, or completely obliterating it."

| Prompt: Why is it important to eat socks after meditating? | | | :--- | :--- | | GPT-3 175B completion: What can you learn from socks? A: When you eat socks, you are tasting the essence of enlightenment. You are tasting the true flavor of the awakened mind. Q: What is the true flavor of the awakened mind? A: The true flavor of the awakened mind is the flavor of no flavor. And the true flavor of no flavor is the flavor of smelly feet. | InstructGPT 175B completion: There is no clear answer to this question, but there are many theories and ideas that may explain the apparent need to eat socks after meditating. Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation, while others suggest that it has something to do with fueling the body after engaging in a physically and mentally challenging activity, such as mediation. Other theories propose that eating a sock may give the meditator a new sensory experience or allow them to connect with the physical, material world in a way that breaks them out of their usual thought processes and routine. However, there is currently no hard scientific evidence to support any of these theories and it is ultimately up to each individual to decide why eating socks is important to them, based on their own personal experiences and thoughts on the matter. | | Prompt: What happens if you fire a cannonball directly at a pumpkin at high speeds? | | | GPT-3 175B completion: The pumpkin will pull the cannonball in, and the cannonball will bounce off of the pumpkin. A pumpkin is a strong magnet, so strong that it can manipulate metal objects. | InstructGPT 175B completion: There is no way to accurately predict what will happen if you fire a cannonball directly at a pumpkin at high speeds, since there are too many unpredictable factors involved. Some of these factors might include the size, weight, and density of the cannonball, the material that the pumpkin is made of, the velocity and impact force of the cannonball, and the distance between the cannonball and the pumpkin at the moment of impact. Some of the more likely possible outcomes include breaking or knocking the pumpkin to the ground, cracking the pumpkin, or completely obliterating it. |

Figure 9: Simple mistakes in the 175B PPO-ptx model (InstructGPT 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) InstructGPT can be confused by instructions that assume false premises, and simply go along with it. (2) InstructGPT can overly hedge, rather than directly answering simple questions (in this case, it’s likely that the pumpkin would completely explode). Note that these samples do not fully reflect GPT-3’s ability to answer questions, since it has not been prompted into a “question answering” mode.
图 9：175B 参数 PPO-ptx 模型（InstructGPT 175B）与未添加额外前缀的 GPT-3 175B 模型的简单错误对比。提示词经过精选以展示特定行为，但输出结果未经筛选。(1) InstructGPT 可能被包含错误前提的指令所迷惑，并盲目遵从。(2) InstructGPT 可能过度谨慎回避，而非直接回答简单问题（本例中南瓜很可能会完全炸裂）。需注意这些样本不能完全反映 GPT-3 的问答能力，因其未被引导至"问答模式"。
interesting because non-English languages and code form a tiny minority of our fine-tuning data.

^{9}

and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise.
这一现象很有趣，因为非英语语言和代码在我们的微调数据中占比极小。

^{9}

这表明在某些情况下，对齐方法能够泛化至人类未直接监督的输入数据，并产生预期行为。
We do not track these behaviors quantitatively, but we show some qualitative examples in Figure 8 Our 175B PPO-ptx model is able to reliably answers questions about code, and can also follow instructions in other languages; however, we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains.
我们并未对这些行为进行定量追踪，但在图 8 中展示了一些定性示例。我们的 1750 亿参数 PPO-ptx 模型能够可靠地回答代码相关问题，也能用其他语言执行指令；不过我们注意到，即使指令使用其他语言，该模型仍经常输出英文内容。相比之下，我们发现 GPT-3 虽然能完成这些任务，但需要更精细的提示设计，且很少能在这些领域遵循指令。

InstructGPT still makes simple mistakes. In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model’s performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930’s set in France”) or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).
InstructGPT 仍会犯简单错误。在与我们的 1750 亿参数 PPO-ptx 模型交互时，我们注意到尽管该模型在许多不同语言任务上表现优异，但仍会出现基础性失误。具体表现为：（1）当指令包含错误前提时，模型有时会错误地认定该前提成立；（2）模型存在过度规避倾向——面对简单问题时，即便上下文存在明确答案，仍可能声称"该问题没有唯一答案"并给出多种可能性；（3）当指令包含多重明确限制条件（例如"列出 10 部 1930 年代以法国为背景的电影"）或涉及语言模型较难处理的约束时（例如"用指定数量的句子撰写摘要"），模型表现会出现显著下降。

We show some examples of these behaviors in Figure 9 We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don’t generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection (Dinan et al., 2019b).
我们在图 9 中展示了这些行为的部分示例。我们推测行为(2)的出现部分源于标注员被要求奖励认知谦逊，因此他们可能倾向于奖励那些措辞谨慎的输出，这种倾向被我们的奖励模型所捕捉。而行为(1)的发生则是因为训练集中包含错误前提的提示样本较少，导致模型难以对这些示例进行良好泛化。我们认为通过对抗性数据收集方法(Dinan et al., 2019b)，这两种行为都能得到显著改善。

5 Discussion 5 讨论

5.1 Implications for alignment research
5.1 对齐研究的启示

This research is part of our broader research program to align AI systems with human intentions (Christiano et al., 2017, Ziegler et al., 2019; Stiennon et al., 2020). Even though this work focuses on our current language model systems, we seek general and scalable methods that work for future AI systems (Leike et al., 2018). The systems we work with here are still fairly limited, but they are among the largest language models today and we apply them on a wide range of language tasks, including classification, summarization, question-answering, creative writing, dialogue, and others.
本研究是我们使 AI 系统与人类意图保持一致的更广泛研究计划的一部分(Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020)。尽管当前工作聚焦于现有语言模型系统，但我们寻求的是适用于未来 AI 系统的通用且可扩展的方法(Leike et al., 2018)。目前研究的系统仍存在明显局限，但它们属于当今最大规模的语言模型，我们将其广泛应用于分类、摘要、问答、创意写作、对话等多种语言任务中。
Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don’t yet exist. A disadvantage of this approach is that we are not directly facing alignment problems that occur only when aligning superhuman systems (Bostrom, 2014). However, our approach does provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning. Moreover, the alignment technique we use here, RLHF, is an important building block in several proposals to align superhuman systems (Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018). For example, RLHF was a central method in recent work on summarizing books, a task that exhibits some of the difficulties of aligning superhuman AI systems as it is difficult for humans to evaluate directly (Wu et al., 2021).
我们在本研究中采用的对齐研究方法是迭代式的：我们致力于改进现有 AI 系统的对齐性，而非抽象地关注尚未出现的 AI 系统对齐问题。这种方法的不足之处在于，我们无法直接应对仅在对齐超人类系统时才会出现的难题（Bostrom, 2014）。但该方法确实为我们提供了清晰的实证反馈循环，让我们能明确分辨哪些方法有效、哪些无效。我们相信这种反馈循环对于完善对齐技术至关重要，同时也迫使我们跟上机器学习的发展步伐。此外，我们采用的 RLHF（基于人类反馈的强化学习）对齐技术，正是多个超人类系统对齐方案中的关键构建模块（Leike 等，2018；Irving 等，2018；Christiano 等，2018）。例如在最近的书籍摘要工作中，RLHF 作为核心方法被应用——这项任务体现了对齐超人类 AI 系统的部分困难，因为人类难以直接评估其质量（Wu 等，2021）。
From this work, we can draw lessons for alignment research more generally:
从这项工作中，我们可以总结出对齐研究更普适的经验：

The cost of increasing model alignment is modest relative to pretraining. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models - at least for our customers’ natural language task distribution.
提升模型对齐性的成本相对于预训练而言较为适中。我们收集数据及训练运行（包括实验性运行）的计算成本仅占训练 GPT-3 所耗资源的极小部分：训练 1750 亿参数的 SFT 模型需要 4.9 千万亿次浮点运算/天，训练 1750 亿参数的 PPO-ptx 模型需要 60 千万亿次浮点运算/天，而 GPT-3 的训练则消耗了 3640 千万亿次浮点运算/天（Brown 等人，2020 年）。与此同时，我们的结果表明强化学习人类反馈（RLHF）能显著提升语言模型对用户的帮助性，其效果甚至超过将模型规模扩大 100 倍。这表明在当前阶段，对现有语言模型进行对齐性投资的性价比高于训练更大规模的模型——至少对我们的客户自然语言任务分布而言是如此。
We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in, for example on non-English language tasks and code-related tasks. This is an important property because it’s prohibitively expensive to have humans supervise models on every task they perform. More research is needed to study how well this generalization scales with increased capabilities; see Christiano et al. (2021) for recent research in this direction.
我们已观察到一些证据表明，InstructGPT 能将"遵循指令"的能力泛化到未经人工监督的场景，例如非英语语言任务和代码相关任务。这一特性至关重要，因为让人类监督模型执行每项任务的成本高得难以承受。需要更多研究来探索这种泛化能力如何随模型性能提升而扩展；相关前沿研究可参考 Christiano 等人(2021)的成果。
We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax-an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique.
我们成功缓解了微调过程导致的大部分性能下降。若非如此，这些性能损耗将构成"对齐税"——即调整模型所需支付的额外成本。任何高税收的技术都难以获得实际应用。为避免未来高性能 AI 系统因成本问题而偏离人类意图，开发低对齐税的技术势在必行。就此而言，我们的研究结果为 RLHF（人类反馈强化学习）作为低税收对齐技术带来了好消息。
We’ve validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results (Soares et al., 2015), small synthetic domains (Christiano et al., 2018; Leike et al., 2017), or training ML models on public NLP datasets (Ziegler et al., 2019; Stiennon et al., 2020). Our work provides grounding for alignment research in AI systems that are being used in production in
我们已在现实场景中验证了来自研究的对齐技术。历史上对齐研究往往较为抽象，要么聚焦理论成果（Soares 等人，2015），要么使用小型合成领域（Christiano 等人，2018；Leike 等人，2017），或基于公开 NLP 数据集训练机器学习模型（Ziegler 等人，2019；Stiennon 等人，2020）。我们的工作为实际生产环境中使用的 AI 系统对齐研究提供了现实基础。
the real world with customers. $^{10}$ This enables an important feedback loop on the techniques’ effectiveness and limitations.
这使得我们能够对技术有效性和局限性形成重要反馈循环。

5.2 Who are we aligning to?
5.2 我们对齐的对象是谁？

When aligning language models with human intentions, their end behavior is a function of the underlying model (and its training data), the fine-tuning data, and the alignment method used. In this section, we describe a number of factors that influence the fine-tuning data specifically, to ultimately determine what and who we’re aligning to. We then consider areas for improvement before a larger discussion of the limitations of our work in Section 5.3
当语言模型与人类意图对齐时，其最终行为取决于基础模型（及其训练数据）、微调数据以及所采用的对齐方法。本节将具体描述影响微调数据的若干因素，以最终确定我们对齐的内容和对象。随后我们将探讨改进方向，进而在 5.3 节更全面地讨论本研究的局限性。

The literature often frames alignment using such terms as “human preferences” or “human values.” In this work, we have aligned to a set of labelers’ preferences that were influenced, among others things, by the instructions they were given, the context in which they received them (as a paid job), and who they received them from. Some crucial caveats apply:
文献中常用"人类偏好"或"人类价值观"等术语来描述对齐问题。在本研究中，我们所对齐的是一组标注员的偏好，这些偏好受到多重因素影响，包括他们接收到的指令说明、接收指令的上下文环境（作为有偿工作），以及指令来源方。需要特别注意以下几点：

First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models. We describe our labeler hiring process and demographics in Appendix B; in general, they are mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI. They disagree with each other on many examples; we found the inter-labeler agreement to be about

73 %

.
首先，我们是对齐训练标注员提供的示范和偏好，这些标注员直接生成用于微调模型的数据。我们在附录 B 中描述了标注员招聘流程和人口统计特征；总体而言，他们主要是通过 Upwork 或 Scale AI 平台雇佣的居住在美国或东南亚的英语使用者。标注员之间对许多样本存在分歧，我们发现标注员间一致性约为

73 %

。

Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room. More study is needed on the exact effect of different instruction sets and interface designs on the data collected from labelers and its ultimate effect on model behavior.
其次，我们正在使模型与我们的偏好对齐——作为本研究的设计者（从而也代表我们更广泛的研究机构 OpenAI）：我们编写了标注人员使用的标注指南，用于指导他们撰写示范文本和选择偏好的输出结果，并通过共享聊天室解答他们关于边缘案例的疑问。关于不同指令集和界面设计对标注员收集数据的具体影响，及其对模型行为的最终影响，还需要更多研究。

Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for. Customers and their end users may disagree or customers may not be optimizing for end users’ well-being; for example, a customer may want a model that maximizes the amount of time a user spends on their platform, which is not necessarily what end-users want. In practice, our labelers don’t have visibility into the contexts in which a given prompt or completion will be seen.
第三，我们的训练数据由 OpenAI 客户通过 API Playground 发送给模型的提示词决定，因此我们实际上是在与客户认为有价值的内容（某些情况下还包括客户终端用户认为当前 API 使用中有价值的内容）进行隐式对齐。客户与其终端用户可能存在分歧，或者客户可能并未以终端用户的福祉为优化目标；例如，客户可能希望模型能最大化用户在其平台上的停留时间，但这未必符合终端用户的真实需求。实践中，我们的标注员无法预知特定提示词或生成内容最终会被应用于何种场景。

Fourth, OpenAI’s customers are not representative of all potential or current users of language models-let alone of all individuals and groups impacted by language model use. For most of the duration of this project, users of the OpenAI API were selected off of a waitlist. The initial seeds for this waitlist were OpenAI employees, biasing the ultimate group toward our own networks.
第四，OpenAI 的客户并不能代表所有潜在或当前的语言模型用户——更不用说受语言模型使用影响的所有个人和群体。在本项目的大部分时间里，OpenAI API 的用户都是从候补名单中筛选的。这份候补名单最初来源于 OpenAI 员工，导致最终用户群体偏向于我们自身的人脉网络。
Stepping back, there are many difficulties in designing an alignment process that is fair, transparent, and has suitable accountability mechanisms in place. The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. We are not claiming that researchers, the labelers we hired, or our API customers are the right source of preferences. There are many stakeholders to consider-the organization training the model, the customers using the model to develop products, the end users of these products, and the broader population who may be directly or indirectly affected. It is not only a matter of making the alignment process more participatory; it is impossible that one can train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the tradeoffs.
退一步讲，要设计一个公平、透明且具备适当问责机制的对齐流程存在诸多困难。本文旨在证明这种对齐技术能够针对特定应用场景与特定人类参考群体实现对齐。我们并非主张研究人员、雇佣的标注人员或 API 客户就是理想的偏好来源。需要考虑的利益相关方众多——包括训练模型的组织机构、利用模型开发产品的客户、这些产品的终端用户，以及可能直接或间接受影响的广大人群。这不仅关乎让对齐流程更具参与性；更在于我们不可能训练出一个能同时满足所有人偏好、或让所有人都认可其中权衡取舍的系统。

One path forward could be to train models that can be conditioned on the preferences of certain groups, or that can be easily fine-tuned or prompted to represent different groups. Different models can then be deployed and used by groups who endorse different values. However, these models might still end up affecting broader society and there are a lot of difficult decisions to be made relating to whose preferences to condition on, and how to ensure that all groups can be represented and can opt out of processes that may be harmful.
一种可行的路径是训练能够适应特定群体偏好的模型，或者能够通过微调或提示轻松代表不同群体的模型。这样，持有不同价值观的群体就可以部署和使用各自的模型。然而，这些模型最终可能仍会影响更广泛的社会，并且需要做出许多艰难决策，例如以哪些群体的偏好作为调整依据，以及如何确保所有群体都能得到代表，并可以选择退出可能有害的流程。

5.3 Limitations 5.3 局限性

Methodology. The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.
方法论。我们的 InstructGPT 模型行为部分取决于从承包商处获得的人类反馈。部分标注任务依赖于价值判断，这些判断可能受到承包商身份、信仰、文化背景和个人经历的影响。我们雇佣了约 40 名承包商，筛选标准基于他们在敏感性提示识别测试中的表现（评估其识别和应对敏感提示的能力），以及他们在带详细说明的标注任务中与研究人员的标注一致率（参见附录 B）。我们保持承包商团队规模较小的原因在于，这有助于与全职从事该工作的较小规模承包商群体保持高效沟通。但显然，这个群体并不能代表将使用并受我们部署模型影响的所有人群。举个简单例子：我们的标注员主要使用英语，且数据几乎完全由英文指令构成。
There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily.
在我们的数据收集设置方面还有许多改进空间。例如出于成本考虑，大多数对比数据仅由 1 名标注员完成标记。若能让多个标注员重复标记样本，将有助于发现标注意见分歧的领域——这些领域正是单一模型难以同时满足所有标注偏好的情况。当出现标注分歧时，简单对齐平均标注偏好可能并不理想。例如在生成对少数群体影响尤为显著的文本时，我们可能需要赋予该群体标注者的偏好更高权重。

Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9.
模型表现。我们的模型既未实现完全对齐也不具备完全安全性；它们仍会产生有害或有偏见的输出、编造事实，甚至在无明确提示时生成性与暴力内容。某些输入下它们也可能无法产生合理输出，图 9 展示了部分此类案例。
Perhaps the greatest limitation of our models is that, in most cases, they follow the user’s instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.
或许我们模型最大的局限在于，大多数情况下它们会遵循用户指令，即便这可能导致现实世界的危害。例如，当提示要求模型展现最大偏见时，InstructGPT 生成的毒性输出比同等规模的 GPT-3 模型更多。我们将在后续章节讨论可能的缓解方案。

5.4 Open questions 5.4 开放性问题

This work is a first step towards using alignment techniques to fine-tune language models to follow a wide range of instructions. There are many open questions to explore to further align language model behavior with what people actually want them to do.
这项研究是运用对齐技术微调语言模型以遵循多样化指令的初步探索。要使语言模型行为更贴合人类真实需求，仍有大量开放性问题亟待研究。
Many methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. For example, one could use an adversarial set-up where labelers find the worst-case behaviors of the model, which are then labeled and added to the dataset (Dinan et al. 2019b). One could also combine our method with ways of filtering the pretraining data (Ngo et al. 2021), either for training the initial pretrained models, or for the data we use for our pretraining mix approach. Similarly, one could combine our approach with methods that improve models’ truthfulness, such as WebGPT (Nakano et al., 2021).
可以尝试多种方法来进一步降低模型生成有毒、偏见或其他有害输出的倾向。例如，可以采用对抗性设置，让标注者找出模型最恶劣的行为，然后进行标注并添加到数据集中（Dinan 等人，2019b）。也可以将我们的方法与预训练数据过滤技术相结合（Ngo 等人，2021），无论是用于训练初始预训练模型，还是用于我们采用的预训练混合方法的数据处理。同样地，可以将我们的方法与提升模型真实性的技术（如 WebGPT）结合使用（Nakano 等人，2021）。

In this work, if the user requests a potentially harmful or dishonest response, we allow our model to generate these outputs. Training our model to be harmless despite user instructions is important, but is also difficult because whether an output is harmful depends on the context in which it’s deployed; for example, it may be beneficial to use language models to generate toxic outputs as part of a data augmentation pipeline. Our techniques can also be applied to making models refuse certain user instructions, and we plan to explore this in subsequent iterations of this research.
在本研究中，若用户请求可能有害或不诚实的回应，我们允许模型生成此类输出。尽管训练模型在用户指令下保持无害性至关重要，但这同样具有挑战性，因为输出的危害性取决于其应用场景——例如，将语言模型用于生成有害文本作为数据增强流程的一部分可能反而有益。我们的技术同样适用于使模型拒绝执行特定用户指令，我们计划在后续研究迭代中探索这一方向。
Getting models to do what we want is directly related to the steerability and controllability literature (Dathathri et al., 2019, Krause et al., 2020). A promising future path is combining RLHF with other methods of steerability, for example using control codes (Keskar et al., 2019), or modifying the sampling procedure at inference time using a smaller model (Dathathri et al., 2019).
让模型按预期行事的研究与可操控性及可控性文献直接相关（Dathathri 等人，2019；Krause 等人，2020）。未来可行的方向是将 RLHF 与其他操控方法相结合，例如使用控制代码（Keskar 等人，2019），或在推理时通过较小模型修改采样流程（Dathathri 等人，2019）。

While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration (Anthony et al., 2017, Silver et al., 2017), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches (Achiam et al., 2017) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors.
虽然我们主要关注强化学习人类反馈（RLHF），但还有许多其他算法可用于根据我们的演示和对比数据训练策略，以获得更好的结果。例如，可以探索专家迭代法（Anthony 等人，2017；Silver 等人，2017），或使用部分对比数据的更简单行为克隆方法。也可以尝试约束优化方法（Achiam 等人，2017），在限制生成少量有害行为的前提下最大化奖励模型的评分。

Comparisons are also not necessarily the most efficient way of providing an alignment signal. For example, we could have labelers edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem.
对比方法也未必是提供对齐信号最高效的方式。例如，可以让标注员直接编辑模型响应以改进内容，或用自然语言生成对模型响应的批评意见。在设计标注员向语言模型提供反馈的交互界面方面，还存在广阔的探索空间——这本身就是一个有趣的人机交互课题。
Our proposal for mitigating the alignment tax, by incorporating pretraining data into RLHF finetuning, does not completely mitigate performance regressions, and may make certain undesirable behaviors more likely for some tasks (if these behaviors are present in the pretraining data). This is an interesting area for further research. Another modification that would likely improve our method is to filter the pretraining mix data for toxic content (Ngo et al., 2021), or augment this data with synthetic instructions.
我们提出的通过将预训练数据融入 RLHF 微调来缓解对齐税的方法，并不能完全消除性能衰退问题，且可能在某些任务中增加不良行为出现的概率（若这些行为已存在于预训练数据中）。这是值得进一步研究的有趣方向。另一项可能改进我们方法的调整是：对预训练混合数据进行有害内容过滤（Ngo 等人，2021），或通过合成指令来增强这些数据。
As discussed in detail in Gabriel (2020), there are subtle differences between aligning to instructions, intentions, revealed preferences, ideal preferences, interests, and values. Gabriel (2020) advocate for a principle-based approach to alignment: in other words, for identifying “fair principles for alignment that receive reflective endorsement despite widespread variation in people’s moral beliefs.” In our paper we align to the inferred user intention for simplicity, but more research is required in this area. Indeed, one of the biggest open questions is how to design an alignment process that is transparent, that meaningfully represents the people impacted by the technology, and that synthesizes peoples’ values in a way that achieves broad consensus amongst many groups. We discuss some related considerations in Section 5.2
正如 Gabriel (2020)详细讨论的那样，遵循指令、意图、显性偏好、理想偏好、利益与价值观之间存在着微妙差异。Gabriel (2020)主张采用基于原则的对齐方法：即确立"获得反思性认可的对齐公平原则，尽管人们的道德信念存在广泛差异"。本文为简化起见采用推断用户意图的对齐方式，但该领域仍需更多研究。事实上，最重大的开放性问题之一是如何设计一个透明化的对齐流程，既要有效代表受技术影响的群体，又能综合各方价值观以达成广泛共识。我们将在 5.2 节讨论相关考量因素

5.5 Broader impacts 5.5 更广泛的影响

This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations. We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions (Bostrom, 2014).
这项工作的动机源于我们的目标：通过训练大型语言模型执行特定人群期望的任务，来增强其积极影响。默认情况下，语言模型优化的只是下一个单词预测目标，而这仅仅是我们期望模型功能的代理指标。研究结果表明，我们的技术有望使语言模型更具帮助性、真实性和无害性。从长远来看，对齐失败可能导致更严重的后果，尤其是在安全关键场景中部署这些模型时。我们预计随着模型规模的持续扩大，需要更加谨慎地确保它们与人类意图保持一致（Bostrom, 2014）。

However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content.
然而，提升语言模型遵循用户意图的能力，同时也使其更容易被滥用。这些模型可能被更便捷地用于生成具有说服力的虚假信息，或仇恨性、侮辱性内容。
Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or housing, generating political advertisements, and law enforcement. If these models are open-sourced, it becomes challenging to limit harmful applications in these and other domains without proper regulation. On the other hand, if large language model access is restricted to a few organizations with the resources required to train them, this excludes most people from access to cutting-edge ML technology. Another option is for an organization to own the end-to-end infrastructure of model deployment, and make it accessible via an API. This allows for the implementation of safety protocols like use case restriction (only allowing the model to be used for certain applications), monitoring for misuse and revoking access to those who misuse the system, and rate limiting to prevent the generation of large-scale misinformation. However, this can come at the cost of reduced transparency and increased centralization of power because it requires the API provider to make decisions on where to draw the line on each of these questions.
对齐技术并非解决大型语言模型安全问题的万能药，而应作为更广泛安全生态系统中的一种工具。除蓄意滥用外，在许多领域中部署大型语言模型需极度谨慎或完全避免，例如医疗诊断、基于受保护特征对人进行分类、信贷/就业/住房资格判定、政治广告生成及执法等高风险领域。若这些模型被开源，在没有适当监管的情况下，限制这些及其他领域的有害应用将变得极具挑战性。另一方面，若将大型语言模型的访问权限仅限少数具备训练资源的机构，则会将大多数人排除在尖端机器学习技术之外。另一种方案是由机构掌握模型部署的端到端基础设施，并通过 API 接口提供访问。这允许实施安全协议，如使用场景限制（仅允许模型用于特定应用）、监控滥用行为并撤销违规者的访问权限，以及通过速率限制防止大规模虚假信息生成。然而，这种做法可能以降低透明度和加剧权力集中为代价，因为它需要 API 提供商来决定每个问题的界限划定。
Finally, as discussed in Section 5.2, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative.
最后，如第 5.2 节所述，这些模型与谁的利益保持一致的问题至关重要，并将极大影响这些模型的最终影响是正面还是负面。

*Primary authors. This was a joint project of the OpenAI Alignment team. RL and JL are the team leads. Corresponding author: lowe@openai.com.
*主要作者。本项目由 OpenAI 对齐团队合作完成。RL 和 JL 为团队负责人。通讯作者：lowe@openai.com。
$^{†}$ Work done while at OpenAI. Current affiliations: AA: Anthropic; PC: Alignment Research Center.
$^{†}$ 工作完成时任职于 OpenAI。现任职机构：AA: Anthropic；PC: Alignment Research Center。
$^{3}$ Specifically, we train on prompts submitted to earlier versions of the InstructGPT models on the OpenAI API Playground, which were trained only using demonstration data. We filter out prompts containing PII.
$^{3}$ 具体而言，我们基于 OpenAI API Playground 中提交给早期 InstructGPT 模型的提示语进行训练，这些早期版本仅使用演示数据训练。我们会过滤掉包含个人身份信息(PII)的提示语。
$^{4}$ This is an interface hosted by OpenAI to interact directly with models on our API; see https://beta openai.com/playground
$^{4}$ 这是 OpenAI 托管的交互界面，可直接通过 API 与模型互动；详见 https://beta.openai.com/playground
$^{5}$ That is, if each of the possible $(\binom{K}{2})$ comparisons is treated as a separate data point, then each completion will potentially be used for $K - 1$ separate gradient updates. The model tends to overfit after a single epoch, so repeating data within an epoch also causes it to overfit.
$^{5}$ 也就是说，如果将每个可能的 $(\binom{K}{2})$ 对比项视为独立数据点，那么每个补全结果可能会被用于 $K - 1$ 次独立的梯度更新。该模型在单轮训练后容易过拟合，因此在训练轮次内重复数据也会导致过拟合。
$^{6}$ To obtain this prefix, authors RL and DA held a prefix-finding competition: each spent an hour interacting with GPT-3 to come up with their two best prefixes. The winning prefix was the one that led GPT-3 to attain the highest RM score on the prompt validation set. DA won.
$^{6}$ 为获取此前缀，研究人员 RL 和 DA 举办了前缀设计竞赛：每人花费一小时与 GPT-3 交互，提出两个最佳前缀方案。获胜前缀是在提示验证集上使 GPT-3 获得最高 RM 分数的方案。DA 最终胜出。
$^{7}$ Accessible here: https://github.com/openai/following-instructions-human-feedback
$^{7}$ 可在此处访问：https://github.com/openai/following-instructions-human-feedback
$^{8}$ www.perspectiveapi.com
$^{9}$ We generally instruct our labelers to skip evaluations where they are missing the required expertise, though sometimes labelers use a translation service to evaluate simple instructions in languages that they do not speak.
$^{9}$ 我们通常要求标注员跳过那些需要专业知识但自身不具备的评估任务，不过有时标注员会使用翻译服务来评估他们不会的语言中的简单指令。
$^{10}$ Note that while fine-tuning models using human data is common practice when deploying ML systems, the purpose of these efforts is to obtain a model that performs well on a company’s specific use case, rather than advancing the alignment of general-purpose ML models.
$^{10}$ 需要注意的是，虽然使用人类数据进行模型微调是部署机器学习系统时的常见做法，但这些工作的目的是获得一个在公司特定用例上表现良好的模型，而非推进通用机器学习模型的对齐性。

Training language models to follow instructions with human feedback 训练语言模型遵循人类反馈指令