这是用户在 2025-7-26 18:24 为 https://readit.site/a/VaQ3q/deep-dive-into-llms-like-chatgpt-tldr 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Deep dive into LLMs like ChatGPT by Andrej Karpathy (TL;DR)
深入探讨LLMs如 ChatGPT 的 Andrej Karpathy(TL;DR)

LLM ChatGPT TL;DR

Who is this deep dive for?
这个深入分析是为谁准备的?

A few days ago, Andrej Karpathy released a video titled "Deep dive into LLMs like ChatGPT." It’s a goldmine of information, but it’s also 3 hours and 31 minutes long. I watched the whole thing and took a bunch of notes, so I figured why not put together a TL;DR version for anyone who wants the essential takeaways without the large time commitment.
几天前,Andrej Karpathy 发布了一段名为 "Deep dive into LLMs like ChatGPT" 的视频。这是一大堆信息,但也有 3 小时 31 分钟长。我看了整个视频并做了很多笔记,所以我想为什么不整理一个 TL;DR 版本,供那些不想花大量时间的人获取核心观点。

If any of this sounds like you, this post (and the original video) is worth checking out:
如果这些任何一点听起来像你,这篇文章(以及原始视频)值得一看:

  • You want to understand how LLMs actually work not just at the surface level.
    你想要了解LLMs是如何工作的,而不仅仅是表面层次。
  • You want to understand confusing fine-tuning terms like chat_template and ChatML (especially if you're using Axolotl).
    你想要理解令人困惑的微调术语 chat_templateChatML (特别是如果你使用 Axolotl 的话)。
  • You want to get better at prompt engineering by understanding why some prompts work better than others.
    通过理解为什么某些提示比其他提示更有效来提高你的提示工程能力。
  • You’re trying to reduce hallucinations and want to know how to keep LLMs from making things up.
    你想要减少幻觉,并想知道如何防止LLMs编造事实。
  • You want to understand why DeepSeek-R1 is such a big deal right now.
    你想要了解为什么 DeepSeek-R1 现在如此重要。

I won’t be covering everything in the video, so if you have time, definitely watch the whole thing. But if you don’t, this post will give you the key takeaways.
我不会涵盖视频中的所有内容,所以如果有时间,建议你完整观看。但如果你没有时间,这篇文章会给你提供关键要点。

Note: If you are looking for the excalidraw diagram that Andrej made for the video, you can download it here. He shared it through Google Drive and it invalidates the link after a certain time. That's why I have decided to host it on my CDN as well.
注意:如果你正在寻找 Andrej 在视频中制作的 excalidraw 图,你可以在这里下载。他通过 Google Drive 分享了它,并且链接会在一定时间后失效。这就是为什么我决定将其托管在我的 CDN 上。

Pretraining Data  预训练数据

Internet  因特网

equation

LLMs start by crawling the internet to build a massive text dataset. The problem? Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it needs heavy filtering.
LLMs 从互联网抓取数据以构建一个庞大的文本数据集。问题在于原始数据噪声很大,包含大量重复内容、低质量文本和无关信息。在训练之前,需要进行大量的过滤。

  • If you're building an English only model, you'll need a heuristic to filter out non-English text (e.g., only keeping text with a high probability of being English).
    如果你正在构建一个仅限英语的模型,你需要一个启发式方法来过滤掉非英语文本(例如,只保留高度可能是英语的文本)。
  • One example dataset is FineWeb, which contains over 1.2 billion web pages.
    一个示例数据集是 FineWeb,包含超过 12 亿个网页。

Once cleaned, the data still needs to be compressed into something usable. Instead of feeding raw text into the model, it gets converted into tokens: a structured, numerical representation.
清洗后的数据仍然需要被压缩成可用的形式。而不是直接将原始文本输入到模型中,它会被转换成标记:一种结构化的数值表示。

Tokenization  分词

Tokenization is how models break text into smaller pieces (tokens) before processing it. Instead of storing raw words, the model converts them into IDs that represent repeating patterns.
分词是模型在处理文本之前将其分解成更小部分(标记)的过程。模型不是存储原始单词,而是将它们转换为表示重复模式的 ID。

  • A popular technique for this is Byte Pair Encoding (BPE).
    这种技术的一种流行方法是 Byte Pair Encoding(BPE)。
  • There’s an optimal number of symbols (tokens) for compression. For example, GPT-4 uses 100,277 tokens. It is totally dependent on the discretion of the model creator.
    标记的最佳数量取决于压缩需求。例如,GPT-4 使用 100,277 个标记。这完全取决于模型创建者的判断。
  • You can visualize how this works using tools like Tiktokenizer.
    您可以使用 Tiktokenizer 等工具可视化这个过程。

Neural Network I/O  神经网络输入输出

Neural Network I/O

Once the data is tokenized, it’s fed into the neural network. Here’s how that process works:
经过分词后,数据会被输入到神经网络中。这个过程如下所示:

  1. The model takes in a context window, a set number of tokens (e.g., 8,000 for some models, up to 128k for GPT-4).
    模型接收一个上下文窗口,一组固定数量的标记(例如,对于某些模型为 8,000,对于 GPT-4 最多为 128k)。
  2. It predicts the next token based on the patterns it has learned.
    它根据所学的模式预测下一个标记。
  3. The weights in the model are adjusted using backpropagation to reduce errors.
    模型中的权重通过反向传播调整以减少错误。
  4. Over time, the model learns to make better predictions.
    随着时间的推移,模型学会了做出更好的预测。

A longer context window means the model can "remember" more from the input, but it also increases computational cost.
更长的上下文窗口意味着模型可以从输入中“记住”更多内容,但这也增加了计算成本。

Neural Network Internals
神经网络内部结构

equation

Inside the model, billions of parameters interact with the input tokens to generate a probability distribution for the next token.
在模型内部,数十亿个参数与输入标记相互作用,生成下一个标记的概率分布。

  • This process is defined by complex mathematical equations optimized for efficiency.
    这个过程由优化了效率的复杂数学方程定义。
  • Model architectures are designed to balance speed, accuracy, and parallelization.
    模型架构设计旨在平衡速度、准确性和并行化。
  • You can see a production-grade LLM architecture example here.
    您可以在这里看到一个生产级别的LLM架构示例。

Inference  推理

Inference

LLMs don’t generate deterministic outputs, they are stochastic. This means the output varies slightly every time you run the model.
LLMs不会生成确定性的输出,而是随机的。这意味着每次运行模型时,输出都会略有不同。

  • The model doesn’t just repeat what it was trained on, it generates responses based on probabilities.
    模型不仅重复它所训练的内容,而是根据概率生成回复。
  • In some cases, the response will match something in the training data exactly, but most of the time, it will generate something new that follows similar patterns.
    在某些情况下,回复会与训练数据中的某一部分完全匹配,但大多数时候,它会生成一些新的内容,遵循相似的模式。

This randomness is why LLMs can be creative, but also why they sometimes hallucinate incorrect information.
这种随机性使得LLMs能够具有创造力,但也可能导致它们有时产生错误的信息。

GPT-2

GPT-2, released by OpenAI in 2019, was an early example of a transformer-based LLM.
GPT-2,由 OpenAI 在 2019 年发布,是早期基于变压器的模型的一个例子LLM。

Here’s what it looked like:
这里是它的样子:

  • 1.6 billion parameters  16 亿参数
  • 1024-token context length
    1024 个标记上下文长度
  • Trained on ~100 billion tokens
    在大约 1000 亿个标记上训练
  • The original GPT-2 training cost was $40,000.
    原始 GPT-2 的训练成本是 40,000 美元。

Since then, efficiency has improved dramatically. Andrej Karpathy managed to reproduce GPT-2 using llm.c for just $672. With optimized pipelines, training costs could drop even further to around $100.
从那时起,效率有了极大的提高。Andrej Karpathy 使用 llm.c 仅花费 672 美元就重现了 GPT-2。通过优化的流水线,训练成本甚至可以进一步降低到大约 100 美元。

Why is it so much cheaper now?
现在为什么便宜这么多?

  • Better pre-training data extraction techniques → Cleaner datasets mean models learn faster.
    更好的预训练数据提取技术 → 更干净的数据集意味着模型学习得更快。
  • Stronger hardware and optimized software → Less computation needed for the same results.
    更强大的硬件和优化的软件 → 对于相同的结果所需计算量更少。

Open Base Models  公开基础模型

Disclaimer: The models under discussion do not strictly follow the Open Source Initiative (OSI) definitions of open-source AI (OSI AI Definition). Instead, we use the term open base models to describe models where the weights are publicly accessible, but training data and full reproducibility may not be provided.
说明:讨论中的模型并不严格遵循开放源代码倡议(OSI)对开源 AI(OSI AI Definition)的定义。相反,我们使用公开基础模型一词来描述模型的权重是公开可访问的,但训练数据和完全可重复性可能不会提供。

Some companies train massive language models (LMs) and release the base models for free. A base model is essentially a raw, pre-trained LM—it still requires fine-tuning or alignment to be practically useful.
一些公司训练大规模语言模型(LMs),并免费发布基础模型。基础模型本质上是一个原始的、预训练的 LM,仍然需要微调或对齐才能实用。

  • Base models are trained on unfiltered internet-scale data, meaning they generate raw completions but lack alignment with human intent.
    基础模型是在未经筛选的互联网规模数据上训练的,这意味着它们生成原始完成内容,但缺乏与人类意图的对齐。
  • OpenAI released GPT-2, an open-weight and source-available model, but not fully open-source under OSI’s definition since its training data was not released.
    OpenAI 发布了 GPT-2,一个开放权重和源代码可用的模型,但根据 OSI 的定义,它并不完全开源,因为其训练数据未公开。
  • Meta released Llama 3.1 (405B parameters), an open-weight but not open-source model.
    Meta 发布了 Llama 3.1(405B 参数),一个开放权重但非开源的模型。

To release a base model, two key components are required:
要发布一个基础模型,需要两个关键组件:

  1. Inference code → Defines the steps the model follows to generate text. Typically written in Python.
    推理代码 → 定义模型生成文本时遵循的步骤。通常用 Python 编写。
  2. Model weights → The billions of parameters that encode the model’s knowledge.
    模型权重 → 数十亿个参数,编码了模型的知识。

How Base Models Work
基础模型如何工作

  • They generate token-level internet-style text.
    它们生成的是基于互联网的标记级文本。
  • Every run produces a slightly different output (stochastic behavior).
    每次运行都会产生略有不同的输出(具有随机行为)。
  • They can regurgitate parts of their training data.
    它们可以重复其训练数据的部分内容。
  • The parameters are like a lossy zip file of internet knowledge.
    参数就像是互联网知识的压缩文件,但会丢失一些信息。
  • You can already use them for applications like:
    你可以将它们用于以下应用:
    • Translation → Using in-context examples.
      翻译 → 使用上下文示例。
    • Basic assistants → Prompting them in a structured way.
      基础助手 → 以结构化的方式提示它们。

Want to experiment with one? Try the Llama 3 (405B base model) here.
想要尝试一下吗?在这里试用 Llama 3(405B 基础模型)。

At its core, a base model is just an expensive autocomplete. It still needs fine-tuning.
本质上,基础模型只是一个昂贵的自动补全工具。它仍然需要微调。


Pre-Training to Post-Training
预训练到后训练

So far, we’ve looked at base models, which are just pre-trained text generators. But to make an actual assistant, you need post-training.
到目前为止,我们已经讨论了基础模型,它们只是预训练的文本生成器。但要制作一个实际的助手,你需要进行后训练。

  • Base models hallucinate a lot → They generate text, but it’s not always useful.
    基础模型会生成很多不切实际的内容 → 它生成文本,但并不总是有用的。
  • Post-training fixes this by fine-tuning the model to respond better.
    后训练通过微调模型来改进这一点。
  • The good news? Post-training is way cheaper than pre-training (e.g., months vs. hours).
    好消息是?后训练比预训练便宜得多(例如,几个月 vs 几个小时)。

Supervised Fine-Tuning (SFT)
监督微调(SFT)

Data Conversations  对话数据

Once the base model is trained on internet data, the next step is post-training. This is where we replace the internet dataset with human/assistant conversations to make the model more conversational and useful.
在基础模型训练完互联网数据之后,下一步是后训练。这时我们用人类/助手对话替换互联网数据集,使模型更具对话性和实用性。

  • Pre-training takes months, but post-training is much faster. It can take as little as a few hours.
    预训练需要几个月的时间,但后训练要快得多。它可能只需要几个小时。
  • The model's algorithm stays the same, we’re just fine-tuning the existing parameters.
    模型的算法保持不变,我们只是微调现有的参数。

To teach a model how to handle back-and-forth conversations, we use chat templates. These define the structure of a conversation and let the model know which part is user input and which part is an assistant response. You can read more about them here.
为了教会模型如何处理来回对话,我们使用聊天模板。这些模板定义了对话的结构,并让模型知道哪些部分是用户输入,哪些部分是助手回复。你可以在这里了解更多。

Example template:  示例模板:

<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>
  • <|im_start|> and <|im_end|> are special tokens that help structure conversations.
    <|im_start|><|im_end|> 是特殊的 token,有助于结构化对话。
  • The model didn’t see these new tokens during pre-training, they’re introduced during post-training.
    模型在预训练过程中没有看到这些新 token,它们是在后训练过程中引入的。
  • OpenAI has discussed fine-tuning LLMs for conversation in the InstructGPT paper.
    OpenAI 在 InstructGPT 论文中讨论了 LLMs 对话的微调。

To visualize this, go to tiktokenizer.
为了可视化这一点,请访问 tiktokenizer。

One such post-training dataset is OASST1. Early post-training datasets were hand-curated by humans. Now, models like UltraChat can generate synthetic conversations, allowing models to improve without as much human input. You can visualize these mostly synthetic datasets here.
一个这样的后训练数据集是 OASST1。早期的后训练数据集是由人类手工筛选的。现在,像 UltraChat 这样的模型可以生成合成对话,使模型能够在较少的人工输入下得到改进。你可以在这里可视化这些主要合成的数据集。

Hallucinations, Tool Use, and Memory
幻觉、工具使用和记忆

One major issue with LLMs is hallucination, where the model confidently generates incorrect or made-up information.
LLMs的主要问题之一是幻觉,即模型自信地生成错误或虚构的信息。

Why does this happen?
为什么会这样?

  • During post-training, models learn that they must always give an answer.
    在后训练阶段,模型学会了必须总是给出一个答案。
  • Even if the question doesn’t make sense, the model tries to generate a response instead of saying, “I don’t know.”
    即使问题没有意义,模型也会尝试生成一个回应,而不是说“我不知道”。

How Meta Deals with Hallucinations
Meta 如何处理幻觉

Meta’s research on factuality (from their Llama 3 paper) describes a way to improve this:
Meta 在其 Llama 3 论文中关于事实性的研究描述了一种改进方法:

  1. Extract a snippet of training data.
    提取一段训练数据。
  2. Generate a factual question about it using Llama 3.
    生成一个关于它的事实性问题,使用 Llama 3。
  3. Have Llama 3 generate an answer.
    让 Llama 3 生成答案。
  4. Score the response against the original data.
    将回答与原始数据进行评分。
  5. If incorrect, train the model to recognize and refuse incorrect responses.
    如果错误,训练模型识别并拒绝错误的回答。

Essentially, this process teaches models to recognize their own knowledge limits.
本质上,这个过程教会模型识别自己的知识局限。

Using Tools to Reduce Hallucinations
使用工具减少幻觉

One way to fix hallucinations is to train models to use tools when they don’t know the answer. This approach follows the pattern:
一种纠正幻觉的方法是训练模型在不知道答案时使用工具。这种方法遵循的模式是:

<|im_start|>user<|im_sep|>Who is Orson Kovacs?<|im_end|>
<|im_start|>assistant<|im_sep|><SEARCH_START>Who is Orson Kovacs?<SEARCH_END><|im_end|>

[...search results...]

<|im_start|>assistant<|im_sep|>Orson Kovacs is ....<|im_end|>

With repeated training, models learn that if they don’t know something, they should look it up instead of making things up.
通过反复训练,模型学会了如果不知道某事,应该查找而不是杜撰。

"Vague Recollection" vs. "Working Memory"
“模糊的记忆” vs. “工作记忆”

  • Model parameters store vague recollections (like remembering something from a month ago).
    模型参数存储模糊的记忆(比如记得一个月前的事情)。
  • Context tokens function as working memory, giving models access to fresh information.
    上下文标记充当工作记忆,使模型能够访问新鲜信息。

This is why retrieval-augmented generation (RAG) works so well: if the model has direct access to relevant documents, it doesn’t need to guess.
这就是为什么检索增强生成(RAG)如此有效:如果模型可以直接访问相关文档,它就不需要猜测。

Knowledge of Self  自我知识

If you prompt an untuned base model about who it is, it will likely hallucinate. For example, a non-OpenAI model might still say it was created by OpenAI simply because most internet data links AI models to OpenAI.
如果你向未调优的基础模型询问它的身份,它可能会胡言乱语。例如,一个非 OpenAI 模型可能会说它是 OpenAI 创建的,仅仅因为大多数互联网数据将 AI 模型与 OpenAI 联系在一起。

How to Fix This
如何修复这个问题

  • Hardcode self-identity into training data → Example: Olmo-2 dataset.
    将自我身份硬编码到训练数据中 → 例如:Olmo-2 数据集。
  • Use system messages → At the start of every conversation, include a reminder of its identity.
    使用系统消息 → 每次对话开始时,包括对其身份的提醒。

By default, LLMs have no real knowledge of themselves. Without specific training, they default to generic AI responses.
默认情况下,LLMs 没有真正的自我认知。如果没有特定训练,它们会默认给出通用的人工智能回复。

Models Need Tokens to Think
模型需要令牌来思考

LLMs don’t reason like humans. They generate tokens sequentially, meaning they need structured generation to think properly.
LLMs 不要像人类一样推理。它们按顺序生成令牌,这意味着它们需要结构化的生成来正确思考。

Example: Bad vs. Good Model Output
示例:差模型 vs. 好模型输出

Bad Model Output:  坏的模型输出:

Human: Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of apples?
Assistant: The answer is $3.

The model jumps to the answer without breaking it down.
模型直接跳到答案而没有分解过程。

Good Model Output:  良好的模型输出:

Human: Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of apples?
Assistant: The total cost of the oranges is $4. 13 - 4 = 9, so the cost of the 3 apples is $9. 9/3 = 3, so each apple costs $3.

Here, the model works through the reasoning step-by-step.
在这里,模型逐步进行了推理。

Why This Matters  为什么这很重要

  • If a model jumps straight to an answer, it might just be guessing.
    如果模型直接给出答案,它可能只是在猜测。
  • If it walks through the solution step-by-step, it’s more reliable.
    如果它逐步展示了解决方案,那就更可靠。
  • The model breaks down the problem into smaller steps. Since there are finite layers in the model, one token output cannot be processed indefinitely. Breaking down the problem into smaller steps allows the model to process the problem in a way that is more likely to yield the correct answer.
    模型将问题分解成更小的步骤。由于模型中有有限的层数,一个词元输出不能无限处理。将问题分解成更小的步骤可以让模型以更有可能得出正确答案的方式处理问题。

For math and logic tasks, it’s best to ask the model to use external tools rather than relying on its own reasoning.
对于数学和逻辑任务,最好要求模型使用外部工具而不是依赖其自身的推理。


Reinforcement Learning  强化学习

Reinforcement Learning

Once a model is trained on internet data, it still doesn’t know how to use its knowledge effectively.
一旦模型在互联网数据上进行训练,它仍然不知道如何有效地运用其知识。

  • Supervised fine-tuning teaches it to mimic human responses.
    监督微调教会它模仿人类的回应。
  • Reinforcement learning (RL) helps it improve by trial and error.
    强化学习(RL)通过试错帮助它改进。

How RL Works  RL 的工作原理

Reinforcement Learning Process

Instead of relying on human-created datasets, RL lets the model experiment with different solutions and figure out what works best.
而不是依赖于人类创建的数据集,RL 允许模型尝试不同的解决方案并找出最有效的方法。

Example Process:  示例过程:

We generated 15 solutions.
Only 4 of them got the right answer.
Take the top solution (each right and short).
Train on it.
Repeat many, many times.

No human is involved in this process. The model generates different solutions to the same problem, sometimes they are in millions. Then it compares them to pick the ones that reached the correct answer and then trains on the winning solutions.
在这个过程中没有人参与。模型生成不同的解决方案来解决同一个问题,有时这些解决方案有数百万个。然后它比较这些解决方案,选择那些得到正确答案的,然后在获胜的解决方案上进行训练。

The pre-training and post-training processes are very well defined but the RL process is still under a lot of active research. Andrej talks about it here as well. Companies like OpenAI do a lot of research on it but it's not public. That is why the release of DeepSeek was such a big deal. Their paper talks more about it. It talks very publicly about RL and FT for LLMs and how it brings out a lot of reasoning capabilities in them.
预训练和后训练过程非常明确,但 RL 过程仍然处于大量的活跃研究之中。Andrej 在这里也提到了这一点。像 OpenAI 这样的公司在这方面做了很多研究,但这些研究并不公开。这就是 DeepSeek 的发布如此重要的原因。他们的论文对此有更详细的描述。它非常公开地讨论了 RL 和 FT 对于LLMs如何带来大量的推理能力。

An example taken from the Deepseek paper shows us that as the time passes, the model is able to use more tokens to get better at reasoning.
DeepSeek 论文中的一个例子显示,随着时间的推移,模型能够使用更多的令牌来更好地进行推理。

Reinforcement Learning Process

You can see that the model has this "aha" moment here, it's not something that you can explicitly teach the model through just training on a dataset. It's something that the model has to figure out on it's own through reinforcement learning. Pros of this technique is, the model is becoming better at reasoning but the con is it's consuming more and more tokens to do so.
你可以看到,模型在这里有一个“恍然大悟”的时刻,这不是你可以通过仅仅在数据集上进行训练来明确教授给模型的东西。这是模型必须通过强化学习自己发现的。这种方法的优点是,模型变得越来越擅长推理,但缺点是它为此消耗的令牌越来越多。

One thing we can learn from the research paper on mastering the game of Go is that RL actually helps the model become better at reasoning than their human counterparts. The model isn't just trying to imitate the human but it's coming up with its own strategies through trial and error to win the game.
从研究论文中我们可以学到,强化学习实际上帮助模型在推理方面比人类对手更好。模型不只是试图模仿人类,而是通过不断尝试和错误来制定自己的策略以赢得比赛。

Reinforcement Learning Process

One very unique thing noted during the AlphaGo's game was a move called Move 37. It's a move that was not part of the training data but the model came up with its own strategy to win the game. The researchers predicted that the chance of it being played by a human would be 1 in 10000. So you can see how the model is capable of coming up with its own strategies.
在 AlphaGo 的比赛中,一个非常独特的事情是被称为第 37 步的一步棋。这一步棋不在训练数据中,但模型自己想出了赢得比赛的策略。研究人员预测,人类玩家会下这一步棋的概率是 1/10000。所以你可以看到模型是如何能够想出自己的策略的。

RL is still heavily unexplored and there is a lot of research going on in this area. It is entirely possible if given the chance, the LLM might come up with a language of its own to express its thoughts and ideas because it discovered that it's the best way to express its thoughts and ideas.

Learning on unverifiable domains a.k.a Reinforcement Learning from Human Feedback (RLHF)

It is very easy to exclude humans from the RL process in verifiable domains. LLMs can act as a judge for its own performance.

However, in unverifiable domains, we need to include humans in the loop.

For example, for a prompt Write a joke about pelicans, it is not very easy to find a way to automatically judge the quality of the joke. The LLM will generate the jokes without issues, but judging their quality at scale is not possible.

Furthermore, including humans in this process at scale is not feasible. This is where RLHF comes in. You can read more about it in this paper.

Reinforcement Learning from Human Feedback

In order to do RLHF at scale, you basically train a separate reward model which could be a transformer without the extra layers surrounding it. You use humans to judge the rank it is giving to the responses and then you use that to train the reward model until you are satisfied with the results. Once that is done, you can use the reward model to judge the quality of the responses generated by the LLM at scale.

Reinforcement Learning from Human Feedback

RLHF Upside

  • Enables RL in unverifiable domains like joke-writing or summarization.
  • Often improves models by reducing hallucinations and making responses more human-like.
  • Exploits the "discriminator-generator gap"—humans find it easier to evaluate an answer than generate one.
    • Example: "Write a poem" vs. "Which of these 5 poems is best?"

RLHF Downside

  • The reward model is just a simulation of human preferences, not an actual human. This can be misleading.
  • RL can game the system, producing adversarial examples that exploit weaknesses in the reward model.
    • Example: After 1,000 updates, the model’s "top joke about pelicans" might be complete nonsense (e.g., "the the the the the the the the").
  • This is known as Adversarial Machine Learning. Since there are infinite ways to game the system, filtering out bad responses isn’t straightforward.
  • To prevent this, reward model training is capped at a few hundred iterations—beyond that, models start over-optimizing and performance declines.

Preview of Things to Come

Future LLMs will expand in several key areas:

  • Multimodal Capabilities → Not just text, but also understanding and generating images, audio, and video.
  • Agent-Based Models → Moving beyond single tasks to long-term memory, reasoning, and correction of mistakes.
  • Pervasive & Invisible AI → AI will be integrated into workflows in a way that becomes second nature.
  • Computer-Using AI → AI models that interact with software and take actions beyond just text generation.
  • Test-Time Training → AI adapting itself in real-time to improve accuracy on the fly.

Keeping Track of LLMs

If you're interested in following developments in this space, here are some great resources:

  • LM Arena → Benchmarking new language models.
  • AI News → A newsletter covering AI research.
  • X (Twitter) → Many researchers share updates here.

Where to Find LLMs

Want to try different LLMs? Here’s where to find them:

  • Proprietary Models → OpenAI (GPT-4), Google (Gemini), Anthropic (Claude), etc.
  • Open-Weight Models → DeepSeek, Meta (Llama), etc. Try them via Together.ai.
  • Run Locally → Use Ollama or LM Studio.
  • Base Models → Explore Hyperbolic.