On Language Models

Language models typically predict outputs sequentially, a process often referred to as "next-token prediction." For instance, in the text domain, these models generate content token by token, whereas in the image domain, they produce outputs patch by patch. It is important to note that there are differences between language models for text and images, although this discussion will not delve deeply into these distinctions. A significant limitation of the next-token prediction method is that once a token has been generated, it cannot be revised; the model can only operate on subsequent tokens.

However, is this next-token prediction strategy truly effective? Does it align with human behavior? It seems unlikely. Consider the human drawing process for illustration. Humans do not typically draw an image patch by patch. Instead, they start with a global sketch and refine it through multiple iterations. They are able to revise any part of the image at any point in the process. Similarly, when writing an article, humans are not confined to constructing sentences word by word; they can revisit and revise previously written words at any time if they deem them suboptimal. We often cannot ascertain the adequacy of our current words until we have written subsequent ones.

This prompts a reconsideration of the next-token prediction algorithm, which is often considered a cornerstone for the development of Artificial General Intelligence (AGI).

Attached: Translation into Chinese is done by GPT4.

关于语言模型

语言模型通常以序列方式预测输出，这一过程通常被称为“下一个词预测”。例如，在文本领域，这些模型逐个词语地生成内容；在图像领域，则逐个区块地产生输出。值得注意的是，文本和图像的语言模型之间存在差异，尽管本讨论不会深入探讨这些区别。下一个词预测方法的一个重大限制是一旦生成了一个词语，就无法对其进行修改；模型只能对后续词语进行操作。

然而，这种下一个词预测策略真的有效吗？它是否与人类行为一致？看起来并非如此。以人类绘画过程为例进行说明。人类通常不会按区块逐个绘制图像。相反，他们首先绘制一个全局草图，然后通过多次迭代进行细化。他们可以在任何时候修改图像的任何部分。同样，当撰写文章时，人类不受限于逐字构造句子；如果认为先前的词语不够理想，他们可以随时回过头去修改。我们通常无法确定当前的词语是否足够好，直到我们写下后续的词语。

这促使我们重新思考下一个词预测算法，这一算法常被视为人工通用智能（AGI）发展的基石。