Letting humans steer it

Writing perfect example answers by hand is slow, and it can never cover every situation. So researchers added a cleverer step: instead of always showing the model the ideal answer, let it write several answers and have humans simply point at the better one.

The full recipe has three stages. First, the model learns language from the internet, the giant training loop we started with. Second, it is fine-tuned on those hand-written good conversations. Third, and this is the new part, the model produces a range of answers, human raters mark which they prefer, and the model is trained to lean toward the kind of answer people kept choosing. Picking a winner is much faster than writing one from scratch, so this scales to far more feedback.

This third stage has a name: RLHF, short for reinforcement learning from human feedback. Reinforcement just means learning from a reward rather than from a labelled answer, the reward here being "humans liked this one more." It is how ChatGPT was built on top of GPT. It is why the assistant tends to be cooperative, why it apologises, and why it declines certain requests. Those are not rules someone typed in. They are the shape left behind by thousands of small human preferences.

One caution to carry forward. RLHF does not make a model correct. It makes a model that produces answers humans tend to prefer, and what humans prefer is usually a confident, fluent, helpful-sounding reply. Hold that thought, because the next slide is about what that preference teaches the model.