Making the model helpful
A pretrained language model predicts the next token. That's it. Its goal isn't to be helpful, honest, or safe — those concepts don't exist in a next-token prediction objective.
A raw GPT-3 would complete your prompt in whatever direction its training data pointed. If you asked a leading question, it might agree. If the training data contained harmful content, it might produce it. It would imitate whatever pattern was most probable.
To turn a capable but undirected model into a useful assistant, researchers developed a technique called RLHF: Reinforcement Learning from Human Feedback.
The process has three steps. First, the model is fine-tuned on examples of good conversations. Second, human raters evaluate model outputs and indicate which responses are better. Third, the model is trained using those preferences as a reward signal — reinforcing the behaviors humans rated highly and discouraging the ones they didn't.
This is how ChatGPT was built. The underlying model is GPT, trained via RLHF to be helpful and to avoid obvious harms. It's why the model declines certain requests, apologizes for mistakes, and tends toward a cooperative tone.
RLHF doesn't make a model correct. It makes a model oriented toward being useful — which is a different thing.
<!-- TODO: a simple three-step diagram of the RLHF process would make this concrete: pretrain → human ratings → RL fine-tuning -->