Two ways to use a transformer

The transformer paper described an architecture for translation — an encoder that reads a sentence, and a decoder that generates the translation. But the architecture was general. Researchers quickly found two different ways to specialize it.

BERT (2018) used the encoder half. It was trained to fill in masked words in a sentence — given "The cat sat on the [MASK]," predict the missing word. Because it can see context on both sides of a word at once, BERT became very good at tasks that require understanding: answering questions, classifying text, recognizing named entities.

GPT (2018) used the decoder half. It was trained to predict the next word in a sequence, reading left to right. This made it a generative model — given a prompt, it would continue writing. It couldn't see future context, but it could generate fluent text indefinitely.

Both were transformers. Both were trained on large amounts of text. The difference was in the training objective, and that shaped what each was good at.

Most modern language models — GPT-4, Claude, Gemini, LLaMA — are descendants of the decoder-only GPT approach, scaled up enormously.

<!-- TODO: a simple split diagram of encoder vs. decoder architectures, with examples of what each is used for -->