Two ways to use it

The original paper built the transformer for translation, with two halves: one to read a sentence and one to write its translation. But people quickly noticed each half was useful on its own, and the two halves grew into the two great families of language model. They split on one question: can a word see what comes after it, or only what comes before?

The reader that sees both sides. The first family was best known through a model called built from the reading half. It sees the whole sentence at once, both sides of any word, which makes it strong at understanding tasks like search and classification.), and you have already played its training game. In the last module you read "The waiter brought us a bottle of wibble," and guessed "wibble" was a drink purely from the words around it, on both sides. That is exactly how BERT learns: hide a word, show it the whole sentence with a blank, make it guess the missing word. Because it sees context on both sides at once, it builds a deep understanding of a sentence as a whole. Great for reading: search, answering questions, sorting text. Not built for writing new text, because it always assumed it could peek at both sides, and when you are generating, there is no "after" yet.

The writer that sees only behind. The second family is the one you have actually chatted with, known through built from the writing half. It only sees the words before the current one, and is trained to predict the next word. Scaled up enormously, this is the family behind most chatbots.). Its training game is even simpler, and it is the one from the learning module: predict the next word. Show it text up to some point, let it guess what comes next, correct it, repeat across an ocean of text. It only ever sees what came before the word it is guessing, never after, because that mirrors the real task of writing: you produce the next word with only the past to go on. That one restriction is what makes it a generator. Give it a start and it predicts the next word, then the next, then the next, writing on indefinitely.

Both are transformers. Both are towers of attention trained on huge piles of text. The only real difference is that single rule, can a word peek ahead or not, and it decides whether you get a brilliant reader or a fluent writer. The chatbots that made AI a household word are the writer, the GPT line, scaled up beyond anything the 2017 authors imagined.

Two ways to use it