Attention is all you need

In 2017, eight researchers at Google published a paper with a title that was half claim, half dare: "Attention Is All You Need." The message was blunt. Drop the old single-file reading. Drop every other trick. Take attention, the part from the last chapter, stack it up, and you have all you need.

The machine they described is the transformer, and the way to picture it is a tower.

At the bottom, the words come in, each one a position in the space of meaning from the last module. Then comes the first floor: a full round of attention, every word reaching out, weighing every other word, blending in what matters. Out the top of that floor comes a new version of each word, now coloured by its context, "it" carrying a little "trophy," "bank" leaning toward "river" or "money" depending on its neighbours.

Then that improved set of words is fed straight into the next floor, which runs attention all over again, on words that already understand their context a little better than before. And then a floor above that, and another, often dozens high.

This is the same stacking idea the whole course keeps circling back to. Remember AlexNet from the going-deeper module: early layers found edges, the next combined them into shapes, the next into whole objects, simple pieces building into complex ones, floor by floor. The transformer is that exact shape, aimed at language. The bottom floors catch simple things, which word refers to which. Higher floors, working on what the lower ones figured out, catch grammar, then tone, then the drift of a whole argument. Meaning assembled in stages, each floor standing on the one beneath it.

And because each word ran several attention readers at once, the heads from the last chapter, every floor is tracking many kinds of relationship in parallel: reference, grammar, topic, all at the same time. Nobody assigned those jobs. As always in this story, the structure was not designed in. It was discovered, by training.

Attention is all you need

The machine they described is the transformer, and the way to picture it is a tower.