Module VII

Paying Attention

Chapter III

The Transformer

Now and then a new design does not just beat what came before. It makes everything before it look like a long detour.

The last chapter built one part: attention, a way for every word to reach out, weigh every other word, and blend in the ones that matter. This chapter is the machine built entirely out of that part. Its claim, made in a paper with a title that doubled as a dare, was startling in its plainness. You do not need the old single-file reading. You do not need anything clever bolted on. Stack attention, and that is enough.

That machine is the transformer. Give it nothing but a mountain of text and one absurdly simple thing to practise, and it soaks up grammar, facts, and the texture of language on its own. Make it bigger, and it starts doing things nobody trained it to do. Almost every AI system you have heard of is, underneath, one of these.

This chapter is how the part becomes the machine, plus the two small pieces attention needed before it could stand on its own.