What to carry out of this chapter

Want-ads, chest-cards, folders, paint blends. If the parts have blurred, let them. Here is the one thing to carry out.

Attention is a way for each word to pull in the other words that matter to it, and ignore the ones that don't. It works by blending: every word becomes a weighted mix of the words around it, heavy on the relevant ones, faint on the rest. The weights come from matching, each word holds up a want-ad for what it is looking for, every other word wears a card advertising what it is, and good matches get a big pour into the blend.

That single mechanism solved the last chapter's problem. Any word can reach any other word in one step, so distance stops mattering, the chain is gone. And because every match is independent, the machine can compute them all at once, which fits a GPU's thousands of parallel workers perfectly. Run several rounds of this side by side, each chasing a different kind of relevance, and you have the raw material of the machine itself.

So we have the part. The next chapter assembles it. Stack these attention layers up, add a couple of missing pieces, give it nothing but text to learn from, and you get the transformer: the architecture almost everything since has been built on.

Want-ads, chest-cards, folders, paint blends. If the parts have blurred, let them. Here is the one thing to carry out.