A stamp for position

There is a hole in the machine we just built, and it is worth seeing clearly, because the fix is clever in how ordinary it is.

Attention, by its nature, does not care about order. It compares every word to every other word as a great unordered heap, weighing how well each matches, with no built-in sense of which came first. That was a feature for solving the distance problem: it let any word reach any other in one hop. But it throws away something language cannot live without.

Order. "The dog bit the man" and "The man bit the dog" use the exact same words. Pour them into an order-blind heap and they look identical, yet they mean opposite things. A reader who ignores order cannot tell who did the biting.

The old single-file approach got order for free, because it literally read left to right; the sequence was baked into the reading. The transformer gave that up for speed. So it has to put order back some other way.

The fix is a . Before the words enter the tower, each one gets stamped with a little extra pattern of numbers that says where it sits: a stamp for "first," a different stamp for "second," another for "fortieth." The stamp is simply added on top of the word's meaning, so each word now carries two things at once: what it is and where it is.

Think of name tags at a long table. Everyone wears a tag that says both their name and their seat number. Now you can shuffle the guests into any heap you like and still reconstruct who sat where. Attention can scramble the words into its unordered comparison, exactly as it wants to, and the position stamps mean nothing is lost. "Dog-in-seat-2" and "dog-in-seat-5" are no longer the same word.

So the transformer is not order-blind after all. It just carries order as a stamp riding along with each word, rather than getting it from the act of reading in sequence. A small piece, easy to skim past, and the machine would be useless without it.

A stamp for position

There is a hole in the machine we just built, and it is worth seeing clearly, because the fix is clever in how ordinary it is.