All at once, not one at a time

Attention had a second gift, and it had nothing to do with meaning. It fit the hardware perfectly, and that turned out to matter just as much.

Look back at the RNN. It was stuck reading in order: word one, then word two, then word three, each step waiting on the one before, because the running summary had to be updated in sequence. You could not start word one hundred until you had trudged through the ninety-nine before it. The work was a single-file line, and a line can only move as fast as one worker.

Attention has no such waiting. Scoring "it" against "trophy" needs nothing from "suitcase" or "because." Every word-to-word match is independent of every other, so the machine can compute all of them at the same instant, the whole web of connections in one go, instead of crawling along a line.

And you have met exactly the machine that loves this kind of work. Remember the orchestra from the going-deeper module: the , thousands of small workers all computing side by side, brilliant whenever a task breaks into many independent pieces that don't depend on each other. The RNN handed that orchestra a single-file line and wasted it. Attention hands it thousands of independent little matches, exactly what it was built to chew through all at once.

So attention was not just a better way to read. It was a better way to read that happened to run beautifully fast on the hardware the field already had. The same lucky fit we saw with images, an idea and a machine meeting at the right moment, happened again here. That combination, direct connections plus blazing parallel speed, is what made attention worth building an entire architecture around. The transformer was the result.

All at once, not one at a time

Attention had a second gift, and it had nothing to do with meaning. It fit the hardware perfectly, and that turned out to matter just as much.