All at once, not one at a time
There's a practical advantage to attention that was just as important as its representational power: it can be computed in parallel.
An RNN processes words one at a time — each step depends on the previous step. You can't compute step 100 until you've done steps 1 through 99. That sequential dependency made training slow and made it hard to use modern hardware efficiently.
Attention has no such dependency. Computing the relevance score between word 1 and word 100 doesn't require anything about words 2 through 99. All the scores for all the pairs can be computed simultaneously.
GPUs, as we saw in Module 5, are designed for exactly this kind of parallel computation. The attention mechanism fit the hardware perfectly. Training could be massively accelerated.
This parallelism — combined with the direct long-range connections — is what made attention the foundation for the architecture that came next.