Recurrent networks processed language one word at a time. Early context faded. Long-range connections weakened. The bottleneck was structural — not something you could fix by adding more layers.

What was needed was an architecture where every word could look directly at every other word, all at once, without anything in between. That idea is called attention.