Attention is a mechanism for building context-aware representations. Each word asks: which other words are relevant to me right now? The answers shape how that word gets represented going forward.

It's parallel, it's flexible, and it removes the distance problem entirely. Any word can directly attend to any other word.

The architecture built entirely around this idea — attention stacked in layers, no recurrence at all — is the transformer. That's next.