Attention is all you need

In 2017, eight researchers at Google published a paper titled "Attention Is All You Need." The title was deliberate and provocative. You don't need recurrence. You don't need convolution. Stack attention layers, and that's enough.

The architecture they described was the transformer.

The key move: instead of processing tokens sequentially, the transformer processes the entire sequence at once. Every token attends to every other token in parallel. The output of one attention layer feeds into the next, each building a richer, more abstract representation of the sequence than the one before.

Multiple "heads" of attention run in parallel — each one free to learn different kinds of relationships. One head might focus on grammatical structure. Another might learn semantic similarity. Another might track long-distance coreference. None of these roles are assigned by humans. They emerge from training.

<!-- TODO: a layered architecture diagram showing tokens → attention layer → richer representations → next attention layer would orient the reader well -->