Module VII
Paying Attention
Chapter III
The Transformer
Sometimes a new architecture doesn't just improve on what came before. It makes everything before it look like a detour.
The transformer arrived with a claim that seemed almost too simple: you don't need recurrence, you don't need sequence. Attention, applied in parallel across all positions at once, stacked into layers, is enough. Let every part of the text relate to every other part simultaneously, and let the network figure out which relationships matter.
Trained on enough data, this architecture could do things no one had explicitly trained it for. Scaled further, it became the foundation for almost everything that followed.
What made it powerful was not just the mechanism. It was that the mechanism happened to fit the hardware, the data, and the ambitions of the field all at the same moment. Some ideas arrive at exactly the right time.