What happens when you just make it bigger
The transformer architecture existed by 2017. The obvious next question was: how big can we make it?
Bigger, here, means more parameters. More layers. Larger attention matrices. More dimensions in the embedding space. And more data to train on.
In 2020, OpenAI published a paper documenting a striking observation: as you scale a language model — more parameters, more compute, more data — performance doesn't just improve gradually. It follows predictable curves. The same patterns of improvement, measured across many different tasks, pointed toward consistent laws governing how capability scales with resources.
The relationship between scale and capability was becoming legible. Researchers could predict, roughly, how much better a model would get if you gave it ten times as much compute. That predictability was enormously useful for deciding where to invest.
But scale didn't just make models better at things they were already doing. It introduced something stranger.
<!-- TODO: a simple graph showing smooth capability curves as scale increases — drives home the predictability before the surprise of emergence -->