Train once, adapt many times
The transformer era introduced a workflow that has defined AI development ever since: pretrain, then finetune.
Pretraining is expensive and general. A model is trained on a massive corpus — books, websites, code, articles — with a self-supervised objective like next-word prediction or masked word prediction. This requires enormous compute and takes weeks on hundreds or thousands of GPUs. The result is a model that has absorbed a broad understanding of language and the world from text.
Finetuning is cheap and specific. You take a pretrained model and train it further on a small, targeted dataset for a particular task — medical question answering, legal document summarization, code completion. The model already knows language; it just needs to learn the specific domain.
This division changed everything. You no longer needed to train a model from scratch for every application. One expensive pretraining run could produce a foundation that dozens of teams could specialize cheaply.
It also raised a question: what exactly did the model learn during pretraining, and how capable is it before any finetuning at all? The answer surprised everyone.