Train once, adapt many times

The transformer era introduced a workflow that has defined AI development ever since: pretrain, then finetune.

Pretraining is expensive and general. A model is trained on a massive corpus — books, websites, code, articles — with a self-supervised objective like next-word prediction or masked word prediction. This requires enormous compute and takes weeks on hundreds or thousands of GPUs. The result is a model that has absorbed a broad understanding of language and the world from text.

Finetuning is cheap and specific. You take a pretrained model and train it further on a small, targeted dataset for a particular task — medical question answering, legal document summarization, code completion. The model already knows language; it just needs to learn the specific domain.

This division changed everything. You no longer needed to train a model from scratch for every application. One expensive pretraining run could produce a foundation that dozens of teams could specialize cheaply.

It also raised a question: what exactly did the model learn during pretraining, and how capable is it before any finetuning at all? The answer surprised everyone.