Here's the key insight that unlocks the whole problem:

You don't know what a word means by looking it up in a dictionary. You know what it means by seeing how it's used — which words appear near it, in which kinds of sentences, in which contexts.

"You shall know a word by the company it keeps." That line was written by a linguist named John Firth in 1957, decades before neural networks. It turned out to be exactly right in a way he couldn't have predicted.

If you train a model to predict which words appear near other words — across millions of sentences — the positions the model learns for each word will naturally cluster by meaning. Not because you told it what words mean. Because meaning is in the patterns of use.

The next chapter is about tokens — how text actually gets broken up before any of this can happen. Then we'll come back to embeddings and see what that learned space actually looks like.