A space of meaning

Each token in the vocabulary maps to a vector — a list of numbers, typically around 300–1600 values long, depending on the model.

You can think of these numbers as coordinates in a high-dimensional space. Just like a point on a map has an x and a y coordinate, each word has hundreds of coordinates placing it somewhere in this space.

What makes embeddings useful is where things end up. Words that are used in similar ways — that appear in similar contexts, that tend to surround similar other words — end up near each other in this space. Not because anyone designed it that way. Because the model was trained to predict words from their context, and similar words have similar contexts.

The result is a map of language. Navigate it and you find neighborhoods: animals in one region, verbs of motion in another, countries and capitals arranged in clusters that reflect their relationships.

<!-- TODO: 2D projection visualization of a word embedding space would be perfect here — show a few clusters and let it sink in that this geometry emerged from training -->