What we actually need

We know what we are running from. Now let's pin down what we are running toward, because "numbers in between the two failures" is still vague. What would a genuinely good representation have to do? Three things.

First, similar words should sit close together. This is the property one-hot threw away. We want "dog" and "puppy" to land near each other, and "Paris" and "London" to land near each other, off in the same neighbourhood as other capital cities. Think of it like a map: things that belong together are drawn close, things that don't are drawn far apart. Distance should finally mean something again, the way it did with pixels two slides ago.

Second, the relationships should be consistent. This is a stronger demand, so take it slowly. It is not enough that related words sit near each other. The way they relate should show up too.

Take "king" and "queen." The only thing that changes between them is male to female. Now take "man" and "woman": the only thing that changes is, again, male to female.

If our representation is any good, the step that carries you from "king" to "queen" should be the same kind of step that carries you from "man" to "woman", same size, same direction. The change in meaning becomes an actual, repeatable move you could make on the map.

Hold onto that idea. It leads somewhere in the final chapter.

Third, it should be compact. Not 10,000 mostly-empty slots per word. A few hundred, maybe 300, every one of them carrying real information. Packed, not padded. Dense, not sparse.

A representation with those three properties has a name. It is called an .

The word is worth unpacking. To embed a word is to give it a place to sit inside a shared space, surrounded by all the other words, near the ones it resembles and far from the ones it doesn't, the way every city has its own spot on a map, fixed by what it sits near.

It is no longer a name tag. It is a position, and the geometry of where things sit, what is near what, which direction leads from one to another, is what carries the meaning.

Which raises the obvious question: who decides where "dog" should sit? Nobody hand-places 10,000 words on a map; that would be the old hand-crafting trap all over again. The positions have to be discovered, worked out from data, the same way AlexNet worked out its features instead of being handed them. The next slide is the single clue that makes that possible.

What we actually need

Second, the relationships should be consistent. This is a stronger demand, so take it slowly. It is not enough that related words sit near each other. The way they relate should show up too.

Take "king" and "queen." The only thing that changes between them is male to female. Now take "man" and "woman": the only thing that changes is, again, male to female.

Hold onto that idea. It leads somewhere in the final chapter.

Third, it should be compact. Not 10,000 mostly-empty slots per word. A few hundred, maybe 300, every one of them carrying real information. Packed, not padded. Dense, not sparse.

A representation with those three properties has a name. It is called an .

It is no longer a name tag. It is a position, and the geometry of where things sit, what is near what, which direction leads from one to another, is what carries the meaning.