The gap between pixels and words

When a neural network looks at an image, it sees numbers. Each pixel is a value between 0 and 255. Three channels — red, green, blue — for each pixel. A 224×224 image is just a grid of about 150,000 numbers. The network can do arithmetic on those directly.

Words don't work that way.

You could number every word in the English language: "the" = 1, "a" = 2, "dog" = 3, and so on. But 3 and 4 being neighbors in that list doesn't mean "dog" and "cat" are related. The number is just a label. It carries no information about what the word means or how it relates to anything else.

For a network to reason about language, it needs something better than labels. It needs a representation where the numbers mean something.

<!-- TODO: maybe add a small visual showing an image as a grid of numbers vs. a word mapped to an arbitrary index — the contrast is the key insight -->