The gap between pixels and words

Let's start from the one fact this whole module hangs on: a neural network cannot see, read, or hear anything. It only ever works with numbers. So before it can learn from a thing, that thing has to be turned into numbers first. For images, that turned out to be easy. For words, it turned out to be the hard problem of the decade. This slide is about why those two cases are so different.

Start with images, because there the numbers come for free. A digital photo is made of tiny dots called pixels, and each pixel is stored as a number for how bright it is: a low number for a dark dot, a high number for a bright one. Pure black might be 0, pure white 255, a mid-grey something around 128. An ordinary photo is just a giant grid of these brightness numbers, around 150,000 of them, and the network can start computing on them straight away.

Here is the part that matters, and it is easy to skim past. Those numbers don't just label the pixels; they genuinely describe them. Two dots that look almost the same shade of grey really do get two almost-equal numbers, say 128 and 130. So the closeness of the numbers is not a coincidence: numbers that sit near each other belong to dots that look alike. The arithmetic lines up with the real world. That single property, near in number means near in reality, is the quiet reason deep learning cracked images first.

Words do not come with numbers like that, and any we invent ourselves lack that magic property.

Suppose we just number the dictionary: "a" is 1, "the" is 2, "dog" is 3, "dollar" is 4, and on down the list. Now look at what we have accidentally claimed. "Dog" got 3 and "dollar" got 4, so by the only thing the network can read, the numbers, they are right next to each other, practically the same. Meanwhile "puppy," sitting far down the alphabet, gets some number like 6,000, miles away from "dog." We have placed "dog" next to "dollar" and far from "puppy," which is exactly backwards. The number is just a name tag, handed out by alphabetical accident. It says nothing real about what the word means.

With pixels, closeness of numbers meant closeness in reality, and the network could trust it. With these word-numbers, closeness means nothing at all. That broken link is the entire problem, and the next slide shows why it is fatal rather than merely untidy.

The gap between pixels and words

Words do not come with numbers like that, and any we invent ourselves lack that magic property.