Reading through a keyhole

The networks from the last few modules were built, first and best, for images. And an image is a comfortable thing to hand a network: a grid of pixels, every dot sitting in a fixed place, all of it present at once. The network can take in the whole picture in one glance.

Language is not like that. A sentence is a sequence: words in a particular order, arriving one after another, where the meaning of one word can hinge on another word far down the line. You cannot take it in at a single glance, because it is not laid out in space like a photo. It is strung out in a line, like beads on a thread.

So how do you feed a thread of words to a network built for grids? In the mid-2010s, before the transformer, the answer was a design called a , an RNN. The idea is the one you would probably reach for yourself.

Read one word at a time, left to right. After each word, jot down a short summary of the story so far. Read the next word, update the summary, and carry on. By the end of the sentence, that summary is meant to hold everything that came before.

It is reading through a keyhole. The network never sees the whole sentence at once. It sees one word, plus whatever it managed to scribble down about all the words before. For a short sentence, that is enough. But the scribble is where the trouble hides, and the next slide is about what it can and cannot hold.