Reading left to right, word by word

The deep learning architectures from the previous module were designed for images: grids of pixels, where position has a clear spatial meaning. Language is different. Language is a sequence — words in a specific order, where meaning depends on relationships that might span dozens of words.

To handle sequences, researchers in the mid-2010s used a class of networks called recurrent neural networks, or RNNs.

The idea is straightforward: process one word at a time, left to right. After processing each word, carry forward a summary of everything you've seen so far — a vector called the hidden state. When you get to the next word, update that summary. By the end of the sentence, the hidden state is supposed to encode everything that came before.

This worked. For short sequences, it worked reasonably well. But it had a fundamental limitation baked into its design.

<!-- TODO: simple animation of tokens being processed one at a time with a hidden state vector being passed along would work well here -->