The bottleneck

The hidden state is a single vector. It has to carry everything the model has learned about the sentence up to that point — all of it, compressed into a fixed-size list of numbers.

For a short sentence, that's fine. But for a long document, early context gets squeezed out as new information pushes in. By the time the model reaches the end of a long paragraph, the first sentence may have faded almost entirely from the hidden state.

This created a real ceiling on what the models could do. Translation got worse as documents got longer. Summaries of multi-paragraph text lost coherence. Following an argument across many sentences required context the hidden state simply couldn't hold.

LSTMs (Long Short-Term Memory networks) were an improvement — a more sophisticated gating mechanism that let the network choose what to remember and what to forget. They helped significantly. But they didn't solve the fundamental problem. The information still had to pass through a narrow channel, one step at a time.

<!-- TODO: a visual of a "bottleneck" metaphor — information flowing through a narrow vector — could land the constraint clearly -->