The bottleneck

The hidden state is one fixed-length list of numbers, and it has to hold the whole sentence so far. For six words, fine. There is room to spare. But watch what happens as the sentence grows.

Every new word has to be squeezed into the same three hundred numbers. Nothing gets bigger. So each new word that goes in pushes a little of the older stuff out, the way pouring more water into a full glass spills whatever was already near the brim. By the end of a long paragraph, the opening sentence has mostly been crowded out. The numbers that once held it have been written over, again and again, by everything that came after.

This is the bottleneck. Everything the model knows about the sentence has to pass through that one narrow, fixed-size summary, one word at a time, like a crowd trying to pour through a single doorway. The doorway never widens. So the longer the text, the more gets lost on the way through.

And you could see the damage in what these systems did. Translate a long sentence and the ending would drift, because the start had faded by the time the model got there. Summarise three paragraphs and the thread would wander. Anything that asked the model to hold an idea across a long stretch ran straight into the same wall: the summary was full, and the past was already leaking out of it.

People found clever patches. The best known was the , which gave the network small gates to decide what was worth keeping and what to let slip, so the most important things survived longer. It genuinely helped. But it did not move the wall, it just used the cramped space more wisely. The summary was still one fixed-size notepad, and the sentence still had to file through it a word at a time.

The bottleneck

The hidden state is one fixed-length list of numbers, and it has to hold the whole sentence so far. For six words, fine. There is room to spare. But watch what happens as the sentence grows.