What to carry out of this chapter

Keyholes, running summaries, whispers down a line. If the names have already blurred, that is fine. You are not meant to keep them.

Here is the one thing to carry out of this chapter.

Before the transformer, machines read language one word at a time, left to right, holding a single short summary of the story so far. It worked for short text. It broke on long text, for two linked reasons. The summary was too small to hold a whole passage, so early words got crowded out. And worse, any two words far apart could only reach each other by passing a message through every word in between, and that message faded with distance, the very same fading we met as the vanishing gradient. The chain itself was the problem.

So the bar for a real fix was clear: get rid of the chain. Let any word connect directly to any other word, however far apart, with nothing in between to fade through. Let every word see every other word at once.

That web of direct connections has a name, and it is the whole next chapter: attention. We will build it from the ground up, starting from a question you already know how to answer, which word in this sentence matters most right now?

Keyholes, running summaries, whispers down a line. If the names have already blurred, that is fine. You are not meant to keep them.

Here is the one thing to carry out of this chapter.