The distance problem

There's a second problem, related but distinct: to connect two words that are far apart, the signal has to travel through every word in between.

In "The animal didn't cross the street because it was too tired," understanding what "it" refers to means connecting "it" (near the end) to "animal" (near the beginning). In an RNN, that connection requires the signal to pass through "didn't," "cross," "the," "street," "because" — every intermediate word.

Each step introduces noise. The longer the path, the weaker the signal by the end. This is the same vanishing gradient problem from Module 5, but playing out at inference time as well as training time.

For language that requires long-range dependencies — a pronoun referring back to a noun from several sentences ago, a clause that modifies something stated much earlier — recurrent networks struggled fundamentally.

The solution couldn't be a minor tweak to the architecture. It needed a different kind of architecture — one where any word could connect directly to any other word, without passing through anything in between.

<!-- TODO: diagram showing two words connected through a long chain of intermediate steps vs. a direct connection — the contrast sets up attention perfectly -->