Which words matter right now?
Attention starts with a simple question: when you're trying to understand a particular word, which other words in the sentence are most relevant?
For the word "it" in "The trophy didn't fit in the suitcase because it was too big," the relevant word is "trophy." For "it" in "The trophy didn't fit in the suitcase because it was too small," the relevant word is "suitcase." Same pronoun, different context, different referent.
A human reader makes this connection instantly. The question is how to give a neural network the same ability — not to pass information through a chain of intermediate steps, but to look directly at any word in the sequence and ask: how relevant is that word to what I'm currently processing?
Attention computes exactly that. For each word being processed, it calculates a relevance score against every other word in the sequence. Those scores determine how much each word contributes to the final representation.
<!-- TODO: an interactive example where "it" lights up and shows which word gets the highest attention score in both sentences would be ideal here -->