Queries, keys, and values
The attention mechanism uses three sets of vectors, with names borrowed from database terminology: queries, keys, and values.
Here's the intuition:
The query is what you're looking for. When processing the word "it," the query encodes something like "what noun am I referring to?"
The keys are what each word advertises about itself. "Trophy" might have a key that says "I'm a noun, I'm an object, I appeared earlier in this sentence."
The values are what each word actually contributes. If a key matches the query well, the corresponding value gets incorporated into the output.
The mechanism computes a score between the query and every key in the sequence. High scores mean high relevance. Those scores become weights, and the weighted sum of the values becomes the new representation for the word being processed.
This is why attention is so powerful: it's a direct, flexible mechanism for building representations that incorporate exactly the context that's relevant — not all context, not no context, but the right context.
<!-- TODO: a diagram of the Q/K/V mechanism, showing one word's query scoring against all keys, would make this concrete without needing any math -->