What the model actually sees

When you type a message to a language model, you see words and sentences. The model sees a row of numbers, one per token.

Each token in the vocabulary has an ID number. In one common tokenizer, "The" is 464, " cat" is 2415, and " sat" is 3332. So the sentence "The cat sat" reaches the model as three numbers: 464, 2415, 3332. That row of numbers is the whole input. The model never gets the letters back.

This simple fact has consequences that catch people out. We just met one: because the model only ever sees whole tokens, never the letters sealed inside them, it can fumble a question as easy as how many r's are in "strawberry."

The next two slides show two more, and both matter in everyday use: why the same sentence can cost very different amounts in different languages, and why "cat" and " cat" are two separate tokens at all.

What the model actually sees

When you type a message to a language model, you see words and sentences. The model sees a row of numbers, one per token.