What the model actually sees
When you type a message to a language model, you see words and sentences. The model sees a sequence of numbers — one per token.
Each token in the vocabulary is assigned an ID. "Hello" might be token 15496. A space followed by "world" might be token 995. The model receives a list of those IDs and works with them from there.
This has some strange consequences.
The model doesn't see individual letters — it sees tokens. So it can't reliably count letters in a word, because it never sees the individual letters. Ask a model how many r's are in "strawberry" and it may get it wrong. Not because it's bad at counting — because it never saw the letters, only the token.
Similarly, different languages are tokenized with different efficiency. Languages with more training data tend to have more dedicated tokens, so their words compress cleanly. Languages with less training data get broken into smaller fragments — meaning the model uses more tokens to process the same amount of meaning.
<!-- TODO: show a specific example of surprising tokenization — "strawberry" split oddly, or a non-English word exploding into many tokens -->