Context windows are measured in tokens

One reason tokenization matters beyond the model's internals: everything about how a language model is used is measured in tokens.

The context window — the total amount of text a model can hold in memory at once — is a token limit, not a word limit. A 100,000-token context window can hold roughly 75,000 words of English. More or fewer depending on how efficiently the text tokenizes.

Cost and speed are also per-token. When you use a commercial language model, you pay for tokens in and tokens out. Long conversations and large documents get expensive partly because they're long in tokens.

This framing also sets up what comes next. The model takes in a sequence of token IDs and needs to do something with them. The something it does is look each one up in the embedding space — turning that list of IDs into a list of vectors the network can actually compute with.

<!-- TODO: segue is natural into chapter 3 (embeddings) — might be worth a visual showing the pipeline: text → tokens → IDs → embedding vectors -->