Not words. Tokens.

Before a language model reads a single sentence, a tokenizer breaks that sentence into pieces. Those pieces are called tokens.

A token is not a word. It might be a whole word — "dog" could be one token. Or it might be part of a word — "unbelievable" might become "un", "believ", "able". Or it could be a punctuation mark, a space, or a number.

The specific splits come from a vocabulary of common subword fragments that was learned from a large sample of text. Frequent words get their own token. Rare words get broken into smaller pieces that do appear in the vocabulary.

This is a practical choice. A vocabulary of whole words would need to include every inflection, every proper noun, every technical term — an endless list. A subword vocabulary can cover almost any text with a compact set of around 50,000–100,000 tokens.

<!-- TODO: interactive tokenizer showing a sentence being broken into colored token chunks would be the ideal illustration here -->