What these tools are bad at

Language models are genuinely impressive at a wide range of tasks. They are also structurally bad at specific things that are worth naming clearly.

Counting and precise arithmetic. The model processes tokens, not digits. It can often get arithmetic right because arithmetic patterns appeared in training data, but it doesn't perform arithmetic the way a calculator does. It can fail in surprising ways on simple calculations.

Knowing what it doesn't know. The model cannot reliably flag its own uncertainty. It's as fluent when it's wrong as when it's right.

Tasks requiring current information. The training cutoff is real. Anything that happened after that date doesn't exist to the model without retrieval tools.

Consistent behavior across long contexts. Instructions given early in a long conversation may fade as the context fills. The model doesn't reliably apply constraints from the beginning of a conversation to responses at the end.

Precise spatial or logical reasoning. Tasks requiring careful step-by-step logic — complex proofs, intricate puzzles — are where next-token prediction fails most visibly.

None of this means these tools are useless. It means knowing what they're bad at is just as important as knowing what they're good at.