Inference

Training is the expensive part. You adjust weights across millions of examples, running the network forward, measuring the error, pushing the error backward, updating every weight. You do this thousands of times. It takes enormous amounts of computation: weeks, sometimes months, on specialized hardware.

Inference is the cheap part. You take a new input, run it through the fixed, frozen weights, and read the output. No learning happens. The weights don't change. The network just computes.

This is why a deployed model can respond in milliseconds. The hard work was done once, during training. What remains is a series of fast mathematical operations on a fixed set of numbers.

The same trained weights can serve millions of users at once. The expensive work doesn't repeat.