What the network sees

The network sees only numbers.

An image enters as a grid of pixel values — each pixel a number representing brightness or color. Text enters as a sequence of token IDs (integers that stand in for words or word fragments). The network has no access to meaning, only to the numbers that represent it.

What the network learns is a series of transformations. Each layer takes the output of the previous layer and reshapes it — computing a more abstract version of the input that's more useful for the task at hand. The first layer might respond to raw pixel differences. By the middle layers, the representation has nothing obvious to do with pixels. By the final layer, what's left is a compressed signal optimized for making the right decision.

The original image is still in there, somewhere — but it's been transformed into something the network can act on directly.