By 1986, the method existed. You could take a network with many layers, show it examples, trace errors backwards, and adjust every weight. The algorithm worked.

But in practice, it barely held together. Go deep enough and the error signal faded before reaching the early layers. The hardware was too slow for anything ambitious. Labeled data was scarce and expensive to assemble.

The algorithm was right. What it needed was scale: more layers that could actually be trained, more data to train on, more computing power to make it practical. None of those were in place yet.

The next chapter asks a different question — not how a network learns, but what it has actually become once learning is done. What does a trained model contain? What does it know? What does "knowing" even mean when it's all just numbers?