By the mid-2000s, the tools existed. ReLU. Better initialization. Batch normalization. Skip connections. The vanishing gradient was a solvable problem, and researchers had solved it, piece by piece.

But solving it in theory wasn't the same as making it work in practice. Training a deep network on anything interesting still took too long. The hardware was too slow. The datasets were too small to let deep networks prove their advantage over the shallower alternatives.

The technique was ready. What was missing was scale.

The next chapter is about where that scale came from — and why it had been sitting, unused, inside gaming computers the whole time.