The vanishing gradient

Backpropagation works by sending an error signal backwards through the network: from the output all the way to the first layer. Each layer uses that signal to adjust its weights.

The problem: at each layer, the signal gets slightly weaker. Not because something goes wrong, but because of the math involved. By the time it reaches the early layers, after passing through ten, twenty, thirty layers, it has become so small it's almost nothing. The early layers barely adjust. They learn almost nothing.

Add more layers, and the problem compounds. The further from the output, the fainter the signal.

This was the vanishing gradient problem. The architecture worked in principle. In practice, the deeper you went, the harder it became to train the front of the network — and the front was where the basic, foundational pattern-recognition happened.

Deep networks were theoretically more powerful than shallow ones. But training them reliably was another matter entirely.