Limits in the 1990s

Backpropagation worked. But only for shallow networks — one or two hidden layers at most.

When researchers tried to go deeper, something went wrong. The error signal traveling backwards through many layers kept getting weaker. By the time it reached the early layers at the front of the network, it had almost disappeared. Those weights barely changed. Training stalled. The deeper you tried to go, the worse it got.

This had a name: the vanishing gradient. The signal that was supposed to teach the early layers just... faded out before it arrived.

Hardware was another problem. Training even a shallow network on a large dataset required more computing power than most researchers could access. And labeled data (examples with the correct answer attached) was hard to come by. Assembling enough of it meant enormous amounts of human effort.

The algorithm existed. The idea was right. But the pieces needed to make it work at scale weren't in place yet.