Other fixes

ReLU solved a big part of the vanishing gradient problem. But it wasn't the only fix that helped.

Researchers also discovered that the starting conditions mattered. If the initial weights were set too large or too small, the network would collapse before training even got going properly. Careful weight initialization (starting values in a range that kept the network stable) turned out to make a meaningful difference.

Then came batch normalization: a technique that keeps the values flowing through each layer within a well-behaved range during training. Without it, values could grow wildly large or shrink to near-zero as they moved through a deep network, destabilizing the whole process.

Perhaps the most elegant fix came later. Researchers at Microsoft built networks where the signal could take a shortcut, jumping over several layers and flowing directly to earlier parts of the network. These skip connections gave the gradient a fast path backwards, bypassing the very layers where it had been getting lost.

None of these were dramatic new ideas. Each one addressed the same underlying problem from a slightly different angle. Together, they made it possible to train networks with dozens, then hundreds, of layers. The path down was finally open.