Other fixes

ReLU solved a big part of the vanishing gradient problem. But it wasn't the only fix that helped.

Researchers also discovered that the starting conditions mattered. If the initial weights were set too large or too small, the network would collapse before training even got going properly. Careful weight initialization (starting values in a range that kept the network stable) turned out to make a meaningful difference.

Then came , which keeps the values flowing through each layer within a steady range. Without it, those values could balloon or collapse toward zero as they moved through a deep network, throwing the whole process off.

Perhaps the most elegant fix came later. Researchers at Microsoft built networks where the signal could take a shortcut, leaping over several layers to reach earlier parts of the network directly. These handed the backwards signal a fast path home, sidestepping the very layers where it had been getting lost.

None of these were dramatic new ideas. Each one addressed the same underlying problem from a slightly different angle. Together, they made it possible to train networks with dozens, then hundreds, of layers. The path down was finally open.

Other fixes

ReLU solved a big part of the vanishing gradient problem. But it wasn't the only fix that helped.