The vanishing gradient

You just recalled how a network learns. sends an error signal backwards through the network, from the output all the way to the first layer, and each layer uses that signal to adjust its weights. That is how the front of the network gets told what to do.

Here is the problem. At each layer the signal travels back through, it gets a little weaker. Not because anything malfunctions, but because of the arithmetic involved. By the time it reaches the early layers, after passing back through ten, twenty, thirty of them, it has shrunk to almost nothing.

So the early layers barely move. Remember what "learning" means here: adjusting the weights, the dials inside each neuron, in response to the error signal. A strong signal means a big, confident adjustment. A signal shrunk to almost nothing means almost no adjustment at all. The weights in the front of the network sit there, barely changing, guess after guess. The network can see it is getting things wrong, but the message telling the front how to fix itself never arrives with any strength.

Add more layers, and it gets worse. The further a layer sits from the output, the fainter the signal that reaches it.

This was the vanishing gradient problem, and it was the obstacle the whole chapter is named for. The design worked on paper. In practice, the deeper you built, the harder it became to train the front of the network. And the front was exactly where the most basic work happens.

Remember from the last module: in a network that looks at images, the first layer is the one that learns to spot edges, the tiny differences in brightness between neighboring pixels. Later layers combine those edges into shapes, shapes into parts, parts into whole concepts like "face." Every later layer is built on top of what the front detects. So if the front never learns to see edges properly, everything stacked above it is building on sand. The vanishing gradient was starving the network exactly where its foundation was supposed to form.

Deep networks were, in theory, more powerful than shallow ones. Training them reliably was another matter entirely.

The vanishing gradient

Add more layers, and it gets worse. The further a layer sits from the output, the fainter the signal that reaches it.

Deep networks were, in theory, more powerful than shallow ones. Training them reliably was another matter entirely.