The activation function problem

The root cause was something called the activation function — a small calculation each neuron performs after adding up its inputs. Early networks used a function called sigmoid, which squashes any value into a gentle S-curve, always producing a number between 0 and 1.

That sounds harmless. The problem is that when backpropagation passes through a sigmoid layer, the signal gets multiplied by a small number. Multiplied by another small number in the next layer. And the next. After ten layers, the signal is millions of times smaller than where it started. It's effectively gone.

The fix came from an idea so simple it seemed almost too obvious to work.

What if a neuron just passed its input straight through (unchanged) when the input was positive, and output zero when it was negative? No squashing. No shrinking. Just: if it's positive, let it through. If it's not, block it.

This is called ReLU — Rectified Linear Unit. When a signal passes through a ReLU neuron, it doesn't get multiplied by a small number. It either passes through unchanged or stops entirely. The gradient doesn't fade.

One small change to how each neuron worked. It turned out to make all the difference for deep networks.