Depth vs. width

So we can finally train deep networks. But here is a fair question: why bother with depth at all? You could keep a single layer and just make it enormously wide, thousands of neurons all in one row. Why stack layers instead of widening one?

The answer is something you already understand, because this whole course has been about it.

Remember the picture from the last module: the first layer of an image network notices edges, the next combines those edges into shapes, the next combines shapes into parts, until the final layer is working with a whole face. Simple things combining into more complex things, one step at a time. You saw it again a few pages ago, when the front layers turned out to be the ones learning the basics that everything else is built on.

And it goes back further than that. It is the same shape as the very first chapters: single switches combined into logic, logic combined into arithmetic, arithmetic into a working machine. Almost everything powerful in this story is built the same way, small pieces stacked into bigger ones, layer upon layer.

That stacking is exactly what depth gives a network, and what a single wide layer cannot. A wide layer has only one storey, so it has to recognise a whole face in a single leap, with no chance to build up to it. A deep network gets a storey for each step: edges, then shapes, then parts, then the face. It builds the way the problem is actually built.

That is why depth wins. Matching the natural layering of the problem, a deep network needs far fewer neurons in total to do the same job. Going deeper was never a stylistic choice. It was the right shape for the task, the same shape the whole course keeps returning to.

Depth vs. width

The answer is something you already understand, because this whole course has been about it.