What made it work
AlexNet wasn't one breakthrough. It was several older ideas, combined and executed at the right scale for the first time.
The depth came from stacking multiple layers of neurons that each learned progressively more abstract features from the layer before — raw pixels in, high-level concepts out. The vanishing gradient fix came from ReLU, which let the training signal flow cleanly through all those layers without fading.
GPU training made the whole thing feasible. Without it, training on 1.2 million images would have taken months.
Two techniques helped the network generalize instead of memorize. The first was dropout: during training, neurons were randomly switched off, forcing the network to learn redundant pathways and not rely on any single route. The second was data augmentation: each training image was shown in slightly varied forms — flipped, cropped, shifted — so the network saw far more variety than the original dataset contained.
None of these ideas were brand new. The contribution was putting them together carefully, at scale, and showing that the combination worked dramatically better than the previous approach.
The result spoke for itself.