Deep learning had conquered vision. Show a network enough images, let it train long enough on GPU hardware, and it could recognize objects better than hand-crafted systems built by experts over decades.
The same techniques spread quickly. Speech recognition. Medical imaging. Drug discovery. The architecture was general enough to work across many domains.
Language was next. And language was harder.
An image is a grid of numbers — pixel brightnesses, each one a clean numerical value. A word is not a number. "Cat" and "dog" aren't close or far from each other the way numbers are. You can't just feed words into a neural network the same way you feed in pixels.
The next module is about that problem: how do you turn language into something a machine can actually work with?