” The two key ideas of deep learning for computer vision — convolutional neural networks and backpropagation were already well understood in 1989. The Long Short Term Memory (LSTM) algorithm, which is fundamental to deep learning for
timeseries, was developed in 1997 and has barely changed since. So why did deep learning only take off after 2012? What changed in these two decades?
In general, three technical forces are driving advances in machine learning:
Datasets and benchmarks
The real bottlenecks throughout the 1990s and 2000s were data and hardware. But here’s what happened during that time: the internet took off, and high-performance graphics chips were developed for the needs of the gaming market.
Between 1990 and 2010, off-the-shelf CPUs became faster by a factor of approximately 5,000. Throughout the 2000s, companies like NVIDIA and AMD have been investing
billions of dollars in developing fast, massively parallel chips (graphical processing
units [GPUs]) to power the graphics of increasingly photorealistic video games—
cheap, single-purpose supercomputers designed to render complex 3D scenes on your
screen in real time. in 2007, NVIDIA launched CUDA a programming interface for its line of GPUs. A small number of GPUs started replacing massive clusters of CPUs in various highly parallelizable applications.
Today, the NVIDIA TITAN X, a gaming GPU that cost $1,000 at the end of 2015,
can deliver a peak of 6.6 TFLOPS in single precision: 6.6 trillion float32 operations
per second. That’s about 350 times more than what you can get out of a modern laptop.
In 2016, at its annual I/O convention, Google revealed its tensor processing unit (TPU) project: a new chip design developed from the ground up to run deep neural networks, which is reportedly 10 times faster and far more energy efficient than top-of-the-line GPUs.
If deep learning is the steam engine of this revolution, then data is its coal: the raw material that powers our intelligent machines, without which nothing would be possible.
The game changer has been the rise of the internet, making it feasible to collect and distribute very large datasets for machine learning.
Today, Flickr, for instance, have been a treasure trove of data for computer vision. So are You Tube videos. And Wikipedia is a key dataset for natural-language processing.
In addition to hardware and data, until the late 2000s, we were missing a reliable way to
train very deep neural networks. As a result, neural networks were still fairly shallow,
thus, they weren’t able to shine against more-refined shallow methods such as SVMs and random forests.
This changed around 2009–2010 with the advent of several simple but important
algorithmic improvements that allowed for better gradient propagation:
Better activation functions for neural layers
Better weight-initialization schemes, starting with layer-wise pretraining, which was quickly abandoned
Better optimization schemes, such as RMSProp and Adam
Only when these improvements began to allow for training models with 10 or more
layers did deep learning start to shine.
Finally, in 2014, 2015, and 2016, even more advanced ways to help gradient propagation
were discovered, such as batch normalization, residual connections, and depthwise
separable convolutions. Today we can train from scratch models that are thousands of layers deep.