## What is deep learning?

*Deep learning* is a form of that models patterns in data as complex, multi-layered networks. Because deep learning is the most general way to model a problem, it has the potential to solve difficult problems—such as computer vision and natural language processing—that outstrip both conventional programming and other machine learning techniques.

Deep learning not only can produce useful results where other methods fail, but also can build more accurate models than other methods, and can reduce the time needed to build a useful model. However, training deep learning models requires a great deal of computing power. Another drawback to deep learning is the difficulty of interpreting deep learning models.

The defining characteristic of deep learning is that the model being trained has more than one *hidden layer* between the input and the output. In most discussions, deep learning means using deep neural networks. There are, however, a few algorithms that implement deep learning using other kinds of hidden layers besides neural networks.

## Deep learning vs. machine learning

I mentioned that deep learning is *a form of* machine learning. I’ll refer to non-deep machine learning as *classical machine learning*, to conform to common usage.

In general, classical run much faster than deep learning algorithms; one or more CPUs will often be sufficient to train a classical model. Deep learning models often need hardware accelerators such as , and also for deployment at scale. Without them, the models would take months to train.

For many problems, some classical machine learning algorithm will produce a “good-enough” model. For other problems, classical machine learning algorithms have not worked terribly well in the past.

from using its old phrase-based statistical machine translation algorithms (one kind of classical machine learning) to using Google’s framework.

, a seven-level *convolutional neural network* (CNN) for recognition of handwritten digits digitized in 32×32 pixel images. To analyze higher-resolution images, the LeNet-5 network would need to be expanded to more neurons and more layers.

, writing in 2016, deep learning has been used successfully to predict how molecules will interact in order to help pharmaceutical companies design new drugs, to search for subatomic particles, and to automatically parse microscope images used to construct a 3-D map of the human brain.

## Deep learning neural networks

The ideas for “artificial” neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognize patterns in the same way that an animal brain and nervous system (including the retina) does.

### Backpropagation

The learning in deep neural networks occurs by strengthening the connection between two neurons when both are active at the same time during training. In modern neural network software this is most commonly a matter of increasing the weight values for the connections between neurons using a rule called *backpropagation of error*, backprop, or BP.

### Neurons

How are the neurons modeled? Each has a propagation function that transforms the outputs of the connected neurons, often with a weighted sum. The output of the propagation function passes to an activation function, which fires when its input exceeds a threshold value.

### Activation functions

In the 1940s and 1950s artificial neurons used a step activation function and were called *perceptrons*. Modern neural networks may *say* they are using perceptrons, but they actually have smooth activation functions, such as the logistic or sigmoid function, the hyperbolic tangent, and the Rectified Linear Unit (ReLU). ReLU is usually the best choice for fast convergence, although it has an issue of neurons “dying” during training if the learning rate is set too high.

The output of the activation function can pass to an output function for additional shaping. Often, however, the output function is the identity function, meaning that the output of the activation function is passed to the downstream connected neurons.

### Neural network topologies

Now that we know about the neurons, we need to learn about the common neural network topologies. In a feed-forward network, the neurons are organized into distinct layers: one input layer, any number of hidden processing layers, and one output layer, and the outputs from each layer go only to the next layer.

In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly, or indirectly through the next layer.

### Training

Supervised learning of a neural network is done just like any other machine learning. You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector. Batches of training data that are run together before applying corrections are called epochs.

For those interested in the details, backpropagation uses the gradient of the error (or cost) function with respect to the weights and biases of the model to discover the correct direction to minimize the error. Two things control the application of corrections: the optimization algorithm, and the learning rate variable, which usually needs to be small to guarantee convergence and avoid causing dead ReLU neurons.

### Optimizers

Optimizers for neural networks typically use some form of gradient descent algorithm to drive the backpropagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimizing randomly selected mini-batches (Stochastic Gradient Descent) and applying *momentum* corrections to the gradient. Some optimization algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).

As with all machine learning, you need to check the predictions of the neural network against a separate validation data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.

### Real DNNs

A deep neural network for a real problem might have upwards of 10 hidden layers. Its topology might be simple or quite complex.

The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train.

## Deep learning algorithms

As I mentioned earlier, most deep learning is done with deep neural networks. Convolutional neural networks (CNN) are often used for machine vision. Recurrent neural networks (RNN) are often used for natural language and other sequence processing, as are Long Short-Term Memory (LSTM) networks and attention-based neural networks. Random Forests, also known as Random Decision Forests, which are not neural networks, are useful for a range of classification and regression problems.

### CNN neural networks

Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear downsampling. ReLU layers apply the non-saturating activation function ** f(x) = max(0,x)**. In a fully connected layer, the neurons have connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss function for classification, or a Euclidean loss function for regression.

### RNN, LSTM, and attention-based neural networks

In feed-forward neural networks, information flows from the input, through the hidden layers, to the output. This limits the network to dealing with a single state at a time.

In recurrent neural networks, the information cycles through a loop, which allows the network to remember recent previous outputs. This allows for the analysis of sequences and time series. RNNs have two common issues: exploding gradients (easily fixed by clamping the gradients) and vanishing gradients (not so easy to fix).

In LSTMs, the network is capable of forgetting (gating) previous information or remembering it, in both cases by altering weights. This effectively gives an LSTM both long-term and short-term memory and solves the vanishing gradient problem. LSTMs can deal with sequences of hundreds of past inputs.

Attention modules are generalized gates that apply weights to a vector of inputs. A hierarchical neural attention encoder uses multiple layers of attention modules to deal with tens of thousands of past inputs.

### Random Forests

Another kind of deep learning algorithm—not a deep neural network—is the Random Forest, or Random Decision Forest. A Random Forest is constructed from many layers, but instead of neurons it is constructed from decision trees, and outputs a statistical average (mode for classification or mean for regression) of the predictions of the individual trees. The randomized aspects of Random Forests are the use of bootstrap aggregation (a.k.a. *bagging*) for individual trees and taking random subsets of the features.

## Deep learning frameworks

While you could write deep learning programs from first principles, it’s far more efficient to use , especially given that they have been optimized for use with GPUs and other accelerators. The pre-eminent framework is , which originated at Google. The favored high-level API for TensorFlow is , which can also be used with other back-end frameworks.

, from Facebook and others, is a strong alternative to TensorFlow, and has the distinction of supporting dynamic neural networks, in which the topology of the network can change from epoch to epoch. Fastai is a high-level third-party API that uses PyTorch as a back-end.

, from Amazon and others, is another strong alternative to TensorFlow, with a claim to better scalability. is the preferred high-level imperative API for MXNet.

Chainer, from IBM, Intel, and others, was in some ways the inspiration for PyTorch, given that it defines the neural network by run and supports dynamic neural networks.