In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published a paper that introduced the backpropagation training algorithm.
Backpropagation is a technique for computing the gradients automatically using Gradient Descent in just two passes through the network (one forward, one backward), the backpropagation algorithm is able find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.
Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters.
Automatically computing gradients is called automatic differentiation, or autodiff. The autodiff technique used by backpropagation is called reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss).
Backpropagation contains two parts - forward and backward
Backpropagation handles one mini-batch at a time (for example containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an epoch.
Step #1: Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer.
Step #2: The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer.
Step #3: Again its output is computed and passed to the next layer and so on until we get the output of the last layer the output layer.
This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
Step #1- Loss Function: Next, the algorithm measures the network’s output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).
Step #2- Chain Rule: After getting output of the output layer it computes how much each output connection contributed to the error. This is done analytically by applying the chain rule, which makes this step fast and precise.
Step #3: The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule — and so on until the algorithm reaches the input layer.
- As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).
Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.
It is important to initialize all the hidden layers’ connection weights randomly.
- For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical.
- In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart.
- If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.
- Here unlike its counterpart the value(X) is getting calculated along with weights and bias values.
For training of any neural network the aim is to minimize the loss (y - ŷ). The back propagation does this job by adjust each weight in the network in proportion to how much it contributes to overall error.