Index

Backpropagation

In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published a paper that introduced the backpropagation training algorithm.

Backpropagation is a technique for computing the gradients automatically using Gradient Descent in just two passes through the network (one forward, one backward), the backpropagation algorithm is able find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

$https://latex.codecogs.com/svg.image?\large {\color{DarkOrange} \mathbf{or}}$

Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters.

Automatically computing gradients is called automatic differentiation, or autodiff. The autodiff technique used by backpropagation is called reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss).

The Algorithm

Backpropagation contains two parts - forward and backward

Epochs

Backpropagation handles one mini-batch at a time (for example containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an epoch.

Forward Pass:

Step #1: Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer.

Step #2: The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer.

Step #3: Again its output is computed and passed to the next layer and so on until we get the output of the last layer the output layer.

This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.

Backward Pass:

Step #1- Loss Function: Next, the algorithm measures the network’s output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).

Step #2- Chain Rule: After getting output of the output layer it computes how much each output connection contributed to the error. This is done analytically by applying the chain rule, which makes this step fast and precise.

Step #3: The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule — and so on until the algorithm reaches the input layer.

As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).

Finally Optimization

Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

Why weights initialize randomly?

It is important to initialize all the hidden layers’ connection weights randomly.

For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical.

In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart.

If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.

Forwardpropagation with Weight update:

Here unlike its counterpart the value(X) is getting calculated along with weights and bias values.

The formula : $https://latex.codecogs.com/svg.image?\large \mathbf{y\ =\ W^\top X + b}$

$https://latex.codecogs.com/svg.image?\large \mathbf{y\ =\ (W_1 X_1 +W_2 X_2 +W_3 X_3) + b}$

Backpropagation Weight update:

For training of any neural network the aim is to minimize the loss (y - ŷ). The back propagation does this job by adjust each weight in the network in proportion to how much it contributes to overall error.

The formula : $https://latex.codecogs.com/svg.image?\large \mathbf{W_{(new)} = W_{(old)}-\eta {\color{Blue} \frac{\partial L }{\partial W_{(old)}}}}$ $https://latex.codecogs.com/svg.image?\large \ \ \ \begin{cases} {\color{Red} \eta} \mathrm{\ \ = \ \ 'eta' \ is\ the\ learning\ rate } \ ,\\ \\ {\color{Red} \frac{\partial L }{\partial W_{(old)}} } \mathrm{\ \ = \ derivative\ of\ loss\ by\ derivative\ of\ old\ weight}\end{cases}$

For the weight 'W4' in the above diagram we just need to calculate $https://latex.codecogs.com/svg.image?\large \mathbf{W_{4(new)} = W_{4(old)}-\eta {\color{black} \frac{\partial L }{\partial W_{4(old)}}}}$

Bias update:

$https://latex.codecogs.com/svg.image?\large \mathbf{b_{2\ new} = b_{2 \ old}-\eta \frac{\partial L }{\partial b_{2\ old}}}$

Chain Rule of differentiation:

Weight update for W₁

$https://latex.codecogs.com/svg.image?\large \mathbf{W_{1\ new} = W_{1 \ old}-\eta {\color{black} \frac{\partial L }{\partial W_{1\ old}}}}$

Inorder to update the W₁, we need to follow the chain rule of differentiation- $https://latex.codecogs.com/svg.image?{\color{black} \frac{\partial L }{\partial W_{1\ old}}}$

$https://latex.codecogs.com/svg.image?\frac{\partial L }{\partial W_{1\ new}}=\frac{\partial L }{\partial O_{2}} * \frac{\partial O_{2} }{\partial W_{1\ old}}$

$https://latex.codecogs.com/svg.image?\large {\color{DarkOrange} \mathbf{or}}$

$https://latex.codecogs.com/svg.image?\frac{\partial L }{\partial W_{1\ new}}=$ $https://latex.codecogs.com/svg.image?\large \begin{bmatrix}\frac{\partial L }{\partial O_{31}}* \frac{\partial O_{31} }{\partial O_{21}}* \frac{\partial O_{21}}{\partial O_{11}}* \frac{\partial O_{11} }{\partial W_{1\ old}}\end{bmatrix} + \begin{bmatrix}\frac{\partial L }{\partial O_{31}}* \frac{\partial O_{31} }{\partial O_{22}}* \frac{\partial O_{22}}{\partial O_{11}}* \frac{\partial O_{11} }{\partial W_{1\ old}}\end{bmatrix}$

But where does `Derivative of Loss`* by `Derivative of old-weight` ( $https://latex.codecogs.com/svg.image?\mathbf{{\color{Blue} \frac{\partial L }{\partial W_{(old)}}}}$ ) come from?*

For the -ve slope

W_(new) = W_(old) - η (-ve)
W_(new) = W_(old) + η
That makes it for always

For the +ve slope

W_(new) = W_(old) - η (+ve)
W_(new) = W_(old) - η
That makes it for always $https://latex.codecogs.com/svg.image?\\ \mathbf{{\color{Red} W_{(new)} < W_{(old)}}}$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Index

Backpropagation

The Algorithm

Epochs

Forward Pass:

Backward Pass:

Finally Optimization

Why weights initialize randomly?

It is important to initialize all the hidden layers’ connection weights randomly.

Forwardpropagation with Weight update:

The formula : $https://latex.codecogs.com/svg.image?\large \mathbf{y\ =\ W^\top X + b}$

Backpropagation Weight update:

Bias update:

Chain Rule of differentiation:

Weight update for W₁

Inorder to update the W₁, we need to follow the chain rule of differentiation- $https://latex.codecogs.com/svg.image?{\color{black} \frac{\partial L }{\partial W_{1\ old}}}$

But where does `Derivative of Loss`* by `Derivative of old-weight` ( $https://latex.codecogs.com/svg.image?\mathbf{{\color{Blue} \frac{\partial L }{\partial W_{(old)}}}}$ ) come from?*

For the -ve slope

For the +ve slope

Vanishing Gradient problem:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Index

Backpropagation

The Algorithm

Epochs

Forward Pass:

Backward Pass:

Finally Optimization

Why weights initialize randomly?

It is important to initialize all the hidden layers’ connection weights randomly.

Forwardpropagation with Weight update:

The formula :

Backpropagation Weight update:

The formula :

Bias update:

Chain Rule of differentiation:

Weight update for W1

Inorder to update the W1, we need to follow the chain rule of differentiation-

But where does Derivative of Loss by Derivative of old-weight ( ) come from?

For the -ve slope

For the +ve slope

Vanishing Gradient problem:

The formula : $https://latex.codecogs.com/svg.image?\large \mathbf{y\ =\ W^\top X + b}$

Weight update for W₁

Inorder to update the W₁, we need to follow the chain rule of differentiation- $https://latex.codecogs.com/svg.image?{\color{black} \frac{\partial L }{\partial W_{1\ old}}}$

But where does `Derivative of Loss`* by `Derivative of old-weight` ( $https://latex.codecogs.com/svg.image?\mathbf{{\color{Blue} \frac{\partial L }{\partial W_{(old)}}}}$ ) come from?*