-
Vanishing-Exploding Gradients and TBPTT
-
Why the big number is a problem since ∞ means a big number?
$\large{\color{Purple}( \infty )}$ - Why the small number is a problem?
- Gradient clipping for exploding gradients
-
Why the big number is a problem since ∞ means a big number?
- Solution for Vanishing gradients
- Solution for expensive gradient Computation Truncated Back Propagation Through Time(TBPTT)
A Recurrent Neural Network is a type of ANN that contains loops, allowing information to be stored within the network. It is absolutely essential for sequential like information (variable input size)
- No memory element.
- The present data doesn't dependent on the pevious data.
In general- variably sized, sequential data combine an input vector with a state vector via a fixed function to produce a new state.
- variably sized: Number of features are fixed the size of the data is not.
- It looks very much like a feedforward neural network, except it also has connections pointing backward.
Weight matrix for a single hidden layer RNN | RNN- Many-To-Many | Total number of Layers = 'T'
The linear combination of
$\Huge{\color{Purple} h}$ and$\Huge{\color{Purple} x}$ and weight matrix$\Huge{\color{Purple} W}$ which we will multiply$\Huge{\color{Purple} h_{t-1}}$ of previous layer and some other weight matrix$\Huge{\color{Purple} W}$ which we will multiply$\Huge{\color{Purple} x_{t}}$ of same layer.
$\large{\color{Purple} g }$ needs not to be same as$\large{\color{Purple} g^* }$ , even$\large{\color{Purple} g^* }$ not always be a Non-linear function.
Step #1: Weight calculation for hidden layers and output layers
To train an RNN, the trick is to unroll it through time and then simply use regular backpropagation. This strategy is called backpropagation through time (BPTT).
Like regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows).
Note: Backpropagation is common in ANN or in Multi-Layer Perceptron.
Inorder to make the expressions simple we put allias in the above two expressionas like
The Matrixes
$\large{\color{Purple} \textbf{W} }$ ,$\large{\color{Purple} \textbf{U} }$ ,$\large{\color{Purple} \textbf{V} }$ do not change with time (or across the layers); Meaning same values in each epoch. Where as in ANN those matrix changes its values. We need to update them. See for Weight update in Backpropagation.
Answer: The following vectors do not change with time (or across the layers).
For the Backpropagation we need to findout the derivative of the loss function let say
Summation of all the intermediate losses through the layers.
- We assume that g is a non-linear function and-
- Loss functions used least-squares error
- We need to findout
[To be continued]
[To be continued]
The gradients of that cost function are then propagated backward through the unrolled network (represented by the solid arrows).
- The model parameters are updated using the gradients computed during BPTT.
Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output (for example, in Figure the cost function is computed using the last three outputs of the network, , so gradients flow through these three outputs, but not through Y(0) and Y(1) ).
- Moreover, since the same parameters W and b are used at each time step, backpropagation will do the right thing and sum over all time steps.
- Fortunately, tf.keras takes care of all of this complexity for you
Hidden Layers:
Answer: We take linear combination followed by non-linearity always. Typically in RNNs we usually use tanh for the nonlinearity in the hidden layers.
- So in this case, the non-linear function is tanh and we need a linear combination of h and x. So there will be some weight matrix W which we will multiply h and some other weight matrix W which we will multiply x.
- Those two weight matrices are different in general. Not only that, they also have different sizes.
- So this is the general formula for the hidden layer of an RNN, some people will replace this tanh by fw or by g.
The calculation for hiddenlayer h2 would be-
Answer:
Now in some cases, it simply makes sense for this function to be a linear function( for regression output). In some cases, it makes sense for the function to be a non-linear function ( for Classification output).
- If it is a classifcation task and let us say it is a binary classification task, then g will become a Sigmoid σ .
- If it is a multiclass classification task, you will use a Softmax.
Answer: The weights and bias in h3 are the same Whh , Wxh and bn for h2
RNN- Many-To-Many | The number of Layers = 'T'
- Now when you have multiple predicted values, let us say having 10 days before is the weather of
$\large{\color{Purple} h_0}$ or temperature of$\large{\color{Purple} {x}_0}$ in some city, let us say Chennai. - Suppose you have that input, you would have the next day's temperature, let us say that is
$\large{\color{Purple} \hat{y_1}}$ , the next day's temperature$\large{\color{Purple} \hat{y_2}}$ and next day's temperature$\large{\color{Purple} \hat{y_3}}$ , till let us say today's temperature which is$\large{\color{Purple} \hat{y_T}}$ . - Now for each one of them, you also have a corresponding ground truth, which should be
$\large{\color{Purple} y_1,\ y_2,\ y_3, \ y_T }$ . And whenever you have a ground truth and a prediction and these two differs, you will have a loss function. - So the total loss is -
- Now in terms of Lt itself, or the local loss function, you again have many choices but we having seen only 2 so far,
- cross entropy -classifcation
- least-squares error - regression or a numerical output
The main problem with RNN is Unstable Gradient
The basic issue for which we had to do BPTT was because W, U, V matrices were constants across time. Because of which you had sort of recursive expressions for the loss with respect to W and the loss with respect to U.
The main issues that come up are
- gradient calculations either explode or vanish, both of these are not ideal.
- The gradient calculations are expensive.
- The solution for exploding gradients is gradient clipping.
- The solution for vanishing gradients is alternate architectures LSTM, GRU.
- The solution for expensive gradient calculations is Truncated Back Propagation Through Time(TBPTT).
There are a few ways by which you can get an idea of whether your model is suffering from exploding gradients or not. They are:
- If the model weights become unexpectedly large in the end.
- Your model has a poor loss.
- Or the model displays NaN loss whilst training.
- The gradient value for error persists over 1.0 for every subsequent iteration during training.
Link Detect Vanishing Gradients
TBPTT Works with both the** vanishing gradient** and the exploding gradient problems, so it is sort of a compound solution.
Now remember the example that we had before this, that example had 65,000 time sequences.
- Now, would you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems.
- Plus it would become potentially very expensive just to do one gradient up-date.
TBPTT- forward propagate through 'k1' steps and back propagate through 'k2' steps
Lets consider the boxes here, each box represents an input, hidden layer and output with its corresponding loss. Now, Remember that we are assuming that the relationship is the same and in fact you can cut it anywhere in the middle and you are going to get exactly the same W.
The basic idea that is instead of training for the whole sequence , you split it up into many mini batches.
- I will forward propagate through 2 steps, then back propagate through 2 steps. Then the W is remains the same everywhere. So I will get some new updated W.
- Then I step forward little bit, forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated.
- Now when the W is updated, I will forward propagate through the whole thing and I keep on doing this.
We saw that you cannot simply calculate, let us say if I have L3, I cannot simply calculate in the usual way besause
Above is applicable for U3, as well
This is basically what we call back propagation through time, because none of these terms is independent. Now this kind of dependency creates several problems.
Heuristic Description.(rough)
Now, all these are heuristic arguments but it turns out to be a remarkably good approximations, unfortunately I cannot go further.
- But if I have norm(let us say 2 norm) of ∥ht+n ∥ , notice h⃗t is a vector.
- If I take its norm, it will be some factor times norm of ht ( ∥ht+n ∥ ∼ ∥ht∥ )
- Norm is a scaler, so this is a number, you are trying to find out the size of ht+n, that will be some number times ht
- And it turns out that it scales approximately as the eigenvalues (λ) of Wn. Like the following-
-
Another way to see this is to assume that the W is diagonal, If W is diagonal, all it will have, Wn will be, all its eigenvalues or all its diagonal terms to the power n.
-
Now which eigenvalue, we will see shortly.
-
The eigenvalue will either be the largest or the smallest.
- The worst-case scenario is if the eigenvalue will be the largest
- The best case scenario or the smallest case scenario is if the eigenvalue will be the smallest.
-
Beacause of scaling as long as I use the same W, which I do for RNN, throughout time, what happens is these vectors constantly get larger in magnitude or constantly get smaller in magnitude.
-
So, if you have a large number of time steps, this number, even if it is small, you know, for example even if it adds to 1.01, over time it is going to get to be a huge number.(this is the power of the exponential function or of the power function)
- Now this is simply for h, you can show that and I would request you to try this out by looking at the expressions in BPTT, the similar arguments hold true for
$\frac{\partial \mathrm{L_3}}{\partial \mathrm{W}}$ also.
Answer: These is a problems because obviously you are never going to get exactly ∞ because you are still dealing with finite number. But the problem is the moment it goes about the largest number that your machine can calculate, it will actually show you NAN, not a number or it will show you ∞, so on and so forth. So really speaking, finite preciation machines cannot handle exploding gradient.
Answer: Similarly you will never actually go to 0. If you do like 0:991000 (because of finite preciation machines), it will be very very very small number. But the problem is it might actually becomes smaller than 10-16, which is the smallest number that you can represent accurately. So at that point you will no longer train, so that will be called saturation. So you will get a very small gradient and that is practically gone.
There is another problem, notice this tanh, even the tanh is being repeated multiple times. So you have ht = tanh(W ht-1), ht+1 will be tanh(ht), so you have tanh2, similarly you will have tanh3.
tanh2 is flat, tanh3 will look even flatter
Look at the tanh, tanh2, tanh2 will look even flatter. And if you take tanh100, it will look even smaller and notice in all these cases, gradients become flatter and flatter and flatter and they become very small.
Now, all these problems put together lead to these 2 issues. The tanh, repeated tanh problem will lead only to the vanishing gradient issue but large number of players can either lead to exploding gradient or it can lead to vanishing gradient, both of these make training very dificult.
Answer: It is very simple, we decide on a maximum allowable gradient size. What do I mean by value of gradient? Gradient is a vector, so you cannot give it a value, you can however give a value to norm of gradient.
- Say-
- My new gradient
$\vec{g^*}$ is in the same direction as the gradient you calculated but I am cutting down its size
Answer: Unfortunately no such simple solution exists. LSTM, GRU is the result
Answer: This solution kind of handles to a certain extent, both the vanishing gradient and the exploding gradient problems.
So, if we have data with thousands and thousands of time steps. And you want to calculate back propagation through time. Now how would you do it?
Forward propagate through the whole thing, calculate the whole of
Then you will back propagate through the whole thing.
Now we have 65,000 time sequences. If you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems, plus it would become potentially very expensive just to do one gradient update.
So the solution to that is truncated back propagation through time.
Since throughout the RNN network you are going to get exactly the same W. So, instead of training for the whole sequence, you split it up into many mini batches( similar like mini batch gradient descent).
I will forward propagate through first 2 steps and back propagate through 2 steps, this is one possibility. Since W is the same everywhere. So I will get some new updated W.
Next I forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated , okay. Now when the W is updated, I will forward propagate through the whole thing, okay. So I keep on doing this.
If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish. Now what is a good rule of thumb? It is actually hard to say for some problems, hundred steps are good for some problems, 10, 20 steps are good, etc.
The deep RNNs are particularly important in language processing especially in language translation.
Now, what are deep RNNs let us look at just one of these if I look at one of these within the RNN it is just an ANN as we saw with normal RNNs. In a normal RNN all you had was one input layer, one hidden layer and one output layer. In a deep RNN all you do is that one single layer of the RNN actually become a deep neural network that is the only difference between a deep RNN and a normal RNN.
Many-to-Many (top left), Many-to-One (top right), One-to-Many (bottom left), and Encoder–Decoder (bottom right) networks
- An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs.
- Example #1: Language Translation, Speech Recognition
- The output does not comes simultaniously with the input and the size of the output need not to be same as input
- Example #2: Video frame by frame analysis
- The output size fixed by the input size
- You could feed the network a sequence of inputs and ignore all outputs except for the last one.
- For example: Sentiment Analysis- you could feed the network a sequence of words corresponding to a movie review and the network would output a sentiment score.
- Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence.
- For example: Image Captioning- the input could be an image (or the output of a CNN), and the output could be a caption(text) for that image.
- Lastly, you could have a sequence-to-vector network, called an
encoder
, followed by a vector-to-sequence network, called adecoder
. - For example, this could be used for translating a sentence from one language to another.
- You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language.
- This two-step model, called an
Encoder–Decoder
, works much better than trying to translate on the fly with a singlesequence-to-sequence RNN
(like the one represented at the top left): the last words of a sentence can affect the first words of the translation, so you need to wait until you have seen the whole sentence before translating it.
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Géron
- NPTEL(Dr. Balaji Srinivasan)
- YouTube 1